AI for the public sector: Specialized LLMs for document search

The ongoing digitalization of public administration in Germany not only simplifies access to public services, but also enables more efficient and modern internal workflows. This makes both services for citizens and jobs in public administration much more attractive. Semantic search engines and retrieval augmented generation systems can make working with large document archives easier and less time-consuming. For example, citizens can use simple language to find out the regulations for business registration, vehicle registration and tax returns. Internal research work is also simplified, e.g. to ask questions if there are uncertainties in an approval procedure or in the event of IT problems.

Motivation

One advantage of digitization is that information on processes, facts, regulations, etc. is available in large digital document archives in public institutions. In order to make these databases easily searchable, intelligent search engines are needed that analyze and compare queries and facts at the content level so that citizens and employees have the opportunity to find the right information in their own words. Improved processes and services in turn lead to greater trust in the sustainability and attractiveness of state and public institutions.

Challenges

The classic keyword search can provide satisfactory results for general search queries. However, if information is sought in subject-specific documents and the query is not formulated in the correct technical language, keyword-based search engines are not suitable. For example, the available information on vehicle registration or tax returns is formulated in complicated, administrative language that is not familiar to the majority of citizens, so that a semantic comparison between technical language and freely formulated queries must be made. In addition, the maintenance and comparison with a database of keywords is inefficient in terms of resources.

For this purpose, semantic search engines must be able to understand the technical and administrative language of the documents and compare it with a search query. This can be achieved through the subject-specific adaptation of modern neural language models, such as LLMs. This requires a sufficient amount of annotated training data and the creation of intelligent document retrieval systems.

Solution approaches

Neural language models based on the transformer architecture are used for the semantic analysis of search queries. These can be adapted using the annotated training data so that they map the similarity between search queries and specialist texts. This enables them to find the appropriate text passage or document for a search query.

In addition, large neural language models (LLMs) and retrieval augmented generation (RAG) can be used to provide a factually correct answer to the search query. LLMs are generically pre-trained generative models that can be optimized for specific tasks in a resource-saving and data-efficient manner. With RAG, an additional intelligent document query component is added to improve the LLM's output. In the case of vehicle registration, it would thus be possible to develop a subject-specific, generative AI system with fewer than 50 training examples, which, in addition to the appropriate text sections, also reproduces a correct, summarized formulation of the content.

More Use Cases in Government & Public Sector

Automatic creation of technically correct responses to ongoing public procedures

Classification and information extraction of incoming documents for completeness and fraud checks

Crop type classification for subsidy control

Semantic search and RAG for searching document archives