The ongoing digitalization of public administration in Germany simplifies access to public services and creates potential for more efficient processing of citizens' affairs. However, the manual processing of incoming documents can be very time-consuming and ties up a lot of human resources. Using modern ML and NLP methods, incoming documents can be automatically categorized, checked for fraudulent characteristics and assigned to the relevant personnel, and relevant information can be extracted from the documents and prepared for subsequent tasks. For example, processes for government grant and funding programs can be made more efficient by assigning incoming invoices from applicants to the correct funding category, extracting audit-relevant fields such as invoice amount and date and identifying features that indicate fraud.
Motivation
As part of the digitalization of public administration, many services for citizens' affairs have already been made accessible via online applications. The advantages are that long waiting times are eliminated and incomplete documents are prevented. However, personnel and time bottlenecks can also arise when processing and checking digitally submitted documents. This includes formal checks, e.g. whether all required fields and checkboxes have been completed and signatures have been added correctly, and content checks, e.g. whether the evidence for an application for authorization to perform medical services is appropriate. In addition, fraud checks may be required, e.g. whether evidence of subsidized or funded work has been falsified or manipulated. NLP- and ML-supported solutions are able to automate such steps and also extract personal and audit-relevant attributes for further processing steps, such as the date of the application and the name of the applicant. Modern LLMs in particular are highly transferable, meaning that different application procedures with different modalities and schemes can be implemented efficiently. The documents checked and information extracted in this way can then be processed and assigned to the relevant specialist procedure or personnel and stored in the right place in the IT system.

Challenges
For incoming documents that follow a specific structure or layout, e.g. application documents for a specific approval procedure in which the input fields and text boxes are always arranged in the same way, rule-based approaches often also work. However, if different approval procedures that run via documents in different layouts and formats are to be mapped by a single ML model, this must be able to process the different textual and structural features. For example, in the above-mentioned process for reviewing subsidized or funded work, the layouts of submitted invoices and certificates usually differ in how costs and prices are presented and arranged. With an annotated dataset mapping different structures and terminologies, an ML model can be adapted for different specific processes at the same time.
Another important aspect of public services is that documents in other languages also need to be processed. Here, it must be ensured that the ML model has been trained with multilingual documents or that documents are translated by other language models.

Solution approaches
Neural language models are used to classify texts automatically, e.g. BERT or LLMs such as Llama or Mistral. Such models can recognize the semantic relationships between words and their respective context as well as subject-specific terminology, e.g. administrative language. As a result, such models are able to contextually classify and categorize the text fields and sections. By annotating sample documents, a language model can learn whether a text field contains a signature or a date, for example.
Pre-trained models exist for text classification and information extraction, which have a high degree of generalizability and have already been trained in various languages in some cases. Depending on the complexity of the use case, there are special methods and algorithms that adapt these models with just a few annotated sample documents. For the grant program mentioned above, fewer than 50 exemplary and representative invoices and certificates could be sufficient to have them automatically processed and checked by an ML system.
