Natural Language Processing Case Study

Semantic Search for public administration

Hint: switch between general information and a more technical view on this project

Introduction

The digitalisation of public administration services is increasingly gaining momentum throughout Europe. When establishing a new digital public service, it is important to allow for easy and intuitive interaction between users and the service in order to ensure democratic and widespread adoption.

However, public authorities often use their own jargon, which is far from intuitive for most citizens. Furthermore, when using a digital service, citizens cannot consult with an authority concerning open questions in the direct way compared to an onsite visit. Therefore, a language barrier can quickly arise. This can lead to frustration on the part of the citizens and, in the worst case, to a lack of acceptance of the digital service. In order to remedy this problem, dida developed an AI based algorithm to extract relevant information from authority documents.

The digitalisation of public administration services is increasingly gaining momentum throughout Europe. When establishing a new digital public service, it is important to allow for easy and intuitive interaction between users and the service in order to ensure democratic and widespread adoption.

However, public authorities often use their own jargon, which is far from intuitive for most citizens. Furthermore, when using a digital service, citizens cannot consult with an authority concerning open questions in the direct way compared to an onsite visit. Therefore, a language barrier can quickly arise. This can lead to frustration on the part of the citizens and, in the worst case, to a lack of acceptance of the digital service. In order to remedy this problem, dida developed an AI based algorithm to extract relevant information from authority documents.

Starting Point

Citizens registering a new business in Germany have to provide an industry code along with their registration. This industry code is chosen from a list of over 800 different codes, each described and defined in complicated “public administration language”. Finding the correct code from all these options is hard, especially if someone is not accustomed to the language used in these descriptions.

The client had an already existing chatbot solution, which could be used as an interface, so that citizens can type in their business description using colloquial language and receive the five most relevant industry codes as response from the chatbot.

The chatbot solution allows routing different user questions to corresponding API endpoints, meaning that we received the colloquial business descriptions written by users as API calls. Our algorithm was supposed to create a response to these API calls containing the 5 most relevant industry codes.

Challenges

The input to the algorithm developed here should be the citizen’s description of the business he or she wants to register.

As authorities often use words and turns of phrases differing widely from colloquial language, finding the correct industry code for a specific business registration is a non-trivial task.
Therefore, simple text search algorithms were not sufficient to find the correct industry code.

There was an already existing solution using basic word embeddings, which often showed unsatisfactory results. This indicated that a better semantic understanding of the definitions and descriptions of different industry codes as well as the colloquial descriptions of a business was needed.

Solution

An AI architecture especially suited for Natural Language Processing tasks was adapted and trained to solve the task. It outputs the relevant industry codes for a given colloquial business description.
The training data for the AI were historical colloquial business descriptions and corresponding business codes. This data was provided by the client.

Technologies used: Python, spaCy, PyTorch, fastAPI, Docker, Elasticsearch

The basis for our algorithm is a version of BERT, a neural network architecture developed by Google, which was already pretrained on a large german text corpus. We finetuned the existing layers and enriched BERTs output with custom features. By adding a few new layers and postprocessing steps we were able to build a text classifier, which leverages the full semantic capabilities of BERT, while being performant enough to run on a CPU.
This final classifier was then trained using historical business description - industry code pairs. It takes a business description as input and outputs a relevance score for each industry code. The 5 highest ranking industry codes are then sent back as a response to the original user request in ranked order.

Product

The final product was deployed as functionality of a chatbot, which the client already had. New industry codes which the algorithm was not trained on can be integrated with very little effort. Below you can see three example outputs of the algorithm for different business descriptions. Because machine learning models for natural language processing are usually language specific, the example below is in German.

The final product was deployed as functionality of a chatbot, which the client already had. New industry codes which the algorithm was not trained on can be integrated with very little effort. Below you can see three example outputs of the algorithm for different business descriptions. Because machine learning models for natural language processing are usually language specific, the example below is in German.