Toggle view

Home › 
Projects › 
Semantic search for public administration

Semantic search for public administration

dida developed an AI based algorithm to extract relevant information from authority documents

Case Study
Natural Language Processing

General

Starting point

Citizens registering a new business in Germany have to provide an industry code along with their registration. This industry code is chosen from a list of over 800 different codes, each described and defined in complicated “public administration language”. Finding the correct code from all these options is hard, especially if someone is not accustomed to the language used in these descriptions.

General

Challenges

The input to the algorithm developed here should be the citizen’s description of the business he or she wants to register.

As authorities often use words and turns of phrases differing widely from colloquial language, finding the correct industry code for a specific business registration is a non-trivial task.

Therefore, simple text search algorithms were not sufficient to find the correct industry code.

General

Solution

An AI architecture especially suited for Natural Language Processing tasks was adapted and trained to solve the task. It outputs the relevant industry codes for a given colloquial business description.

The training data for the AI were historical colloquial business descriptions and corresponding business codes. This data was provided by the client.

General

Product

The final product was deployed as functionality of a chatbot, which the client already had. New industry codes which the algorithm was not trained on can be integrated with very little effort.

Below you can see three example outputs of the algorithm for different business descriptions. Because machine learning models for natural language processing are usually language specific, the example below is in German.

Technical

Starting point

The client had an already existing chatbot solution, which could be used as an interface, so that citizens can type in their business description using colloquial language and receive the five most relevant industry codes as response from the chatbot.

The chatbot solution allows routing different user questions to corresponding API endpoints, meaning that we received the colloquial business descriptions written by users as API calls. Our algorithm was supposed to create a response to these API calls containing the 5 most relevant industry codes.

Technical

Challenges

There was an already existing solution using basic word embeddings, which often showed unsatisfactory results. This indicated that a better semantic understanding of the definitions and descriptions of different industry codes as well as the colloquial descriptions of a business was needed.

Technical

Solution

Technologies used

Backend: Python, spaCy, PyTorch, NumPy, Pandas, fastAPI, Pydantic, Docker, Elasticsearch
Infrastructure: GCloud (Training), Git, dvc, tensorboard

The basis for our algorithm is a version of BERT, a neural network architecture developed by Google, which was already pretrained on a large german text corpus. We finetuned the existing layers and enriched BERTs output with custom features. By adding a few new layers and postprocessing steps we were able to build a text classifier, which leverages the full semantic capabilities of BERT, while being performant enough to run on a CPU.

This final classifier was then trained using historical business description - industry code pairs. It takes a business description as input and outputs a relevance score for each industry code. The 5 highest ranking industry codes are then sent back as a response to the original user request in ranked order.

Technical

Product

The final product was deployed as functionality of a chatbot, which the client already had. New industry codes which the algorithm was not trained on can be integrated with very little effort.

Below you can see three example outputs of the algorithm for different business descriptions. Because machine learning models for natural language processing are usually language specific, the example below is in German.

Get quarterly AI news

Receive news about Machine Learning and news around dida.

Successfully signed up.

Valid email address required.

Email already signed up.

Something went wrong. Please try again.

By clicking "Sign up" you agree to our privacy policy.

dida Logo
Book ML Talk