dida Logo
Our client

Semantic Search for Public Administration

Our algorithm helps citizens through the bureaucracy of registering a business.

Input:Free-text business descriptions
Output:    The according industry codes
Goal:Simplify business registration for citizens

Starting Point

Citizens registering a new business in Germany have to provide an industry code along with their registration. This industry code is chosen from a list of over 800 different codes, each described and defined in complicated “public administration language”. Finding the correct code from all these options is hard, especially if someone is not accustomed to the language used in these descriptions.

Our client PublicPlan developed an online service portal for the federal state of North Rhine-Westphalia which enables citizens to access public administration services. PublicPlan wanted to enhance its functionality by offering an intuitive search function for industry codes, integrated both in the portal's chatbot and the business registration assistant.

Challenges

The input to the algorithm developed here should be the citizen’s free-text description of the business he or she wants to register.

As authorities often use words and turns of phrases differing widely from colloquial language, finding the correct industry code for a specific business registration is a non-trivial task.

Therefore, simple text search algorithms are not sufficient to find the correct industry code.

Solution

We adapted and trained an AI architecture especially suited for Natural Language Processing tasks to solve the task. For a given colloquial business description, the trained model suggests the relevant industry codes.

The training data for the AI were historical colloquial business descriptions and corresponding business codes. This data was provided by the client.

The final product can be deployed in various settings. Currently, it is used as a functionality of the chatbot and the business registration assistant which the client already had. Our solution is flexible and easy to maintain: New industry codes can be integrated with very little effort.

Below you can see three example outputs of the algorithm for different business descriptions. Because Machine Learning models for Natural Language Processing are usually language-specific, the example below is in German.

Philipp Jackmuth

Managing Director

p.jackmuth@dida.do

Tell us a little about your project requirements and we'll get in touch with you.

Technical Details

The Client's Requirements

The client had an existing solution (chatbot and business registration assistant), which could be used as an interface so that citizens can type in their business descriptions using colloquial language and receive the five most relevant industry codes as a response.

Both solutions allow routing different user questions to corresponding API endpoints, meaning that we received the colloquial business descriptions written by users as API calls. Our algorithm was supposed to create a response to these API calls containing the 5 most relevant industry codes.

There was an existing solution using basic word embeddings, which often showed unsatisfactory results. This indicated that a better semantic understanding of the definitions and descriptions of different industry codes as well as the colloquial descriptions of a business was needed.

Our Solution

Backend: Python, spaCy, PyTorch, NumPy, Pandas, fastAPI, Pydantic, Docker, Elasticsearch
Infrastructure: GCloud (Training), Git, DVC, tensorboard

The basis for our algorithm is a version of BERT, a neural network architecture developed by Google, which was already pre-trained on a large german text corpus. We finetuned the existing layers and enriched BERTs output with custom features. By adding a few new layers and postprocessing steps we were able to build a text classifier that leverages the full semantic capabilities of BERT while being performant enough to run on a CPU.

This final classifier was then trained using historical business description - industry code pairs. It takes a business description as input and outputs a relevance score for each industry code. The 5 highest-ranking industry codes are then sent back as a response to the original user request in ranked order.

Further Projects

A selection of projects we have done

Smart Access Control with Facial Recognition

We developed a multi-level security system with facial recognition for automatic access control.

Smart Access Control with Facial Recognition

We developed a multi-level security system with facial recognition for automatic access control.

Numeric Attribute Extraction from Product Descriptions

Automatically extract numerical attributes from product descriptions in order to enrich the existing database.

Automatic Planning of Solar Systems

Creative solutions enabled us to automate the process of planning solar systems.