The Client's Requirements
The client had an existing solution (chatbot and business registration assistant), which could be used as an interface so that citizens can type in their business descriptions using colloquial language and receive the five most relevant industry codes as a response.
Both solutions allow routing different user questions to corresponding API endpoints, meaning that we received the colloquial business descriptions written by users as API calls. Our algorithm was supposed to create a response to these API calls containing the 5 most relevant industry codes.
There was an existing solution using basic word embeddings, which often showed unsatisfactory results. This indicated that a better semantic understanding of the definitions and descriptions of different industry codes as well as the colloquial descriptions of a business was needed.
Our Solution
Backend: Python, spaCy, PyTorch, NumPy, Pandas, fastAPI, Pydantic, Docker, Elasticsearch
Infrastructure: GCloud (Training), Git, DVC, tensorboard
The basis for our algorithm is a version of BERT, a neural network architecture developed by Google, which was already pre-trained on a large german text corpus. We finetuned the existing layers and enriched BERTs output with custom features. By adding a few new layers and postprocessing steps we were able to build a text classifier that leverages the full semantic capabilities of BERT while being performant enough to run on a CPU.
This final classifier was then trained using historical business description - industry code pairs. It takes a business description as input and outputs a relevance score for each industry code. The 5 highest-ranking industry codes are then sent back as a response to the original user request in ranked order.