© unsplash/@youxventures

© unsplash/@youxventures
Back Office Automation

Extraction of entities from invoices and orders in SAP

Context

Incoming invoices and orders from external companies arrive in a variety of formats. To further process the documents, the most relevant information needs to be extracted and entered into the company's software such as SAP. In the case of incoming invoices, at least the bank account number, the net amount, the value added tax, the supplier's entity name, the due date for payment and the invoice date has to be extracted.

Challenges

As the invoices and orders come from a variety of suppliers and clients, there is no uniform extraction method that can be applied. This challenge is even more imminent if the company is active in different countries with different languages and formats and has a high number of clients or suppliers with occasional orders or deliveries. Optical character recognition (OCR) and template-based solutions will thus - in most cases - not deliver satisfactory results.

Potential solution approaches

As a first step, the document content has to be extracted with OCR libraries such as TesserAct, ABBYY FineReader or Google Vision API. Once the right OCR tool is chosen, a data pipeline with a high number of labels ('ground truth data') has to be established and a machine learning (ML) model can be trained. For information inside a table, table segmentation needs to be applied to locate an entities' position within tables.

For simple extraction fields, such as dates, e.g. rule based approaches might be sufficiently reliable. For more complex fields, options for ML algorithms are Random forest, Naive Bayes classifiers, TF-IDF or LSTM algorithms.

Graph Neural Networks are a further option to map relationships between entities (such as due date and payment date). In most cases, a combination of regular expressions, traditional machine learning methods and Deep Neural Networks delivers the best results.

Related Case Studies

Natural Language Processing

Legal review of rental contracts

Different methods from the field of NLP helped us to create a software that spots errors in legal contracts.
Our solution

Related webinars

Text recognition (OCR) - The first step on the way to a successful implementation of an NLP project

In this talk we will deal with the topic of text recognition.

Ewelina Fiebig

Machine Learning Scientist

Fabian Gringel

Machine Learning Scientist

Labeling Tools - The second step on the way to the successful implementation of an NLP project

The success of an NLP project consists of a series of steps from data preparation to modeling and deployment. Since the input data are often scanned documents, the data preparation step initially involves the use of text recognition tools (OCR for short) and later on also the use of so-called labeling tools. In this webinar we will deal with the topic of selecting a suitable labeling tool.

Ewelina Fiebig

Machine Learning Scientist

Fabian Gringel

Machine Learning Scientist

Semantic search and understanding of natural text with neural networks: BERT

In this webinar you will get an introduction to the application of BERT for Semantic Search using a real case study: Every year millions of citizens interact with public authorities and are regularly overwhelmed by the technical language used there. We have successfully used BERT to deliver the right answer from government documents with the help of colloquial queries - without having to use technical terms in the queries.

Konrad Schultka

Machine Learning Scientist

Jona Welsch

Machine Learning Scientist

Recurrent neural networks: How computers learn to read

The webinar will give an introduction to the functioning of RNNs and illustrate their use in an example project from the field of legal tech

Fabian Gringel

Machine Learning Scientist