© unsplash/@mbaumi

© unsplash/@mbaumi
Back Office Automation

Extracting information from tables

Context

Information in documents often comes in the form tables. For example, in service charge statements from landlords, tenants receive once a year an overview of the service charges related to their apartment. Most tables include core information regarding the total costs to the tenant, so it is key to a machine learning solution to be able to extract the right information and validate their values.

Challenges

Although it appears to be a standard machine learning application, one major challenge lies in the recognition and extraction of tabular data. Without a proper table recognition up front, many OCR tools simply extract the continuous text data without recognizing the tabular structure of the data. So the right combination of table recognition algorithm and OCR is essential when it comes to extracting tabular data.

Moreover, tabular data as part of a contract can be specific to a client, client group or a list of general conditions that might - or might not - apply to the client. Therefore, the relevance of the table in the context of the entire document needs to be examined carefully.

Potential solution approaches

As a first step, ABBYY Fine Reader or Amazon Textract as OCR tool are recommended which come with table extraction out-of-the-box. However, the performance of generic tools is often not sufficient. For specific use cases, one can take advantage of additional domain knowledge to achieve significantly better results. In particular, this is the case when one can narrow down the type of documents one deals with (e.g. only invoice documents).

Custom solutions can be based on rather simple regular expression/string matching techniques or make use of sophisticated network architectures like graph neural networks (modelling geometric relations of e.g. word boxes) or convolutional neural networks (using the document image as input). Often a combination of different approaches yields the best result.

Related webinars

Text recognition (OCR) - The first step on the way to a successful implementation of an NLP project

In this talk we will deal with the topic of text recognition.

Ewelina Fiebig

Machine Learning Scientist

Fabian Gringel

Machine Learning Scientist

Labeling Tools - The second step on the way to the successful implementation of an NLP project

The success of an NLP project consists of a series of steps from data preparation to modeling and deployment. Since the input data are often scanned documents, the data preparation step initially involves the use of text recognition tools (OCR for short) and later on also the use of so-called labeling tools. In this webinar we will deal with the topic of selecting a suitable labeling tool.

Ewelina Fiebig

Machine Learning Scientist

Fabian Gringel

Machine Learning Scientist

Recurrent neural networks: How computers learn to read

The webinar will give an introduction to the functioning of RNNs and illustrate their use in an example project from the field of legal tech

Fabian Gringel

Machine Learning Scientist