Extracting information from tables
Information in documents often comes in the form tables. For example, in service charge statements from landlords, tenants receive once a year an overview of the service charges related to their apartment. Most tables include core information regarding the total costs to the tenant, so it is key to a machine learning solution to be able to extract the right information and validate their values.
Although it appears to be a standard machine learning application, one major challenge lies in the recognition and extraction of tabular data. Without a proper table recognition up front, many OCR tools simply extract the continuous text data without recognizing the tabular structure of the data. So the right combination of table recognition algorithm and OCR is essential when it comes to extracting tabular data.
Moreover, tabular data as part of a contract can be specific to a client, client group or a list of general conditions that might - or might not - apply to the client. Therefore, the relevance of the table in the context of the entire document needs to be examined carefully.
Potential solution approaches
As a first step, ABBYY Fine Reader or Amazon Textract as OCR tool are recommended which come with table extraction out-of-the-box.
However, the performance of generic tools is often not sufficient. For specific use cases, one can take advantage of additional domain knowledge to achieve significantly better results. In particular, this is the case when one can narrow down the type of documents one deals with (e.g. only invoice documents).
Custom solutions can be based on rather simple regular expression/string matching techniques or make use of sophisticated network architectures like graph neural networks (modelling geometric relations of e.g. word boxes) or convolutional neural networks (using the document image as input). Often a combination of different approaches yields the best result.