Machine Learning for Information Extraction from Contracts

Context

The identification and validation of contract clauses can be very time consuming and requires a lot of time of well-trained and expensive lawyers. Especially if the service is offered via a online customer platform, where the user can upload her contract to be reviewed.
The automated service should be able to identify the relevant clauses and validate the rightfulness of these clauses against current jurisdiction.

Challenges

The processing of contract documents requires in a first step an appropriate optical character recognition (OCR) tool to extract the text from uploaded images or pdf documents. As this data is uploaded by users, the quality might be very diverse and requires additional post-processing.

The identification of the right paragraph can be based on regular expressions based on keywords or matching with custom dictionaries. The analysis of contract clauses can be automated by configuring a machine learning model, that should be able to extract central and relevant information from the identified paragraph and clause by understanding the meaning of text passages, singular sentences and titles.

Moreover, training data has to be collected, so that the clauses in contract documents are labeled with information about their validity. As the reasons for the (in-) validity of clauses might be manyfold, it requires legal knowledge in the labelling process and handcrafted features and rules of thumb to output reasonable suggestions for the lawyers handling the customer request, based on latest jurisdiction. As jurisdiction might change over time, e.g. with the capping of rents per square meter in Berlin, the algorithm needs to be adapted over time to reflect legislative changes.

Potential solution approaches

In a first step, the documents need to be digitized using an OCR tool, such as ABBYY FineReader or Google Cloud Vision, which in our experience are the best for document images of different image quality.

Identifying the relevant text passages can be accomplished by using a natural language processing (NLP) model, which is able to learn the relations between words and sentences. Commonly used techniques for text classification tasks are TF-IDF algorithms, Naive Bayes classifiers, word embedding methods and LSTM networks.

More Use Cases in Back Office

Automated question answering for CVs of candidates

Classification of customer requests

Classification of PDF documents

Extracting information from tables

Extraction of entities from invoices and orders in SAP

Identification and validation of contract clauses