© unsplash/@cytonn_photography

© unsplash/@cytonn_photography
Back Office Automation

Identification and validation of contract clauses

Context

The identification and validation of contract clauses can be very time consuming and requires a lot of time of well-trained and expensive lawyers. Especially if the service is offered via a online customer platform, where the user can upload her contract to be reviewed. The automated service should be able to identify the relevant clauses and validate the rightfulness of these clauses against current jurisdiction.

Challenges

The processing of contract documents requires in a first step an appropriate optical character recognition (OCR) tool to extract the text from uploaded images or pdf documents. As this data is uploaded by users, the quality might be very diverse and requires additional post-processing.

The identification of the right paragraph can be based on regular expressions based on keywords or matching with custom dictionaries. The analysis of contract clauses can be automated by configuring a machine learning model, that should be able to extract central and relevant information from the identified paragraph and clause by understanding the meaning of text passages, singular sentences and titles.

Moreover, training data has to be collected, so that the clauses in contract documents are labeled with information about their validity. As the reasons for the (in-) validity of clauses might be manyfold, it requires legal knowledge in the labelling process and handcrafted features and rules of thumb to output reasonable suggestions for the lawyers handling the customer request, based on latest jurisdiction. As jurisdiction might change over time, e.g. with the capping of rents per square meter in Berlin, the algorithm needs to be adapted over time to reflect legislative changes.

Potential solution approaches

In a first step, the documents need to be digitized using an OCR tool, such as ABBYY FineReader or Google Cloud Vision, which in our experience are the best for document images of different image quality.

Identifying the relevant text passages can be accomplished by using a natural language processing (NLP) model, which is able to learn the relations between words and sentences. Commonly used techniques for text classification tasks are TF-IDF algorithms, Naive Bayes classifiers, word embedding methods and LSTM networks.

Related Case Studies

Natural Language Processing

Legal review of rental contracts

Different methods from the field of NLP helped us to create a software that spots errors in legal contracts.
Our solution

Related webinars

Text recognition (OCR) - The first step on the way to a successful implementation of an NLP project

In this talk we will deal with the topic of text recognition.

Ewelina Fiebig

Machine Learning Scientist

Fabian Gringel

Machine Learning Scientist

Labeling Tools - The second step on the way to the successful implementation of an NLP project

The success of an NLP project consists of a series of steps from data preparation to modeling and deployment. Since the input data are often scanned documents, the data preparation step initially involves the use of text recognition tools (OCR for short) and later on also the use of so-called labeling tools. In this webinar we will deal with the topic of selecting a suitable labeling tool.

Ewelina Fiebig

Machine Learning Scientist

Fabian Gringel

Machine Learning Scientist

Semantic search and understanding of natural text with neural networks: BERT

In this webinar you will get an introduction to the application of BERT for Semantic Search using a real case study: Every year millions of citizens interact with public authorities and are regularly overwhelmed by the technical language used there. We have successfully used BERT to deliver the right answer from government documents with the help of colloquial queries - without having to use technical terms in the queries.

Konrad Schultka

Machine Learning Scientist

Jona Welsch

Machine Learning Scientist