Our software protects tenants against excessive service charge bills.
Scanned service charge statement
Assessment of the validity of the cost items
Lower the cost of the review process
Our client is an online tenant protection club that provides its customers with the expertise of tenancy lawyers. Many customer inquiries concern the validity of service charge statements.
These may be invalid for various reasons:
Cost items not allocable to tenants
Costs disproportionately high
Distribution key inadmissible
Invoicing not on time
Incorrect billing period
Statement and allocation of costs not comprehensible
Until now, all incoming service charge statements have been checked manually by tenancy law experts, which is time-consuming and correspondingly cost-intensive.
The goal of the project was to develop AI-powered software that automatically checks service charge statements for these errors.
The test should be live, i.e. it may only take a few seconds.
The source data are mostly scanned documents uploaded by customers. Therefore, the software must be robust against poor scan quality.
The results must be interpretable and transparent so that they can be reviewed by a legal expert if necessary.
The logic of the checks should be adaptable to potential future legal changes.
For the development of the algorithms, we combined methods from the fields of Natural Language Processing and Computer Vision and relied on the following methods, among others:
OCR (automatic character recognition)
Fuzzy string search
Neural Networks (R-CNN)
The developed software checks a three-page service charge statement in about 10 seconds and achieves accuracies of 88-95% for the checks of the different error types. It is thus on par with the performance of a tenancy law expert.
In almost all service charge statements, a large part of the relevant information is summarized in a single table. The recognition and extraction of this table are essential for many steps in the checks.
Since the performance of existing commercial and open source solutions for table extraction was not sufficient in tests (only 60-70% of tables were correctly recognized), we decided to develop our own custom solution:
We use a CascadeTabNet to identify areas of the document where tables are located. This identification takes place exclusively on the image level.
Subsequently, we analyze the positions of the strings within these areas and their relative arrangement to each other in order to recognize columns and rows of the tables and to be able to read out their contents in a structured way.
Using this approach, we were able to increase the accuracy of table recognition to 93%.
Review of the cost items
Based on the extracted table, listed cost items can be read out and evaluated. We want to check whether they can actually be passed on to the tenants.
Due to the often poor quality of uploaded documents, we decided to use an approach that is robust against OCR errors: the individual cost items are compared (as strings) with lists of
known admissible and
known inadmissible positions.
The comparison is done using a fuzzy string search, which outputs a similarity value for a pair of strings to be compared:
>>> fuzz.ratio("cable fees", "cable fees") -> 100
>>> fuzz.ratio("cable fees", "cable/TV fees") -> 87
>>> fuzz.ratio("cable fees", "property tax") -> 18
Since there are a variety of different algorithms for fuzzy string search (corresponding to different definitions of string similarity), we trained a machine learning classifier to consider and weight multiple types of similarity scores. Based on the associated similarity scores, the classifier makes an estimate of whether a given item is allocable or not: