© unsplash/@imattsmart

© unsplash/@imattsmart
Back Office Automation

Classification of PDF documents

Context

Document classification is often a first step in processing incoming documents, e.g. pdf files or jpg files received via email. For further processing and to reduce the response time, documents have to be classified by type and ideally, be forwarded to the right agent.

Challenges

Each company has established its own workflows to forward documents to the right agent and most companies have established a ticket system to cope with the amount of incoming documents. Creating and forwarding a ticket to an agent might be rule based (e.g. based on document type, ZIP code or receiving email account) or be done manually. Extracting the relevant information for the best direct assignment can significantly reduce the time to process incoming documents while improving customer and supplier relationships.

Potential solution approaches

Assuming the output of the optical character recognition (OCR) is of good quality, document classification is a standard machine learning task. While broader classification of e.g. document type might be implemented with rule-based approaches (e.g. searching for keywords like "invoice"), a more detailed classification can be achieved by training machine learning algorithms on a labeled dataset. Algorithms proven to be effective in document classification tasks are support-vector machines, naive Bayes or logistic regression. These algorithms are trained on vectorized representations of words and semantics such as the bag-of-words model or TF-IDF.

Related webinars

Text recognition (OCR) - The first step on the way to a successful implementation of an NLP project

In this talk we will deal with the topic of text recognition.

Ewelina Fiebig

Machine Learning Scientist

Fabian Gringel

Machine Learning Scientist

Labeling Tools - The second step on the way to the successful implementation of an NLP project

The success of an NLP project consists of a series of steps from data preparation to modeling and deployment. Since the input data are often scanned documents, the data preparation step initially involves the use of text recognition tools (OCR for short) and later on also the use of so-called labeling tools. In this webinar we will deal with the topic of selecting a suitable labeling tool.

Ewelina Fiebig

Machine Learning Scientist

Fabian Gringel

Machine Learning Scientist

Semantic search and understanding of natural text with neural networks: BERT

In this webinar you will get an introduction to the application of BERT for Semantic Search using a real case study: Every year millions of citizens interact with public authorities and are regularly overwhelmed by the technical language used there. We have successfully used BERT to deliver the right answer from government documents with the help of colloquial queries - without having to use technical terms in the queries.

Konrad Schultka

Machine Learning Scientist

Jona Welsch

Machine Learning Scientist

Recurrent neural networks: How computers learn to read

The webinar will give an introduction to the functioning of RNNs and illustrate their use in an example project from the field of legal tech

Fabian Gringel

Machine Learning Scientist