Automatic Classification of PDF Documents

Context

Document classification is often a first step in processing incoming documents, e.g. pdf files or jpg files received via email. For further processing and to reduce the response time, documents have to be classified by type and ideally, be forwarded to the right agent.

Challenges

Each company has established its own workflows to forward documents to the right agent and most companies have established a ticket system to cope with the amount of incoming documents. Creating and forwarding a ticket to an agent might be rule based (e.g. based on document type, ZIP code or receiving email account) or be done manually. Extracting the relevant information for the best direct assignment can significantly reduce the time to process incoming documents while improving customer and supplier relationships.

Potential solution approaches

Assuming the output of the optical character recognition (OCR) is of good quality, document classification is a standard machine learning task. While broader classification of e.g. document type might be implemented with rule-based approaches (e.g. searching for keywords like "invoice"), a more detailed classification can be achieved by training machine learning algorithms on a labeled dataset. Algorithms proven to be effective in document classification tasks are support-vector machines, naive Bayes or logistic regression. These algorithms are trained on vectorized representations of words and semantics such as the bag-of-words model or TF-IDF.

More Use Cases in Back Office

Automated question answering for CVs of candidates

Classification of customer requests

Classification of PDF documents

Extracting information from tables

Extraction of entities from invoices and orders in SAP

Identification and validation of contract clauses