© unsplash/@imattsmart

© unsplash/@imattsmart
Home › 
Use Cases › 
Classification of PDF documents

Classification of PDF documents

Use Case
Back Office


Document classification is often a first step in processing incoming documents, e.g. pdf files or jpg files received via email. For further processing and to reduce the response time, documents have to be classified by type and ideally, be forwarded to the right agent.


Each company has established its own workflows to forward documents to the right agent and most companies have established a ticket system to cope with the amount of incoming documents. Creating and forwarding a ticket to an agent might be rule based (e.g. based on document type, ZIP code or receiving email account) or be done manually. Extracting the relevant information for the best direct assignment can significantly reduce the time to process incoming documents while improving customer and supplier relationships.

Potential solution approaches

Assuming the output of the optical character recognition (OCR) is of good quality, document classification is a standard machine learning task. While broader classification of e.g. document type might be implemented with rule-based approaches (e.g. searching for keywords like "invoice"), a more detailed classification can be achieved by training machine learning algorithms on a labeled dataset. Algorithms proven to be effective in document classification tasks are support-vector machines, naive Bayes or logistic regression. These algorithms are trained on vectorized representations of words and semantics such as the bag-of-words model or TF-IDF.

Get quarterly AI news

Receive news about Machine Learning and news around dida.

Successfully signed up.

Valid email address required.

Email already signed up.

Something went wrong. Please try again.

By clicking "Sign up" you agree to our privacy policy.

dida Logo