A flyer document
Classification of PDF documents
Context
Document classification is often a first step in processing incoming documents, e.g. pdf files or jpg files received via email. For further processing and to reduce the response time, documents have to be classified by type and ideally, be forwarded to the right agent.
Challenges
Each company has established its own workflows to forward documents to the right agent and most companies have established a ticket system to cope with the amount of incoming documents. Creating and forwarding a ticket to an agent might be rule based (e.g. based on document type, ZIP code or receiving email account) or be done manually. Extracting the relevant information for the best direct assignment can significantly reduce the time to process incoming documents while improving customer and supplier relationships.
Potential solution approaches
Assuming the output of the optical character recognition (OCR) is of good quality, document classification is a standard machine learning task. While broader classification of e.g. document type might be implemented with rule-based approaches (e.g. searching for keywords like "invoice"), a more detailed classification can be achieved by training machine learning algorithms on a labeled dataset. Algorithms proven to be effective in document classification tasks are support-vector machines, naive Bayes or logistic regression. These algorithms are trained on vectorized representations of words and semantics such as the bag-of-words model or TF-IDF.