© unsplash/@imattsmart
Document classification is often a first step in processing incoming documents, e.g. pdf files or jpg files received via email. For further processing and to reduce the response time, documents have to be classified by type and ideally, be forwarded to the right agent.
Each company has established its own workflows to forward documents to the right agent and most companies have established a ticket system to cope with the amount of incoming documents. Creating and forwarding a ticket to an agent might be rule based (e.g. based on document type, ZIP code or receiving email account) or be done manually. Extracting the relevant information for the best direct assignment can significantly reduce the time to process incoming documents while improving customer and supplier relationships.
Assuming the output of the optical character recognition (OCR) is of good quality, document classification is a standard machine learning task. While broader classification of e.g. document type might be implemented with rule-based approaches (e.g. searching for keywords like "invoice"), a more detailed classification can be achieved by training machine learning algorithms on a labeled dataset. Algorithms proven to be effective in document classification tasks are support-vector machines, naive Bayes or logistic regression. These algorithms are trained on vectorized representations of words and semantics such as the bag-of-words model or TF-IDF.
Ewelina Fiebig
Machine Learning Scientist
Fabian Gringel
Machine Learning Scientist
Ewelina Fiebig
Machine Learning Scientist
Fabian Gringel
Machine Learning Scientist
Konrad Schultka
Machine Learning Scientist
Jona Welsch
Machine Learning Scientist
Fabian Gringel
Machine Learning Scientist
April 23rd
2019
August 17th
2020
March 30th
2020
January 20th
2020
September 28th
2020