The goal of this project was to automatically identify and extract specific information from PDF documents attached to emails on a large scale. The technical solution from a machine learning perspective was quite clear early on: an optical character recognition (OCR) service would extract blocks of raw text from the PDFs, which would then be processed by a neural network for the final extraction of the relevant information. After a short development phase, the general training framework for the model as well as the data engineering was mostly done and the team had three remaining main issues to address:
Performing the actual training runs and evaluations of the neural network.
Providing a platform allowing the trained models to be seamlessly integrated into an already existing Kubernetes ecosystem.
Making sure that the service in production allowed for sufficient computational performance (in terms of scaling) as well as predictive performance (in terms of statistical metrics).
The project required that these tasks should be handled mostly independently of each other: any newly trained model should be easily monitored and effectively deployed, unaffected by the current stage of the development process and changes of the data. Additionally, this process should not require any manual input from the team (in addition to code changes) and only consist of transparent and reproducible steps captured in the code base.