What is MLOps?

The use of MLOps frameworks can drastically improve the data science workflow, as an example of dida’s projects shows.

The goal of this project was to automatically identify and extract specific information from PDF documents attached to emails on a large scale. The technical solution from a machine learning perspective was quite clear early on: an optical character recognition (OCR) service would extract blocks of raw text from the PDFs, which would then be processed by a neural network for the final extraction of the relevant information. After a short development phase, the general training framework for the model as well as the data engineering was mostly done and the team had three remaining main issues to address:

Performing the actual training runs and evaluations of the neural network.
Providing a platform allowing the trained models to be seamlessly integrated into an already existing Kubernetes ecosystem.
Making sure that the service in production allowed for sufficient computational performance (in terms of scaling) as well as predictive performance (in terms of statistical metrics).

The project required that these tasks should be handled mostly independently of each other: any newly trained model should be easily monitored and effectively deployed, unaffected by the current stage of the development process and changes of the data. Additionally, this process should not require any manual input from the team (in addition to code changes) and only consist of transparent and reproducible steps captured in the code base.

Automating the Training

The first step was to address the automation of the training process. The team decided to use Vertex AI pipelines for this task. This allows the individual components of the training process to be designed as a directed graph which defines the dependencies between the specific steps (see the figure below). The graph includes components representing Python code (such as train-model and evaluate-model as well as placeholders for directories, input data and so-called artifacts (i.e., data created in a run of the pipeline).

A training run can now be easily triggered as a CI/CD pipeline based on commits to the codebase (such as for example changes to the model configuration or the training protocol), while the data handling as well as the provisioning of computational resources is handled by Vertex AI. This allows a fully transparent training process, which does not require the manual management of data and individual VMs to train the model.

Introduction: What is MLOps?

Information Extraction from PDF: From Model Development to Production

The use of MLOps frameworks can drastically improve the data science workflow, as an example of dida’s projects shows.

Automating the Training