In a now well known 2015 Google paper, the life cycle of complex ML systems was investigated in the context of so-called technical debt. This term is commonly used to describe potential problems in the overall process of operationalizing the development, deployment and maintenance in large-scale software projects. In the context of classical software engineering, technical debt is usually resolved with a set of techniques now known as DevOps (stemming from development and operations) in order to ensure fluid workflows and high software quality. The development and productionalization of machine learning systems, however, introduces new problems which are not encountered in classical software engineering. Examples of these problems are manifold:
The goal of this project was to automatically identify and extract specific information from PDF documents attached to emails on a large scale. The technical solution from a machine learning perspective was quite clear early on: an optical character recognition (OCR) service would extract blocks of raw text from the PDFs, which would then be processed by a neural network for the final extraction of the relevant information. After a short development phase, the general training framework for the model as well as the data engineering was mostly done and the team had three remaining main issues to address:
Performing the actual training runs and evaluations of the neural network.
Providing a platform allowing the trained models to be seamlessly integrated into an already existing Kubernetes ecosystem.
Making sure that the service in production allowed for sufficient computational performance (in terms of scaling) as well as predictive performance (in terms of statistical metrics).
The project required that these tasks should be handled mostly independently of each other: any newly trained model should be easily monitored and effectively deployed, unaffected by the current stage of the development process and changes of the data. Additionally, this process should not require any manual input from the team (in addition to code changes) and only consist of transparent and reproducible steps captured in the code base.
The first step was to address the automation of the training process. The team decided to use Vertex AI pipelines for this task. This allows the individual components of the training process to be designed as a directed graph which defines the dependencies between the specific steps (see the figure below). The graph includes components representing Python code (such as train-model and evaluate-model as well as placeholders for directories, input data and so-called artifacts (i.e., data created in a run of the pipeline).
A training run can now be easily triggered as a CI/CD pipeline based on commits to the codebase (such as for example changes to the model configuration or the training protocol), while the data handling as well as the provisioning of computational resources is handled by Vertex AI. This allows a fully transparent training process, which does not require the manual management of data and individual VMs to train the model.