ML model deployment: How to deliver a Machine Learning model to the customer

Anton Shemyakov

Congratulations! You have successfully trained your machine learning model. The weights are stored in the project directory on your hard drive. The scripts to load the model and run inference are working as expected, and the model reaches the desired performance on the test dataset. Now, you are facing the challenge of making the model predictions available to anyone interested in generating predictions with it.

Model delivery or deployment is a crucial step in creating an impactful machine learning application, and this blog post will guide you, step by step, through a possible solution to this challenge. Model deployment systems come in all shapes and sizes, ranging from basic to sophisticated. However, this blog post focuses on a practical and straightforward approach. The proposed solution could be a solid starting point for exposing model predictions to customers, and it allows extension if additional requirements arise.

Let's imagine that a machine learning model is a text classification model intended to protect people from scammers on the internet. It takes a suspicious email as input and performs binary classification to determine if the sender has malicious intent with extremely high accuracy. With that mental image in mind, let's dive into the tutorial.

Create a REST API

How would a data scientist generate model predictions on a piece of text? Probably, they would load the model into a Jupyter notebook and run the predict method. Unfortunately, your average customer is not familiar with Python and Jupyter notebook—everyday tools for data scientists. Yet, there is a tool with which almost everyone feels comfortable: the web browser. Web browsers like Google Chrome, Edge, Safari, or Firefox have the potential to make your model available if they can properly communicate with it.

Much of the communication on the internet occurs over the HTTP protocol, which is designed for a client-server model. In this model, there is an entity called a server that continuously monitors incoming messages and an entity called a client that sends requests. When the server receives a request from the client, it processes the request in a predefined manner and sends back a response. The HTTP protocol defines the format of messages circulating, including methods, headers, response codes, message payload, and so on. You can learn more about HTTP here. A valid request-response pair of messages might look like this:


POST /predict HTTP/1.1
Content-Type: text/plain
Content-Length: 44
Some text that I want to send to your model.


HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 5

As you can see, there are certain rules on how to structure HTTP messages, but at their core, they are pieces of text.

To make model predictions available over HTTP, you need to create a server by wrapping the model code into something called a REST API. REST stands for representational state transfer and API means application programming interface, but these names are not important. What is important is that a REST API creates a universal interface for your model, which receives input from text (or an HTTP request to be more precise) and returns the result as text back to the sender. As a developer, it is your responsibility to define which messages your REST API accepts and how it reacts to them.

One example of a library for creating REST APIs in Python is FastAPI, while another is Flask. These tools make it easy to create simple REST APIs, but mastering them requires a significant amount of effort.

Here is an example of implementing a basic model prediction API with FastAPI:

from fastapi import Body, FastAPI

from my_model import MyModel # import model class

app = FastAPI() # create FastAPI application
model = MyModel() # create an instance of the model that will make predictions"/predict") # declare a path operation function that is called on a POST request at /predict endpoint
async def predict(text: str = Body()) -> bool: # the function expects text in request body and returns either "true" or "false"
    return model(text) == 1 # run model inference and transform binary output to boolean

You can continue learning about building REST APIs in Python here.

Containerizing a service

Suppose you have successfully implemented a REST API application that returns predictions from your model over HTTP. We refer to such an application as a service. Because the model deployment service plays the role of a server in a client-server model, it has to run continuously, all the time! After all, requests from potential clients might come at any moment. Naturally, your personal computer is a poor choice to run the service for many reasons; for example, it would constantly consume a fraction of your PC's resources.The best hardware choice to run the application is a dedicated computer node, a server. Let's, for the moment, ignore the question of where to provision a server and focus on a new emerging challenge, familiar to anyone who has tried running their code on another computer.

Even before training a machine learning model, you have to correctly set up the working environment, for example, install the OS and Python packages or define environment variables. This environment should be replicated exactly on another machine to ensure that the application runs without errors, and that is a challenging engineering problem. If only there were a way to create some sort of a container where the application would be loaded together with all dependencies and instructions on how to run it! Then special software could run this container on any server regardless of the system environment. Luckily, there is a way to package an application together with its dependencies into a container; the technology is called containerization, and the tool to create containers and run them is called Docker.

A container is similar to an actual running instance of a program, and in the Docker world, the executable file to spawn a container is called an image. To containerize your service, you need a Dockerfile, which describes how exactly an image should be built from your code.The simplest Dockerfile for deploying a text classification REST API would probably contain instructions to include application code and model weights, install Python packages inside the image, and define the application entry point. The details of building docker images are beyond the scope of this blogpost, you can find more information in this blogpost or directly in Docker documentation. Still, here is an example of basic Dockerfile for the REST API we are deploying:

# Use python 3.11 slim as a base image
FROM python:3.11-slim as app
# Set the working directory inside the image to /app
# Copy the application requirements file to the container
COPY requirements.txt requirements.txt
# Install the application requirements
RUN pip install --no-cache-dir -r requirements.txt
# Copy the application source code to the container
COPY . /app
# Start the application when the container starts
ENTRYPOINT ["python", ""]

Once you've built the Docker image with your service, deploying it on a remote server is easy. The image has to be uploaded to the machine and started with the command:

docker run image_name

Finding a suitable host for the application

Now, let's revisit the question of where to find a suitable computer to run the containerized model serving API. This is a very practical question, and while the answer essentially boils down to "buy or rent it from someone," the variety of options available is extensive. One way to categorize the compute infrastructure is to place it on a customization-simplicity scale. The intuition behind this scale is that the more flexibility and control you have over the server configuration, the more setup work you have to do manually, and vice versa.

On the most extreme end of the customization spectrum, there is the so-called on-premise architecture, which essentially means that you buy and maintain your own physical computer or a group of computers. In this case, you are free to choose the exact CPUs, GPUs, RAM, disk configurations, and other hardware options, as well as software options like the operating system. However, you also have to take care of networking, disk space, access management, and many other details on your own.

One step away in the direction of simplicity begins the domain of cloud computing. In the cloud computing paradigm there is a cloud provider who manages their on-premise architecture and allows customers to use it for a fee. The market for cloud computing is vast, with the most popular providers being Amazon Web Services, Microsoft Azure, and Google Cloud Platform.

Typically, the most customizable offering in cloud computing is Infrastructure-as-a-service (IaaS), where a Virtual Machine is provided. A virtual machine can be thought of as a computer with predefined hardware and its own operating system, which can be accessed, for example, via SSH. This solution takes some burden off you as the developer, but you still have to install the required OS packages and maintain disk space.

A simpler yet popular solution is Container-as-a-service (CaaS), which means that the cloud provider runs your Docker container. In this scenario, you only have to worry about building the correct image and the rest is taken care of. On the downside, you don't have the flexibility of logging in to the server and executing an arbitrary program.

Even less customizable approach to cloud computing is Function-as-a-service (FaaS). When you select FaaS offering you are only responsible for implementing a certain piece of logic as a function written in a programming language of your choice. The cloud provider then executes this function with the provided arguments when the trigger event happens.

This section only scratches the surface of cloud computing, of what it is capable of and how it helps you to deliver your apps to the customer. Refer directly to the documentation of a cloud provider of your choice for a deep dive into the topic.

As a data scientist, you probably do not want to manage infrastructure and do system administration work, so on-premise and IaaS approaches to infrastructure are not desirable. On the other hand, the Function-as-a-Service approach would require loading the model every time a request from the customer arrives, which might lead to suboptimal resource utilization and long response time if the model is large. That means, in our case, Container-as-a-Service seems to be the best-fitting cloud computation framework.

Deploying the containerized API to Cloud Run

At dida we like Google Cloud Platform as a cloud provider. And in GCP Container-as-a-service tool is called Cloud Run. Supposing you implemented the REST API for your model, created a Dockerfile and set up an account on Google Cloud here are the required steps to deploy a model to Cloud Run:

  1. Locally build and tag a Docker image with your application. Tagging an image just means that you give it an identifying name by running docker build --tag my-tag . .

  2. Log in to GCP artifact registry and push your image with docker push my-tag. Artifact registry is a library of images you maintain on GCP.

  3. Open Cloud Run dashboard in Google Cloud console and create a new service. Specify the image name, service name, minimum and maximum number of instances.

  4. Once the container is deployed to Cloud Run anyone can access your model's predictions by sending HTTP requests to the service URL!

The minimum and maximum number of instances are part of a great out-of-the-box feature of Cloud Providers: autoscaling. When your model becomes successful and popular, it could happen that a lot of people try to access it simultaneously. In that case, the load on the server becomes too high and it could slow down or even crash. A potential solution is to spin up a second identical server running the same image on separate hardware that will process half of the incoming requests, just like a second checkout opening at the supermarket when the line gets too long. If, on the other hand, the server is not receiving any requests, maybe at night, it is possible to shut down some of the instances in order to save the compute resources. Cloud Run autoscaling makes this possible; you only have to specify the minimum allowed number of running instances and the maximum number of instances available when the service use is at its peak.


As you aim to solve real-world problems with machine learning algorithms, you should always keep an eye on the end goal: making an impact where it matters. And in many cases, it means embedding your model into existing digital workflows, providing an accessible interface for predictions, and eventually deploying it online. Hopefully, this tutorial gave you a basic understanding of what it takes to deliver your ML model's predictions to users.

Be aware that this post is not a step-by-step recipe that you should strictly follow in your project. Instead, we only tried to illuminate the most essential concepts using the toy example. Model deployment is a part of a vast and dynamic field of MLOps, and there is always something new to learn when it comes to deploying or maintaining models in production.