Ways of deploying your model

Microservices: API-ify your model (Pickle, Joblib)
Deploy model together with your application (Python, MLlib)
Deploy model as SQL stored procedure
Shared service: host your model in a dedicated tool, possible automated
Streaming: load your model in memory (PMML, ONNX)
Many cloud providers and ML tools provide solutions for model deployment that integrate closely with their machine learning and data environments. These can greatly speed up deployment and ease infrastructure overhead such as:
  • GCP Vertex AI
  • AWS Sagemaker
  • MLFlow
Once a machine learning model has been generated, that code needs to be deployed for usage. How this is done depends on your use case and your IT environment.

As a microservice

Why: Your model is intended to provide output for a real time synchronous request from a user or system.
How: The model artefact and accompanying code to generate features and return results is packaged up as a contain- erised microservice.
Watch out for: The model and microservice code should always be packaged together - this avoids potential schema or feature generation errors and simpli- fies versioning and deployment.
Your model and feature generation code will need to be performant in order to respond in real time and not cause downstream timeouts.
It will need to handle a wide range of possible inputs from users.

Embedded models

Why: Your model is intended to directly surface its result for further usage in the context it is embedded e.g. in an application for viewing.
How: The model artefact is packaged up as part of the overall artefact for the application it is contained within, and deployed when the application is deployed.
Watch out for: The latest version of the model should be pulled in at application build time, and covered with automated unit, integration and end-to-end tests.
Realtime performance of the model will directly affect application response times or other latencies.

As a SQL stored procedure

Why: The model output is best consumed as an additional column in a database table.
The model has large amounts of data as an input (e.g. multi-dimensional time-series).
How: The model code is written as a stored procedure (in SQL, Java, Python, Scala etc. dependent on the database) and scheduled or triggered on some event (e.g. after a data ingest).
Modern data warehouses such as Google BigQueryML or AWS RedShift ML can train and run ML models as a table-style abstraction.
Watch out for: Stored procedures not properly configuration controlled.
Lack of test coverage of the stored procedures.

As part of batch data pipeline

Why: Your model is intended to provide a set of batch predictions or outputs against a batched data ingest or a fixed historical dataset.
How: The model artefact is called as part of a data pipeline and writes the results out to a static dataset. The artefact will be packaged up with other data pipeline code and called as a processing step via an orchestration tool. See our data pipeline playbook for more details.
Watch out for: Feature generation can be rich and powerful across historical data points
Given the lack of direct user input, the model can rely on clean, normalised data for feature generation.
Parallelisation code for model execution may have to be written to handle large datasets.

As part of a streaming data pipeline

Why: Your model is used in near-real-time data processing applications, for example in a system that makes product recommendations on a website while the users are browsing through it.
How: The model artefact is served in memory in the streaming processing framework, but using an intermediate format such as ONNX or PMML. The artefact is deployed while the stream keeps on running, by doing rolling updates.
Watch out for: Performance and low latency are key. Models should be developed with this in mind; it would be good practice to keep the number of features low and reduce the size of the model.

Experience report

I worked on a model that was used to forecast aspects of a complex logistics system. The input data was a large number (many thousands) of time-series and we needed to create regular forecasts going into the future for some time, so that downstream users could plan their operations and staffing appropriately. There was a lot of data involved at very high granularity, so it was a complex and time-consuming calculation. However, there was no real-time need and forecast generation once a week was more than enough to meet the business needs.
In this context the right approach was to use a batch-process in which forecasts were generated for all parts of the logistics chain that needed them. These were produced as a set of tables in Google BiqQuery. I really liked this method of sharing the outputs because it gave a clean interface for downstream use.
One of the challenges in this work was the lack of downstream performance measures. It was very hard to get KPIs in a timely manner. Initially we measured standard precision errors on historical data to evaluate the algorithm and later we were able to augment this with A/B testing by splitting the logistics network into two parts.
Katharina Rasch Data engineer Equal Experts, EU