Bridging the Gap Between Machine Learning Development and Production
In the field of Artificial Intelligence, Machine Learning has emerged as a transformative force, empowering businesses to unlock the power of data and make informed decisions. However, bridging the gap between developing ML models and deploying them into production environments can pose a significant challenge.
In this interview with Alessandro Conflitti (Head of Data Science at Radicalbit , a Bitrock sister company), we explore the world of MLOps, delving into its significance, challenges, and strategies for successful implementation.
Let’s dive right into the first question.
What is MLOps and why is it important?
MLOps is an acronym for «Machine Learning Operations». It describes a set of practices ranging from data ingestion, development of a machine learning model (ML model), its deployment into production and its continuous monitoring.
In fact, developing a good machine learning model is just the first step in an AI solution. Imagine for instance that you have an extremely good model but you receive thousands of inference requests (input to be predicted by the model) per second: if your underlying infrastructure does not scale very well, your model is going to immediately break down, or anyway will be too slow for your needs.
Or, imagine that your model requires very powerful and expensive infrastructures, e.g. several top notch GPUs: without a careful analysis and a good strategy you might end up losing money on your model, because the infrastructural costs are higher than your return.
This is where MLOps comes into the equation, in that it integrates an ML model organically into a business environment.
Another example: very often raw data must be pre-processed before being sent into the model and likewise the output of an ML model must be post-processed before being used. In this case, you can put in place an MLOps data pipeline which takes care of all these transformations.
One last remark: a very hot topic today is model monitoring. ML models must be maintained at all times, since they degrade over time, e.g. because of drift (which roughly speaking happens when the training data are no longer representative of the data sent for inference). Having a good monitoring system which analyses data integrity (i.e. that data sent as input to the model is not corrupted) and model performance (i.e. that the model predictions are not degrading and still trustworthy) is therefore paramount.
What can be the different components of an MLOps solution?
An MLOps solution may include different components depending on the specific needs and requirements of the project. A common setup may include, in order:
- Data engineering: As a first step, you have to collect, prepare and store data; this includes tasks such as data ingestion, data cleaning and a first exploratory data analysis.
- Model development: This is where you build and train your Machine Learning model. It includes tasks such as feature engineering, encoding, choosing the right metrics, choosing the model’s architecture selection, training the model, and hyperparameter tuning.
- Experiment tracking: This can be seen as a part of model development, but I like to highlight it separately because if you keep track of all experiments you can refer to them later, for instance if you need to tweak the model in the future, or if you need to build similar projects using the same or similar datasets. More specifically you keep track of how different models behave (e.g. Lasso vs Ridge regression, or XGBoost vs CatBoost), but also of hyperparameter configurations, model artefacts, and other results.
- Model deployment: in this step you put your ML model into production, i.e. you make it available for users who can then send inputs to the model and get back predictions. What this looks like can vary widely, from something as simple as a Flask or FastAPI to much more complicated solutions.
- Infrastructure management: with a deployed model you need to manage the associated infrastructure, taking care of scalability, both vertically and horizontally, i.e. being sure that the model can smoothly handle high–volume and high–velocity data. A popular solution is using Kubernetes, but by no means is it the only one.
- Model monitoring: Once all previous steps are working fine you need to monitor that your ML model is performing as expected: this means on the one hand logging all errors, and on the other hand it also means tracking its performance and detecting drift.
What are some common challenges when implementing MLOps?
Because MLOps is a complex endeavour, it comes with many potential challenges, but here I would like to focus on aspects related to Data. After all that is one of the most important things; as Sherlock Holmes would say: «Data! data! data! (…) I can’t make bricks without clay.»
For several reasons, it is not trivial to have a good enough dataset for developing, training and testing a ML model. For example, it might not be large enough, it might not have enough variety (e.g. think of a classification problem with very unbalanced, underrepresented classes), or it might not have enough quality (very dirty data, from different sources, with different data format and type, plenty of missing values or inconsistent values, e.g. {“gender”: “male”, “pregnant”: True}).
Another issue with Data is having the right to access it. For confidentiality or legal (e.g. GDPR) reasons, it might not be possible to move data out of a company server, or out of a specific country (e.g. financial information that cannot be exported) and this limits the kinds of technology or infrastructures that can be used, and deployment on cloud can be hindered (or outright forbidden). In other cases only a very small curated subset of data can be accessed by humans and all other data are machine–readable only.
What is a tool or technology that you consider to be very interesting for MLOps but might not be completely widespread yet?
This might be the Data Scientist in me talking, but I would say a Feature Store. You surely know about Feature Engineering, which is the process of extracting new features or information from raw data: for instance, having a date, e.g. May 5th, 1821, compute and add the corresponding week day, Saturday. This might be useful if you are trying to predict the electricity consumption of a factory, since often they are closed on Sundays and holidays. Therefore, when working on a Machine Learning model, one takes raw data and transforms it into curated data, with new information/features and organised in the right way. A feature store is a tool that allows you to save and store all these features.
In this way, when you want to develop new versions of your ML model, or a different ML model using the same data sources, or when different teams are working on different projects with the same data sources, you can ensure data consistency.
Moreover, preprocessing of raw data is automated and reproducible: for example anyone working on the project can retrieve curated data (output of feature engineering) computed on a specific date (e.g. average of the last 30 days related to the date of computation) and be sure the result is consistent.
Before we wrap up, do you have any tips or tricks of the trade to share?
I would mention three things that I find helpful in my experience.
My first suggestion is to use a standardised structure for all your Data Science projects. This makes collaboration easier when several people are working on the same project and also when new people are added to an existing project. It also helps with consistency, clarity and reproducibility. From this perspective I like using Cookiecutter Data Science.
Another suggestion is using MLflow (or a similar tool) for packaging your ML model. This makes your model readily available through APIs and easy to share. And finally I would recommend having a robust CI/CD (Continuous Integration and Continuous Delivery) in place. In this way, once you push your model artefacts to production the model is immediately live and available. And you can look at your model running smoothly and be happy about a job well done.
Main Author: Dr. Alessandro Conflitti, PhD in Mathematics at University of Rome Tor Vergata & Head of Data Science @ Radicalbit (a Bitrock sister company).
Interviewed by Luigi Cerrato, Senior Software Engineer @ Bitrock