Overview of MLOps

MLOps is turning out to be the saving grace for the overly complex machine learning lifecycle. Just as DevOps transformed the world of software development, MLOps is doing the same with machine learning.

Google and Microsoft define MLOps as follows:

Google: MLOps is a set of standardized processes and technology capabilities for building, deploying, and operationalizing ML systems rapidly and reliably.

Microsoft: MLOps, or DevOps for machine learning, enables data science and IT teams to collaborate and increase the pace of model development and deployment via monitoring, validation, and governance of machine learning models.

Although there are similarities between DevOps and MLOps, in that both involve a set of best practices, processes, tools, and philosophy, the latter is significantly more complicated.

For beginners, there are problems specific to ML that must be addressed from the get-go, such as data drift. Data drift occurs when there are changes in the data being fed into models thus requiring the retraining of the model. Teams must develop repeatable and reproducible pipelines that work seamlessly when data changes. Creating repeatable and reproducible processes can be done using several open-source software products if an engineer desires the do-it-yourself model. Or else there are MLOps platforms that can do the whole gambit from start to finish.

Development and Training

ML development and training (experimentation) are the most time-consuming part of the lifecycle. Data must be prepared, cleansed, and features selected (feature engineering). Thereafter, algorithms must be selected, hyperparameters tuned, and hundreds or thousands of experiments should take place. Once that is accomplished, models will be deployed and monitored.

There are many steps in the end-to-end process. Dozens of startups have entered the landscape, creating open-source tools that tailor to each step in the lifecycle. Also, there are MLOps platforms from AWS, Google, and Azure that take care of the entire lifecycle, gluing and automating the different open-source products required to conduct each task.

The popular Google white paper “Hidden Technical Debt in Machine Learning Systems” which is quoted regularly by players in the ML industry explains the ML lifecycle based on their experience. Google has been running sophisticated ML models for a number of years and the paper describes best practices and pitfalls to be avoided. In fact, the white paper has been so profound to the industry that startups are popping catering to specific areas described in the research paper.

In another white paper, titled Practitioners guide to MLOps, Google explains that building ML competency within an organization involves three different areas: Data engineering, ML engineering, and App engineering.

Data Engineering: Preparing and curating the data
ML Engineering: Building ML-enabled systems
App Engineering: Monitoring KPIs