The machine learning lifecycle: How to build robust ML systems
Complex network infrastructure systems are great candidates for the application of machine-learning-based solutions for monitoring and management. Consider a machine learning (ML) service developed to forecast outages in a cellular network. When the network operator receives an alert of a potential upcoming outage, they can take steps to proactively mitigate problematic behavior, before it begins to affect customers.
The service is developed with the assistance of the data engineering team, which builds the underlying data pipelines that ingest raw streams of network performance metrics and load them into an ML-optimized database. The data science team then performs the initial data exploration, feature engineering, ML modeling, and hyperparameter optimization. Together they deploy a production-ready ML service. As is often the case, the ML model performs well for several months. Predictions are made at the expected accuracy, network operators are able to quickly resolve network issues, and customers are happy. But in some cases, for reasons we will soon discuss, the quality of the predictions may slowly degrade. Data scientists may retrain the model on newer data and retune the hyperparameters, but performance may not improve.
In a more severe case, there could be a complete ML system outage. How do engineers debug the ML system and identify the root cause of the performance degradation and eventual outage? How do they develop ML monitoring tools to proactively maintain high-quality, high-availability machine learning systems in modern telecommunications networks?
Understanding performance issues through monitoring
The potential root cause of these failures is varied and can include both ML-specific reasons, related to changes in the statistical distributions of the underlying data, and non-ML reasons, related to the more general challenges of operating data-intensive applications. In Figure 1, we compare potential reasons for failures regarding ML-specific data science reasons, ML-specific engineering reasons, and more general engineering challenges.
ML models are theoretically rooted in statistics and implicitly assume that the statistical distribution of the training data and inference data are the same. When an ML model is trained, its internal parameters are optimized so as to maximize the expected performance on the training dataset. As a result, if the ML models are applied to datasets with different characteristics, their performance may be sub-optimal. Due to the dynamic environment that ML models operate in, it is typical for data distributions to shift over time. In cellular networks, this shift can occur over the course of months, as new base stations are installed and upgraded. But in some cases, for example during a conference or large concert, the shift can occur within hours.
By implementing processes for detecting the presence of this statistical distribution drift, as well as outliers and anomalous data points, these issues can be proactively detected and addressed, before they negatively impact performance and the end-user experience.
The systems engineered to deploy ML models also require unique forms of risk mitigation. Due to the strong dependence of ML models on the underlying data, and the multi-tenant environment that ML infrastructure occupies, additional data management and monitoring is necessary. The data that ML models ingest from various data stores and data lakes, which are often created and maintained by other teams, must be constantly monitored for unexpected changes that may have unexpected effects on ML model outputs. Furthermore, to ensure issues can be quickly identified and resolved, it’s necessary to maintain informative logs of data and model versions. These engineering challenges are further complicated by the challenges of deploying and scaling distributed systems, in general. In the upcoming paragraphs, we will discuss these ML-specific challenges in detail.
Preventing machine learning failures with data monitoring
ML models, due to their strong dependence on input data, have strict constraints and expectations of the data format. For example, a model trained on categorical data, such as a set of zip codes, may not produce reliable predictions when new zip codes are encountered. Similarly, a model trained on temperature data represented in Celsius, may produce incorrect predictions if the input data is represented in Fahrenheit. These subtle changes in the data often occur silently and can lead to performance degradation. For these reasons, additional ML-specific schema validation is advisable.
Another unique failure mode arises from so-called missing data. Data can sometimes be missing due to measurement failures. For example, a faulty temperature sensor on a device, or a message lost in transmission. In this case, the data exists in reality, but there is no record. In such a scenario it may still be possible for the ML model to produce reliable predictions by inferring reasonable values for the missing data points. However, in other scenarios, in which the data is missing because it simply does not exist, we cannot impute the missing data.
For example, during a time interval in which no phone calls were made, it is not possible to calculate the average length of a call. Inappropriately filling in such data points can lead to less reliable predictions. It’s important to note that not all ML algorithms are robust to missing data, and that the abrupt appearance of large amounts of missing data may indicate problems in downstream data pipelines. It is therefore advisable to monitor for the presence of unexpected missing data.
The presence of outliers (i.e., unusually low or high numerical values) and anomalies (i.e., unusual ranges or combinations of values) may indicate issues with upstream data pipelines. The presence of outliers and anomalies in training data can lead to poor, biased models, and in inference data can lead to incorrect predictions. Fortunately, if proactively detected, outliers and anomalous data points can be filtered out before they degrade performance.
Measuring differences between statistical distributions
A common cause of performance decline is the gradual drift between the training and inference data distributions, referred to as concept drift. This can occur in the form of a shift in the mean and standard deviation of numerical features (Figure 2a). For example, over time the number of connection attempts to a cellular base station may increase, as an area becomes more populated. Several approaches, summarized in Figure 2, can be utilized to measure this statistical drift: the Kolmogorov-Smirnov (KS) test, the Kullbeck-Leibler (KL) divergence, and comparison of summary statistics.
The KS test (Figure 2b) is used for evaluating the equality of two probability distributions. It is calculated by computing the cumulative distribution function of two datasets, identifying their maximum point-wise difference, and calculating the corresponding p-value via the Kolmogorov distribution. This approach enables making statistically principled decisions based on pre-decided acceptable bounds for the p-value. However, note that since this approach doesn’t take into account the broader cumulative differences between distributions, it is most useful only for detecting this specific form of distribution drift.
An alternative method is to calculate the KL divergence (Figure 2c), or the related Jensen-Shannon divergence, and reject values above a pre-decided threshold. Compared to the KS test, this approach is more holistic in that it takes into account the entire distribution. However, there is no associated p-value, and therefore a loss of some interpretability and principled decision making.
A simpler approach is to directly compare the dataset summary statistics (Figure 2d). For example, calculate the mean and standard deviation of the training and inference datasets, and calculate the similarity between them. This approach is very computationally efficient. Very fast algorithms for calculating these summary statistics already exist. However, such approaches can be more prone to false positives and false negatives, due to the relative lack of statistical grounding.
Preventing engineering failures in ML systems
By designing a machine learning system that explicitly includes data monitoring and model observability tools, we can mitigate the risk of ML performance degradation. In Figure 3 we summarize these tasks, which can be logically grouped together based on their stage in the ML workflow.
In the data pipeline stage (green) sit tasks involving data monitoring and ML-specific validation. To aid in these tasks, the software community has built various open-source tools for data version control (e.g. DVC), metadata (e.g. Amundsen, DataHub, MLMD), and validation (e.g. Great Expectations, deequ, TFDV).
At the ML model stage (orange and blue) sit tasks for tracking and registering different versions of ML models (e.g. MLflow, ModelDB, TFX), and the infrastructure for serving them to the end users (e.g. KF Serving, TF Serving, Seldon Core). These tasks all ultimately sit within the larger computing infrastructure (purple) comprised of workflow managers (e.g. Airflow), container orchestration tools (e.g. Docker Swarm, Kubernetes), virtual machines, and other cluster management tools.
Versioning and tracking of data and ML models
Because organizational data pipelines can be very long and complex, with individual components owned by different teams, with different goals and obligations, informative data versioning and provenance is imperative for speedy troubleshooting and root cause analysis. If ML performance issues are caused by unexpected changes to data schemas, unexpected changes to feature generation, or errors in intermediate feature transformation steps, historical records prove to be an important tool in pinpointing when the issue first appeared, what data it affected, and which inference results are potentially impacted.
For example, the first step in a data generation process may consist of recording certain metrics relating to quality of service at a cellular base station. This raw data may be transmitted in a streaming manner and stored in a database. In the second stage, this raw data may be filtered, additional derived metrics may be calculated, and results may be stored in a separate data lake. Afterwards, additional ML-specific feature engineering may be performed, and results may be stored in yet another database. Along this pipeline, various data quality validation tests may be performed. A record of the results of these tests, at various stages of the workflow, can make it easier for ML teams to identify unexpected issues in data pipelines.
Of particular interest to ML operations is the phenomenon of training and serving skew, which refers to differences in the data used for training and inference. These discrepancies can arise for various reasons. The computational demands of training and inference can be quite different and are therefore sometimes optimized in ways that can lead to different outputs. Training, for example, is often performed as a distributed batch job on heavily preprocessed datasets. Inference, on the other hand, is often performed on streaming data that is processed on-the-go. Of particular risk are engineered features that depend on aggregate measures of recent data, such as rolling averages. This can be monitored to make sure it doesn’t diverge in unexpected ways.
In a similar manner to data versioning, there may be many versions of ML models. These versions may be simply retrained on different datasets, but may also have different hyperparameters, or different code altogether. If issues with ML model outputs are reported, it is useful to have a record of which model version is associated with the problematic behavior, which hyperparameters were used, and on what dataset it was trained. A so-called model registry, which stores serialized models and their associated metadata, makes this takes easier.
Integrating machine learning systems with existing infrastructure
Finally, the ML system needs to be well integrated into the existing technical infrastructure and business ecosystem. ML-oriented databases and streaming services may need to be configured for ML-optimized queries and load balancing services may need to be employed to maintain high availability and robustness. It is now common practice to utilize microservice architectures, rooted in containers and virtual machines, to deploy ML models. Virtual machines enable dynamic resource scaling in response to varying computational demands, which can range from CPU-intensive batch-like processes during training, to memory-intensive stream-like processes during inference. Each individual step of the data pipeline and ML workflow can be enclosed within its own container, where it can be safely built and maintained by different teams. Orchestration tools such as Docker Swarm and Kubernetes aid in this container management, while tools such as Airflow and Kubeflow coordinate ML workflows.
Challenges in massive data streaming scenarios
In future network architectures the usage of ML-based technologies may become very widespread. At such scale, a massive amount of streaming data may be collected and stored, and the usual algorithms to assess data quality and distribution drift may no longer be computationally efficient. The underlying algorithms and processes may need to be modified. Furthermore, it is likely that future architectures will have an increased movement of computation away from a centralized system, and towards the outer edge, closer to the end consumers. This enables reduced latencies and network traffic, at the cost of a more complex architecture with novel engineering challenges and concerns. In such situations, depending on local governmental regulations, there may be tighter limitations on data collection and data sharing, necessitating more careful approaches towards training and serving ML models in a secure, federated manner.
In writing this post, I would like to thank Wenting Sun, Xuancheng Fan, Kunal Rajan Deshmukh, and Zhaoji Huang for their contributions in related work, and Zeljka Lemaster for proof reading and revision suggestions.
The ML deployment and life cycle management process is an exciting journey filled with many subtleties and complexities. The industry is in a state of constant flux, with new packages and startups emerging to fill needs as they arise over time. For more information about the machine learning lifecycle management processes employed in the industry, please view the conference talks in OpML 2020, as well as the papers Data Validation for Machine Learning and Monitoring and Explainability of Models in Production.
Read all about democratizing AI with automated machine learning technology.
Read our blog post on improving invoice anomaly detection with AI and machine learning.
Like what you’re reading? Please sign up for email updates on your favorite topics.Subscribe now
At the Ericsson Blog, we provide insight to make complex ideas on technology, innovation and business simple.