A look at automated fault management with machine learning

The digitization of industry and critical infrastructures, brought to life through virtualized cloud platforms, requires a new approach to fault management altogether. In this blog post, the second in our series about automated fault management, we take a good look at the latest machine learning techniques used to detect and predict faults in cloud systems.

working with prototype
Fetahi Wuhib

Senior researcher

Chunyan Fu

Experienced Researcher

Mbarka Soualhia

Postdoctoral fellow

Based on machine learning techniques, fault detection and fault prediction functions make an integral component of a modern day automated fault management system. As we made the case in our previous post, automating fault detection for management systems using ML, machine learning techniques play an important role in automating these functions. In this post, we describe how different machine learning techniques can be applied in automated fault management systems to both detect faults and anomalies, and also predict faults that will eventually occur.

Components of a fault management system

The figure (below) captures the key functions included in a fault management system and how they relate to each other. To get an overview of each function, we recommend that your read our previous post. In this post, we delve a little more into the specifics of the various techniques.

Fault detection architecture 50

Basic machine learning components

The two major types of machine learning – supervised and unsupervised learning – have different applications and, as such, address different aspects of problems that are faced by today's fault detection methods.

Most machine-learning-based solutions share the same set of core components: a data component that serves the data that would be used to train and evaluate the machine learning model, a preprocessing component that would adapt the data into a format usable for using with the models and the training/inference pipelines where the actual machine learning magic happens.


Like most machine learning solutions, successful fault detection and prediction require a large amount of data on which to train or fit the models. Such data may already exist as historical monitoring data in most systems with functioning monitoring components. However, it is important to avoid the pitfalls of a human-knows-best approach in selecting metrics to monitor. Specifically, the goal of a monitoring systems for ML-based techniques should be to collect as many metrics as possible, as frequently as possible for as long as possible, while keeping the impact on the monitored system minimal.

There is well-known software for collecting monitoring data and storing the collected metrics. When it comes to storage, we found that (time series) databases (TSDBs) such as Prometheus or Influxdb perform much better in storing and retrieving monitored data compared to traditional databases. When it comes to collecting metrics, Collectd is probably the oldest and most stable one with a support for a large number of systems and applications. However, much better performance can be achieved by using monitoring agents specific to the TSDB in use (e.g. node-exporter or cAdvisor with Prometheus or Telegraf in Influxdb). In our test bed, we have deployed Prometheus TSDB with node-exporter and cAdvisor agent where node-exporter scrapes about 1000+ metrics every 10 seconds from each host while cAdvisor does the same for 4000+ metrics from each Kubernetes node at the same frequency.


The data stored in a TSBD is often not directly usable for machine learning models. Rather, it has to go through a preprocessing process before it can be used to train/fit the models. The process often starts with a data conversion process whereby the metrics retrieved from the TSDB, often in human readable JSON or XML formats, into numeric format that is suitable for the analytics/ML software. Next, one or more of data preprocessing steps are applied to the data depending on the data itself and the requirement of models to be trained. The most common ones include:

  • data synchronization where metrics collected from/through several agents are aligned in time with each other
  • data cleansing where either unnecessary data is removed (e.g. unusable metrics, such as non-numeric or ephemeral data, empty samples, etc.) and missing data is generated (e.g. imputation of missing data through interpolation)
  • gaugification where counter-type metrics (metrics that increase all the time) are converted to gauge-type metrics (metrics whose values can both increase and decrease), through e.g. the process of differencing
  • normalization where the values of the metrics are scaled such that all metrics have comparable magnitudes (e.g. through min-max normalization and standardization)
  • features selection where the relevant metrics are identified for use in training the models

Of all the above steps, the latter is perhaps the most technical one in the sense that there are several different techniques to apply for different applications. In general, features selection allows models to be trained with only a subset of the metrics from the original dataset, making the training process faster and its resource requirements lower. If properly done, it also improves the accuracy of the model by removing noise from the training data. In our experiments, we were able to reduce the number of features of one dataset by an order of magnitude by using a process called recurrent feature elimination. For another dataset, we were able to reduce the number of features by a factor of 6 using PCA, while still maintaining a classification accuracy of more than 99% by the trained models.

Training and inference pipelines

Once the data is (pre)processed, it is then put to use through two pipelines: a training pipeline and an inference pipeline. The training pipeline, takes the data in bulk, uses it to train/fit a model and makes this model available for the inference pipeline. The model output is then used by the inference pipeline to evaluate real-time samples of the monitored metrics.

Depending on the trained model, the inference results vary. When applied to fault detection, for instance, the inference result may indicate whether the sample of the monitored metrics indicate the existence of a specific type of fault or not. In the following section, we will describe some of the models we investigated, and the performance results we obtained.

In general, the training pipeline is executed only once. However, if situations where the characteristics of the managed system changes over time, the process would need to be executed again. This pipeline is also the more resource intensive one of the two pipelines, and it could mostly benefit from AI accelerator hardware, such as Graphic Processing Units (GPU), Field Programmable Gate Arrays (FPGA) or Application-Specific Integrated Circuits (ASIC).

ML-based fault detection

The two major types of machine learning – supervised and unsupervised learning – have different applications when used in the context of our architecture and as such address different aspects of problems that are faced by today's fault detection methods.

Supervised fault detection

In supervised learning, the machine learning model is trained with samples of the state of the system (i.e. the dataset). In addition to the samples, the training process also needs a 'label' that indicates whether each sample of the state indicates a faulty state or not. Once the model is trained, it can be used to classify the state of the system (fault or no fault) from a sample of the state of the live system. If trained with appropriate data, common machine learning methods such as Support Vector Machines (SVM), Random Forest (RF) or (deep) Neural Networks (NN) can perform well for this purpose. In general, SVM and RF are simpler and hence faster to train while NN can generally perform better when used with a dataset with a large number of features, at the expense of being slower and more resources-intensive.

In our testbed, using SVM, RF and NN models, we were able to detect non-fatal CPU and HDD overload faults with an F1-Score of more than 95%. However, the average training time for the models varied very much, with SVM fastest at around 0.17 sec, RF at 6.19 sec and NN slowest at 102secs, for training with 1200 samples.

Unsupervised fault detection

In unsupervised learning, the model is also trained with the samples of the state of the system, but without the 'labels' indicating what the actual state of the system is. Because of this, the model is not generally tied to a specific fault or fault characteristics. Rather, it is used as a vehicle for 'anomaly detection' which typically indicates that something is amiss in the monitored system and requires further investigation. For example, clustering techniques such as K-means and Density-Bases Spatial Clustering of Application with Noise (DBSCAN) or mixture methods such as multi-variate gaussian mixture model can be put to use. When such models are applied to the data collected from a fault-free system, they can identify clusters to which fault free states of the system belong. The state of a live system can be then evaluated against these models to classify the state as normal or anomalous.

Contrary to supervised learning models that performed consistently well, we obtained mixed results when it comes to the accuracy of our unsupervised learning models. For instance, the multi-variate gaussian mixture model was able to detect about 99% of non-fatal CPU-overload faults but only about 60% of non-fatal HDD-overload faults, which we suspect is due to the fact that the HDD-overload fault state is 'too similar' to that of a normally operating but highly loaded system.

Supervised vs unsupervised fault detection

Generally speaking, supervised fault detection methods have a much better performance than unsupervised methods in the sense that they have much better accuracy, precision and recall rates. They shine in scenarios where the faults are not easy to detect by other simpler methods. They typically also require a smaller amount of data to be trained with. However, this comes at the expense of requiring labelled data. For most systems, since faults are the exception and not the norm, it is generally difficult to get sufficient data to train the models. For some systems and faults though, it may be possible to artificially introduce the faults and generate more fault data for the training.

Despite the fact that unsupervised learning models need much more data than their supervised counterparts, it is actually easier to obtain training data for them since they just need access to the monitoring data of the system (we assume here that the system is free from faults, or data collected from a faulty system is removed). As unsupervised models are used for anomaly detection, on the positive side, this means that they are capable of detecting a wide range of faults (or performance issues, security incidents, etc.). On the negative side however, the alerts raised by such systems do not identify the root cause, and therefore need further investigation.

Automated fault prediction

As a key feature of automated fault management systems, fault detection enables cloud providers to react to faults once they have occurred. In highly-available (HA) systems, this may be acceptable as the fault's effect can be managed with minimal impact. However, in critical system components, where any kind of fault can have a much higher impact, fault prediction techniques become key. Accurate and timely prediction of faults can enable service providers to maintain their up times without having to deal with costly service outages. Unfortunately, compared to fault detection, fault prediction is a much harder problem to tackle.

In order to predict a specific fault e.g. CPU-overload in a system, the fault needs to be identified and 'labeled'. This can be maintained through detection and forecasting. Another factor to consider is that the fault has to be predictable, meaning that either there are observable system state changes (e.g. reflected by specific metric variances) towards the fault, or, the fault occurs regularly following some pattern(s). This can be maintained through pattern recognition. In general, both cases can be translated into a time-series prediction problem and we can use the data collected from the TSDB to train a prediction model.

There are many ways to carry out prediction. Classically, statistical methods such as AutoRegressive Integrated Moving Average (ARIMA) have been used to predict the next values of a time series via analyzing if the series is 'stationary' (value varies in a recognizable pattern along a fixed mean value axis) or not. In case the series is not stationary, several techniques can be applied to it to make it stationary. ARIMA works well when used with a single time series at a time. However, it suffers from scalability problems when used with several metrics. For instance, the model didn't even converge for 20+ metrics as input in our setup.

Another method for carrying out prediction is using NN. Compared to statistical methods such as ARIMA, deep learning methods based on NN are more scalable and can work with several metrics at the same time. Several types of Recurrent Neural Networks (RNNs), such as Long Short-Term Memory (LSTM), are typically used in prediction problems as they are designed to work with sequence of data, allowing them to learn the changes of input features. More recently, Convolutional Neural Networks (CNNs) have also been applied for prediction problems when the input data has an ordered relationship in the time steps of a time series. In our experiments, both CNN and LSTM seem to have comparable accuracy (e.g. 96.47% vs. 96.88% for CPU-overload fault, or 85.52% vs. 88.73% for network fault). However, the training times were vastly different, with CNN models trained an order of magnitude faster than LSTM models.

Learn more

Read more about the basics of machine learning and the role it will play in supporting 5G systems.

Take a look at future technologies with Ericsson Research.

Chunyan Fu
Chunyan Fu is an experienced researcher in the Cloud Systems and Platforms research area at Ericsson Research. Her research activities include SDN fault monitoring, cloud resource scheduling and she works on supervised machine learning for fault detection and prediction in cloud systems.
Mbarka Soualhia
Mbarka Soualhia is a Postdoctoral Fellow in the Cloud intelligence and Automation team.
The Ericsson blog

In a world that is increasingly complex, we are on a quest for easy. At the Ericsson blog, we provide insight, news and opinion to help make complex ideas on technology, business and innovation simple. If you want to hear from us directly, please head over to our contact page.

Contact us