Automating fault detection for management systems using ML
In an Industrial Cloud platform where processes are business-critical and real-time cooperation between machines, software and humans is the norm, there is a key requirement for handling faults in an effective and timely manner. The same holds true for telecom systems where up-times are expected to exceed 5 9's. The component of any management system that deals with faults is called Fault Management. In this post, we focus on the work done in the area of fault detection, a specific function of Fault Management, and how to automate it using Machine Learning.
Fault management and fault detection
Fault management is a functional area of systems management that relates to the detection, prediction, isolation (root cause analysis) and prevention of faults. Components of a fault management system include:
- A monitoring system which collects up to date information about the managed system
- A fault prediction component which anticipates faults before they occur
- A fault detection component which detects faults that are not possible to predict
- In situations where the cause of the fault is not obvious, the root cause analysis component identifies the cause of the fault
- The fault prevention/recovery component takes the necessary steps to recover from the faults (during detection cases) or to prevent the fault from occurring (during prediction cases)
Fault detection today
In most of today's IT and telecom environments fault detection is done in one of three ways.
- Number one: Wait for users to report the faults
The simplest and the worst way to detect faults is to not have any automated system in place but wait for users to detect and report the fault. Though this is the worst decision an IT architect can make when it comes to detecting faults, it is unfortunately the one that is often made in smaller organizations where IT cost is an issue and admins spend their time dealing with an incoming flow of user requests instead of working on improving their IT infrastructure. In such environments, beyond the interruptions and cost caused by faults, it would also have indirect impact on users' performance since one is not able to rely on the IT infrastructure
- Number two: Use test suites
Another common way to detect faults in IT/telecom infrastructure is the use of test suits (such as those included in OpenStack Tempest or Nagios check plugins). Such test suites include a battery of automated tests that check the availability and functionality of a service by actually using it. For instance, for the case of OpenStack Tempest a test may involve creating a virtual machine from a specified image, start it up, and check whether once the VM is running, it can be accessed over the network. Though such tests have a very high accuracy (i.e., are able to detect the all faults that the tests are designed to detect), they have several major drawbacks. First, such tests are often expensive to run. Specifically, the tests would use/consume resources that are (or would have been) used by users, affecting performance of the IT system. In addition, running the tests themselves takes a considerable time and infrastructure to run (e.g., the default OpenStack Tempest test suite takes about an hour to finish). Finally, such tests have to be run periodically in order to detect faults all the time. However, high frequency of running the tests means considerable cost incurred and a more severe performance impact, while running the tests at a lower frequency means that faults will not be detected in time
- Number three: use of simple rules on monitored metrics that would trigger an alarm
One more common method to detect faults in IT infrastructures today is the use of simple rules on monitored metrics that would trigger an alarm. Typically, these rules are specified in terms of a threshold and a function computed over a few metrics that are collected from the monitored system. (To avoid multiple alarms in situations where the value oscillates around the threshold, one typically applies hysteresis thresholds, a higher threshold to raise an alarm and a lower hysteresis threshold to clear it.) Such methods are implemented in OpenStack Aodh service or with the warning and critical thresholds used in Nagios monitoring configuration or as triggers in Zabbix. The advantage of using such rules to detect faults is that they are very lightweight and as such, they can be evaluated fairly frequently with little overhead and very minimal impact on the performance of the running system. However, alarms raised by such rules do not necessarily indicate fault, rather, they indicate an anomaly that may or may not be a result of a fault situation
Methods two and three are better than method one, but they have problems, some of which are discussed above. In addition, both methods require a lot of manual configuration (be it designing tests or identifying metrics and thresholds) to be of any use. What this means is that one needs to have a deep knowledge of the entire layer of the managed system, starting from the lowest hardware levels up to the highest applications running on the system. It is often very difficult to find such a person with such a level of expertise on all levels of the system. This also implies that this person will be spending a very long time, identifying by hand which tests and metrics would be usable to identify the faults of interest. This is a very difficult task given how frequently software changes in today's world of CI/CD.
Relying on a human to specify rules to detect faults means that the fault detection system will only be able to detect faults that the person thought about, leaving the system wide open for new faults or faults that the person did not think about or were out of his/her expertise. Also, there are several faults that are not directly detectable. Such faults typically arise in situations where users consume resources or services from resource/service providers that they do not administer. In a properly designed highly available (HA) system for instance, the failure of a component would not be visible for the end-user. Another example is non-fatal faults of a server which typically will not be visible to VMs that run on that server. As would be seen in the coming sections, Machine Learning (ML) based methods for detecting faults do not have these problems.
Fault detection tomorrow
ML-based fault detection methods have applications in several different areas and scenarios. There are two major types – supervised and unsupervised machine learning.
The so-called supervised-learning based methods – those where a model is trained by looking at several examples of the fault of interest – are particularly useful for detecting known faults that are difficult to detect, mainly because metrics directly associated with the faults are not accessible for the monitoring system. In such situations, the occurrence of a fault has to be deduced by analyzing the indirect effect of the fault on monitorable metrics.
For example, in our lab testbed, it was possible to detect CPU overload on a Kubernetes slave by only looking at operating system metrics at the Kubernetes master. The effort required to detect faults using supervised learning techniques, however, limits their use to only those situations where the detection problem itself is hard.
On the other hand, fault detection systems based so-called unsupervised learning methods – where a model is trained by looking at the fault-free state of system – can be used in all IT and telecom systems of today as they have much simpler requirements with respect to system expertise or data needed or training the ML models. In addition, these models are very powerful in that a single model can be used to detect a very large number of faults. Another particularly attractive feature of fault detections systems based on unsupervised learning is that these methods have the ability to detect unknown or unexpected faults. On the down side, unsupervised learning-based models typically have lower performance and do require a large amount of data to train effectively.
In our next post, we describe in more detail how ML models can be trained and used to detect faults in an IT system. We will also give a brief summary of the results we obtained from our tests based on OpenStack and Kubernetes.