Automate performance-fault localization in cloud systems to minimize downtime

Netflix, Dropbox and Gmail are just a few of the popular web services that rely on virtualized cloud systems. According to Statista, the number of Netflix streaming subscribers worldwide is currently around 33 million, while Google's e-mail service has an estimated 1 billion active users, and Dropbox has 500 million. With this many end users expecting high-quality of service, cloud service providers are under pressure to minimize service disruptions and downtime.

Machine Room

You see, when consumers use a web service, they want an instantaneous response. A neuroscience study referenced in the Ericsson Mobility Report from February 2016 measured user reactions to network performance and showed that delays in loading web pages and videos can affect people’s stress levels more than watching a horror movie, solving a mathematical problem, or waiting in line at the grocery store! What’s more, time-to-content and additional delays due to re-buffering lead to decreases in Net Promoter Scores.

Cloud service providers can avoid stressing out their end users by deploying an agile and automated service-assurance system. The detection of anomalies and root cause analysis (RCA) of the corresponding faults are crucial aspects of service-assurance systems for cloud and data center environments. It becomes critical in telecom cloud scenarios where even short service downtime can result in lost revenue. The effects on mission-critical systems can be much worse.

Cloud management

Cloud management and service assurance experts

We are a team of researchers working at Ericsson Research specifically focused on management and service assurance aspects of the cloud. We will touch upon these critical research problems in Celtic SENDATE EXTEND (WP4), a recently started three year-long project.

The main motivating factor of the research is that anomaly detection and RCA are time consuming and costly – hence automating as much as possible is a winning strategy. The challenge, however, is that automated systems, with properties such as high effectiveness and accuracy, usage simplicity, and operator trust, are extremely difficult to develop.

Challenges mainly originate from the inherent complexity of the cloud infrastructure and the mix of services running on top. Factors contributing to this complexity are heterogeneous and disaggregated hardware, multiple layers of software and hardware, sharing of resources, and high system dynamicity.

Service Assurance and RCA in the Cloud

In Celtic SENDATE EXTEND (WP4), we focus on developing methods and tools for anomaly detection and fault-localization techniques including RCA with the aim of supporting service assurance in the cloud. We are aiming for a data-driven approach for problem formulation and solution. Specifically, we are using operational data to learn about the behavior and characteristics of the system without requiring a thorough understanding of deep internals of the system.

RCA in the Cloud

In this diagram you can see how our envisioned service assurance system works:

  • The Cloud Monitoring component is responsible for collecting the performance data of the infrastructure as well as metrics from the services. The Data Pre-processor component is responsible for cleaning the data as well as feature engineering and selection.
  • The processed data is fed into the Fault/Anomaly Detection system component. This component trains and uses statistical models to detect any abnormal behavior. At this point in time this component relies on technology for SLA prediction, which was developed in the VINNOVA REALM project – a predecessor to Celtic SENDATE. Once an SLA violation is detected, the Fault/Anomaly Localization system component is invoked. It uses an advanced statistical correlation methodology to pin-point a faulty or suspicious metrics in the system.
  • Note the decoupling of the detection and localization components; this provides a highly modular system with properties such as scalability and extensibility. The Fault classification component is responsible for classifying the detected problem at a higher level, based on the details of the faulty metric(s) pointed-out by the fault localization module.
  • The last two components are related to the system actuation for mitigation of the identified faults in the system by calculating parameters for an orchestration action like vertical scaling and then executing such a mitigation action.

Fault/Anomaly Localization

We are currently exploring several methods for fault and anomaly detection, with the aim of being capable of finding even unknown types of anomalies and faults. One considered approach utilizes unsupervised machine learning to diagnose performance problems of the services running in a cloud or a cluster environment. Specifically, our approach uses multivariate statistical correlations to quantify the ‘suspiciousness’ of metrics causing performance degradation. The approach is online, robust, scalable, and can adapt to system state changes over time.

Highly accurate anomaly detection

Our preliminary evaluation show promising results with high accuracy in detecting CPU, memory and disk file swap anomalies for a video streaming service. A representative set of results shown below demonstrates the effectiveness of the Fault Localization system when detecting the concurrent faults in the system such as CPU hog and memory leak. CPU hog faults are within the list of suspicious metrics and in most cases are the top-ranked suspicious metric. Memory leaks are also accurately detected in most faulty time intervals however the average fault rank is lower since their impact is lower in our considered scenario.

SENDATE figure 1


SENDATE figure 2

Overall, there are only few anomalous time intervals where the fault is missed by the Fault Localization system marked as red dots in the figures below (false negatives). An extensive set of results gathering and evaluation is underway to validate the approach under different traffic load and fault scenarios for a research publication.

The next step

Since the approach is general and data-driven it can easily be extended to identify a wider range of anomalies. A more extensive evaluation will continue under Celtic SENDATE project as we collect more operational data traces from different cloud services and experiment with new type of faults and anomalies in the system. Future plans also include research into the aspects such as computation, storage overhead, scalability and integration possibilities with other fault detection and localization techniques during the course of the Celtic SENDATE project.

Jawwad Ahmed, Andreas Johnsson, Christofer Flinta
Ericsson Research.

The Ericsson blog

In a world that is increasingly complex, we are on a quest for easy. At the Ericsson blog, we provide insight, news and opinion to help make complex ideas on technology, business and innovation simple. If you want to hear from us directly, please head over to our contact page.

Contact us