Proactive service assurance for cloud applications – Predicting application issues
- Service assurance is critical for maintaining high-quality advanced mobile network services. While artificial intelligence and machine learning (AI/ML) offer promising proactive solutions, their widespread application is hindered by the challenges of building and running these models.
- This blog post outlines these challenges and describes the first step of proactive service assurance: Effectively predicting application issues.
Emerging network applications, such as massive immersive communication and connected intelligent machines, demand guaranteed end-to-end quality-of-service (QoS) with high reliability and availability. Distributed, heterogeneous clouds serve as the fundamental infrastructures, supporting these demanding applications.
However, due to the dynamic nature of such clouds, applications deployed in these environments may face performance issues, where the infrastructure and traffic pattern may change, and failures may occur.
Factors like improper resource allocation, network congestion, hardware failures, or software bugs can all degrade service availability, impacting QoS metrics, including response time and throughput. To sustain application performance and availability, it is essential to identify and address these potential problems, making service assurance mechanisms critically important.
Here, we will discuss in depth the challenges and solutions of service assurance, emphasizing AI/ML based approaches.
Reactive vs. proactive service assurance
In modern service assurance, two primary approaches are employed: Reactive and proactive.
Reactive service assurance involves responding to issues, such as service interruptions and performance degradation after they occur. Typically, a monitoring system detects these issues in applications or infrastructure, triggering alerts to technical experts or automated procedures for investigation and resolution. Speedy response and minimum downtime are critical in this approach.
However, reactive methods can lead to prolonged downtime and customer dissatisfaction if the issues are not promptly detected and resolved. Sometimes, quick fixes like simple restarts are applied without addressing the underlying causes, resulting in recurring issues.
Proactive service assurance, in contrast, anticipates and addresses potential issues before they impact service quality or cause disruptions. It continuously monitors infrastructure performance, using predictive analytics and machine learning to detect trends and patterns, forecast problems, and enable preemptive actions. This proactive approach enhances service reliability, minimizes downtime, and optimizes resource utilization by identifying areas for improvement based on performance insights.
Approaches in use today
Currently, reactive service assurance is the standard way to ensure the runtime QoS of cloud applications. While fault tolerance and high availability strategies can be implemented in the earlier deployment phase, they often struggle to adapt to the highly dynamic environment changes.
Contrary to the reactive method, proactive service assurance is a more promising method for providing highly reliable services. However, its broader adoption faces challenges due to the complexities involved in building and running AI/ML models.
Challenges in applying AI/ML in proactive service assurance
Proactive service assurance requires in-depth research to develop predictive models that can accurately and immediately anticipate both cloud infrastructure failures and application performance issues. In a previous blog post, we discussed the use of Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) models for predicting cloud infrastructure failures. In this post, we will focus on predicting performance issues for applications deployed in cloud environments. This brings new challenges in both building the prediction model and sustaining its performance during runtime.
To illustrate the challenges in predicting application performance issues, consider the example of a cloud system hosting applications, such as an extended reality application or a cloud native 5G core, and radio access networks, shown in Fig. 1. Here, an operator aims to develop a deep learning model to predict performance degradations. This model needs to be trained within a “model building” function using data collected from a “monitoring system.” Once trained, it is deployed in the cloud system for “online prediction.” The insights from these predictions can then be used for proactive service assurance, such as root cause analysis and triggering of remediation actions.
Fig. 1: Overview of application performance prediction in cloud system
Challenges related to building a prediction model
The first challenge in developing prediction models for cloud-based application performance issues is feature (or metric) selection. In Fig. 1, the cloud system includes a “Monitoring System” that gathers raw training data, including potential features, like container and infrastructure CPU utilization, memory usage, network latency, and I/O operations. Using all these features for model training is impractical, as it can increase computational costs and potentially reduce the model’s predictive accuracy due to noise.
Conversely, relying solely on historical key performance indicators (KPIs) or features that directly manifest the QoS of the application, such as response time or throughput is often insufficient. These KPIs often exhibit non-stationary behavior, fluctuating with workload patterns, user activity, or seasonal trends. The causes of application performance degradation are often diverse and complex, involving factors such as server hardware issues, insufficient VM resources, misconfigured load balancers, network congestion, and scheduling problems in cloud management. A careful feature selection process is needed to differentiate impactful features from the less informative ones thereby improving the model's predictive performance and efficiency.
The second challenge relates to determining model parameters, such as the training data size or the number of data samples for each training and the appropriate sampling rate, which is the time difference between two consecutive collected data samples. Using a larger number of data samples with an increased sampling rate can improve model accuracy and its ability to predict subtle performance changes. However, it can also lead to excessive resource consumption for data collection and storage.
Communication service providers (CSPs) may have specific performance expectations despite resource constraints. For example: achieving over 85 percent model accuracy with two CPUs for model training and 8GB of training data storage. Consequently, a careful selection of the data sampling rate and the total number of training samples is required to fulfill both the model accuracy and resource requirements.
As part of a proactive service assurance solution, CSPs need a reasonable lead time to forecast application issues, allowing for timely and effective responses within the permitted service interruption time. For example, a CSP might define the minimal prediction horizon, which is the forecast period that can be represented as the number of future data samples the model should predict, such five minutes – equivalent to predicting 30 samples if the sample rate is 10 seconds. This means that the model needs to predict the application's KPI degradations at least five minutes in advance. To achieve this, it is critical to carefully balance the input window – the number of data samples the model uses for prediction – to ensure that the minimal prediction horizon is fulfilled without comprising the model performance requirements.
Challenges related to maintaining prediction model performance during run time
- Once a prediction model is deployed for online inferencing, the “monitoring system” continues to collect real-time data and feeds it into the trained model for online predictions. At this stage, we want the model to consistently meet its performance requirements. However, given the high variability in configurations and workloads within the cloud system, the distribution of incoming data may change over time, drifting from its original ranges (known as concept drift), degrading the performance of the prediction model.
- In other cases, namely infrastructure changes or unknown failures, the features selected for training the model can become irrelevant (known as feature drift) for predicting the current application’s performance. In such a case, the prediction model’s performance may drop significantly, making it unsuitable for proactive assurance.
To effectively address these issues, it is necessary to manage such drifts in runtime, where the drifts shall be detected in a timely manner and the prediction model shall be adjusted accordingly to adapt to the drift. This will significantly help to restore and maintain the performance of the prediction model.
Next, we will discuss the methods used for addressing these challenges, including feature selection, parameter selection, and runtime drift handling.
Methods for AI/ML model feature selection
From the thousands of features collected, to figure out what features we need for predicting an application performance degradation, we introduce a three-step feature selection method: (1) Similarity Analysis, (2) Feature Reduction, and (3) Causal-Temporal Analysis (see Fig. 2).
Figure 2: Proposed automated feature selection method
- Similarity analysis – As the application performance degradation is manifested through degradation of the application KPI(s), the similarity analysis entity is responsible for finding the infrastructure features that have high similarity in their trends with the application KPI(s). This entity receives the normalized data (including infrastructure features and application KPI(s)) as input and reports a similarity score between infrastructure features and application KPI(s) as output. Correlation analysis techniques like the Pearson Correlation Coefficient, which calculates the ratio between the covariance of two features, are also used to find the similarity score. Additionally, methods like Dynamic Time Wrapping (DTW) can be employed since the faults that occur at the infrastructure level may take time to impact the application KPI(s) and DTW can calculate a distance between each infrastructure feature and KPI pair.
- Feature reduction – Using the similarity score for each feature, we can now reduce the dimension of the feature set. The feature reduction function first sorts the infrastructure features according to their similarity to the application KPI(s), selecting the top features as output. The number of features to reduce can be set to a fixed threshold or adjusted experimentally, for example by incrementally modifying it until no further performance improvements are observed.
- Causal-temporal analysis – Since there is usually a delay between the occurrence of the problem and its manifestation on the KPIs, it is important to find the features with a causal-temporal relationship to the application KPI, which helps better predict the application performance issues. With the reduced feature set from step 2, we can apply causality discovery algorithms such as Granger Causality (GC) and Time-Lagged Cross Correlation (TLCC) to select the most effective features for training.
To read more about the feature selection method, please refer to our paper published in IEEE ICC 2023.
Methods for AI/ML model parameter selection
Parameter selection can be considered a multi-objective optimization issue for finding an optimal combination of training data size, data sampling rate, input window, and prediction horizon. The goal is for a CSP to achieve a balance between the prediction performance and the resource utilization. We present an automated parameter selection method that uses a surrogate-assisted non-dominated sorting genetic algorithm II (SA-NSGA-II) (See Fig. 3). The method includes three steps.
Fig. 3: Proposed parameter selection method
- “Search space identifier” defines the search space for the solutions, setting lower bound and upper bound for each decision variable, including prediction horizon, input window, sampling interval, and training data size. Each solution in the search space is a set of four decision variables. The search space is tailored to address the CSPs’ requirements, such as the resource constraints and the minimal prediction horizon.
- the “SA-NSGA-II based parameter selection” performs selection using a priority-based ranking method to classify the parameter populations in the search space into different fronts, based on Pareto dominance (that is one option "dominates" another if it is better in at least one aspect without being worse in any other aspect). To determine what is ‘better’, we evaluate two targets: Performance of the prediction model and resource consumption.
- To evaluate the performance of a prediction model for each solution, we need to train a separate model for each solution. This process requires substantial computational resources and time due to the huge number of parameter combinations. To avoid this, we employ a random forest surrogate model that can approximate the model performance for each solution. The surrogate model is trained offline using a limited number of solutions and their observed model performance. It significantly reduces the search time, compared to the time used for a brute force search and the traditional NSGA-II search.
- To evaluate the resource consumption for collecting and storing the training data, we build a linear function of the data dimension. Therefore, the less data we use for training, the lower the resource consumption. The full selection procedure consists of the steps shown in Fig. 4.
- the “Decision maker” in Fig. 3 selects the optimal solution. The ‘optimal’ is based on a weighted sum methodology that assigns a score to each candidate solution on the Pareto front. The solution with the highest score is selected.
Fig. 4: Major procedure of SA-NSGA II
Methods for AI/ML model run time drift handling
For drift handling, the traditional way is to collect new data, retrain, and redeploy the model. This way is cumbersome for handling frequent drifts, as retraining a model requires collecting a large amount of data. In addition, simple retraining may not solve all problems, since it becomes incapable when the importance of the features used for prediction is changed. To tackle such issues, we introduce a runtime drift handling method, that first detects drifts, then analyzes the severity of drifts, and finally determines and applies proper drift adaptation methods.
Fig. 5 shows the drift handling method. It monitors the performance of the prediction model using the previously trained prediction model and the new data collected from the current system as inputs. Once it detects a drift, it collects new data to adapt the model, and it uses the adapted model to update the online prediction model. The method includes two main steps: “Drift detection” and “Drift adaptation”.
Fig. 5: Proposed run time drift handling method
- The “Drift detection” includes two steps:
- For “model performance monitoring”, techniques such as Cumulative Sum (CUSUM) and Adaptive Windowing (ADWIN) can be used to determine whether the model performance dropped or not. Using CUSUM, for example, a drift is detected when the mean of received predictions significantly deviates from zero.
- The “feature importance analysis” is used to evaluate the continuing relevance of the features used in predicting application performance issues. In this step, we use a perturbation-based method, which tests how the model performance changes if replacing a feature’s data with random data. This step outputs the features and their importance scores before and after drift. This information together with a drift alert is passed to the drift adaptation function.
- The “Drift adaptation” also includes two steps:
- The “drift severity analysis” inspects the features and their importance scores to identify the type of drift. If no feature changes its importance score significantly, the drift is considered a ‘concept drift’. If most of the important features have score changes, the drift is considered a ‘severe feature drift’, while the case in the middle, that is, a minority of the important features have score changes, is counted as ‘non-severe feature drift’. The drift type is used for model adaptation method selection.
- “Model adaptation” is responsible for selecting the most appropriate method for adjusting the prediction model based on the type of drift.
- For “concept drift adaptation”, a reinforcement learning (RL) method can be used to select among ‘retraining’, ‘partially updating the whole neural network’, ‘adapting the first layer’, and ‘adapting the last layer’ of the prediction neural network, so that the selected method is resource-efficient while resuming the prediction model performance. Learn more in our IEEE TNSM 2022 paper.
- For “feature drift adaptation”, in addition to changing the prediction model, it is also important to modify the features. In the case of ‘severe feature drift’, feature reselection and model retraining are necessary. In the case of ‘non-severe feature drift’, to save the cost of feature reselection and model retraining, we can drop the features having significant importance changes and adjust the prediction model, for example, modifying the input layers using some new data. Read more in our IEEE TNSM 2024 paper.
Summary
In this blog post, we have outlined the challenges associated with applying AI/ML prediction models for proactive service assurance. It is important to select the most appropriate features and parameters for effective model building, and so is maintaining the performance of the prediction models during runtime, when concept drift or feature drift occurs.
Effectively predicting the application issue is the first step to proactive service assurance, while analyzing the root cause and taking proactive remediation actions are the second and third, which are critical for maintaining application availability and performance. In our next blog posts, we will explore the root cause analysis and proactive remediation action selection.
Read more
RELATED CONTENT
Like what you’re reading? Please sign up for email updates on your favorite topics.
Subscribe nowAt the Ericsson Blog, we provide insight to make complex ideas on technology, innovation and business simple.