Autoscaling: Value vs. state-detection in the cognitive cloud

New forms of cognitive autoscaling in cloud computing could soon be deployed in safety-critical systems, futuristic networking architectures and virtualized resource management. But are they really up to ten times better than today’s value-based autoscaling solutions? Find out below…

Radio Behind Antenna Rooftop

Senior data scientist

Contributor (+1)

Cloud computing, as a network-centric computing paradigm, has become widely adopted during recent years due to its efficiency, accessibility and scalability.

An introduction to autoscaling in telecom

The latter, scalability, in turn enables efficient resource management and allocation. This can be implemented through a process we call autoscaling. Autoscaling systems are used to scale allocated computational resources, such as memory and CPU time, to support the industry in maintaining a desired quality of service.

Today’s autoscaling solutions often utilize control loops that use observed metric values of the system in conjunction with pre-configured deployment policies to determine actions for autoscaling. At the same time, state-of-the-art studies improve autoscaling by predicting future metric values of the system with the help of machine learning.

Cognitive autoscaling in cloud computing

However, at Ericsson, we’re challenging these state-of-the-art studies and, in doing so, are proposing ten-fold improvements to the prevailing solutions. How does this new solution look? Well, it’s all based on the concept of cognitive orchestration.

Cognitive orchestration is machine learning driven orchestration where the orchestrator is able to handle decision-making, reasoning and problem solving independently.

Through cognitive orchestration, we can automatically discover and detect computational states of a cloud-based service and provision machine learning functionalities for optimized utilization. Computational state may, for example, describe whether the cloud-based service is idle, busy or overloaded.

The test: State detection vs. value observation

Metric value-based scaling is a popular approach in established autoscaling solutions. Generally, these autoscaling solutions perform well, but they encounter several challenges especially during anomalous behavior. According to our research, there are three major reasons why state detection outperforms value observation in autoscaling:

  • Value-based autoscaling is vulnerable to anomalies, such as measurement errors and runtime failures
  • State detection gives a more holistic view on the state of the system which means that we can differentiate, for instance, a system anomaly from high demand of the service
  • State detection allows early pro-activity without making assumptions about the future, and it bases its decisions on multidimensional metric data which is more accurate than decisions based on evaluating future values of the system

In a recent stress test, we implemented orchestration which was driven by machine learning and could proactively take actions based on the detected computational state before the levels of individual system metrics had significantly diverged. As a result, we increased the resource efficiency and the quality of the cloud-based services, simply by taking advantage of machine learning.

In the next step, we compared the performance of our cognitive solution against a traditional value-based autoscaling solution. In this comparison, we conducted a 10-hour stress-test on a webserver that was autoscaled by the two solutions in a cloud environment. We evaluated the results by measuring the number of unused computational resources.

Figure 1: Comparison of state-of-the-art metric value based autoscaling solution and our proposed cognitive solution

Figure 1: Comparison of state-of-the-art metric value based autoscaling solution and our proposed cognitive solution

In the graphs above, the red lines depict the number of redundantly scaled computational resources (“idles”), and the blue line shows the average CPU usage of computational resources. We defined a pod to be redundant if its CPU load remained at zero percent for the period of its deployment lifecycle.

With a value-based autoscaling solution, the average number of redundantly scaled computational resources was 0.20, whereas with our proposed state-detection-based solution the average was only 0.02. This means that our proposed cognitive autoscaling solution based on state detection performs ten times better than the value-based autoscaling solution.

It is also worth noting in the graph that the spikes depicting the redundant computational resources (red lines) are not only higher with the value-based solution, but they also occur more frequently than with proposed solution that is based on state detection. This proves the novel value of the state-detection-based autoscaling solution when compared to the value-based autoscaling solution that merely utilizes observed system values.

Making it easy through an automated machine learning pipeline

To support the cognitive capabilities of our orchestration solution, we designed a machine learning pipeline that increases the level of automatization. The machine learning pipeline enables automated discovery of computational states in a data-driven manner. With a data-driven approach, we can ensure that our machine learning models can adapt to the characteristics of various cloud-based services.

In order to stay updated, the utilized machine learning models need to be trained frequently. However, model training can produce a complex workflow that requires execution of multiple pre- and post-processing operations. To support the easy usability, we let the pipeline handle the complex workflows automatically, and thus the pipeline obtains system metrics from the cloud service as an input and produces the trained machine learning model for cognitive orchestration as an output (Figure 2 below).

Figure 2: Automated machine learning pipeline for discovery of computational states

Figure 2: Automated machine learning pipeline for discovery of computational states

In our solution, training the machine learning model differs from the traditional methods. Usually, feature extraction is a manual task where a human assigns corresponding labels by presuming the state of the system. This is problematic because some computational states of the system might be unknown to humans, or they may be very hard to detect.

For this reason, we utilize both unsupervised and supervised machine learning, together with gradient analysis for feature extraction and state identification, in order to fully automate model training. Unsupervised machine learning techniques allows detection of previously unknown computational states. Supervised machine learning enables lightweight deployment of cognitive capabilities in constrained cloud environments, such as the edge cloud.

Figure 3: Overview of service-agnostic and self-adaptive cognitive cloud

Figure 3: Overview of service-agnostic and self-adaptive cognitive cloud

Furthermore, the obtained machine learning model is utilized, for instance, in cognitive orchestration of autoscaling resources. As an input, the model receives multidimensional system data and outputs a computational state of the system. Based on the detected system state, the orchestrator can execute corresponding autoscaling actions.

Figure 4: Computational state is determined based on multidimensional data

Figure 4: Computational state is determined based on multidimensional data

Towards zero-touch deployment architecture

In the proposed solution, machine learning procedures are divided among two entities called cognitive controller and cognitive supervisor. This division of machine learning procedures supports detection of computational states, runtime failures and network patterns in real time.

Figure 5: Proposed cognitive orchestration architecture in cluster-level

Figure 5: Proposed cognitive orchestration architecture in cluster-level

A cognitive controller handles computationally-demanding operations and orchestrates the cognitive supervisor(s). More precisely, the cognitive controller is responsible for model training, feature extraction, reconfiguration of orchestration policies and deployment of the cognitive supervisor(s).

When deploying services, the cognitive controller automatically provisions one cognitive supervisor for each service running in the cloud. Generally, the cognitive controller takes actions when the overall computational load is low in the worker node. Thus, the demanding operations of the cognitive controller, such as model training, do not interfere with other computational processes.

On the other hand, a cognitive supervisor handles real-time detection of computational states, runtime anomalies and network patterns. The cognitive supervisor is designed to be computationally light, and, thus, the footprint of real-time computational state detection is low. If a cognitive supervisor detects changes in the performance of a supervised service, the cognitive supervisor alerts the associated cognitive controller which can enforce reconfiguration of orchestration policies accordingly.

Next steps in 5G and beyond

Our proposed state-detection-based cognitive orchestration solution introduces several new opportunities in the field of autoscaling research because it allows fully automated, service-agnostic and holistic autoscaling of computational resources.

Our cognitive autoscaling solution could be utilized in a number of applications, such as in safety-critical systems, futuristic networking architectures and virtualized resource management, as we proceed to 5G and beyond. For example, our approach of zero-touch architecture could be utilized in heterogenous applications, such as 5G service chaining, in order to decrease the operational complexity of services that are composed of multiple components.

Read more

Learn more about the future direction of cloud and cognitive technologies in the Ericsson Technology Review article library.

Visit our future technologies page.

The authors J. Reijonen and E. Hiltunen, together with M. Komu and M. Elmusrati recently submitted a research paper to the 1st IEEE International Conference on Autonomic Computing and Self-Organizing Systems - ACSOS 2020.

The Ericsson Blog

Like what you’re reading? Please sign up for email updates on your favorite topics.

Subscribe now

At the Ericsson Blog, we provide insight to make complex ideas on technology, innovation and business simple.