Ericsson’s next-gen AI-driven network dimensioning solution
Accuracy is a very important requirement within the network dimensioning task, as an incorrect assessment would result in costly penalties for Ericsson and ultimately damage customer trust. A new data-driven network dimensioning tool is currently being developed at Ericsson, which employs Machine Learning (ML) to create models that successfully predict resource requirements. Additionally, the ML solution facilitates incorporation of telco domain knowledge into the modeling process which makes them more explainable and trustworthy from a telco perspective.
Network dimensioning in telecommunications
Today, telecommunications hardware and software suppliers such as Ericsson have a wide range of telco products and solutions in their portfolio. In the Digital Services space, these typically come in the form of software (SW) applications (which realize certain telco network functions). After a proper presales process has been conducted, a Communication Services Provider (CSP), often known as the operator, typically deploy these telco applications on top of their owned datacenters. These datacenters are also known as Network Function Virtualization Infrastructure (NFVI).
Network Functions Virtualization (NFV) has many advantages today as it enables the operator to scale up and down faster than the traditional Physical Network Functions (PNF), where the hardware had to be procured, commissioned and connected to be made available for deployment of a new network application. As mentioned before, Virtual Network Functions (VNF) are then deployed as SW applications on top of an NFVI to provide their telco services at the operator’s premises. Recently, Cloud Native Network Functions (CNF) have come into the picture to enable an even more dynamic and changing application ecosystem, with a further breakdown of all the components and domains (into so-called microservices) that realize a telco function, all managed by Kubernetes.
Network Functions Virtualization
NFV allows network operators to manage and expand their network capabilities on demand using virtual, software-based applications where physical boxes once stood in the network architecture.
Find out moreThese technology trends and realities imply that Ericsson is faced with the challenge of having to deploy SW applications that provide the expected grade of service of their telco functions, but with little to no control of the resources and dynamics of the target infrastructure. For instance, a common scenario could require Ericsson to deploy its SW applications on an NFVI configuration or variant, which has never been tested before, creating different uncertainties like what would be the application’s resources consumption (in the form of CPU Load, Memory, Storage and Networking), and overall telco service performance, when deployed in that untested infrastructure.
In summary, accurate network dimensioning becomes a very challenging endeavor in the situations described above and calls for a data-driven approach that can cope with the increased complexity of such deployment scenarios and that can properly scale to the current and future needs according to the trends described before. Our approach has considered historical data from Performance Management (PM) counters, which are normally used for monitoring the network behavior and other factors.
Network dimensioning at Ericsson
Capacity and Node Dimensioning (CANDI) is a tool used across Ericsson to execute network dimensioning. It is operated by diverse dimensioning experts who iteratively set different parameters in it to get a result that is representative of the target deployment or customer network. Data-Driven CANDI is a web application powered by ML to execute network planning and dimensioning. It is envisioned as the next evolutionary step to the existing CANDI tool. It is initially a self-service tool where the user can import customer PM data, and subsequently train an ML model on this data for dimensioning of that customer network to predict standard network load indicators like average CPU load and memory usage. The tool provides a solution for modeling different traffic scenarios easily as well. Data-Driven CANDI’s high-level architecture is shown below.
Fig. 1. Data-Driven CANDI High Level System Architecture
Telco interpretability of network dimensioning ML models
As a small recap, dimensioning practices for telco applications are calling for a data-driven approach to address accuracy needs in the increased complexity of the telco infrastructure ecosystem. Data-Driven CANDI is then the answer to these needs, and it is using ML models for its estimations.
The accuracy requirements of the trained ML models in Data-Driven CANDI were mostly challenged in the aspect of telco interpretability. In practice, telco interpretability means that the coefficients (for example, individual feature contributions) of the resulting model should make sense to a dimensioning expert so that trust is ensured in the model’s predictions.
To elaborate on telco interpretability, let’s take the example of the Call Session Control Function (CSCF) in the IP Multimedia Subsystem (IMS). CSCF is responsible for routing requests from IMS users, as well as providing the necessary services by triggering corresponding application servers in the network. The CSCF PM counters can be used to compute required traffic model figures (or features) and their corresponding percentage of generated CPU load. The data is produced according to telco PM standards, in an aggregated manner, based on a preconfigured granularity period.
From a domain perspective, for CSCF, there are certain traffic model figures that are contributing the most when estimating Average CPU Load. Thus, it is expected that the trained model coefficients corresponding to these features provide a consistent ranking, in line with CSCF characteristics testing and product verification results.
Challenges of standard linear modelling approaches
Standard linear models applied on the data yielded minimum error on prediction, but the individual feature contributions/model coefficients did not match the domain knowledge about the network function capacity behavior to the offered traffic. To the domain, standard linear model predictions were unexplainable and untrustworthy. Further investigations on the data showed the presence of multi-collinearity being the root cause of this situation. Both accuracy and reliability in determining the effects of individual features of the dimensioning model were lost because of multi-collinearity. It was then determined that the modelling strategy should cater for the strong multi-collinearity effect in the data and without dropping important features in the process. Moreover, it would ideal if there is a capability to inject the domain knowledge to the dimensioning models while training.
Domain knowledge could be specific to each network function or generic. Theoretical analysis proved also that this approach scales well as long as the accuracy and interpretability needs are met. At the same time, it would be even more desirable if the modelling strategy is still robust in the absence of domain knowledge. To fulfill all the aforementioned requirements, standard techniques were certainly not an option.
Ericsson’s solution
The telco dimensioning problem can be conceived as a regression problem from an AI/ML perspective. The proposed solution is Bayesian Regression which proved to be more robust to multi-collinearity of features. Additionally, our approach allows the incorporation of domain knowledge into the modeling (for example, in the form of priors, bounds and constraints), to avoid dropping network features that are critical for the domain and interpretability requirements, from a model’s trustworthiness perspective.
Fig. 2. Machine Learning Architecture with domain knowledge incorporation
As depicted in figure 2, the domain knowledge is incorporated into the Bayesian Regression model where it influences the cost function and the search space of the optimizer.
Improved interpretability in model coefficients
To discuss the improvements, consider the CSCF virtual network function and the modeling error metrics comparison following table 1.
|
Error Metrics |
Standard Benchmark Approach |
Ericsson Solution (With domain information) |
|
Avg. Train Root Mean Square Error (On Cross Validation) |
0.86 |
1.28 |
|
Avg. Test Root Mean Square Error |
1.10 |
1.37 |
Table 1. Modeling Error Comparison with Ericsson Solution and standard approaches
With the Standard Benchmark Approach (for example, Linear Models), the trained model’s coefficients did not align with the CSCF domain knowledge. There were even cases where negative coefficients were obtained (for example, the model indicating that the more transactions of a certain traffic model figures existed, the less Average CPU Load was generated).
On the other hand, our solution, the telco domain-inspired model, yielded results which complied with the domain expectations by allocating suitable weightage to individual features. Additionally, the prediction error of this solution is competitive to both baseline modeling and superior to dimensioning estimations provided by traditional tooling, for the given range of prediction as indicated in table 1.
After testing our modeling techniques with different datasets, and from different IMS network functions through Data-Driven CANDI, we have been able to consistently obtain models with less than 4 percent error, with interpretable coefficients.
Key takeaways
With Ericsson’s next generation ML solution for dimensioning, multi-collinear features of network data, deemed as mandatory for resource consumption estimations, are retained without significantly compromising competitive performance and accuracy needs. With the proposed modelling strategy, telco domain knowledge can be incorporated during model training, achieving interpretability needs. Both accuracy figures and insertion of domain knowledge led to a very trustworthy modeling approach. Additionally, the modelling strategy is robust across various granularities of network traffic data, and shows promising robustness even in the absence of domain knowledge.
The business impact
- Modeling strategy copes with the complexities in the telco industry trends, and the many NFVI variants that in many cases are untested by Ericsson, where dimensioning shall also be accurate and interpretable.
- Modeling strategy can be applied across various network functions in the IP Multimedia Subsystem, with very promising potential to be applied to other domains such as 5G.
- It is also worth noticing that Data-Driven CANDI, the vehicle of this modeling strategy, was recognized as a finalist in the Dataiku Frontrunner Awards 2021, in the Value at Scale category.
Want to know more?
Read how NFV is allowing network operators to manage and expand their network capabilities on demand using virtual, software-based applications where physical boxes once stood in the network architecture.
Get an introduction to data-driven network architecture in global telecommunications systems here.
Explore telecom AI
Explore AI in networks
RELATED CONTENT
Like what you’re reading? Please sign up for email updates on your favorite topics.
Subscribe nowAt the Ericsson Blog, we provide insight to make complex ideas on technology, innovation and business simple.