Enhancing mobile networks: The hybrid AI advantage
- Automation is key to cost-effective operation, and AI is a key contributor.
- Hybrid AI combines symbolic approaches with data-driven AI, such as Machine Learning-trained models.
- Hybrid approaches to AI can improve network automation throughout its lifecycle when compared to strictly data-driven approaches.
5G subscribers are expected to reach 4.6 billion by the end of 2028, according to the 2023 edition of the Ericsson Mobility Report. The report also explains that new services being introduced will allow certain devices to automatically select network slices. Specific examples include user equipment route selection policy (URSP) and access for reduced capability (RedCap) devices such as smartwatches, packet routers, and Internet of Things (IoT) devices. In light of this heightened demand for connectivity, automating facets of network operation becomes crucial to ensure cost-effective functioning. Artificial intelligence (AI) technologies are widely regarded as enablers for automation. The most popular category of AI is machine learning (ML)-trained models.
The main standardization body 3GPP and other standardization organizations specify signaling flows to facilitate the lifecycle management of ML-trained models. An example of ML-trained models are artificial neural networks (ANNs). Provided with input, these models generate output, such as predicting a future network condition. ML techniques like deep learning are used to train neural networks from input data observed in the network. If data is available a priori in the form of a training dataset, the models are trained using supervised or unsupervised learning techniques. Training can also be done using real-time data in what is known as online reinforcement learning (RL). In deep learning, the training process involves adjusting a neural network’s numerical weights to minimize a loss function, that is, the difference between the output of the model and the so-called ground truth, which represents the real output that the model should ideally produce.
Challenges with ML models
While ML-trained models theoretically have applicability in network automation, their deployment in production environments presents several challenges.
For trained models, the variance of data in the dataset may not match the variance of data that they encounter when deployed. In such cases, predictive models cannot generalize, meaning that they cannot perform as expected in environments they were not trained for. For example, let’s assume a model is trained to turn off unused radio frequency (RF) ports in radio units (RUs) to save energy. The action of turning on or off a port relies on the model predicting user throughput. If the model is trained using data from a rural environment, where traffic decreases during evening hours, it may not provide accurate predictions when deployed in urban environments, where data traffic increases in the evening and during rush hour.
- Safety in learning
For models trained using real-time data, RL algorithms explore the output parameter space (referred to as actions in RL terms) and, through trial and error, progressively improve at taking optimal actions given an input (referred to as state). This is generally known as an exploration-exploitation learning technique. With this technique, exploration is random in the early stages, whereas in later stages, the trained neural network is exploited to produce an output. Exploration can be dangerous, as it may produce collateral negative effects on network operations. For example, let’s assume the same RF port management use case as before, and 4 RF ports per RU. An exploration technique would involve turning off or on all permutations of RF ports in the RU. Some of these actions could cause user traffic to be dropped (for example, when turning off all four RF ports).
- Explainability and causality
Causality and explainability are complementary approaches aimed at informing humans about how and why a trained model produces a certain output given an input. This is important not only for humans to establish trust in the decisions of the AI system but also for troubleshooting purposes in case the output deviates from what was expected.
Causal ML identifies cause-and-effect relationships between model input variables (features) and outputs, specifically discerning the impact that a feature has on producing a certain outcome. Explainable AI, on the other hand, includes techniques that validate causal assumptions. Considering, for example, the energy-saving use case previously mentioned, causal ML can be used to establish that there is a causal relationship between user equipment (UE) throughput and RU power consumption. Explainable AI techniques can then be used to show that turning off RF ports has the largest impact on energy consumption, whereas other features such as electrical tilt adjustment of the antenna, downlink power adjustment, and microsleep have relatively little effect. Currently, it is challenging for deep learning (DL) models to learn causal relationships and perform causal reasoning on their own. Oftentimes, domain expert input is needed in conjunction with external metadata to build causal and explainable models.
- Multi-task learning
ML models are typically trained for a specific task, which deprives them of benefiting from other tasks that might be complementary. Looking at the example from the previous challenge, turning RF ports on and off may be part of an overarching strategy for energy savings in RAN. Similar to the RF port management models, there can be other models that handle radio resource management, make decisions on the conditional handover of mobile devices, adjust the electrical tilt of the antenna, control the transmission power on the downlink radio channel, and so on. Even though it’s possible to combine such models through multi-task learning techniques, it can be challenging to know which tasks benefit when combined under which circumstances. Consequently, a different type of intelligence is required, which can easily be extended with external knowledge, without requiring training. We introduce this approach in the next section.
Symbolic AI processes symbols, that is, sequences of characters, such as words, that represent real-world knowledge. Symbols may represent objects (for example, mobile devices, radio units, antennas, baseband boards, and so on), or concepts, for example, key performance indicators (KPIs), quality of service (QoS) policies, routing rules, billing plans, and so on. Symbols can also have relationships with each other, for example, an engine being part of a car, or an RF port belonging to an RU. Symbols and their relationships constitute knowledge graphs stored in knowledge bases, utilized by symbolic techniques such as reasoning and planning to generate new knowledge and insights.
When compared to ML models, symbolic AI techniques can address the challenges presented in the previous section. Because they are based on human-readable symbolic knowledge, they are better at providing explanations for their decision processes and establishing causal chains. Symbolic AI techniques also have the advantage of explainability, as the decision-making process used to reach a conclusion is transparent and interpretable by humans. They can also generalize more easily than their ML counterparts by the addition of symbols to the knowledge graph. Examples of symbolic AI techniques include reasoning and planning.
On the other hand, ML models are better at handling uncertainty, as they can provide predictions using data they have not been trained with before. While they often demand substantial input data for training and verification, ML models do not necessitate explicitly created knowledge in the form of symbols—something that would entail human-in-the-loop involvement.
Recently, hybrid AI solutions, combining the advantages of both symbolic and ML-based approaches, have emerged.
Addressing ML challenges using hybrid AI
In our hybrid AI approach, we explore solutions that integrate both symbolic AI approaches and ML-trained models. The choice between symbolic and ML-based AI combinations depends on the specific use case and resource availability, including data, knowledge, and computational resources. In particular, combinations of ML-trained ANNs, and symbolic methods have been grouped under the term neurosymbolic AI, and taxonomized by Henry Kautz and Garcez and Lamb.
In addressing the generalization challenge, a strategy often involves splitting tasks between the symbolic and ML domains to adapt to potential distributional shifts. For example, referring to the aforementioned energy-saving use case, a symbolic part can be added to define the environment in which the ML model was trained. This profile may include symbols describing the RU model, operating frequency, bandwidth, as well as UE traffic, and mobility patterns. Subsequently, for each new model deployment (for example, in another radio base station), a symbolic process could compare the reference profile with the current profile of the new deployment and decide whether to transfer the ML model as is or undergo some additional training. Over time, multiple profiles can be created, enhancing the efficiency of ML model deployment.
Applying hybrid AI to radio base station energy efficiency
In this section, we discuss how hybrid AI can be used to combine ML-trained models with symbolic techniques, using radio base station energy efficiency as an example. Specifically, we examine the challenge of energy savings at a radio base station while ensuring end-user QoS. Different predictive, deep learning-based approaches are designed for this purpose. One of them involves turning on or off individual RF ports. Another entails configuring a sleep mode on the cell each mode characterized by a sleep duration. A third involves adjusting the downlink transmission (Tx) power. However, not all approaches may be suitable at all times, as contextual knowledge is required to determine which of them would be more effective under certain conditions. Hence, an optimal solution to the problem might involve a combination of deep-learning approaches and symbolic planning. This approach decides when and for how long each deep learning method should take effect.
Figure 1 illustrates an ML-trained model. The model is a feed-forward type ANN, comprising layers of nodes called neurons. The input is processed in each layer and transmitted to the next.
In our example, this model takes as input a measurement of downlink and uplink aggregate radio throughput and the number of active UE. It then produces a prediction on whether to turn on or off an RF port in a radio unit when not needed, to save power. Turning off an RF port involves risk, as any traffic on the uplink or downlink interface that should be received or transmitted from the inactive RF port will also be dropped.
In addition to the above, there could be multiple other power-saving models. Let us describe two examples:
- An ML-trained model to adjust downlink transmission power
In addition to aggregating throughput on the uplink and downlink channels, as presented previously, the model uses as input an average reference signal measurement (for example, received signal strength indicator) from all UE in the area. For energy conservation, the model is trained to generate the minimum Tx power level required to uphold quality of service for all UE attached to the cell. Consider, for instance, the QoS requirement of a guaranteed bitrate and a maximum latency for all UE attached to the cell. The model would be trained to generate the minimum Tx power in dBm needed to maintain that QoS. There is a risk that setting the Tx power too low may result in dropped traffic for UE situated toward the cell edge.
- An ML-trained model to configure the sleep mode on the cell
This model takes the average packet interarrival time on the output buffer of the RU as input, learns patterns in data packet arrival, and determines the sleep mode for the cell. Longer-duration sleep modes have the potential to save more energy, as more hardware components of the cell are put to sleep. However, there is also a risk that incoming downlink traffic may be dropped.
Selecting the best model
In many situations, selecting which model or models to use for power-saving depends on knowledge that is not inherently embedded in the models. Some examples include:
- Differing requirements: Different cells may have distinct requirements depending on the owner of the cell, the deployment area, or both. For example, a public macro-cell may be authorized to keep an RF port active at all times, regardless of whether traffic is being transmitted or not, to offer a baseline service for any UE seeking to connect. On the other hand, a small cell used in an indoor location for macro-cell offloading may not be subject to such a rule.
- Varied effects on power consumption: Different methods may have varying effects on power consumption, depending on the location and time of day. For example, the method to turn off the RF ports may be more effective and less risky during nighttime in rural areas. On the other hand, the approach to adjust the Tx power may be more effective in dense deployments in urban environments, where there are more obstacles, and UE is highly mobile.
- Local regulations: Transmissions exceeding a specific power threshold may be temporarily or permanently prohibited to prevent interference in adjacent bands.
A symbolic method
It would be possible to train another ANN to determine which model or models to execute. However, as the requirements above may change over time and across locations, retraining would be necessary, rendering it costly to keep such a model current. Furthermore, re-architecting of the input and output layers, when requirements are added or removed, may also be required, making the process even more expensive.
A symbolic method, on the other hand, where such knowledge is stored in the form of symbols and relationships in a knowledge base, and processing of this knowledge yields a decision on which model or models to use, would mean that the maintenance cost of the system, as it scales and is redeployed from cell to cell, is smaller than using only ANNs. Figure 2 illustrates the block components of a hybrid system that could implement the proposed functionality.
The process is triggered by a user query. A user, in this case, can be the owner of the mobile network, such as a service provider. The query can be expressed in any type of interface. For the sake of this example, we demonstrate a natural language prompt. The query contains the requirements that the system should meet. In the context of this example, the query contains the goal, that is, a desirable end-state. For example, the goal may be to reduce power efficiency by a certain amount within a specified time, for the service provider to meet sustainability targets. This query can be in the form of an intent, with requirements stated as expectations. This query is transformed from natural language to a machine-readable form, for example, an entity-relationship diagram, wherein entities are recognizable symbols. This transformation is achieved by a query interpretation module, which can be an ML-trained model such as a long short-term memory network or a transformer.
The symbolic part of the system also includes mechanisms for retrieving the state of the system, that is, variables such as Performance Monitor (PM) counters or Configuration Management (CM) attributes that can be used to select which ANN or ANNs to use. Furthermore, a knowledge base contains all the contextual, external knowledge needed to perform the selection.
The selection process itself, in the context of this example, is a planning problem. Given a goal set by the user or defined by the system, a symbolic engine executes a planner, identifying steps based on knowledge from the knowledge base. In this case, the "steps" are the execution of ANNs and the actuation of the output in the RU. For example, the planner may decide that for some time, it is beneficial to adjust the downlink power transmission, and later may switch to turning off RF ports. The plan is then submitted to a neural engine which executes the ANNs in the plan and configures the real system —in this case, the RU, through the operation and maintenance (OAM) interface of the baseband board. Over the course of the configuration period, the planner monitors the state of the system through energy efficiency KPIs such as power consumption and performs replanning when necessary to meet the end goal. An example of such plans is shown in Figure 3.
The figure shows daily plans for two different profiles or radio base stations (RBSs), which, in this example, are location related.
The first profile is a rural RBS serving IoT-type devices such as sensors and actuators in a rural environment. For example, it could be an RBS used in precision agriculture or forest monitoring. In such a setting, the devices periodically transmit and receive data; therefore, the sleep mode adjustment ANN could yield better results, as it can easily learn the transmission patterns and switch off hardware components of the RBS’s RUs effectively. The downlink transmission power adjustment ANN is also run periodically in case some of the UE has relocated and/or new UE was installed in areas where current coverage is spotty.
The second profile is that of an urban radio base station where mobility and traffic patterns are more seasonal and dynamic. For example, in late-night and early morning hours, switching off part of the RU by using the RF port control model presented previously may make more sense, as there are few active UEs attached to the mobile network and data traffic and mobility patterns are stable. During rush hour, mobility patterns are more dynamic, and these promote the use of the downlink transmission power adjustment ANN. Similarly, during productive hours, where, for example, data traffic throughput significantly increases, the sleep mode adjustment model may be used to learn to adapt the sleeping patterns of RUs based on the mentioned patterns.
Conclusion and future work
Network automation is crucial for the cost-effective operation of current and future generations of mobile networks. Hybrid AI has the potential to enhance the efficiency of data-driven, ML-based approaches by addressing their weaknesses. Specifically:
- By providing the ability to generalize, allowing models trained in one domain to be deployed in another and scaling faster. This is particularly important for mobile networks with a distributed RAN architecture.
- By identifying cause-and-effect relationships between observable parameters and desired outcomes and providing explanations to humans about the decision process to establish trust between human operators and AI.
- By offering a safe learning environment for online learning approaches and RL.
Hybrid AI also encounters ongoing challenges that are actively under research. Specifically, building knowledge graphs in the knowledge base requires human experts. Additionally, stored knowledge graphs generally do not scale well, adding computational complexity to symbolic processes using the knowledge base. Curating the knowledge base, by not only adding information correctly but also removing information that is no longer relevant and/or useful to the system, is an ongoing area of research.
Like what you’re reading? Please sign up for email updates on your favorite topics.Subscribe now
At the Ericsson Blog, we provide insight to make complex ideas on technology, innovation and business simple.