Skip navigation
Like what you’re reading?

Accelerating reinforcement learning for intent-based RAN optimization

Reinforcement learning (RL) is a key enabler for autonomous network optimization. However, large-scale RL is challenged by a lack of software enablers and efficient architectures. In this article, we address RL for the intent-based optimization of radio access network (RAN) functions. We first introduce IBA-RP, our internal emulator that facilitates research on intent-based RAN automation including RL-based optimization.  Next, we present HYDRA, our internal proof-of-concept framework for RL, and discuss the motivations behind its key design choices. HYDRA, a hyperscale cloud-native framework, has been applied for intent-based optimization experiments with the IBA-RP.

Senior Data Scientist

Software Researcher, Adjunct Professor

Senior Data Scientist

Data Scientist

Technology Leader

Manager, Artificial Intelligence

Principal Developer

Senior Data Scientist

Software Researcher, Adjunct Professor

Senior Data Scientist

Data Scientist

Technology Leader

Manager, Artificial Intelligence

Principal Developer

Senior Data Scientist

Contributor (+6)

Software Researcher, Adjunct Professor

Senior Data Scientist

Data Scientist

Technology Leader

Manager, Artificial Intelligence

Principal Developer

Cellular network functions have grown vastly in complexity to support the proliferation of end-user services in terms of connectivity, mobility, quality-of-service, and related key performance indicators (KPIs) [1]. Recently, intent-based management has been proposed as a means to simplify network operations, by allowing operators to specify the desired performance of a selected few high-level KPIs [2]. These KPI specifications are translated into lower-level objectives via preference mapping and addressed via standard optimization techniques, a prominent example being Reinforcement Learning (RL) [1]. We briefly describe our intent-based automation research platform (IBA-RP), developed to perform realistic experiments and evaluations towards RAN on intent-based system optimizations.

Deploying RL at the network scale for intent-based optimization is hard. First, the intent is typically not fixed. Instead, the operator can adapt the intent parameters to capture the latest operational preferences. The RL agent must therefore adapt quickly and efficiently in response to intent changes. This motivates the need for robust and scalable RL training pipelines. Second, RL-based optimization agents need to guard against actions that may degrade the network performance. Third, since radio conditions are dynamic, changes in the operating environment (that is, environmental drift) need to be tracked and addressed. In the rest of this article, we describe HYDRA, short for Hyperscale DevOps for RL Automation, an internal cloud-native framework that addresses these challenges. HYDRA is used for proof-of-concept studies at Ericsson, and it is not available for wider use as a product or a feature.

Intent-based RAN system optimization

Intents specify the expectations of a system without revealing the specific methods used to meet those expectations. Intents declared for a system can often be multi-objective, specifying several high-level expectations or objectives that the operator wants the system to fulfill, for instance, capacity (number of served users), system run-time cost, and energy usage. For such multi-objective intents, weight needs to be associated with each included expectation to indicate its relative importance compared to the others. Moreover, conflicts between expectations, making it hard or even impossible to fulfill a given intent, must be identified, and resolved, either during planning or at run-time.

The fulfillment of a given intent and its included expectations can be measured during run-time using selected key performance indicators (KPIs) produced by the system. The KPIs are produced based on low-level metrics and data produced by the system, often in a fine-granular and frequent manner.

RL agent learns a control policy for optimizing a system towards provided intent

Figure 1: RL agent learns a control policy for optimizing a system towards provided intent [3]

Intent-based Optimization using Reinforcement Learning

RL is a promising technology for optimizing the cellular network towards specified intents. For reinforcement learning to be effective, the anticipated outcomes and the KPIs reported by the system must be converted into a reward value. This reward value specifies how effectively the system fulfills their intended objective. Subsequently, the objective for the RL agent is to learn a control policy that maximizes the cumulative reward value over an extended period. An RL agent learns this control policy by performing controlled actions that lead to the reconfiguration of the internal settings of the RAN system. This leads to new state data being reported by the system, in terms of selected system internal KPIs and metrics, together with a new reward value, specifying how well the given intent is currently being fulfilled. The agent continuously updates its internal control policy concerning the states, actions, and rewards observed while interacting with the system. After several such interactions, the RL agent becomes updated, allowing it to learn a policy for performing actions that maximize the long-term reward value, and thus also the long-term intent fulfillment, of the RAN system. Figure 1 illustrates how an RL agent can be trained to learn a policy that performs actions towards a RAN system so that a given user-provided intent becomes fulfilled.

Intent-based Automation Research Platform (IBA-RP) with RL agent

Figure 2: Intent-based Automation Research Platform (IBA-RP) with RL agent

Intent-based Automation Research Platform

To study intent-based RAN optimization, Ericsson has developed an in-house RAN system emulator, called Intent-Based Automation Research Platform (IBA-RP), with the vision that studies carried out using IBA-RP will lead to future Ericsson product offerings. IBA-RP implements a containerized version of a 5G RAN system consisting of a set of container-based services. Figure 2 illustrates an IBA-RP system together with an RL agent. The RL agent performs scaling decisions for RAN control-plane services to fulfill the operator-provided intents. 

The IBA-RP system also holds a set of platform services to interact with the RL agent and define and reason around intents. The operator can specify the targeted intent in an intent portal. The provided intent can consist of several high-level expectations or objectives that the operator wants the IBA-RP system to fulfill. The reward handler translates collected system metrics and KPIs, based on provided intents, into a reward value, specifying how well the provided intent is currently being fulfilled. The data handler collects low-level system KPIs and metrics, both from the application, infrastructure, and hardware layers into state information as required by the RL agent. The decision handler provides an interface for the RL agent to perform scaling actions towards the IBA-RP system and to realize these scaling decisions towards the underlying application and infrastructure layers.  There are also additional services to collect and visualize selected metrics and KPIs (not included in Figure 2).

Challenges for using RL for intent-based RAN system optimization

Figure 2 illustrates how an RL agent part of the IBA-RP system, in an online fashion, can be used to learn a policy for a given operator-specified intent. To learn a policy, the RL agent needs to perform many actions towards IBA-RP over an extended period, often with some performance degradation during the early stages of training. The latter is in most cases considered non-acceptable for a live RAN system. Moreover, if the operator changes the targeted intent, the whole RL agent training process needs to be restarted, since the control policy derived for the previous intent no longer is valid for the new intent.

We have researched solutions to the outlined problems using IBA-RP together with HYDRA. We (1) speed up RL agent training by scaling, allowing multiple IBA-RP systems, each with its own deployed agent, to collectively produce state, action- and reward training data. Next, HYDRA allows us to utilize (2) offline RL to derive policies for different user-provided intents using collected state, action- and reward data. Finally, a policy, derived for a specific intent using offline RL in HYDRA, can be (3) seamlessly deployed to a running IBA-RP system instance. The latter allows us to evaluate how well a derived policy fulfills the given intent when used towards IBA-RP in an online fashion. Figure 3 illustrates schematically how we have connected IBA-RP and HYDRA. The rest of this blog post presents HYDRA in more detail.

Scaled training for different intents using multiple IBA-RP instances and the HYDRA framework

Figure 3: Scaled training for different intents using multiple IBA-RP instances and the HYDRA framework

HYDRA: Hyperscale DevOps for RL Automation

HYDRA is a cloud-native framework that has been designed for the systematic and large-scale operation of RL workflows, including the IBA-RP use case described above. HYDRA implements automated pipelines for data collection, feature manipulation, model training & deployment, and inference. In addition, HYDRA supports model lifecycle management (LCM) and a visualization dashboard to help design, monitor, and track the RL workflows. Finally, HYDRA’s drift detection functionality helps detect changes in the target environment and automatically triggers model refinement to help mitigate its impact. HYDRA is a proof-of-concept framework implemented on the Google Cloud Platform (GCP), with the main goals of identifying challenges and proposing design solutions that help accelerate RL adoption and improve the reliability of RL solutions within the Ericsson product portfolio. The HYDRA architecture, illustrated in Figure 4, implements the following key design choices for the intent use case:

  • Data-centrism: Data has a central place in the overall HYDRA workflow. The RAN KPIs and related metrics, as well as their associated metadata, are stored in a database that serves the data visualization, drift detection, and training pipelines. In addition, the data is mapped to trained models for provenance and used in downstream functionalities such as model explainability analysis. Pre-collected data is also central to offline RL, which is the basis for RL policy learning in the intent-based optimization use case as described in the next section.
  • Recommendations: The deployed RL model is served via recommender workflow, where the model can be queried by the target application (for example, IBA-RP) with an observed state. The RL model is configured to respond with a recommended action or action probability distribution, and optionally the uncertainty associated with each action. The action recommendation can be combined with action outputs from other policies, for example, a baseline control policy or a rule-based policy, to select a final optimal action. This approach helps balance between RL, which is particularly uncertain during the initial training phases, and more reliable but potentially suboptimal classical policies.
  • Risk Mitigation: HYDRA implements a guardrail component to ensure that invalid or unsafe actions do not propagate to the target application. This guardrail component intercepts actions from the RL model and analyses them before sending them onwards. Actions can be classified as being unsafe using statistical means (for example, based on previous action outcomes), application-specific rules that could be defined by a subject matter expert, or a combination of the two.
HYDRA architecture that has been implemented on GCP

Figure 4: HYDRA architecture that has been implemented on GCP

The HYDRA workflow is tailored towards offline RL. This involves data collection pipelines that store time-stamped tuples of [state, action, and rewards] in a central database. The data is periodically ingested by the training pipeline for model training including hyperparameter tuning, which eventually generates trained RL models for deployment towards the target application. The deployed RL agent interacts with the target environment (for example, IBA-RP) via the guardrail component responsible for protecting against detrimental actions. The logger continuously collects environment data, which is used for drift detections and subsequent model retraining.

Offline RL for Intent-based Optimization

The classical RL approach is online, where the RL agent interacts with the target application (environment) to explore and learn optimal actions in near-real time. However, for most real-world RAN applications, online RL is risky since it can lead to undesirable outcomes, such as performance degradation and even outages [4]. Further, online RL is constrained by the cost and latency of live environment interaction, which makes it unusable in many scenarios. Recently, offline RL has been proposed as a viable approach, as it learns RL control from pre-collected datasets that contain environment interactions.

An important element of offline RL is off-policy evaluation, which helps evaluate trained RL policies before model deployment [5]. We adopt offline RL for intent-based RAN optimization to take advantage of these benefits. Offline RL enables proactively training RL policies so that a suitably trained agent can be deployed as soon as the intent formulation is modified. Post-deployment, a logger functionality logs the newly generated data samples for policy refinement, drift detection, and other services. This completes the HYDRA offline RL loop for continuous training.

Another key advantage of this approach is that the collected data can be used to generate synthetic offline datasets for  new intent formulations. By training RL policies on these datasets, it is possible to proactively learn RL policies for future intent formulations, which greatly accelerates the availability of performant RL policies in the deployed system.

An example of the performance differences that can be obtained by proactively training RL policies on different intent formulations can be seen in Figure 5.  Using HYDRA, two RL agents, Agent 3 and Agent 4 respectively, were trained using a given offline dataset.  RL Agent 3 was trained to optimize for “Intent 1”, which was the intent that is the same as the intent applied for collection of the offline dataset, while RL Agent 4 was trained to optimize for another “Intent r5”.

Intent 1

Probability Density


The X-axis denotes probability density values for the corresponding reward values denoted by the Y-axis

Intent r5

Probability Density


The X-axis denotes probability density values for the corresponding reward values denoted by the Y-axis

Figure 5: Performance for two offline-trained RL agents when the reward is calculated using Intent 1 and Intent r5. Note that RL Agent 3 was trained to optimize Intent 1, while RL Agent 4 was trained to optimize Intent r5.

It is clear from Figure 5 that RL Agent 3 outperforms RL Agent 4 under Intent 1, while RL Agent 4 outperforms RL Agent 3 under Intent r5. This result was expected since each RL Agent was trained to optimize their respective intent. It is noteworthy, however, that despite never optimizing for Intent r5 during the data collection of the offline dataset, RL Agent 4 was still able to learn behaviors that allowed it to achieve high rewards under Intent r5. This result indicates that, at least under certain conditions, offline RL provides a mechanism for efficiently optimizing under multiple intents as the user’s desired behavior varies over time.

Automated Training Pipeline

HYDRA implements automated pipelines for data acquisition, data ingestion for model training, and hyperparameter tuning. In particular, the training process can be scaled to several compute instances that run in parallel with different hyperparameter tuning configurations. The trained models are evaluated to select the top-performing model, which is subsequently deployed for predicting control actions in the target environment. The training pipeline can be triggered by a range of time and data-based triggers. Further, HYDRA employs industry-standard tools for tracking the training runs as well as for its model registry functionality. The top model is also inspected with a technique known as SHapley Additive exPlanations (SHAP), which provides explainability analysis for the model behavior [6].

One key aspect of HYDRA training pipelines is that they are configuration-based, such that users define all the parameters for their training run in a YAML file, which is used to initiate the pipeline. Examples of parameters in this training configuration file are the model hyperparameters to explore, the total number of hyperparameter combinations to test, the different state space feature combinations to test, the name of the BigQuery table to use for the training (offline RL mode). This configuration file can be version-controlled, so that teams can keep track of the different training pipelines they have initiated over time.  Additionally, the configuration file can be used to automate the re-training and deployment of new RL policies over time.  For instance, HYDRA users can use cloud scheduler jobs along with their pre-defined training pipeline configuration file to launch training pipelines at regular intervals (for example, hourly, daily, weekly, etc.) to update their deployed RL policy. 

An example of the performance improvement that can be obtained by using this iterative re-training approach is shown in Figure 6. An initial offline RL dataset was collected from IBA-RP using an agent that chooses actions at random. After allowing this random agent to control the system for several days, HYDRA was used to train and deploy an RL policy that was allowed to control the system for an additional day.  Finally, using the combined data collected from both the random and trained agent’s control of the system, a second trained RL policy was generated using HYDRA and deployed to control the system for an additional day.

Probability Density


The X-axis denotes probability density values for the corresponding reward values denoted by the Y-axis

Figure 6: Reward distributions obtained from the random agent (blue), first deployed RL agent (orange), and second deployed RL agent (green) during the iterative offline RL experiment using HYDRA with the IBA-RP system. RL Agent 1 was trained on the data collected from the Random agent’s control of the system and RL Agent 2 was trained using the data collected from both the Random and RL Agent 1 control of the system.

Additional HYDRA features

A crucial aspect of RL involves the careful monitoring of environmental variables and system rewards over time. This is because these elements can evolve, and their change may impact the efficacy of the RL models. We conclude with an outline of some additional HYDRA functionalities that support RL workflows. These include a feature store, a drift detection mechanism, and a large-scale flexible visualization dashboard.

Feature stores play a vital role in managing and monitoring data used in RL models. In HYDRA, two feature stores have been explored: The Vertex AI feature store provided by the Google Cloud Platform (GCP) [7] and the open-source Feast feature store [8]:

  • Vertex AI Feature Store: This feature store is equipped with built-in data drift detection capabilities. It utilizes various metrics to identify anomalies in data distributions. For categorical features, an L-infinity distance metric is used to calculate a distance score, while for numerical features, the Jensen-Shannon divergence is employed. When these scores exceed predefined thresholds, the system flags these as anomalies.
  • Feast Feature Store: Feast is explored as an alternative to GCP’s feature store, primarily for its cost-effectiveness in online serving. Feast monitors both offline and online data stores. For drift detection, the Kolmogorov-Smirnov (KS) test is employed with Feast. This metric compares the distribution of historical data with that of newly ingested data, either in the offline store or directly in the online store. If the KS metric exceeds a certain threshold, it indicates data drift.

To automate and streamline the drift detection process, we set up cloud functions and ingestion jobs on GCP designed to listen to the logs coming from the feature store. The system compares two data distributions — the historical data and the newly ingested data — to identify drifts. Upon detecting drift in data distributions, HYDRA triggers the retraining of RL agents. This ensures that the agents are better equipped to handle out-of-sample or out-of-distribution data.

Visualizations are essential to monitor and track machine learning workflows in general, and are especially important for RL, since in RL both the model and the application are expected to change over relatively short time intervals. These visualizations span not only the raw data, but also the state space coverage of RL models, their actions and reward distributions, and the relative metrics for the trained RL agents. In addition, the operator will benefit from easy-to-parse visualizations for various intent formulations. HYDRA employs Looker Studio, which is a powerful tool provided by GCP for data visualization at scale [9].

Looker Studio provides seamless integration with the database containing the experiment data for training and evaluation. This means that all data can be easily visualized in a standardized, comparable way. This allows the user to inspect the results on a deeper level and see issues that might not be visible when only considering the reward. We use Looker Studio to create several plots specific to RL applications, for example, those that describe the state and action space coverage by RL agents. Control elements added to these plots allow filtering by RL agent, time interval, and other configuration parameters. This helps detect anomalies or performance degradations so that these can be addressed proactively. Looker Studio also updates plots and figures as more data is streamed from the dataset to the dashboard.

  1. 5G wireless access: an overview, Ericsson Whitepaper, link
  2. Autonomous networks with multi-layer, intent-based operation, Ericsson Technology Review, link
  3. Shi, Haibo & Sun, Yaoru & Li, Guangyuan & Wang, Fang & Wang, Daming & Li, Jie. (2019). Hierarchical Intermittent Motor Control With Deterministic Policy Gradient. IEEE Access. PP. 1-1. 10.1109/ACCESS.2019.2904910.
  4. Online and offline Reinforcement Learning: What are they and how do they compare?, Ericsson Blog, link
  5. Levine, S., Kumar, A., Tucker, G. and Fu, J., 2020. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv e-prints, link.
  6. Albini, E., Long, J., Dervovic, D. and Magazzeni, D., 2022, June. Counterfactual SHapley additive explanations. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (pp. 1054-1070).
  7. Introduction to feature management in Vertex AI, link.
  8. Feast: Feature Store for Machine Learning, link.
  9. Looker Studio Overview, link
The Ericsson Blog

Like what you’re reading? Please sign up for email updates on your favorite topics.

Subscribe now

At the Ericsson Blog, we provide insight to make complex ideas on technology, innovation and business simple.