Online and offline Reinforcement Learning: What are they and how do they compare?
In this multi-part series, we aim to address this problem by describing the fundamental principles and high-level features of the key RL concepts. We will outline the typical learning behaviors associated with each feature while comparing their common advantages and disadvantages. The first part of the series – Online and Offline Reinforcement Learning – discusses online RL, where a learning agent interacts with the target environment, and juxtaposes it with offline RL, where the agent learns from a pre-recorded dataset. In the context of Radio Access Networks (RAN), offline and online RL feature contrasting elements that need to be considered carefully when designing an RL solution. We exemplify these elements with the help of common RL for RAN use cases.
Reinforcement Learning has seen an enormous surge in applications over the past decade. However, despite its vast reach, RL terminology can still often be poorly understood, especially by beginners in the area. This complicates the process of adopting RL to new application domains. An intuitive appreciation of the various RL aspects can go a long way towards improving the quality and development life cycle of RL solutions.
For instance, at the experiment design stage, early insights into the complexity and learning behavior can help filter out poor RL design candidates. Subsequently, a shared, clear, understanding of RL terminology helps with cross-functional collaboration as the solution progresses from design to the implementation and deployment stages. Finally, the operation and maintenance of RL deployments are aided by the knowledge of common problems encountered with RL applications, as well as their standard resolution techniques.
For RAN, RL is a key technology that will enable the next cycle of advances in operational efficiency and reliability1. Many problems within RAN are characterized by little-to-no optimality labels (that is, the optimal action is either unknown or not recorded). RL’s data-driven and goal-oriented approach makes it suitable for RAN as compared to other “label-hungry” Machine Learning (ML) techniques or optimization algorithms that require explicit knowledge of the real-world dynamics being modeled. RL has been proposed for RAN control loops with latencies ranging from a few milliseconds, as in Modulation Coding Scheme (MCS) selection2, to several minutes for network slicing, and even up to a few hours for antenna tilt steering3, for improved performance. While some RAN scenarios are amenable for exploration, for example, those serving best-effort traffic, others can have strict real-time requirements that discourage unreliable actions. Therefore, choosing an appropriate RL setup is key to realizing real-world performance gains and deploying long-term RL solutions in RAN.
Online and offline RL: Commonalities and differences
Classical RL is an online (that is in real-time or on-the-fly) learning technique, where training an action policy—a strategy or a set of rules—is tightly coupled with iterative data collection. As the agent becomes more proficient, new experiences, which potentially describe environment interactions that were previously unknown and unachievable by the agent, become available. Such data might contain single environment experiences (for example, traditional State-Action-Reward-State-Action [SARSA] or Q-Learning) or can be comprised of many, sampled from an always evolving “replay buffer”4. However, iterative data collection is often time-consuming and expensive, and incremental policy learning frequently produces undesirable control actions. As a result, offline RL, where policy training is decoupled from the data collection process, has recently been proposed as an efficient learning technique5. In this section, we describe online RL and offline RL and contrast them in terms of their general learning characteristics.
Online RL refers to the paradigm where the training process for an RL policy interacts with the environment to learn optimal actions. In this case, the RL policy predicts actions and collects the corresponding rewards as soon as they become available. In a strict online learning sense, the policy parameters are updated with all the data samples that have been collected so far, before generating the next action prediction. However, in most practical deployments of online RL, the requirement for immediate updates is relaxed, allowing each training iteration to span several data collection timesteps (a “batch”). The reason for this relaxation is twofold: First, training on batches of data is central to the learning efficiency and robustness of modern parameter update algorithms. Second, the training process benefits from the significant speed-up achieved by running parallel jobs for iterative data collection and model parameter updates6.
Irrespective of the training batch size, the defining feature of online RL is near real-time access to the environment, enabling the RL policy to explore the environment with any valid action for a given observation and analyze its outcome. At the same time, the policy also strives to frequently exploit optimal actions that maximize returns for the observed environment state. This exploration-exploitation dilemma is a central characteristic of online RL. State-of-the-art online RL algorithms hence employ a variety of techniques that efficiently navigate this dilemma to learn optimal policies.
Offline RL refers to the scenario where an RL algorithm learns a policy from a static, pre-collected dataset. This dataset might have been collected by an arbitrary, even unknown policy that captures its interaction with the environment over several timesteps. These interactions are commonly stored as sequences of (state, action, reward, next state) tuples, in certain cases together with the associated metadata. During training, an offline RL algorithm iterates over the dataset to estimate the value of actions represented in the dataset. In contrast to online RL, the offline scenario does not have access to the environment at training time.
With offline RL, the accuracy of the value estimates depends on the richness of the dataset in terms of its state and action space coverage. In most scenarios of practical interest, only a subset of the full state-action space can be captured owing to data collection and memory constraints. Therefore, efficient offline RL algorithms must guard against overestimating action values that are not sufficiently represented in the dataset.
An important aspect of offline RL is policy evaluation prior to deployment. Since there is no environment access during training, offline RL must rely on a portion of the static dataset to evaluate trained policies in terms of their predicted performance in the actual environment. Referred to as off-policy evaluation (OPE), this involves importance sampling to identify effective policies that have been trained offline5. Since it is typically quite challenging to obtain a robust dataset for training and evaluating offline RL policies, conventional OPE is often enhanced through online validation methods such as A/B testing, canary testing, and graded deployments.
Online and offline learning are two distinct approaches for RL policy training. Below, we contrast the features of each approach and highlight the relative merits that each can present under certain scenarios. It is important to note that online RL and offline RL are not mutually exclusive: Elements of both approaches can be combined advantageously in an end-to-end policy learning architecture.
- Resources: Online RL relies on environment interactions for data collection.
In many real-world use cases, this requires provisioning additional compute and memory resources for policy learning close to the environment. On the other hand, offline RL relies on a pre-collected dataset that can be located and processed flexibly in terms of both physical resources and learning schedules.
- Environment Stationarity: Many RL algorithms are designed for stationary environments, where the state transition dynamics do not change over time. However, online RL algorithms have been proposed for scenarios where stationarity does not hold, for example through change detection and down-weighting of historical observations. Since Offline RL algorithms only generate static policies learnt from pre-recoded datsets, they do not directly address environment non-stationarity.
- Environment Complexity: Online RL acquires data samples for training iteratively, while offline RL only has access to pre-recorded data. This helps online RL to scale to arbitrarily complex environments that feature large state and/or action spaces, whereas recent works indicate that offline RL might be at a fundamental disadvantage when addressing such use cases9.
- Asymptotic Performance: With online RL, the rate of exploration typically declines over time as the RL policy converges which is, under certain assumptions, guaranteed to be the asymptotically optimal policy8. On the other hand, the achievable performance of offline RL policies is constrained by the state and action space coverage of the dataset, which is significantly harder to guarantee.
- Safety: A key aspect of online RL is action space exploration, which requires uncertain actions to be executed in the environment. It can lead to suboptimal outcomes, especially during the initial learning phase. Safe online RL is an important area of research to mitigate these issues11. In contrast, offline RL iterates over the dataset and converges to an action policy before any environment interaction. As a result, offline RL can be helpful in avoiding detrimental actions, especially early into deployment.
- Architecture: Online RL typically implements two loops that run iteratively: A data collection loop and a policy training loop. Ensuring good performance requires careful design of these loops: For example, parallelization to maximize data throughput and avoid training bottlenecks, maintaining data provenance, and handling unexpected online events. Offline RL decouples data collection from policy training, which leads to implementations that are simpler to design and maintain over time.
- Data Sources: The data for online RL is acquired through a single mode: Policy interactions with one or more environments. Even though data collection can be scaled by running multiple instances of the environment and the policy, it is challenging and often expensive to design effective multiple data collection and policy training workflows. In contrast, offline RL can employ multimodal data collection, for example, logged data from baseline policies, prior policy deployments, and even human-in-the-loop interactions. Assimilating diverse data sources in a meaningful and efficient way is the subject of active research10.
While the above features are expected to hold in general, the learning behavior often varies greatly depending on the use case, the choice of algorithm, hyperparameters, and other related factors. Some problems tend to be better suited for online RL—among them, we find arcade games and simulation-based environments. On the other hand, some other domains are better served with offline RL, an example being learning robotic manipulation tasks from human demonstration. Further, several RL applications can benefit from a hybrid approach that bootstraps an RL policy from offline data that is eventually refined/fine-tuned by online environment interaction7.
Online and offline RL for RAN use cases
For RAN use cases that support simulator-based RL model training, online RL can be a viable paradigm. The advantage of simulator-based RL training is that it can be scaled massively, and arbitrary control actions can be explored freely with the risks associated with environmental degradation. However, despite their relative fidelity, RAN simulators rarely capture the complexity of real-world scenarios, leading to the development of data-driven digital twins that can supplant existing RAN simulators.
Online exploration in live RAN networks is typically deemed too risky to be of practical value. Therefore, offline RL is emerging as a useful alternative for cases where accurate RAN models are unavailable or expensive. Offline RL can train RL models based on data collected from live deployments, thus ensuring that the training data accurately represents the target environment. These models can be validated with collected data using OPE, or incrementally in the target RAN deployments.
A hybrid approach can also be considered to extract the complementary benefits of these two paradigms: A base RL model can be trained online against a RAN simulator or a digital twin. Subsequent retraining cycles can fine-tune the model through offline RL based on data collected from the live network prior to deployment. Post-deployment model updates can also be either online or offline, depending on the specific requirements and constraints of the use case.
Acknowledgment: We thank Markus Svensén, András Méhes, and Ola Dahl for their valuable inputs.
1. V. Berggren, K. Dey, J. Jeong. and B. Guldogan, 2022. Bringing reinforcement learning solutions to action in telecom networks. Ericsson blog.
2. P. Soldati, E. Ghadimi, B. Demirel, Y. Wang, M. Sintorn, R. Gaigalas, 2023. Approaching AI-native RANs through generalization and scalability of learning. Ericsson Technology Review
3. AI: enhancing customer experience in a complex 5G world, Ericsson blog.
4. Sutton, R.S. and Barto, A.G., 2018. Reinforcement learning: An introduction. MIT press.
5. Levine, S., Kumar, A., Tucker, G. and Fu, J., 2020. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.
6. McCandlish, S., Kaplan, J., Amodei, D. and Team, O.D., 2018. An empirical model of large-batch training. arXiv preprint arXiv:1812.06162.
7. Nair, A., Gupta, A., Dalal, M. and Levine, S., 2020. Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359.
8. Dann, C., Mansour, Y., Mohri, M., Sekhari, A. and Sridharan, K., 2022, June. Guarantees for epsilon-greedy reinforcement learning with function approximation. In International Conference on Machine Learning (pp. 4666-4689). PMLR.
9. Foster, D.J., Krishnamurthy, A., Simchi-Levi, D. and Xu, Y., 2021. Offline reinforcement learning: Fundamental barriers for value function approximation. arXiv preprint arXiv:2111.10919.
10. Zhou, G., Ke, L., Srinivasa, S., Gupta, A., Rajeswaran, A. and Kumar, V., Real World Offline Reinforcement Learning with Realistic Data Source. In Deep Reinforcement Learning Workshop NeurIPS 2022.
11. Filippo Vanella, Jaeseong Jeong, and Alexandre Proutiere. “Off-policy Learning for Remote Electrical Title Optimization.” IEEE VTC Fall, 2020.
Like what you’re reading? Please sign up for email updates on your favorite topics.Subscribe now
At the Ericsson Blog, we provide insight to make complex ideas on technology, innovation and business simple.