On-policy and off-policy Reinforcement Learning: Key features and differences
Reinforcement learning (RL) is shaping up to be a key enabler for artificially intelligent systems and processes. However, real-world applications of RL are hindered by the general challenges of deploying and maintaining machine learning systems. This problem is multiplied by the breadth of RL-specific terminology, which can be confusing for new practitioners and makes it difficult to efficiently collaborate on adopting RL techniques for new use cases.
In this multi-part series, we discuss key RL terms that can be commonly misunderstood. We will also outline the gains and challenges of competing RL techniques and the impact they can have on RL applications. The first part of this series discussed online and offline RL, which deals with whether the training data is acquired interactively (online) or from a pre-recorded dataset (offline)1. This follow-up part addresses on-policy and off-policy RL, which addresses whether the training process refines the same policy as used to acquire data (on-policy) or a separate, typically optimal, RL policy (off-policy). We describe on-policy and off-policy RL using canonical algorithms for each and discuss the contrasts in the learning behavior of each approach. Further, we discuss each approach in the context of optimizing radio access networks (RAN) with the help of some example use cases.
RL is central to the ongoing efforts in RAN automation2. Since RAN serves a multitude of service types with diverse Quality of Service (QoS) requirements, the appropriate RL techniques are also likely to vary across use cases and deployment scenarios3. Some service types, for example, industrial control, require stringent performance in terms of the packet error rates and latency of data acquisition. Other services, most prominently mobile broadband, prioritize throughput over most other performance metrics. Yet others, like real-time streaming applications in augmented and virtual reality require both high throughput and low latency for seamless performance. Selecting a suitable RL technique for RAN control therefore requires careful analysis of the available choices and understanding of their inherent trade-offs.
On-policy and off-policy RL: A comparison
RL policy training uses one of two common approaches: On-policy methods, which iteratively refine a single policy that also generates control actions within the environment (known as the behavior policy). Then, there are also the off-policy methods where data from the behavior policy trains a separate target policy for an optimization objective. This difference in which policy is updated with the training data has a profound impact on the learning behavior of the various RL algorithms, which we will discuss in the rest of this section. It is important to note that the distinction between on-policy and off-policy methods is generally meaningful only in the context of online RL. With offline, the training dataset trains an optimal policy irrespective of the policy used to generate data; hence, offline RL almost always employs an off-policy learning scheme.
With on-policy RL, actions are generated in response to observed environment states using a certain RL policy. The outcome of these actions is collected and used to iteratively refine the parameters of the same policy. On-policy RL, therefore, uses a common behavior and target policy, which is responsible for (1) exploring the state and action spaces, and (2) optimizing the learning objective based on the data it has collected so far. Most on-policy algorithms incorporate some form of action randomness to balance between these two goals. Therefore, the RL agent may sometimes explore by selecting an action that has an uncertain outcome, while at other times it may exploit the latest policy by selecting the action that has the highest expected return for the current state. The outcome of the selected action is used to iteratively update the policy parameters, which influences subsequent behavior.
The SARSA algorithm is a canonical example of an on-policy learning. In each step, SARSA selects either the current best action or another, exploratory, action with some (typically small) probability. The outcome of this action is used to update the value function for the current policy, and the process repeats itself. With a careful design of the learning schedule, and under certain assumptions, SARSA can converge to the optimal policy4.
Off-policy RL maintains two distinct policies: A behavior policy and a target policy. While the behavior policy generates control actions for observed environment states, the target policy is trained iteratively using the subsequent outcome of the action. Off-policy RL, therefore, decouples the data collection process from policy training. A key advantage of off-policy methods is that they can learn an optimal target policy, such as a greedy reward-maximizing policy, regardless of whether the behavior policy is exploratory. A common practice with off-policy learning is to periodically update the behavior policy with the latest target policy to maximize learning gains.
Q-learning is a common example of off-policy RL. Like SARSA, the behavior policy generates random control actions with a small probability. Unlike SARSA however, Q-Learning uses the outcome of this action to separately update the value function for a greedy (target) policy. In other words, the data collected by the random behavior policy is applied to learn an optimal target policy. Q-learning is also proven as optimal under certain simplifying assumptions5.
Learning Characteristics
Strictly speaking, on-policy learning can be seen as a special case of off-policy, where the target policy is kept identical to the behavior policy. However, the key benefits of off-policy stem from its ability to explicitly learn an optimal policy irrespective of the behavior policy employed. This distinction has a large impact on the learning characteristics of the two competing techniques, which are outlined below.
- Optimality: One of the drawbacks of on-policy learning is that it iteratively refines a single policy, which risks reinforcing the locally optimal actions, and by that failing to reach the global optimum. Off-policy methods train an optimal policy directly, therefore, they are generally better at navigating local peaks or valleys of the solution space to arrive at a global optimal solution.
- Online performance: The online, that is real-time performance of on-policy methods can be better in some cases, since the latest version of the learned policy is used for selecting actions at every step7. This can help avoid poor performance associated with a suboptimal behavior policy employed by off-policy learning, especially during the initial learning phase.
- Sample efficiency: On-policy methods are known to be relatively less sample-efficient overall on account of relying on a single policy for both exploration and exploitation. With off-policy learning, the behavior policy can be designed such that the training of the target policy is sample-efficient6.
- Flexibility: Compared to on-policy methods, off-policy is more flexible in terms of learning from diverse data sources. For example, an off-policy algorithm can be applied to learn from data generated by a conventional non-learning controller or even human interactions. While on-policy is a strictly online learning method, off-policy can be used with online, batch, or offline learning schemes.
- Convergence: Many on-policy algorithms, such as SARSA, are analytically tractable and provide convergence guarantees under mild assumptions. In the special case where current actions do not influence future environment states, known as a bandit setting, rigorous performance bounds are also available8. In contrast, analyzing off-policy methods is more challenging and often requires stronger assumptions for similar convergence results10.
- Complexity: On-policy algorithms tend to have lower computational complexity since only a single policy needs to be stored and updated at every step. For the same reason, on-policy algorithms are also easier to tune in general11.
Between on-policy and off-policy learning, the latter is the more flexible and powerful learning approach. Some researchers and practitioners even treat on-policy as a special case of off-policy learning where the behavior and target policies are the same. Even if that were the case, on-policy brings with it attractive features in terms of simplicity, convergence, and efficiency. Ultimately, the performance of the learning method depends greatly on the type of algorithm used, exploration parameters including annealing, and other hyperparameters. Another aspect that influences the choice of learning method is whether the training is online or offline, which was discussed in the previous blog in detail1. With online learning, an RL algorithm selects actions in real time based on interactions with the environment and is free to train the policy using either on-policy or off-policy updates. In contrast, offline RL employs a precollected dataset without real-time interactions. Offline RL is, therefore, closely tied to off-policy methods with many classical off-policy algorithms having their offline counterparts.
On-policy and Off-policy RL for RAN optimization
The distinction between on-policy and off-policy algorithms for RAN optimization naturally depends on the use case. Some RAN applications, such as Modulation and Coding Scheme selection have to operate under severe latency and computational limitations due to their urgency and complexity9. These applications involve selecting the best possible modulation and coding rate for data transmission based on the current wireless channel conditions, where on-policy methods might find better use. In contrast, antenna tilt steering and similar applications have a higher tolerance for training and inference latency. As such, they can benefit from the use of advanced off-policy learning schemes that typically have larger memory and computational footprints2,3. Another problem dimension relates to error tolerance. Some wireless services, for example, those used for industrial control, have a low tolerance for intermittent drops in performance. On the other hand, best-effort services, for example, in the case of file download, are more amenable to uncertain short-term performance that can stem from aggressive exploration-based learning as is common with off-policy methods.
Theoretical guarantees and convergence properties can make on-policy a better choice for the more exacting RAN use cases. Similarly, for environments where the reward structure is either known or can be estimated reliably, off-policy might not provide any significant advantage over the simpler on-policy methods. On the other hand, it is conceivable that off-policy algorithms will prove advantageously flexible and general-purpose for several RAN applications where RL policies can be trained within simulation environments, where both computational complexity constraints and error rates have much less significance. Therefore, simulation-based training might prefer off-policy learning. Another reason to favor off-policy learning is its improved chances of convergence to the globally optimal solution, which can be the goal with data-intensive services such as mobile broadband and streaming.
Acknowledgement: We thank Markus Svensén, András Méhes, and Ola Dahl for their valuable inputs.
References
1 V. Saxena, B. Guldogan, D.D. Nimara, 2023. Online and offline Reinforcement Learning: What are they and how do they compare? Ericsson Blog, link.
2 V. Berggren, K. Dey, J. Jeong. and B. Guldogan, 2022. Bringing reinforcement learning solutions to action in telecom networks. Ericsson Blog, link.
3 P. Soldati, E. Ghadimi, B. Demirel, Y. Wang, M. Sintorn, R. Gaigalas, 2023. Approaching AI-native RANs through generalization and scalability of learning. Ericsson Technology Review.
4 Singh, S., Jaakkola, T., Littman, M.L. and Szepesvári, C., 2000. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine learning, 38, pp.287-308.
5 Melo, F.S., 2001. Convergence of Q-learning: A simple proof. Institute Of Systems and Robotics, Tech. Rep, pp.1-4.
6 Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R.E. and Levine, S., 2016. Q-prop: Sample-efficient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247.
7 Perkins, T. and Precup, D., 2002. A convergent form of approximate policy iteration. Advances in neural information processing systems, 15.
8 Slivkins, A., 2019. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning, 12(1-2), pp.1-286.
9 Saxena, V., 2021. Machine Learning for Wireless Link Adaptation: Supervised and Reinforcement Learning Theory and Algorithms, Doctoral dissertation, KTH Royal Institute of Technology.
10 Melo, F.S., Meyn, S.P. and Ribeiro, M.I., 2008, July. An analysis of reinforcement learning with function approximation. In Proceedings of the 25th international conference on Machine learning (pp. 664-671).
11 Fakoor, R., Chaudhari, P. and Smola, A.J., 2020, August. P3o: Policy-on policy-off policy optimization. In Uncertainty in Artificial Intelligence (pp. 1017-1027). PMLR.
Like what you’re reading? Please sign up for email updates on your favorite topics.
Subscribe nowAt the Ericsson Blog, we provide insight to make complex ideas on technology, innovation and business simple.