# Comparing model-based and model-free reinforcement learning: Characteristics and Applicability

Reinforcement learning (RL) will drive the next phase of efficiency and autonomy in the radio access network (RAN). However, the pace of progress is hindered by the complexity of designing and operating real-world RL solutions. This challenge is compounded by the broad scope of RL variants and their terminology, which makes it difficult to collaborate efficiently on a number of RAN use cases

In this multi-part series, we define key RL terms that can be commonly misunderstood. We will also outline the benefits and challenges of competing RL techniques and the impact they can have on RL applications. The first part of this series discussed online and offline RL, which deals with whether the training data is acquired interactively (online) or from a pre-recorded dataset (offline). The second part of this series addressed on-policy and off-policy RL, which dictates whether the training process refines the same policy as used to acquire data (on-policy) or a separate, typically optimal, RL policy (off-policy). In this third and final part of the series, we address Model-free RL (MFRL) and Model-based RL (MBRL), which highlights whether the data is used to also learn environment dynamics that are then leveraged to train RL policies. By learning action policies directly on the data collected from environment interactions, an MFRL agent treats the environment as a black box. In contrast, an MBRL agent learns a model of the environment and subsequently leverages it to learn optimal action policies.

As we discussed in parts 1 and 2 of this blog series, RL-based techniques are going to play a key role in radio access network (RAN) design and automation. The adoption of AI capabilities within RAN is closely associated with RL, allowing agents to learn by interacting with complex RAN systems rather than being provided ground-truth labels^{2}. In such a complex ecosystem, identifying the most suitable RL technique for the required task is paramount^{3}. At the same time, selecting the appropriate RL technique depends on the specific use case for RAN, including factors such as the complexity and stationarity of the environment. A key feature of RAN systems is the availability of high-fidelity simulators that capture various aspects of the RAN functionality. These simulators can serve as proxies for learning and testing RL policies before deploying them in a live system. As the requirements and complexity of RAN increases, the underlying simulator used for learning optimal policies might require more sophisticated agents, which are able to capture their dynamics more accurately and facilitate more efficient training. As such, the question of choosing between MFRL and MBRL in RAN is becoming increasingly relevant.

### Model-based and model-free reinforcement learning: A comparison

Any RL problem can be viewed as a Markov Decision Process (MDP), described by the tuple.

(S,A,R,T,γ)

where S is the state space, A is the action space, R is the reward function, T(s' |s,a) describes the probability that the action a in state s leads to the next state s’, and γ is the discount factor that quantifies the tradeoff between immediate and future returns.

RL agents learn by interacting with the environment. They apply actions that follow a policy π(s), which maps observed states to actions. They then retrieve the next state from the underlying dynamics đ¯ and observe the reward based on R(T)^{8}. If đ¯ and R are known, the problem can be handled with dynamic programming algorithms, which decompose the optimization problem into a series of sub-problems having analytical solutions. However, đ¯ and R are rarely known in most real-world scenarios. In these cases, we can follow one of two paths. If đ¯ and R are unknown, but treated as a black box, we apply model-free RL techniques. If đ¯ and R are unknown and we choose to learn them alongside our policy π, we apply model-based RL techniques. Another factor that can support the use of MFRL or MBRL techniques is sample complexity, which indicates the number of environment interactions that are required to achieve the desired level of performance. MBRL methods typically exhibit decreased sample complexity compared to competing MFRL methods.

**Model-free RL** treats the environment as a black box and directly aims to learn an action policy . There are different ways in which model-free RL can be applied^{8}:

- Value-based methods derive the policy π(s) based on their estimations of the value function V(s) and/or state-value function Q(s,a). These two functions quantify the expected accumulated reward that is received following the current policy π from the current state onwards. Typical examples of this family of methods include Q-learning and SARSA.
- Policy-based methods bypass evaluating V and Q and directly learning a parametrized policy π
_{θ}. The REINFORCE algorithm constitutes a typical example of this framework. - Actor-Critic methods, which combine both approaches, leveraging estimates of V and Q for more stable updates of π
_{θ}. Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) are typical examples.

A common characteristic of all these methods is that they do not try to learn a parametrized model of the reward function or the environment dynamics. As such, an agent can only learn by leveraging experiences originating from the real environment.

**Model-based RL** uses machine learning models to, among others, increase the training efficiency and the performance of the policy. For instance, to model the environment, ensembles of feedforward neural networks^{4} or variational autoencoders (VAEs)^{5} might be employed.

In the context of RL, environment models can be used for:

**Simulating the environment:**Leverage the world model as a data-augmentation technique to generate artificial trajectories with s_{t+1}= T_{Ī}(s'âs,a),r=r_{Ī}(s_{t},a),a∼π(s). DYNA and MBPO are typical examples.**Assisting learning algorithms**: Leverage smooth (differentiable) functions for T_{Ī}and r_{Ī }to perform gradient-based optimization on entire trajectories. Stochastic Value Gradients (SGD) and Dreamer are typical examples.**Strengthening the policy**: Leverage internal planners that simulate the environment before picking the best action. Monte Carlo Tree Search (MCTS) is a typical example.

An MBRL algorithm can combine more than one of the above. Intuitively, we expect that a representation of the environment dynamics, also called a world model, can be beneficial. For instance, we might choose to leverage such a world model to generate new artificial samples, without requiring further interactions with the real environment. Such use of a world model would, therefore, increase training efficiency.

Broadly speaking, MBRL can fall into two categories:

**Background planning**: In which we learn an optimal policy for every state beforehand. The learning objective mimics the one from MFRL with the addition of the world models and reward model. Planning algorithms that work on the basis of background knowledge demonstrate fast inference time and more coherence (meaning similar states lead to similar actions). However, they generally do not perform optimally when venturing into unfamiliar situations.**Decision-time planning**: Where we do not solve an optimization problem for every state, but instead optimize the upcoming sequence of actions from the current state. World models are vital for this approach, as we must be able to “plan ahead” internally without interacting with the real environment. The artificial trajectories, which are used for planning are called model rollouts. Decision-time planning algorithms are characterized by their competency in unfamiliar situations, at the expense of slower inference time and less coherence.

The following figure gives you a snapshot of algorithms that fall within these categories.

When choosing between MFRL and MBRL, it is crucial to understand that both MFRL and MBRL can be online or offline, on-policy or off-policy. If online and offline answers the question: “How is the data acquired?”, and on-policy and off policy answers: “Am I generating data with the same policy I am optimizing?”, then MFRL and MBRL answer “Do I, alongside training my usual RL agent, try learning a representation of the environment?”

### Learning characteristics

Both model-free and model-based RL methods have been extensively studied in the literature, and several algorithms have been developed that employ these methods. Below, we outline some key features and general differences in the learning characteristics of the two learning approaches.

**Optimality**– asymptotic performance: When exploring optimality, the asymptotic performance of the algorithm is examined. Asymptotic performance measures the eventual performance (returns) a trained policy will achieve if it were to be trained indefinitely.**Given sufficient (online or offline) environment interactions**, MFRL tends to exhibit higher asymptotic performance than MBRL. Intuitively, this occurs because MBRL enforces certain assumptions about the environment. For instance, an MBRL agent might assume that the dynamics can be perfectly represented by a differential neural network or a quadratic polynomial. These assumptions might not hold, however, and their adoption, therefore, reduces eventual performance. In contrast, MFRL makes no assumptions about the environment as it treats it as a black box.**Sample complexity**- Sample efficiency: Since MFRL does not make any assumptions about the environment, its sample complexity might be high. In contrast, MBRL is significantly more sample-efficient. Some of the reasons for MBRL’s lower sample complexity include: (i) generating artificial data (data augmentation) by leveraging the learned environment models and (ii) training more effectively on real data, by propagating the gradients through real trajectories.**Adaptability**: MBRL is typically more adaptable to changing rewards and dynamics. Since in MBRL the learned reward and dynamics models are continuous functions, they can be used to estimate their behavior under new and unseen conditions.**Convergence**: There are numerous MBRL and MFRL algorithms that have theoretical guarantees about convergence. For MBRL, however, a necessary assumption is that the models employed are sufficiently complex to describe the underlying dynamics.**Complexity**: MBRL is typically more complex. Alongside the usual policies, MBRL requires the training of environment models. Training requires more resources, as more models are being fit to the data. Furthermore, these environment models bring along extra hyperparameters that need to be tuned. Also for inference MBRL, needs more computational resources, particularly when employing decision-time planning algorithms.**Exploration**: MBRL can explore the environment more efficiently by using the learned reward and dynamics model. For example, alongside policy-related exploration, we can enforce model-induced exploration. Model-induced exploration might lead us to be more explorative in underexplored states (to improve our world model) or where the world model exhibits high uncertainty.

When choosing between whether to use MFRL or MBRL, the key deciding factors are: (i) the significance of the sample complexity for our learning problem: (ii) the complexity of the environment; and (iii) the variability of it.

When training on a time-consuming environment simulator, MBRL can achieve performance comparable to MFRL with significantly fewer interactions with the environment. As such, even though the added world models might increase the number of learnable parameters, owing to the decreased amount of environment interactions, the overall training time can significantly decrease.

Environmental complexity can impede MBRL application if the world model employed does not have the necessary capacity to describe the dynamics. This is becoming increasingly less of an issue, however, as more sophisticated world model architectures are being developed.

More variable environments might favor the application of MBRL. With MFRL, if the dynamics shift over time, then the agent typically requires full retraining to accommodate for this shift. In contrast, MBRL allows us to leverage the learned world model to reasonably alter the dynamics and train/plan using modified artificial trajectories.

### Model-free and model-based RL for telecommunication

In the previous installment of the series on RL methods, we presented examples of leveraging model-free RL for RAN optimization. The maturity and relative simplicity of training MFRL algorithms make them ideal learning paradigms for telecommunications. However, the use of MBRL can also be considered, particularly in cases where policy adaptiveness and sample complexity are important. Typical telecommunication use cases in which MBRL can thrive are those that train an RL agent within complex, prohibitively slow simulators.

The Ericsson Global Artificial Intelligence Accelerator successfully utilized MBRL to develop agents capable of tuning cavity filters^{6}. Due to manufacturing tolerances, many filters leave the production line detuned, requiring a human expert to intervene and tune them before they get shipped into Ericsson radio products. To automate this process, the proposed RL agent needs to interact with screws that lie on top of the filter, altering the topology of the cavities and changing its frequency response. Prior MFRL models developed internally allowed the tuning of cavity filters by training on a simple, circuit-based simulator. However, the sample complexity of MFRL made it prohibitively expensive to train on more accurate 3D simulators of filters, which are more representative of the real filter dynamics. In this use case, the adaptation of an MBRL method (Dreamer^{5}) managed to achieve comparable results with the prior MFRL baseline (SAC), while requiring an order of magnitude fewer interactions with the environment during training. The MBRL method made it possible to train the RL algorithm on the 3D simulator, now only requiring one week of training. For reference, training the prior MFRL agent in that same simulator would require close to half a year.

#### References

^{1} Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.F. and Dennison, D., 2015. Hidden technical debt in machine learning systems. Advances in neural information processing systems, 28.

^{2} V. Berggren, K. Dey, J. Jeong. and B. Guldogan, 2022. Bringing reinforcement learning solutions to action in telecom networks . Ericsson Blog, link.

^{3 }P. Soldati, E. Ghadimi, B. Demirel, Y. Wang, M. Sintorn, R. Gaigalas, 2023. Approaching AI-native RANs through generalization and scalability of learning. Ericsson Technology Review

^{4} Michael Janner, Justin Fu, Marvin Zhang, Sergey Levine. When to trust your model: Model-Based Policy Optimization. arXiv:1906.08253

^{5 }Danijar Hafner, Timothy Lilicrap, Jimmy Ba, Mohammad Norouzi. Dream to Control: Learning Behaviours by Latent Imagination

^{6 }Doumitrou Daniil Nimara, Mohammadreza-Malek Mohammadi, Petter Ögren, Jieqian Wei, Vincent Huang. Model-Based Reinforcement Learning for Cavity Filter Tuning. PMLR

^{7 }Tutorial on Model-Based Methods in Reinforcement Learning, url: https://sites.google.com/view/mbrl-tutorial

^{8 }Sutton, R. S., Barto, A. G. (2018 ). Reinforcement Learning: An Introduction. The MIT Press.

Like what you’re reading? Please sign up for email updates on your favorite topics.

Subscribe nowAt the Ericsson Blog, we provide insight to make complex ideas on technology, innovation and business simple.