Model-Free vs Model-Based RL: Core Concepts and Large-Model Applications

This article explains the fundamental architecture of reinforcement learning, contrasting model‑free and model‑based approaches, detailing environment models, planning, data augmentation, expert iteration, and embedding planning, and then examines how large language models use policy‑based methods such as PPO, DPO, and GRPO for RL‑HF.

Data Party THU
Data Party THU
Data Party THU
Model-Free vs Model-Based RL: Core Concepts and Large-Model Applications

Introduction

This article provides a concise overview of reinforcement learning (RL), outlining its overall architecture and the main classification into model‑free and model‑based methods. It serves as a primer for readers new to RL before diving into classic algorithms.

Model-Free vs Model-Based RL

Model‑Free and Model‑Based are the two core families of RL. The key distinction is whether the agent explicitly learns or uses a dynamics model of the environment (i.e., transition probabilities). Model‑Based RL first learns a model of "how the world works" and then uses it for planning or policy learning, while Model‑Free RL ignores explicit world rules and learns policies or value functions directly from trial‑and‑error experience.

Environment Model

In RL, the environment model represents two probability distributions:

State transition probability : the probability of moving from state s to state s' after taking action a .

Reward function : the expected reward (or distribution) received for a given state‑action pair (s,a) .

If the agent possesses such a model, it can simulate future trajectories in its "mind" without interacting with the real environment.

Model‑Based RL Workflow

The typical pipeline consists of:

Collect experience from the environment.

Train a dynamics model from the collected data.

Use the learned model to generate simulated (virtual) experience and combine it with real data for training a model‑free algorithm.

Four mainstream categories of Model‑Based RL are:

Planning : e.g., Model Predictive Control (MPC) simulates trajectories in the model and solves a short‑horizon planning problem at each decision step.

Data Augmentation : methods such as MBVE (Model‑Based Value Expansion) and World Models first learn a dynamics model from real data, then generate fictitious rollouts to augment training data for a model‑free algorithm.

Expert Iteration (ExIt) : exemplified by AlphaZero, uses a search algorithm (e.g., Monte‑Carlo Tree Search) inside the model to generate expert actions, which are then distilled into a policy network via imitation learning.

Embedding Planning into Policies : designs a policy architecture that contains an internal planning module (e.g., a neural network calling a planner). The planner’s output (optimal actions, values) becomes an intermediate feature for the policy, which is still trained end‑to‑end with model‑free objectives such as policy gradients.

Overall, Model‑Based RL seeks a balance between world understanding and efficient action, offering higher sample efficiency and better interpretability.

Model-Free RL

Model‑Free RL does not explicitly model the environment’s transition dynamics. Instead, the agent treats the environment as a black box, learning directly from real interaction data to optimize a policy or value function. This simplicity makes Model‑Free methods especially suitable for large‑model reinforcement learning, where the state‑transition dynamics are either deterministic or defined by human decisions, rendering explicit modeling unnecessary.

Key Elements in Large‑Model RL

In the context of large language models (LLMs), the RL components are:

State : the user prompt.

Action : the generated text response (a sequence of tokens).

Reward Model (RM) : a learned model that scores the quality of a response.

Because the action space is a high‑dimensional discrete sequence, traditional value‑based methods struggle, whereas policy‑based methods naturally handle sequence generation via policy gradients.

Policy‑Based Optimization for Large Models

The dominant paradigm for aligning LLMs with human preferences is policy‑based RL. The most widely used algorithm is Proximal Policy Optimization (PPO) , which introduces a clipped probability ratio and a KL‑divergence penalty to improve response quality while preventing the model from drifting too far from its pretrained distribution.

Other approaches include:

Direct Preference Optimization (DPO) : does not run an explicit RL loop but optimizes a policy using pairwise preference data (chosen vs. rejected responses), effectively a hidden policy‑gradient method.

Generalized Rank‑Based Policy Optimization (GRPO) : generates multiple candidate responses for a single prompt, assumes a relative ranking can be derived (e.g., via length, diversity, or simple heuristics), and encourages the policy to increase the probability of higher‑ranked candidates without requiring absolute reward scores.

These methods are chosen because they bypass the need to model complex text dynamics, instead learning directly from human preference signals.

Value‑Based Methods in Large‑Model RL

Classic value‑based algorithms (e.g., DQN) are impractical for LLMs due to the astronomically large discrete action space, making explicit Q‑function estimation computationally infeasible.

Nevertheless, the idea of a value function still appears indirectly:

The Reward Model acts as a proxy for state‑action value, providing scalar feedback for policy updates.

In an Actor‑Critic framework (e.g., PPO), the critic estimates state value V(s) or advantage A(s,a) to reduce gradient variance, though many production systems omit the critic and rely solely on the RM.

Offline preference learning methods (e.g., Bradley‑Terry models) treat the problem as relative value estimation, supplying supervision without online action selection.

Thus, value‑based components serve as auxiliary signals rather than the primary learning algorithm in large‑model RL.

References

https://spinningup.openai.com/en/latest/spinningup/rl_intro.html

https://arxiv.org/pdf/2412.05265

https://arxiv.org/pdf/2412.10400v3

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

RLHFPolicy OptimizationPlanningModel-BasedModel-free
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.