Artificial Intelligence 8 min read

Why a Robust Training Pipeline Beats Fancy LLM Tricks – Lessons from DAPO

The article analyzes the DAPO technical report, showing how dynamic‑sampling pipelines and token‑level loss handling in SFT and RL training outperform ad‑hoc algorithm tricks, and compares the training dynamics of reinforce_baseline and GRPO with concrete code examples.

Baobao Algorithm Notes

Mar 27, 2025

Why a Robust Training Pipeline Beats Fancy LLM Tricks – Lessons from DAPO

Background and Motivation

Recent work by Seed& Tsinghua, the DAPO report, demonstrated that a 32‑b‑base model can achieve an AIME score of 50, offering many practical tricks for large‑language‑model (LLM) fine‑tuning (SFT) and reinforcement learning (RL).

Dynamic Sampling in Online Pipelines

Both SFT and RL benefit from a dynamic‑sampling strategy: the model’s current performance on a prompt determines the sampling budget. Hard prompts receive more samples, easy prompts fewer, and a filter model discards unsolvable hard prompts or trivial easy prompts.

During SFT, similar strategies are used for code/math rejection sampling and diverse response selection (embedding + clustering, length filtering). A well‑designed online‑dynamic‑sampling pipeline is far more effective than unreliable algorithmic tricks.

When the SFT pipeline is solid, the same data‑task, response synthesis, and scoring methods can be transferred to online RL by integrating them into the replay‑buffer construction, provided the replay buffer is decoupled from the main code for flexible control.

Key Pitfalls for Teams

Offline SFT pipelines are labor‑intensive yet consist of well‑defined steps that could be automated.

RL teams often repeat SFT data‑collection steps, wasting resources.

Agent‑based RL requires a stable environment; without prior agent‑SFT data, environment instability becomes a critical bottleneck.

Token‑Level Loss Analysis

DAPO highlights token‑level loss, which becomes problematic when gradient accumulation (GA) is large. Large‑batch loss and GA‑loss differ; most frameworks implement the GA version, leading to higher loss for long‑text training.

In openrlhf/verl, micro‑batch loss is token‑level, but gradient accumulation still introduces the same issue, causing overly aggressive optimization early in training.

if len(prefetch) == 0 or len(prefetch) % self.strategy.accumulated_gradient != 0:
    prefetch.append(experience)
if len(prefetch) % self.strategy.accumulated_gradient == 0:
    torch.distributed.barrier()
    length_status = {'response_length': prefetch[0].info['response_length'].sum()}
    for exp in prefetch[1:]:
        length_status['response_length'] += exp.info['response_length'].sum()
    length_status = self.strategy.all_reduce(length_status, op='sum')

GRPO without token‑level loss shows higher repeatness, while GA‑token‑level loss stabilizes training and reduces repetition.

Reinforce‑Baseline vs. GRPO

Both methods compute advantage differently: reinforce‑baseline uses r‑group_mean + global normalization, whereas GRPO uses (r‑group_mean) / group_std (group normalization). The only difference is a scaling coefficient.

When rewards are binary (0/1), the variance estimates differ, leading to distinct training dynamics. GRPO converges faster but can be less stable than reinforce‑baseline, which is more robust under low‑variance conditions.

Conclusions

Moving the SFT pipeline into an online replay‑buffer enables stable online RL with a robust environment and RL method.

Implementing token‑level loss at the GA level is important for large‑gradient‑accumulation training; an alternative is to perform multiple parameter updates per sample (more off‑policy).

Reinforce‑baseline and GRPO share similar dynamics; the advantage scaling makes reinforce‑baseline generally more stable, while GRPO may exhibit stronger early optimization.

LLM SFT GRPO RL Dynamic Sampling Reinforce Token-level Loss

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.