What Small Labs Reveal About RL Training: Multi‑Stage, Entropy, and Resource Strategies
The article analyzes Skywork OR1's technical report, detailing how small‑scale teams use GRPO‑based reinforcement learning with multi‑stage training, advantage‑mask variants, high‑temperature sampling, adaptive entropy loss, and resource‑allocation tricks to improve large language model performance while avoiding premature entropy collapse.
Background and Motivation
After the hype around DeepSeek‑R1 subsided, many teams returned to the classic combination of supervised fine‑tuning (SFT) and reinforcement learning (RL). The Skywork OR1 technical report provides a concrete case study from a modest research group, offering reproducible experiments that are less resource‑intensive than those of large corporations.
Why Different Teams Pursue RL Differently
Large‑scale labs (e.g., OpenAI, Google) aim to close the gap with top‑tier models and establish technical barriers.
Independent practitioners and small labs seek to understand RL fundamentals and apply them to daily work.
Enthusiasts may simply want to publish a paper.
Key Strengths of the Skywork Report
Uses ByteDance's verl and Qwen as base models, making replication straightforward.
Establishes a strong baseline (DeepScaleR) for imitation and improvement.
Experiments run on 32, 64, and 128 GPUs, keeping hardware requirements modest.
Focuses on mitigating policy‑entropy collapse, a critical RL issue.
Provides extensive ablations without unnecessary embellishments.
Training Strategy Overview
The report employs the basic GRPO algorithm, removing the response‑length normalization term to reduce length bias. This modification aligns with techniques described in the DAPO paper.
Multi‑Stage Training
Because long‑context (e.g., 32k tokens) dramatically increases RL compute, the authors split training into stages: first train with an 8k context, then continue with 16k, and so on. Although training on shorter contexts could theoretically limit the model's ability to generate long responses, empirical results show that the multi‑stage curve matches the single‑stage curve on the test set, while offering three main benefits:
Compute savings : the model reaches comparable performance earlier.
Higher token efficiency : shorter responses reduce wasted tokens, which matters for token‑based pricing.
Preserved scaling potential : later stages still improve performance, indicating the model’s capacity is not capped.
Advantage‑Mask Strategies for Truncated Responses
When training with an 8k limit, many responses are truncated, inflating the apparent truncation ratio early in training. The authors test three masking policies:
no mask : all responses contribute to the advantage calculation.
advantage_mask_before : truncated responses receive zero advantage and are excluded from group‑level advantage.
advantage_mask_after : truncated responses are included in the group advantage but still receive zero individual advantage.
Results show that advantage_mask_before reduces overall accuracy despite improving non‑truncated accuracy—a classic reward‑hacking scenario—so it is discarded in favor of no mask or advantage_mask_after.
High‑Temperature Sampling
Increasing the sampling temperature improves early‑stage learning signals and preserves later‑stage potential, despite higher initial entropy. Experiments confirm that higher temperature yields better final performance, while also slowing entropy decay.
Adaptive Entropy Loss
To counter rapid entropy collapse, the authors add an entropy‑loss term whose coefficient is dynamically adjusted. When the model’s entropy falls below a target, the coefficient is increased; when it exceeds the target, the coefficient is decreased. This dynamic control is sensitive to both the loss magnitude and the training data distribution.
Clip‑Higher Technique
Borrowed from DAPO, the clip_higher ratio caps the advantage values, further slowing entropy decay. Experiments indicate that a ratio around 0.28 (the DAPO recommendation) yields the best trade‑off.
KL Loss Considerations
After a certain training stage, KL loss drives the policy toward the reference model, causing KL loss to approach zero and halting further performance gains. Consequently, the authors stop applying KL loss beyond ~750 steps.
Mitigating Entropy Collapse
Entropy collapse leads to poorer model performance.
On‑policy training consistently avoids rapid collapse; off‑policy training inevitably collapses regardless of other hyper‑parameters.
Off‑policy remains valuable for efficiency and asynchronous frameworks, but must be paired with entropy‑preserving techniques.
Even on‑policy only slows collapse; additional measures (entropy loss, clip‑higher) are still needed.
The authors recommend combining entropy loss with the clip_higher trick to maintain a healthy entropy level throughout training.
Training Resource Allocation
Resource constraints are a primary concern for small teams. The report breaks RL training time into three components: rollout time, policy‑update time, and auxiliary operations (reward computation, experience generation). Key observations:
Rollout dominates overall compute.
Increasing the number of policy updates per rollout adds only ~3% overhead while improving sample efficiency.
Scaling GPU count does not linearly reduce generation time; the longest response becomes the bottleneck (the “bucket effect”).
Larger rollout batch sizes and group sizes consistently improve test‑set performance.
Concluding Remarks
The report emphasizes that RL outcomes depend heavily on the base model’s capabilities, training data quality, and the chosen task domain. While Qwen’s strong mathematical pretraining leads to rapid entropy decay, models like LLaMA may not require the same entropy‑control measures. Readers are encouraged to scrutinize the experimental motivations and not accept conclusions blindly.
Paper: Skywork Open Reasoner 1 Technical Report
Link: https://arxiv.org/abs/2505.22312
Code: https://github.com/SkyworkAI/Skywork-OR1Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
