How STAPO Improves Large‑Model Fine‑Tuning by Silencing Spurious Tokens
The STAPO (Spurious‑Token‑Aware Policy Optimization) algorithm, introduced by Tsinghua University's iDLab and Didi's Deep Sea Lab, tackles policy‑entropy instability and performance oscillation in reinforcement‑learning fine‑tuning of large models by mathematically analyzing token collision probability, defining spurious tokens, and applying a Silencing Spurious Tokens mechanism that yields state‑of‑the‑art results on multiple math‑reasoning benchmarks.
Key Contributions
Collision‑probability and Shannon‑entropy bound analysis shows that the norm of token‑level policy gradients is negatively correlated with token generation entropy, providing a theoretical basis for large‑model RL design.
Introduces the notion of a spurious token – a token that appears in correct answers but contributes little or negatively to reasoning – and establishes a three‑dimensional analysis framework (policy‑gradient norm, entropy‑change direction, learning potential) to identify such tokens.
Proposes the Silencing Spurious Tokens (S2T) mechanism and integrates it with a group‑advantage objective, yielding the STAPO (Spurious‑Token‑Aware Policy Optimization) algorithm that improves policy‑entropy stability and convergence.
Spurious Token Definition and Detection Criteria
A token t_i is classified as spurious when it satisfies three conditions simultaneously:
Positive advantage: A_i > 0.
Low generation probability: πθ(a_i|s) < τ_p (e.g., τ_p = 0.01).
Low token‑level generation entropy: H_i = -∑_a πθ(a|s) log πθ(a|s) < τ_h (e.g., τ_h set to the 5th percentile of entropy distribution).
Only tokens meeting all three thresholds are masked as spurious.
Silencing Spurious Tokens (S2T) Mechanism
The S2T mechanism defines a binary mask M_i ∈ {0,1} for each token position i:
M_i = 1 if (A_i > 0) ∧ (πθ(a_i|s) < τ_p) ∧ (H_i < τ_h)
M_i = 0 otherwiseDuring back‑propagation, the gradient of the policy loss is multiplied by M_i, effectively silencing gradients from spurious tokens while preserving gradients from informative tokens.
STAPO Objective
The overall loss combines a group‑advantage term with an entropy regularizer:
L(θ) = - E_{πθ}[ M_i · A_i ] + λ · E_{πθ}[ H(πθ) ]where λ balances advantage maximization and entropy stability. The expectation is taken over token sequences generated by the current policy πθ. The mask M_i ensures that only non‑spurious tokens contribute to the advantage term.
Experimental Setup
Base models: Qwen‑3 1.7B, 8B, and 14B.
Benchmarks: AIME24, AIME25, AMC23, MATH500, Minerva, OlympiadBench (six math‑reasoning tasks).
Baselines for comparison: GRPO, 20‑Entropy, JustRL.
Sampling settings: top‑p = 1.0 (full sampling) and top‑p = 0.9.
Results
STAPO consistently outperforms all baselines. Under top‑p = 1.0, average accuracy improves by 7.13 % ; under top‑p = 0.9, improvement is 3.69 % . Policy entropy curves show smoother trajectories and no collapse to zero, indicating stable exploration compared with GRPO, which exhibits entropy collapse.
Key metrics (accuracy, entropy, training reward) visualized across training steps demonstrate that STAPO achieves higher accuracy and reward while maintaining lower and more stable entropy than 20‑Entropy and JustRL.
Future Directions
The authors plan to extend STAPO to embodied‑intelligence large models, focusing on end‑to‑end autonomous‑driving fine‑tuning tasks to improve generalization in unseen driving scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
