7 min read

How STAPO Improves Large‑Model Fine‑Tuning by Silencing Spurious Tokens

The STAPO (Spurious‑Token‑Aware Policy Optimization) algorithm, introduced by Tsinghua University's iDLab and Didi's Deep Sea Lab, tackles policy‑entropy instability and performance oscillation in reinforcement‑learning fine‑tuning of large models by mathematically analyzing token collision probability, defining spurious tokens, and applying a Silencing Spurious Tokens mechanism that yields state‑of‑the‑art results on multiple math‑reasoning benchmarks.

Didi Tech

Mar 12, 2026

How STAPO Improves Large‑Model Fine‑Tuning by Silencing Spurious Tokens

Key Contributions

Collision‑probability and Shannon‑entropy bound analysis shows that the norm of token‑level policy gradients is negatively correlated with token generation entropy, providing a theoretical basis for large‑model RL design.

Introduces the notion of a spurious token – a token that appears in correct answers but contributes little or negatively to reasoning – and establishes a three‑dimensional analysis framework (policy‑gradient norm, entropy‑change direction, learning potential) to identify such tokens.

Proposes the Silencing Spurious Tokens (S2T) mechanism and integrates it with a group‑advantage objective, yielding the STAPO (Spurious‑Token‑Aware Policy Optimization) algorithm that improves policy‑entropy stability and convergence.

Spurious Token Definition and Detection Criteria

A token t_i is classified as spurious when it satisfies three conditions simultaneously:

Positive advantage: A_i > 0.

Low generation probability: πθ(a_i|s) < τ_p (e.g., τ_p = 0.01).

Low token‑level generation entropy: H_i = -∑_a πθ(a|s) log πθ(a|s) < τ_h (e.g., τ_h set to the 5th percentile of entropy distribution).

Only tokens meeting all three thresholds are masked as spurious.

Silencing Spurious Tokens (S2T) Mechanism

The S2T mechanism defines a binary mask M_i ∈ {0,1} for each token position i:

M_i = 1  if (A_i > 0) ∧ (πθ(a_i|s) < τ_p) ∧ (H_i < τ_h)
M_i = 0  otherwise

During back‑propagation, the gradient of the policy loss is multiplied by M_i, effectively silencing gradients from spurious tokens while preserving gradients from informative tokens.

STAPO Objective

The overall loss combines a group‑advantage term with an entropy regularizer:

L(θ) = - E_{πθ}[ M_i · A_i ]  +  λ · E_{πθ}[ H(πθ) ]

where λ balances advantage maximization and entropy stability. The expectation is taken over token sequences generated by the current policy πθ. The mask M_i ensures that only non‑spurious tokens contribute to the advantage term.

Experimental Setup

Base models: Qwen‑3 1.7B, 8B, and 14B.

Benchmarks: AIME24, AIME25, AMC23, MATH500, Minerva, OlympiadBench (six math‑reasoning tasks).

Baselines for comparison: GRPO, 20‑Entropy, JustRL.

Sampling settings: top‑p = 1.0 (full sampling) and top‑p = 0.9.

Results

STAPO consistently outperforms all baselines. Under top‑p = 1.0, average accuracy improves by 7.13 % ; under top‑p = 0.9, improvement is 3.69 % . Policy entropy curves show smoother trajectories and no collapse to zero, indicating stable exploration compared with GRPO, which exhibits entropy collapse.

Key metrics (accuracy, entropy, training reward) visualized across training steps demonstrate that STAPO achieves higher accuracy and reward while maintaining lower and more stable entropy than 20‑Entropy and JustRL.

Future Directions

The authors plan to extend STAPO to embodied‑intelligence large models, focusing on end‑to‑end autonomous‑driving fine‑tuning tasks to improve generalization in unseen driving scenarios.

Fine-tuning Reinforcement learning AI safety large model spurious token STAPO

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.