Didi Tech
Mar 12, 2026 · Artificial Intelligence
How STAPO Improves Large‑Model Fine‑Tuning by Silencing Spurious Tokens
The STAPO (Spurious‑Token‑Aware Policy Optimization) algorithm, introduced by Tsinghua University's iDLab and Didi's Deep Sea Lab, tackles policy‑entropy instability and performance oscillation in reinforcement‑learning fine‑tuning of large models by mathematically analyzing token collision probability, defining spurious tokens, and applying a Silencing Spurious Tokens mechanism that yields state‑of‑the‑art results on multiple math‑reasoning benchmarks.
AI safetyFine-tuningReinforcement learning
0 likes · 7 min read
