Ring-lite: Open‑Source Lightweight MoE Model Sets SOTA on AIME and LiveCodeBench
Ring-lite, an open‑source lightweight Mixture‑of‑Experts inference model built on Ling‑lite‑1.5, introduces the C3PO reinforcement‑learning training method and achieves state‑of‑the‑art results on benchmarks such as AIME24/25, LiveCodeBench, CodeForce, and GPQA‑diamond, while offering full transparency of weights, code, and data.
Introduction
Today we open‑source the lightweight inference model Ring‑lite , built on the previously released Ling‑lite‑1.5 (MoE architecture, 16.8 B total parameters, 2.75 B activation parameters). Using the novel C3PO reinforcement‑learning training method, Ring‑lite achieves state‑of‑the‑art performance on several inference leaderboards (AIME 24/25, LiveCodeBench, CodeForce, GPQA‑diamond), matching the quality of dense models that are three times larger in activation size.
C3PO Reinforcement Learning Method
The first technical innovation is the C3PO (Constrained‑Contextual‑Computation Policy Optimization) method, which directly addresses instability caused by large fluctuations in response length during RL training. By fixing the total number of tokens processed per training step, C3PO stabilizes gradient norms and system throughput, while discarding a small portion of tokens through a selection strategy.
Experiments show that when response length drops, gradient norm and reward sharply decline, and system throughput falls. C3PO’s token‑budget constraint eliminates these spikes and, combined with an entropy‑loss‑based selection of the Long‑CoT SFT checkpoint, prevents sudden reward drops.
Token‑Efficiency‑Based Training Ratio
We analyze the trade‑off between SFT and RL from a token‑efficiency perspective. Using the ratio RL tokens / SFT tokens as a metric, we find a sweet spot that balances performance and token efficiency, outperforming pure Long‑CoT SFT or pure RL approaches.
Choosing the SFT checkpoint based on entropy loss yields results close to the optimal trade‑off curve observed in experiments.
Stage‑wise Multi‑Domain Training
Ring‑lite is trained jointly on mathematics, coding, and scientific tasks. Directly mixing all domains degrades performance compared with single‑domain training. A two‑stage strategy—first training on mathematics, then mixing coding and STEM tasks—mitigates cross‑domain conflicts and yields higher scores.
Benchmark Results
Ring‑lite was compared with lightweight inference models Qwen‑3‑8B, AceReason‑Nemotron‑7B, and DeepSeek‑R1‑Distill‑Qwen‑14B on standard complex‑reasoning benchmarks.
Mathematical reasoning (AIME24/25): 76.61 and 69.11, surpassing all baselines.
Programming contests (LiveCodeBench, CodeForce): 60.66 and 86.45 % respectively, leading the field.
Scientific reasoning (GPQA‑diamond): 61.05, comparable to the best baseline.
The average score across these leaderboards exceeds all compared models, despite Ring‑lite using only 2.75 B activation parameters.
Data Construction
We built a large, high‑quality dataset for Long‑CoT SFT and RL training.
Mathematics: Over 73 k cleaned problems from open sources (BigMath, DeepScaleR) and competition archives (AoPS), forming a reinforcement‑learning dataset.
Code: 14 k samples from CodeContest, TACO, APPS, and QOJ, each with verified executable solutions and test cases.
Science: 3 833 expert‑annotated questions from Olympiads and graduate exams.
Data were filtered, deduplicated, and annotated with multi‑dimensional metadata (source, subject, difficulty, etc.) to enable dynamic sampling during RL.
Future Plans
We aim to extend C3PO beyond training stability to inference, allowing dynamic token budgets that grow with model capability, and to pursue end‑to‑end collaborative optimization that bridges training and inference efficiency.
All code, model weights, training scripts, and datasets will be released incrementally, making Ring‑lite the first lightweight MoE inference model with fully transparent end‑to‑end training.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
