Artificial Intelligence 7 min read

Can a 3B Model Rival Opus 4.5 in Programming? Inside the Domestic VibeThinker‑3B

VibeThinker‑3B, a 3‑billion‑parameter Chinese‑built model, achieves programming benchmark scores comparable to top‑tier models like Opus 4.5, excelling in AIME, HMMT, LiveCodeBench and LeetCode contests, thanks to its Spectrum‑to‑Signal training pipeline, Claim‑Level reliability evaluation, and multi‑stage SFT and RL refinements.

Machine Heart

Jun 17, 2026

Can a 3B Model Rival Opus 4.5 in Programming? Inside the Domestic VibeThinker‑3B

In recent days a 3‑billion‑parameter model called VibeThinker‑3B has attracted attention on X because its performance on several verifiable reasoning tasks, especially programming, falls within the range of frontier models such as Gemini 3 Pro, GPT‑5 high, Claude Opus 4.5, GLM‑5 and Kimi K2.5, while being far smaller.

The model, developed by the Sina Weibo team, is a dense inference model designed to push the limits of verifiable reasoning under strict size constraints.

Benchmark results show strong scores: 94.3 on the AIME 26 test, 89.3 on HMMT 25, 80.2 Pass@1 on LiveCodeBench v6, and a 96.1 % pass rate on unpublished LeetCode weekly and bi‑weekly contests between 2026‑04‑25 and 2026‑05‑31.

The technical report reveals that VibeThinker‑3B builds on Qwen2.5‑Coder‑3B and uses an upgraded Spectrum‑to‑Signal process for post‑training. Supervised fine‑tuning (SFT) is enhanced with data synthesis, quality filtering and curriculum learning, while MGPO‑style reinforcement learning is extended to multiple verifiable domains, preserving full long‑context reasoning traces with a 64K context window. Offline self‑distillation and Instruct RL further consolidate capabilities.

Training proceeds in four stages:

Two‑phase curriculum SFT: the first phase covers mathematics, programming, STEM reasoning, general dialogue and instruction following; the second phase targets higher‑difficulty, broader‑scope reasoning samples, using diverse‑exploration distillation to retain multiple solution paths.

Multi‑domain reasoning RL: MGPO is reused, applying RL sequentially to math, programming and STEM tasks while keeping the full long‑term reasoning trajectory.

Offline self‑distillation: high‑quality trajectories from the RL checkpoints are filtered and distilled into a unified student model, with a learning‑potential score prioritising correct but poorly mimicked traces.

Instruct RL: the final stage improves controllability of user‑directed prompts, employing rule‑based validators and reward models for format‑sensitive, open‑ended teaching data.

A Claim‑Level Reliability (CLR) scaling strategy further boosts performance, raising AIME 26 from 94.3 to 97.1, HMMT 25 from 89.3 to 95.4, and BruMO 25 to 99.2.

AI researcher Sebastian Raschka summarized the report’s key points, and the full technical report ("VibeThinker‑3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models", arXiv https://arxiv.org/pdf/2606.16140) and model weights (HuggingFace https://huggingface.co/WeiboAI/VibeThinker-3B) are publicly available.

The authors acknowledge that the model’s scope is limited; it underperforms in domains requiring broad general knowledge.

They propose a “parameter‑compression coverage hypothesis”: different abilities depend on parameters in distinct ways. Verifiable reasoning is highly compressible, relying on multi‑step inference, constraint satisfaction, self‑correction and answer verification, whereas open‑domain knowledge and long‑tail understanding demand larger parameter counts. VentureBeat quoted the report, noting that the work “reveals a partial decoupling between reasoning ability and factual knowledge, suggesting that reasoning can be compressed more effectively than previously imagined,” with implications for model design, deployment cost and AI accessibility.

Overall, the goal is not to replace large models but to explore the true boundaries of small models along specific ability dimensions, demonstrating that compact language models can achieve frontier‑level performance in tasks with clear feedback and verification signals, offering a complementary research direction to the traditional scale‑up paradigm.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI research benchmark performance small language model Claim-Level Reliability Spectrum-to-Signal verifiable reasoning VibeThinker-3B

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.