Artificial Intelligence 9 min read

Can a 3B Model Rival Claude Opus 4.5? Benchmark Gaps or Aggressive Post‑Training?

VibeThinker‑3B, a 3‑billion‑parameter language model built on Qwen2.5‑Coder‑3B, achieves scores within the range of 671 B‑parameter models on benchmarks such as LiveCodeBench, AIME26, IMO‑AnswerBench and GPQA, thanks to a two‑stage SFT, multi‑domain reinforcement learning, offline self‑distillation and a claim‑reliability (CLR) evaluator that together push its reasoning ability to the frontier.

Machine Learning Algorithms & Natural Language Processing

Jun 18, 2026

Can a 3B Model Rival Claude Opus 4.5? Benchmark Gaps or Aggressive Post‑Training?

Model and Benchmark Performance

VibeThinker‑3B is a 3 B‑parameter language model built on Qwen2.5‑Coder‑3B. After two‑stage supervised fine‑tuning, long‑context reinforcement learning and offline self‑distillation, it reaches scores comparable to several frontier models on multiple reasoning benchmarks:

LiveCodeBench v6: 80.2 (Claude Opus 4.5 84.8).

AIME 26: 94.3, surpassing DeepSeek V3.2 (671 B, 94.2); with claim‑reliability (CLR) 97.1.

IMO‑AnswerBench: 76.4 raw, 80.6 with CLR (DeepSeek V3.2 78.3, GLM‑5 82.5, Kimi K2.5 81.8).

GPQA‑Diamond: 70.2 raw, 72.9 with CLR (still below the strongest flagship models).

IFEval: 93.4.

HMMT 25 and BruMO 25 improve to 95.4 and 99.2 respectively after CLR.

Parameter Efficiency

These results place a 3 B model within the score band of models with hundreds of billions of parameters on tasks with explicit verification signals such as mathematics, code and STEM reasoning.

Post‑Training Pipeline

Supervised fine‑tuning (SFT) consists of two curriculum stages. The first stage covers broad domains (math, code, STEM reasoning, general dialogue, instruction following). The second stage focuses on long‑context samples longer than 5 K tokens. Each long sample is generated by VibeThinker‑1.5B with eight independent samplings; samples with a relative simplicity score below 0.75 are discarded.

Multi‑domain reinforcement learning uses MGPO. For each question the empirical correctness probability p(q) is estimated; samples with p(q) ≈0.5 receive higher weight because they lie near the decision boundary.

Long‑context RL includes a 64 K context phase (Long2Short Math RL) that reduces early trajectory truncation, and a length‑aware reward shift that reallocates reward within correct trajectories, rewarding shorter correct answers more heavily while keeping the total shift zero.

Offline self‑distillation employs diversity‑exploration distillation to retain multiple valid solution paths. Checkpoints are selected based on Pass@K performance, favoring models that generate more effective solutions. Learning‑potential scoring evaluates each correct trajectory by its length‑normalized negative log‑likelihood; higher scores indicate trajectories the student has not yet mastered, increasing their distillation weight.

Verifiable Reasoning (CLR)

During inference the model generates 32 candidate reasoning trajectories per query, extracts the final answer and five decision‑related statements, and validates them internally. Each trajectory receives a reliability score; erroneous statements incur a nonlinear penalty. Answers are clustered by equivalence, and the summed reliability scores of trajectories in each cluster determine the final answer. CLR raises scores on answer‑determinable tasks (AIME 26, HMMT 25, BruMO 25, IMO‑AnswerBench).

Ability Boundaries

VibeThinker‑3B excels on math, code and STEM tasks where feedback is explicit, confirming the “parameter compression‑coverage hypothesis” that a reusable reasoning core can be packed into a 3 B model. On knowledge‑intensive open‑domain tasks its performance lags behind larger models, indicating parameter‑coverage limits. Recent LeetCode contests show 123 passes out of 128 first submissions (96.1 % pass rate), demonstrating strong code generalization beyond static benchmarks but still far from a universal programming agent.

Resources

Paper: https://arxiv.org/pdf/2606.16140

Code repository: https://github.com/WeiboAI/VibeThinker

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models reinforcement learning benchmark performance post-training parameter efficiency verifiable reasoning VibeThinker-3B

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.