DeepSeek V4 Benchmark Leak Fuels Talk of a New Coding King

A leaked SWE‑Bench score of 83.7% for DeepSeek V4 sparked claims it outperforms Claude Opus 4.5 and GPT‑5.2, but the data was later debunked as fabricated while official hints confirm a 1‑million‑token context model and a mid‑February 2026 release.

AI Insight Log
AI Insight Log
AI Insight Log
DeepSeek V4 Benchmark Leak Fuels Talk of a New Coding King

Earlier today a screenshot circulating on X claimed DeepSeek V4 achieved an 83.7% score on the SWE‑Bench Verified coding test, allegedly surpassing Claude Opus 4.5 (80.9%) and GPT‑5.2 High (80.0%). The post also listed other leaked metrics such as ~90% on HumanEval, >80% across SWE‑Bench domains, 1M token context, and a cost 20‑40× lower than OpenAI.

Netizens quickly questioned the authenticity. The image showed an AIME 2026 score of 99.4%, a value mathematically impossible according to the official AIME scoring system. Epoch AI director Jaime Sevilla refuted the claim, stating the referenced FrontierMath Tier 4 scores are from a dataset only accessible to OpenAI and the DeepSeek team, and that DeepSeek V4 has not been evaluated.

“In the official scoring system, the highest AIME score is 119/120 (99.2%) or 120/120 (100%). 99.4% cannot occur.”

Despite the dubious leak, DeepSeek’s official app announced a gray‑scale test of a new long‑text model supporting a 1‑million‑token context, aligning with the “repo‑level reasoning” capability mentioned in the leaked data.

The Overchat.ai page, likely a pre‑release SEO placeholder, described three core features of the upcoming model:

Repo‑level Reasoning : understands entire project architecture, not just individual code files.

Engram Conditional Memory System : aims for near‑infinite context retrieval when handling 1M‑token inputs.

Mixture‑of‑Experts (MoE) architecture upgrade : builds on V3’s efficiency while specializing for coding tasks.

Both the leaked screenshot and Overchat’s description indicate a planned release around mid‑February 2026 (approximately February 17, coinciding with the Lunar New Year). The article concludes that the 83.7% SWE‑Bench figure is likely false, the 1M‑token model test is genuine, and DeepSeek is poised to launch a new model during the holiday period.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DeepSeeklarge language modelAI IndustrySWE-benchAI benchmarkingContext LengthRepo-level Reasoning
AI Insight Log
Written by

AI Insight Log

Focused on sharing: AI programming | Agents | Tools

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.