What DeepSeek’s Secret “Model1” Reveals About the Upcoming V4 LLM

Analyzing recent DeepSeek flashmla repository commits, the article uncovers that the mysterious Model1 likely corresponds to DeepSeek‑V4, detailing architectural shifts to a 512‑dimensional head, full support for NVIDIA Blackwell GPUs, token‑level sparse MLA, and new mechanisms such as Value Vector Position Awareness and Engram.

Data Party THU
Data Party THU
Data Party THU
What DeepSeek’s Secret “Model1” Reveals About the Upcoming V4 LLM

On January 20, 2025 DeepSeek released the DeepSeek‑R1 model, sparking a wave of interest. A year later, the company’s flashmla codebase received several updates, and a new entry named Model1 attracted attention as a possible codename for the upcoming flagship model.

image
image

Using Gemini to analyze the recent commits in the flashmla repository, the author extracted technical details that suggest Model1 is the internal development name for DeepSeek‑V4.

1. Core Architecture: Return to 512‑Dimensional Heads

The macro DISPATCH_HEAD_DIM in csrc/api/common.h shows two branches:

V32 (DeepSeek‑V3.2) keeps d_qk = 576, reflecting the asymmetric MLA design (128‑dim RoPE + 448‑dim Latent). Model1 switches to a standard 512‑dimensional head, indicating a “normalization” step for DeepSeek‑V4, likely to align with the Blackwell (SM100) architecture or to improve latent compression.

2. Full Support for NVIDIA Blackwell (SM100) Architecture

The repository contains several Blackwell‑specific optimizations:

In api.cpp a new function FMHACutlassSM100FwdRun targets the SM100 instruction set.

The README states that running on B200 GPUs requires CUDA 12.9.

Performance figures: on B200 the not‑yet‑fully‑optimized Sparse MLA reaches ~350 TFlops, while Dense MLA on H800 (SM90a) achieves ~660 TFlops.

3. Introduction of Token‑Level Sparse MLA

Both test_flash_mla_sparse_decoding.py and test_flash_mla_dense_decoding.py appear, showing parallel sparse and dense decoding paths.

Sparse operators store KV cache in FP8 but perform matrix multiplication in bfloat16, preserving accuracy while reducing memory for very long contexts.

4. New Mechanisms: Value Vector Position Awareness (VVPA) and Engram

VVPA aims to mitigate positional information decay in long‑text MLA.

Engram, referenced in community discussions, appears to be a novel distributed storage or KV‑compression technique designed for the high‑throughput demands of Model1.

image
image

Because the code marks Model1 as a branch parallel to V32 rather than a patch, Gemini concludes that it represents a new architecture version, logically the next step after V3.2—DeepSeek‑V4.

Readers are invited to consider whether Model1 truly is the rumored DeepSeek V4.

LLMDeepSeekGPU OptimizationDeepSeek V4MODEL1Sparse MLA
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.