Why DeepSeek‑V4 Took Twice as Long: Inside the Training‑Stability Challenges and Engineering Hacks
The DeepSeek‑V4 technical report reveals that the model’s doubled training time stems from massive token and parameter scaling, severe training‑stability issues in MoE layers, and a suite of engineering solutions—including Anticipatory Routing, SwiGLU Clamping, specialist expert training, and a custom sandbox cluster—while also exposing high hallucination rates despite impressive benchmark performance.
DeepSeek released a ~60‑page technical report for V4 after a 484‑day gap, roughly twice the interval of the previous V3 release, and the report opens up the model’s architecture, pre‑training, and post‑training pipelines.
Scale jump: V3 was trained on 14.8 T tokens; V4‑Flash used 32 T and V4‑Pro 33 T. Parameter counts also rose dramatically to 1.6 T for Pro and 284 B for Flash. The authors explicitly label the resulting “training stability challenge” as a core difficulty.
Stability problem: Outlier values in MoE layers get amplified by the routing mechanism, causing loss spikes. The team mitigates this with two tricks: Anticipatory Routing , which decouples backbone and router updates by using earlier‑stage parameters, and SwiGLU Clamping , which restricts SwiGLU activations to the range [‑10, 10]. The report states these measures are “significantly effective,” but admits the underlying mechanism remains an open question.
Hardware vs. software speculation: Although the report does not name a specific platform, observers suspect that hardware limitations (chip, interconnect, cooling, drivers, compiler stack) may be a major factor, echoing similar “training‑stability” complaints from other labs such as xAI’s experience with Nvidia’s latest chips.
Agent training system: Rather than a simple hard‑transfer from a dialogue model, DeepSeek injects massive agentic data during the mid‑training phase. They introduce a “Specialist Training” regime that first creates separate experts (math, code, agent, instruction‑following) and then merges them via Multi‑teacher On‑Policy Distillation (OPD). To make OPD feasible, teacher logits are not cached; only the final hidden states are stored, and logits are reconstructed on‑the‑fly using a custom TileLang kernel while sorting samples by teacher index.
Reward model upgrade: The traditional scalar reward model is replaced by a Generative Reward Model (GRM) that produces a detailed evaluation report based on a predefined rubric. The GRM itself is jointly optimized with the actor via RL, so the same network learns to generate and assess outputs.
Infrastructure breakthroughs: DeepSeek built the DSec sandbox cluster, featuring a 3FS distributed file system and hundreds of thousands of concurrent sandbox instances, enabling massive parallel code‑execution training. Their MegaMoE design fuses communication and computation in a single pipeline kernel, achieving 1.5‑1.73× speed‑ups for general workloads and up to 1.96× for latency‑sensitive RL rollouts.
Tool‑calling DSL: A custom XML‑like domain‑specific language was created for tool calls, raising success rates from “luck‑based” to “industrial‑grade reliability.”
Reasoning modes: V4 supports a fast Non‑think mode for simple tool selection and higher‑cost High/Max modes for long‑document reasoning, code generation, and complex bug fixing. The new Interleaved Thinking retains full cross‑turn reasoning history in tool‑calling scenarios, unlike V3.2 which discards it.
Benchmark results: On the Intelligence Index benchmark, V4‑Pro cost only $1,071, roughly four times cheaper than Claude Opus 4.7’s $4,811. On the GDPval‑AA agent benchmark, V4‑Pro‑Max scored 1,554, leading all open‑source models. However, the Artificial Analysis report flags a 94 % hallucination rate on the AA‑Omniscience test, highlighting a trade‑off between inference power and factual accuracy.
Despite the stability hiccups and high hallucination rate, the authors commend DeepSeek‑V4 for its unprecedented transparency—explicitly acknowledging hardware pain points, publishing concrete mitigation patches, and showcasing how massive sandbox‑level engineering can push agent capabilities forward.
In summary, DeepSeek‑V4 demonstrates that when architectural perfection is lacking, aggressive engineering (efficient MoE kernels, specialist expert training, custom sandbox clusters, and generative reward models) can close the gap, albeit at the cost of increased hallucinations.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
