DeepSeek V Series: Technical Overview of Scaling Laws, Grouped Query Attention, and Mixture‑of‑Experts
The article reviews DeepSeek’s V‑series papers, explaining how scaling‑law insights, Grouped Query Attention, a depth‑first design, loss‑free load balancing, multi‑token prediction and Multi‑Head Latent Attention together enable economical mixture‑of‑experts LLMs that rival closed‑source models while cutting compute and hardware costs.
DeepSeek’s latest V‑series models have attracted great attention. This article provides a zero‑technical‑barrier overview of the four papers that introduce the series.
What you will gain
Quickly grasp the technical logic that runs through the four papers.
Understand how DeepSeek challenges the dominant “spend‑more‑to‑win” narrative of closed‑source LLMs.
See a critical view of the current AI research and industry practices.
The four papers are freely available for download:
2401 – DeepSeek LLM: Scaling Open‑Source Language Models with Longtermism
2405 – DeepSeek‑V2: A Strong, Economical, and Efficient Mixture‑of‑Experts Language Model
2408 – Fire‑Flyer AI‑HPC: A Cost‑Effective Software‑Hardware Co‑Design for Deep Learning
2412 – DeepSeek‑V3 Technical Report
Although the author is not a specialist, the article extracts the core ideas from the papers.
Why DeepSeek? The author asks why DeepSeek can achieve top‑rank performance at a fraction of the cost compared with OpenAI, Google, or Microsoft. The answer lies in revisiting the scaling law – the empirical observation that larger models (more parameters and data) tend to perform better – and questioning its universal applicability.
The first paper (2401) introduces the concept of Longtermism (a long‑term perspective on model scaling) and discusses the limitations of closed‑source products that rely on massive compute and annotation budgets.
The second paper (2405) focuses on three technical improvements:
Higher‑quality data sets, even if not dramatically larger.
Grouped Query Attention (GQA) to reduce computational complexity.
Depth‑First Design (DFD) that increases model depth, improving reasoning and code‑generation tasks.
GQA works like a library catalog: grouping queries reduces the amount of work needed to locate information, saving compute and memory.
The paper also mentions SFT and DPO techniques that improve dialogue safety and alignment.
The third paper (2408) is not about new model architecture but about a cost‑effective hardware‑software co‑design (HF Reduce) that lets high‑end software run efficiently on mid‑range GPUs, cutting both cost and energy consumption by about 50%.
The fourth paper (2412) presents DeepSeek‑V3, which builds on the previous innovations and adds two new mechanisms:
Loss‑Free Load Balancing Strategy (LFBS) – dynamically adjusts expert workload without extra loss terms.
Multi‑Token Prediction (MTP) – enables the model to anticipate several future steps, improving overall reasoning.
DeepSeek‑V3 also introduces Multi‑Head Latent Attention (MLA) , described as a “memory‑palace” that stores token representations in labeled “rooms,” allowing efficient retrieval and compression compared with traditional flat KV caches.
Across the series, DeepSeek demonstrates that strong performance does not require the highest compute budget; instead, careful data curation, modular expert routing, and novel attention mechanisms can achieve economical yet powerful LLMs.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.