Artificial Intelligence 25 min read

Technical Overview of DeepSeek Series Models and Innovations

The DeepSeek series introduces a refined Mixture‑of‑Experts architecture with fine‑grained expert partitioning, shared experts, and learnable load‑balancing, alongside innovations such as Group Relative Policy Optimization, Multi‑Head Latent Attention, Multi‑Token Prediction, mixed‑precision FP8 training, and the R1/R1‑Zero models that use Long‑CoT reasoning, reinforcement‑learning pipelines, and distillation to achieve OpenAI‑comparable performance at lower cost.

Tencent Technical Engineering

Feb 14, 2025

Technical Overview of DeepSeek Series Models and Innovations

1. DeepSeek Series Model Technical Innovations

DeepSeek gained attention during the Spring Festival with DeepSeek‑V3 and DeepSeek‑R1, which introduce several innovations in Mixture‑of‑Experts (MoE) architecture, training efficiency, and inference speed.

1.1 DeepSeek MoE Architecture

Traditional MoE modules consist of multiple Feed‑Forward Network experts selected by a routing gate. DeepSeek improves this by (a) finer‑grained expert partitioning, (b) separating shared and routed experts, and (c) adding a learnable bias to the gate for dynamic load balancing.

(a) Traditional MoE – experts are activated selectively, reducing activation parameters.

(b) Fine‑grained expert division – more experts with smaller hidden dimensions, keeping total parameters constant.

DeepSeek‑V3 also introduces a new load‑balancing strategy that adds a learnable bias to the gate scores, dynamically adjusting routing preferences without extra loss terms.

1.2 Group Relative Policy Optimization (GRPO)

GRPO is an RLHF variant that removes the value model, reducing training cost. It replaces the value estimate with the average reward of multiple sampled outputs.

1.3 Multi‑Head Latent Attention (MLA)

MLA reduces KV‑cache size by applying low‑rank decomposition to the key‑value matrices, enabling longer context or larger batch sizes with lower memory.

1.4 Multi‑Token Prediction (MTP)

MTP allows the model to predict several tokens at once, improving training efficiency and inference speed. DeepSeek‑V3 uses a cascaded serial structure for MTP.

1.5 Mixed‑Precision Framework

DeepSeek‑V3 trains with FP8 for most operations while keeping critical parts in BF16/FP32, achieving up to double speed with negligible accuracy loss.

2. DeepSeek R1‑Zero and R1

2.1 Overview of GPT‑4, GPT‑4o, o1, R1

R1 adopts Long‑CoT reasoning, providing transparent step‑by‑step thought processes.

2.2 Breakthroughs of R1 and R1‑Zero

Strong reasoning capability comparable to OpenAI‑o1.

Improved interpretability via Long‑CoT.

Open‑source and lower cost.

2.3 Technical Details

R1‑Zero is trained solely with reinforcement learning using GRPO and rule‑based rewards, achieving “Aha moments”. R1 adds a four‑stage training pipeline: cold‑start with CoT data, RL‑focused fine‑tuning, rejection sampling & supervised fine‑tuning, and full‑scenario RL.

Distillation of R1’s reasoning ability to smaller dense models further boosts their performance.

References

Key papers include DeepSeek‑MoE, DeepSeek‑Math, DeepSeek‑V2, DeepSeek‑V3 technical report, and the R1 arXiv preprint.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Model Optimization AI Mixture of Experts DeepSeek

Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.