Artificial Intelligence 37 min read

DeepSeek LLM Series (V1‑V3, R1) Technical Overview and Analysis

The DeepSeek technical overview details the evolution from the dense 67 B V1 model through the 236 B MoE‑based V2 and 671 B V3 with FP8 training, to the RL‑only R1 series that learns reasoning without supervision, highlighting innovations such as Grouped‑Query Attention, Multi‑Head Latent Attention, load‑balancing‑free MoE, Multi‑Token Prediction, and knowledge distillation, and reporting state‑of‑the‑art benchmark results and open‑source reproduction projects.

Tencent Cloud Developer

Feb 27, 2025

This document provides a comprehensive technical overview of the DeepSeek family of large language models, covering the DeepSeek‑LLM series (V1, V2, V3), the DeepSeek‑R1 series, and associated research contributions.

DeepSeek‑LLM Series

V1 (DeepSeek‑67B) – Released Jan 5 2024. Built on the dense LLaMA‑2 architecture, trained on 2 trillion bilingual tokens. Highlights include Grouped‑Query Attention (GQA) to reduce KV‑cache size, Pre‑Norm stability, RMSNorm, and SwiGLU activation. The 67 B model outperforms LLaMA‑70B on bilingual benchmarks.

V2 (DeepSeek‑V2) – Released May 7 2024 (final version Jun 19 2024). Scales to 236 B parameters (21 B active per token) with a 128 K context window. Uses a Mixture‑of‑Experts (MoE) backbone with 160 routing experts and 2 shared experts. Key innovations:

Multi‑Head Latent Attention (MLA) that compresses query/key/value into a latent vector, reducing KV‑cache by 93.3 % and increasing inference speed ~5.8×.

DeepSeekMoE – replaces traditional FFNs with expert networks, improving parameter efficiency and cutting training cost by ~42.5 %.

Three auxiliary load‑balancing losses (expert‑level, device‑level, communication‑level) and a token‑dropping budget to keep training stable.

Token‑level FP8 quantization of KV‑cache.

Long‑context extension to 128 K via YaRN and two‑stage fine‑tuning.

Alignment is performed with Supervised Fine‑Tuning (SFT) on ~1.5 M instruction data (beneficial + safety) followed by Direct Preference Optimization (DPO) to teach the model what to say and what not to say.

DeepSeek‑V3

Technical report published Dec 27 2024. The model contains 61 Transformer layers and 671 B parameters (37 B active per token). It replaces all but the first three feed‑forward blocks with MoE (1 shared expert + 256 routing experts, 8 active per token). Major contributions:

Auxiliary‑loss‑free load balancing with bias injection.

Multi‑Token Prediction (MTP) – predicts several future tokens in parallel, improving data efficiency.

FP8 mixed‑precision training with DualPipe pipeline parallelism and optimized All‑to‑All communication, achieving a training cost of only 2.788 M H800 GPU‑hours (≈1/15 of Llama‑3).

Knowledge distillation from DeepSeek‑R1 to boost reasoning capability.

State‑of‑the‑art performance on a wide range of benchmarks, surpassing open‑source models (Qwen2.5‑72B, Llama‑3.1‑405B) and matching closed‑source GPT‑4o and Claude‑3.5‑Sonnet.

DeepSeek‑R1 Series

Paper released Jan 22 2025. R1‑Zero is a reinforcement‑learning‑only model that learns reasoning without any supervised data, using rule‑based reward functions for accuracy and format. It introduces Group Relative Policy Optimization (GRPO), which estimates a baseline by averaging rewards over multiple rollouts, eliminating the need for a separate value model.

R1 builds on R1‑Zero with a cold‑start phase (few‑shot CoT generation) and a second RL phase that aligns the model with human preferences (usefulness, safety, rule‑following). The training pipeline includes:

Pre‑training on 4.8 T tokens (multilingual, enriched Chinese data).

Two‑stage RL: reasoning‑oriented RL followed by general alignment RL.

Distillation experiments (Logic‑RL, Open‑R1) that transfer R1’s reasoning ability to smaller open‑source models, achieving large gains with only ~800 K synthetic samples.

Evaluation shows substantial improvements over V3 on education benchmarks (MMLU, GPQA), mathematics (MATH‑500), and coding (LiveCodeBench). Identified limitations include weaker function‑calling, multi‑turn dialogue, and language‑mixing issues.

Reproduction Projects

Two open‑source reproductions are highlighted: Logic‑RL – reproduces R1‑Zero training on a 2 K synthetic logic‑puzzle dataset, demonstrating self‑evolution behaviors such as hesitation tags, multi‑path exploration, and verification steps. Open‑R1 – uses DeepSeek‑R1 generated reasoning trajectories (Bespoke‑Stratos‑17K) to distill knowledge into Qwen and LLaMA models, achieving notable performance gains on math, coding, and science tasks.

The document concludes with community engagement calls, QR codes for joining DeepSeek discussion groups, and promotional material.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Mixture of Experts DeepSeek scaling laws AI research

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.