Artificial Intelligence 21 min read

DeepSeek Series Overview: Core Technologies, Model Innovations, and Product Highlights

The article delivers a PPT‑style deep dive into the DeepSeek series—from the original LLM through DeepSeek‑MoE, Math, V2, V3 and R1—highlighting core innovations such as Multi‑Head Latent Attention, fine‑grained MoE, GRPO reinforcement learning, Multi‑Token Prediction, DualPipe parallelism and FP8 training that together achieve high performance at a fraction of traditional costs, and notes their integration into Tencent’s OlaChat intelligent assistant.

Tencent Cloud Developer

Mar 5, 2025

DeepSeek Series Overview: Core Technologies, Model Innovations, and Product Highlights

This article provides a PPT‑style, in‑depth overview of the DeepSeek series, summarizing recent research papers and key technical innovations from the original DeepSeek LLM to the latest DeepSeek‑V3/R1 models.

1. DeepSeek Series Summary

DeepSeek LLM: First‑generation model based on LLaMA architecture, using Pre‑Norm, RMSNorm, SwiGLU, and Grouped‑Query Attention (GQA). Innovations include an extended hyper‑parameter scaling law and a multi‑step learning‑rate scheduler.

DeepSeekMoE: Introduces a fine‑grained Mixture‑of‑Experts (MoE) architecture with many small experts and shared experts, achieving performance superior to GShard at comparable parameter counts.

DeepSeek‑Math: Proposes the Group Relative Policy Optimization (GRPO) algorithm to stabilize reinforcement‑learning training.

DeepSeek‑V2: Employs Multi‑Head Latent Attention (MLA) to compress KV caches, reducing inference memory by ~93% while maintaining performance.

DeepSeek‑V3: Combines MLA, DeepSeekMoE, GRPO, a token‑balanced loss, and Multi‑Token Prediction (MTP). Training cost is roughly 1/10 of LLaMA‑70B and 1/30 of OpenAI inference pricing.

DeepSeek‑R1: Uses pure reinforcement learning with GRPO and rule‑based rewards to boost reasoning ability, followed by a small‑sample SFT stage.

2. Core Technologies

DeepSeekMoE Architecture : Replaces the standard Feed‑Forward Network (FFN) in Transformers with sparse MoE layers composed of a gating network and multiple expert FFNs. Fine‑grained expert division and shared experts reduce redundancy and improve specialization.

GRPO (Group Relative Policy Optimization) : A reinforcement‑learning policy‑optimization method that uses relative rewards within a group of generated outputs, eliminating the need for a separate value network and improving training stability.

MLA (Multi‑Head Latent Attention) : Compresses KV caches by applying low‑rank decomposition to keys and values, drastically lowering inference memory consumption.

MTP (Multi‑Token Prediction) : Extends the training objective to predict multiple future tokens simultaneously, improving data efficiency, global context modeling, and enabling parallel prediction.

DualPipe Parallelism : Overlaps forward and backward communication/computation phases across pipeline stages, reducing pipeline stalls in large‑scale training.

FP8 Mixed‑Precision Training : Executes most compute‑intensive operations in FP8 while keeping critical layers (e.g., normalization, attention) in higher precision (BF16/FP32) to balance efficiency and numerical stability.

3. DeepSeek‑V3/R1 Core Issues

Why are DeepSeek‑V3/R1 models both inexpensive and high‑performing? The answer lies in the combination of MLA, DeepSeekMoE, GRPO, MTP, and engineering optimizations such as DualPipe pipeline parallelism, expert parallelism across eight nodes, and FP8 training.

The training pipeline includes:

Base model training with MLA and MoE.

Reinforcement‑learning fine‑tuning using GRPO and rule‑based rewards (accuracy and format).

Optional SFT with curated Chain‑of‑Thought data to improve reasoning.

4. OlaChat Intelligent Assistant

OlaChat is a Tencent data‑analysis product built on large‑model technology, now supporting DeepSeek V1/V3 models for smarter SQL generation, data visualization, and result interpretation.

The article concludes with QR codes and links for joining Tencent Cloud developer communities and DeepSeek discussion groups.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Mixture of Experts DeepSeek FP8 training Multi-Head Attention

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.