Artificial Intelligence 8 min read

How DeepSeek Trains and Optimizes Its LLMs: From Pre‑training to Reasoning Models

This article breaks down DeepSeek's LLM training pipeline, explaining the massive pre‑training phase, instruction fine‑tuning, reinforcement‑learning‑from‑human‑feedback, and the distinct roles of its V3 instruction model and R1 reasoning model, while also highlighting performance metrics and current limitations.

Big Data Tech Team

Feb 18, 2025

How DeepSeek Trains and Optimizes Its LLMs: From Pre‑training to Reasoning Models

Training Process

Large language model training generally consists of two stages: pre‑training and post‑training, and DeepSeek follows the same pattern.

Pre‑training Stage

The goal is to teach the model universal language patterns by predicting the next token in massive web‑scale text corpora, often amounting to trillions of tokens sourced from public datasets such as Common Crawl. This stage relies on a single loss function for autoregressive token prediction and consumes extensive compute resources to produce a foundational model.

Instruction fine‑tuning (also called supervised fine‑tuning, IFT or SFT) is a long‑standing technique that refines the base model to follow specific command formats, improving its ability to generate coherent, instruction‑compliant responses.

Post‑training Stage

After pre‑training, DeepSeek applies two main post‑training methods:

Instruction tuning : The model learns to recognize and obey structured prompts, e.g., answering “Explain the history of the Roman Empire” with a rich, easy‑to‑understand response.

Reinforcement learning from human feedback (RLHF) : Human preference data—initially labeled manually and now partially annotated by AI—is used with a contrastive loss to teach the model to prefer higher‑quality answers and align with human reading preferences.

Model Working Principle

DeepSeek offers two distinct LLM families:

DeepSeek‑V3 – an instruction model similar to ChatGPT. It generates token streams directly from user prompts, often formatting answers as Markdown lists and highlighting key points. Tokens are generated rapidly, with each token representing a word or sub‑word.

DeepSeek‑R1 – a reasoning model. When queried, it first produces an extensive chain‑of‑thought token sequence that explains the problem, breaks it down, and only after this reasoning phase does it switch tone to deliver the final answer.

Optimization and Innovation

DeepSeek‑R1 combines a 13‑billion‑parameter backbone with a dedicated reasoning module and an SFT module. The reasoning module, trained via RL, boosts logical inference, while SFT refines language fluency. This hybrid design yields strong performance on both reasoning‑heavy tasks and general language tasks.

Evaluating LLM output quality increasingly relies on strategy‑based assessments: researchers embed models in simulated game environments, observe decision‑making and task completion, and infer text generation quality from observed behavior.

Performance figures reported for DeepSeek‑R1 include an average response latency of about 2 seconds and an API throughput comparable to GPT‑4.5, roughly four times that of GPT‑4, enabling fast handling of complex logical and mathematical queries.

Despite these strengths, the reasoning‑centric architecture may face challenges in scalability, adaptability to highly diverse scenarios, and maintaining consistent output quality across all domains.

LLM DeepSeek model training RLHF reasoning model instruction fine‑tuning

Written by

Big Data Tech Team

Focuses on big data, data analysis, data warehousing, data middle platform, data science, Flink, AI and interview experience, side‑hustle earning and career planning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.