Artificial Intelligence 13 min read

How DeepSeek’s RL‑Powered Time Scaling Is Redefining AI Model Training

DeepSeek’s rapid rise is examined through its RL‑based Time Scaling paradigm, cost‑effective architecture, innovative training pipeline, open‑source strategy, and security challenges, highlighting how these breakthroughs disrupt traditional AI model development, lower resource demands, and influence industry dynamics.

Alibaba Cloud Developer

Feb 28, 2025

How DeepSeek’s RL‑Powered Time Scaling Is Redefining AI Model Training

DeepSeek, a fast‑growing AI model, is analyzed for its technical advances, training methodology, and industry impact, emphasizing its disruptive potential.

0x01: AI Technology Uprising – DeepSeek’s Rise

DeepSeek has become a hot topic, prompting IT professionals to study its "advanced" productivity and reflect on its implications.

Data Performance

Rapid global adoption, reaching hundreds of millions of users in seven days and surpassing ChatGPT’s growth in two months.

Post‑boom, Nvidia’s stock experienced significant volatility, reshaping the AI value chain.

Industry Performance

GPU manufacturers, cloud providers, and other tech giants are rapidly collaborating with DeepSeek.

Some companies feel forced to abandon large‑model development, then reconsider as opportunities arise.

DeepSeek’s challenge to AI dominance is likened to a bold technological uprising, energizing the domestic AI community.

0x02: Decoding DeepSeek’s Core Technologies

The author identifies two primary reasons for DeepSeek’s breakout, based on extensive research and consultation.

Innovation 1: RL‑Based Time Scaling

Unlike traditional models that rely heavily on compute and data, DeepSeek uses reinforcement learning (RL) to implement a "Time Scaling" paradigm, allowing the model to iteratively refine its answers before submission, leading to emergent intelligence.

This approach, also referred to as "Test Time Scaling" or "RL Scaling," mirrors concepts seen in OpenAI’s O1 and O3, but DeepSeek demonstrates its feasibility and performance comparable to O1.

DeepSeek validates this path and matches O1 performance.

Innovation 2: Low Training Cost, High Inference Performance

DeepSeek achieves comparable results with only 1/27 of the cost of GPT‑o1, embodying "integrated innovation" by combining architecture, optimization, and infrastructure advances.

Model Architecture

Moe architecture alters the FFN pattern, reducing communication overhead and enabling larger model scales.

MHA remains the core of Transformer models, but techniques like MLA significantly lower KV‑Cache usage.

Optimization Methods

GRPO improves upon PPO by simplifying the structure and using internal RL evaluation, eliminating the need for external reward models.

Multi‑token prediction follows a three‑step, two‑step approach, accelerating inference.

Infrastructure

DualPipe pipelines keep GPUs fully utilized by overlapping forward and backward passes.

FP8 mixed‑precision balances accuracy and memory, performing best on NVIDIA H‑series GPUs.

These integrated innovations dramatically cut training and inference costs while maintaining quality, promoting AI democratization.

0x03: DeepSeek‑R1 Training and Distillation Pipeline

The process consists of four stages and six steps.

Stage 1 – RL Feasibility

DeepSeek trains directly with RL, producing the R1‑zero model that shows emergent abilities but unstable outputs.

Stage 2 – Data Distillation

A cold‑start SFT creates a small seed dataset; rejection sampling adds 600k high‑quality CoT data, totaling roughly 800k SFT examples.

Stage 3 – Reinforcement Training

Two additional SFT rounds followed by two RL rounds yield the final R1 model with strong reasoning capabilities.

Stage 4 – Model Distillation

Distilling from Qwen and Llama with two SFT rounds reduces model size and deployment cost.

Training a 600B‑parameter model remains challenging despite the simplified "four‑stage, six‑step" outline.

0x04: Reproducing R1 with Minimal Resources

Fei‑Fei Li’s team recreated R1 using 1K data and under $50, achieving comparable performance on a specific benchmark, illustrating the "time‑for‑effect" principle.

Other teams found that S1‑32B’s success depends on a strong teacher model (e.g., Qwen).

0x05: Open‑Source Strategy and Data as a Competitive Edge

Open‑sourcing DeepSeek reshapes perceptions of compute, algorithm, and data requirements, influencing market dynamics.

However, opaque data pipelines—especially the 800k SFT and RL CoT data—remain critical yet undisclosed assets.

0x06: Security Challenges

During holidays DeepSeek suffered DDoS attacks and exposed ClickHouse publicly, illustrating typical business‑first security trade‑offs.

Beyond infrastructure, large‑model outputs can be manipulated, posing content‑safety risks for consumer applications.

0x07: Final Thoughts

The author’s views are personal, unaffiliated with any organization, and welcomes corrections.

Acknowledgments to colleagues for inspiration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

DeepSeek Open-source AI model architecture AI model training cost‑efficient AI

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.