Artificial Intelligence 18 min read

Understanding Inference Large Language Models: DeepSeek‑R1 and the Rise of Test‑Time Computation

This article explains how inference‑oriented large language models such as DeepSeek‑R1 and OpenAI o1‑mini shift AI research from training‑time scaling to test‑time computation, detailing the underlying principles, new scaling laws, verification techniques, reinforcement‑learning pipelines, and practical methods for distilling reasoning capabilities into smaller models.

Architect

Feb 27, 2025

Understanding Inference Large Language Models: DeepSeek‑R1 and the Rise of Test‑Time Computation

What Is an Inference‑Based Large Language Model?

Unlike traditional models that directly output answers, inference‑based models decompose questions into smaller reasoning steps (chain‑of‑thought), enabling the model to "think" before answering.

Limitations of Training‑Time Computation

Performance improvements historically relied on three factors: model size, training data volume, and compute (FLOPs). This "training‑time" paradigm faces diminishing returns, akin to the bottlenecks of oil extraction.

Insights from Scaling Laws

Power‑law relationships (e.g., Kaplan and Chinchilla laws) show that model performance scales with compute, data, and parameters, but gains diminish as these factors grow.

The Rise of Test‑Time Computation

When training‑time scaling hits limits, the focus shifts to test‑time computation, allowing models to generate intermediate reasoning steps that improve answer quality.

Search‑Based Verifiers

Two main verifier types are introduced:

Result Reward Model (ORM) – evaluates final answers.

Process Reward Model (PRM) – evaluates the reasoning process.

Typical workflow: generate multiple answer candidates, then score them with a verifier; the highest‑scoring answer is selected.

Majority Voting

Generating several answers and selecting the most frequent one (self‑consistency) provides a simple yet effective verification method.

Best‑N Sampling

Generate N samples, score each with an ORM (or PRM), and pick the top‑scoring answer; weighted variants combine scores across samples.

Beam Search with Process Reward Models

Beam search expands multiple reasoning paths, scoring each with a PRM and pruning low‑scoring branches, then applies best‑N selection on the remaining candidates.

Monte Carlo Tree Search (MCTS)

MCTS iteratively selects, expands, simulates, and back‑propagates scores to balance exploration and exploitation of reasoning steps.

Modifying the Proposal Distribution

Instead of output‑centric verification, this approach trains the model to generate better reasoning tokens by reshaping its token distribution via prompt engineering or dedicated training.

Prompt Engineering

Providing examples or explicit instructions (e.g., "let's think step‑by‑step") nudges the model toward chain‑of‑thought generation.

STaR (Self‑Training with Reasoning)

STaR lets the model generate its own reasoning data, which is then used for supervised fine‑tuning, reinforcing desirable reasoning patterns.

DeepSeek‑R1: From Zero to a Powerful Inference Model

DeepSeek‑R1 (671B parameters) was built through five stages: cold‑start fine‑tuning, inference‑focused RL, rejection sampling, supervised fine‑tuning, and RL for all scenarios with safety rewards.

Key steps include:

Cold‑start on a small high‑quality reasoning dataset.

RL with accuracy and format rewards (using GRPO).

Generating 800k high‑quality samples for supervised fine‑tuning.

Distilling reasoning ability into smaller models (e.g., Qwen‑32B) via teacher‑student training.

Unsuccessful Attempts

Attempts to use MCTS and PRM‑based best‑N suffered from large search spaces and high computational cost, highlighting practical limits of these techniques.

Future Outlook

Test‑time computation opens new performance pathways and moves LLMs toward true "thinking" AI, promising breakthroughs in complex problem solving and scientific discovery.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Large Language Models DeepSeek-R1 reinforcement learning inference test-time computation

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.