What Is an Inference Large Language Model? A Visual Guide

The article explains inference‑type large language models, how they differ from traditional models by breaking questions into reasoning steps, the shift from training‑time to test‑time compute, scaling‑law insights, validation techniques, proposal‑distribution tricks, and the detailed training pipeline of DeepSeek‑R1, while also discussing failed experiments and future directions.

Top Architect
Top Architect
Top Architect
What Is an Inference Large Language Model? A Visual Guide

What Is an Inference Large Language Model?

Inference‑type LLMs such as DeepSeek‑R1, OpenAI o1‑mini and Gemini 2.0 Flash Thinking decompose a question into a chain of reasoning steps before producing an answer, enabling the model to "think" rather than merely output a token.

illustration of inference model
illustration of inference model

The article uses more than 40 custom visualizations to walk the reader through the core principles, the testing‑time computation mechanism, and DeepSeek‑R1’s technical breakthroughs.

Limits of Training‑Time Compute

As of the first half of 2024, performance improvements still rely on three factors: model‑parameter scale, training‑data volume, and FLOPs (training‑time compute). These are collectively called "training‑time compute" and behave like a fossil‑fuel resource that suffers diminishing returns.

training compute diagram
training compute diagram

New Scaling Laws for Test‑Time Compute

Traditional scaling laws (Kaplan¹² and Chinchilla³) relate model performance to training‑time compute. Recent work shows that test‑time compute—how much computation a model spends during inference—follows similar power‑law trends, but with a clear shift toward longer reasoning chains.

test‑time scaling law
test‑time scaling law

OpenAI’s blog suggests that test‑time compute may obey the same scaling trend as training‑time compute, while a study on board‑game scaling (AlphaZero on Hex) demonstrates a tight coupling between the two regimes.

OpenAI test‑time scaling
OpenAI test‑time scaling

Validation Techniques

Test‑time compute can be harnessed through two broad categories:

Search‑based validators (output‑centric) that generate many reasoning paths and rank them with a reward model.

Proposal‑distribution modifiers (input‑centric) that train the model to sample more “reasoning” tokens.

Search‑based methods include:

Majority voting (self‑consistency) – generate multiple answers and pick the most frequent.

Best‑N sampling – generate N candidates, score each with a Result Reward Model (ORM), and select the highest‑scoring answer.

Weighted Best‑N – combine ORM scores with a Process Reward Model (PRM) that evaluates intermediate reasoning steps.

Beam‑search with PRM – keep the top‑k reasoning paths, expand them, and prune low‑scoring branches.

Monte‑Carlo Tree Search (MCTS) – four‑step loop (selection, expansion, rollout, back‑propagation) to balance exploration and exploitation of reasoning steps.

MCTS diagram
MCTS diagram

Modifying the Proposal Distribution

Instead of searching for the best path, the model can be nudged to sample reasoning‑friendly tokens. Simple prompting such as "Let’s think step‑by‑step" reshapes the token distribution toward chain‑of‑thought behavior, but static prompting alone does not guarantee robust reasoning.

prompt engineering
prompt engineering

Two main families of distribution‑modification techniques are:

Prompt‑engineering updates that provide examples (in‑context learning) to steer the model.

Training the model to focus on reasoning tokens, often via supervised fine‑tuning on synthetic reasoning data.

STaR and Self‑Teaching

STaR (Self‑Teaching Reasoner) lets a base LLM generate its own reasoning data, which is then filtered for correctness and used to fine‑tune the same model. The pipeline consists of generation, correctness check, triplet creation, and supervised fine‑tuning.

STaR pipeline
STaR pipeline

DeepSeek‑R1 Training Pipeline

DeepSeek‑R1 (671 B parameters) was built through a five‑step process:

Cold‑start – fine‑tune DeepSeek‑V3‑Base on a small high‑quality reasoning dataset (~5 k tokens) to avoid unreadable outputs.

Inference‑oriented RL – apply a GRPO (Group‑Relative Policy Optimization) algorithm with two rule‑based rewards: accuracy (test‑time correctness) and format (use of <answer> tags).

Reject sampling & RL‑generated synthetic data – generate 600 k high‑quality reasoning samples and 200 k non‑reasoning samples using the result‑reward model.

Supervised fine‑tuning – train DeepSeek‑V3‑Base on the 800 k synthetic dataset.

Full‑scenario RL – further RL with additional usefulness and harmlessness rewards, plus a requirement for the model to summarize its reasoning to improve readability.

DeepSeek‑R1 pipeline
DeepSeek‑R1 pipeline

Because the 671 B model is impractical for consumer hardware, the authors distilled its reasoning ability into smaller models (e.g., Qwen‑32B) by training the student to match the teacher’s token‑distribution on the 800 k high‑quality samples.

distillation diagram
distillation diagram

Unsuccessful Attempts

The authors also tried PRM‑based best‑N sampling and MCTS for reasoning but encountered large search‑space limits, high computational cost, and difficulty training fine‑grained reward models, leading to abandonment of those approaches.

Future Outlook

Test‑time compute is emerging as a new performance frontier, pushing LLMs toward genuine "thinking" capabilities. As techniques mature, we may see breakthroughs in complex problem solving, scientific discovery, and other high‑impact domains.

"The shift from training‑time to test‑time compute marks a pivotal step toward AI that can reason like humans." – cited from the article’s concluding remarks.
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsDeepSeek-R1reinforcement learningscaling lawsinference modelstest-time compute
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.