Why Reasoning LLMs Are Redefining AI: From OpenAI’s o1 to Open‑Source DeepSeek‑R1

This article analyzes the evolution of large language model training, the emergence of reasoning‑oriented LLMs, benchmark breakthroughs of closed‑source models like o1 and o3, and how open‑source projects such as DeepSeek‑R1 replicate and extend these advances through reinforcement learning, scaling laws, and model distillation.

Architect
Architect
Architect
Why Reasoning LLMs Are Redefining AI: From OpenAI’s o1 to Open‑Source DeepSeek‑R1

Historical LLM Training Paradigm

Before reasoning models, large language models (LLMs) were trained in two stages: (1) pre‑training on massive internet text, and (2) alignment via supervised fine‑tuning (SFT) and reinforcement learning from human feedback (RLHF). Scaling‑law research showed that larger models trained on more data consistently improve performance.

Using more data to train larger models yields better results.

Reasoning Paradigm

Reasoning models generate a long thought chain before producing a final answer. The chain acts like a search algorithm: the model decomposes the problem, critiques partial solutions, explores alternatives, and only then emits the answer. Two simple ways to increase the compute spent on reasoning are (1) generating longer token sequences and (2) generating multiple candidate outputs and aggregating them (e.g., majority vote or best‑of‑N).

Thought chains resemble search processes rather than raw text generation.

Closed‑Source Reasoning Models (OpenAI)

OpenAI released several reasoning models:

o1‑preview / o1‑mini : excel on verifiable tasks (math, programming). They use long thought chains and parallel decoding (64 samples, majority vote). Benchmarks:

AIME 2024 solved 74‑93 % of problems vs. 12 % for GPT‑4o.

ARC‑AGI accuracy 87.5 % (first model >85 % human level).

SWE‑Bench Verified 71.7 %.

Codeforces Elo 2727 (top 200 competitive programmers).

o3 / o3‑mini : further improvements. o3‑mini offers low/medium/high reasoning modes that control the length of the thought chain. In high‑mode it surpasses all previously released OpenAI reasoning models, including full o1, while being 80 % cheaper than o1‑mini.

Open‑Source Reasoning Models (DeepSeek)

DeepSeek‑R1‑Zero is the first publicly documented model that learns reasoning ability solely through large‑scale reinforcement learning (no SFT). It uses rule‑based rewards to enforce correctness and output format, avoiding neural reward‑model hacking.

Training Pipeline of DeepSeek‑R1

Cold‑start SFT : fine‑tune the 6710 B‑parameter MoE base model (DeepSeek‑v3) on a small set of long‑thought‑chain examples to give the model an initial reasoning template.

Reasoning‑focused RL : apply Group‑wise Reward‑Based Policy Optimization (GRPO) with two rule‑based rewards (accuracy and format) to improve performance on verifiable tasks.

Rejection‑sampling SFT : generate many candidate trajectories, filter them with quality checks, and use the best samples to build a larger, more diverse SFT dataset.

General RLHF : combine the rule‑based rewards for reasoning data with neural reward models trained on human preferences for general data, aligning the model while preserving reasoning ability.

GRPO was chosen because it reduces RL training cost and eliminates the need for a separate critic model.

Reward Design

Accuracy reward : binary check whether the final answer matches the ground‑truth (e.g., string match for GSM8K or test‑case pass for code).

Format reward : forces the model to emit the answer between special tokens, enabling simple verification.

During RL the model learns to generate longer thought chains, self‑reflect, and explore alternative solutions, leading to large gains on benchmarks such as AIME 2024 (from 15.6 % to 71.0 % raw accuracy, 86.7 % with 16‑vote majority).

DeepSeek‑R1 (Full Model)

After the four‑stage pipeline, DeepSeek‑R1 achieves performance comparable to OpenAI’s o1‑mini on most tasks and surpasses it on several programming benchmarks. The model retains the long‑thought‑chain behavior while producing more readable output.

Distillation of Reasoning Ability

Because the full DeepSeek‑R1 (6710 B MoE) is expensive, the team distilled its knowledge into dense models (e.g., Qwen‑2.5‑14B). The distilled models match or exceed the performance of prior open‑source reasoning models and even outperform o1‑mini on many benchmarks, demonstrating that reasoning patterns can be transferred efficiently.

Key Trends and Open Questions

Long thought chains provide a controllable compute knob for inference.

Large‑scale RL (often with simple rule‑based rewards) is the primary driver of reasoning capability.

Reasoning models require far less human‑supervised data than standard LLMs.

Distillation is an effective way to obtain smaller, high‑performance reasoning models.

Open challenges include safe training of long chains, balancing general language ability with reasoning strength, optimal use of SFT, preventing over‑thinking, and efficient deployment of reasoning models.

Code example

相关阅读:
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.