Artificial Intelligence 35 min read

Do Machines Really Think? Inside Deep Reasoning, Scaling Laws & RLHF for LLMs

This article examines whether large language models truly think, explores the origins of deep reasoning through transformer architectures and scaling laws, reviews chain‑of‑thought and its variants, and analyzes how reinforcement learning from human feedback—including PPO, DPO, and GRPO—helps internalise step‑by‑step reasoning while pointing to future directions such as atomic thought, hierarchical models, and training‑free in‑context knowledge bases.

Tencent Technical Engineering

Dec 1, 2025

Do Machines Really Think? Inside Deep Reasoning, Scaling Laws & RLHF for LLMs

Machine Thinking

Large language models (LLMs) perform pattern recognition at a scale that captures the dense network of concepts underlying human knowledge. By ingesting massive corpora, they learn meta‑patterns of how humans think and solve problems, effectively internalising a form of “thought”.

Essence and Necessity of Deep Reasoning

Scaling laws (e.g., Scaling Laws for Neural Language Models , arXiv:2001.08361) show a power‑law relationship between model performance, parameter count, data volume, and compute. Beyond a critical size, models exhibit emergent capabilities such as multi‑step reasoning and cross‑task generalisation. Deep reasoning reduces hallucinations by turning inference into an iterative deduction process that starts from axioms, world knowledge, and the query, converging toward a ground‑truth answer. In practice this is achieved by prompting the model to “think step‑by‑step”, allocating more tokens and compute to fully exploit its internal knowledge.

External Guidance: Chain‑of‑Thought (CoT) and Variants

CoT prompting adds a simple instruction (e.g., “please think step‑by‑step and output each reasoning step”) to transform a direct end‑to‑end mapping into a transparent, multi‑step reasoning process. Key variants include:

Few‑shot, one‑shot, zero‑shot CoT.

Self‑Consistency – generate multiple CoT answers and vote for the most frequent.

Tree of Thoughts (ToT) and Graph of Thoughts (GoT) – extend the linear chain into a graph‑structured search, enabling broader exploration and back‑tracking.

Other extensions such as Faithful (RAG‑based re‑thinking), AutomateCoT (question clustering), Step‑Back Prompting (abstract before reasoning), and Multimodal CoT (visual augmentation).

All share the core idea of converting a direct mapping into a stepwise deduction.

Reinforcement Learning from Human Feedback (RLHF)

To make models spontaneously employ deep reasoning, the community adopts RLHF, which typically follows three stages:

Supervised Fine‑Tuning (SFT) – train on high‑quality <instruction, answer> pairs to teach the model to follow human instructions.

Reward Model (RM) construction – add a scalar head to the SFT checkpoint and train on human preference data so the model can assign a reward score to any generated answer.

Reinforcement Learning Fine‑Tuning – use the RM’s scores as reward signals to update the policy via algorithms such as PPO, DPO, or GRPO.

PPO (Proximal Policy Optimization) initializes four models (actor, critic, value, reference) from the SFT checkpoint. The actor predicts tokens, the critic (reward model) scores complete sentences, the value model estimates future token value, and the reference model provides a KL‑penalty to stabilise updates. PPO’s clip‑based loss is robust but requires loading four full models into GPU memory, leading to high resource consumption.

DPO (Direct Preference Optimization) removes the explicit reward model and directly optimises the policy against human preference pairs. It is computationally cheaper but highly sensitive to the quantity and quality of preference data.

GRPO (Group Relative Policy Optimization) discards the value model and computes a “group‑wise” advantage across multiple sampled outputs, reducing memory usage by ~30‑40 % and enabling larger batch sizes.

Emerging Directions

Atomic Thought

Recent work proposes breaking reasoning into minimal, functionally coherent units called <atom‑think>. During RL the model is rewarded for autonomously generating useful atomic steps (e.g., <OBSERVATION>, <HYPOTHESIS_TESTING>), encouraging logical consistency and mitigating “correct answer, flawed reasoning”.

Hierarchical Reasoning Model (HRM)

HRM introduces two recurrent loops operating at different time scales: a high‑level loop for strategic, abstract planning and a low‑level loop for fast, concrete actions. Training uses a one‑step gradient approximation, reducing memory from O(T) to O(1). A Q‑learning‑style decision module chooses between “halt” and “continue” after each high‑level cycle.

Recursive Reasoning Model (TRM)

TRM simplifies HRM to a single tiny recurrent module that simultaneously outputs a latent variable and an initial answer. With only ~7 M parameters, TRM outperforms much larger models on ARC‑AGI benchmarks, demonstrating extreme parameter efficiency through recursive reasoning.

Training‑Free GRPO & In‑Context Knowledge Bases

Training‑Free GRPO treats the RL target as an external reasoning knowledge base rather than the policy itself. Multiple model outputs are generated, a lightweight (often rule‑based) reward model scores them, and a “semantic relative advantage” is computed via LLM‑generated summaries. These summaries are stored in a controllable knowledge base that can be injected into future prompts, enabling low‑cost, domain‑specific performance gains without further model fine‑tuning.

Conclusion

Scaling laws, deep reasoning techniques, and RLHF together form a dominant pipeline for building capable LLMs. Open questions remain about whether future AGI will embed deep reasoning intrinsically or continue to rely on external prompting and alignment mechanisms. Balancing efficiency, interpretability, and alignment will be crucial as the field advances.

References

Kaplan, Jared et al. “Scaling Laws for Neural Language Models.” arXiv:2001.08361. URL: https://arxiv.org/abs/2001.08361

Wei, Jason et al. “Chain of Thought Prompting Elicits Reasoning in Large Language Models.” arXiv:2201.11903. URL: https://arxiv.org/abs/2201.11903

Wang, Xuezhi et al. “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” arXiv:2203.11171. URL: https://arxiv.org/abs/2203.11171

Yao, Shunyu et al. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” arXiv:2305.10601. URL: https://arxiv.org/abs/2305.10601

Besta, Maciej et al. “Graph of Thoughts: Solving Elaborate Problems with Large Language Models.” arXiv:2308.09687. URL: https://arxiv.org/abs/2308.09687

Schulman, John et al. “Proximal Policy Optimization Algorithms.” arXiv:1707.06347. URL: https://arxiv.org/abs/1707.06347

Rafailov, Rafael et al. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” arXiv:2305.18290. URL: https://arxiv.org/abs/2305.18290

Shao, Zhihong et al. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.” arXiv:2402.03300. URL: https://arxiv.org/abs/2402.03300

Deng, Yong et al. “Atom‑Searcher: Enhancing Agentic Deep Research via Fine‑Grained Atomic Thought Reward.” arXiv:2508.12800. URL: https://arxiv.org/abs/2508.12800

Wang, Guan et al. “Hierarchical Reasoning Model.” arXiv:2506.21734. URL: https://arxiv.org/abs/2506.21734

Jolicoeur‑Martineau, Alexia. “Less is More: Recursive Reasoning with Tiny Networks.” arXiv:2510.04871v1. URL: https://arxiv.org/abs/2510.04871v1

Cai, Yuzheng et al. “Training‑Free Group Relative Policy Optimization.” arXiv:2510.08191. URL: https://arxiv.org/abs/2510.08191

Ouyang, Siru et al. “ReasoningBank: Scaling Agent Self‑Evolving with Reasoning Memory.” arXiv:2509.25140. URL: https://arxiv.org/abs/2509.25140

Hendrycks, Dan et al. “A Definition of AGI.” arXiv:2510.18212. URL: https://arxiv.org/abs/2510.18212

LLM chain of thought scaling laws RLHF AI alignment deep reasoning

Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.