Artificial Intelligence 18 min read

Why Intermediate Tokens Matter: Denny Zhou’s Deep Insights into LLM Reasoning

This article distills Denny Zhou’s Stanford CS25 lecture, explaining how large language models achieve reasoning through intermediate token generation, chain‑of‑thought prompting, self‑consistency, reinforcement‑learning fine‑tuning, and answer aggregation, while highlighting theoretical foundations and practical breakthroughs.

Data Thinking Notes

Aug 21, 2025

Why Intermediate Tokens Matter: Denny Zhou’s Deep Insights into LLM Reasoning

All LLM enthusiasts should know these insights.

This may be the clearest, most understandable explanation of large language model (LLM) reasoning principles.

Recently, Denny Zhou, chief scientist and research director at Google DeepMind, shared profound insights on LLM inference during Stanford’s CS25 course.

As a leading AI figure, Zhou systematically described LLM inference mechanisms and optimization methods, revealing core principles and recent advances.

Four Key Points Summarized by Denny Zhou

Inference in LLMs simply means generating a series of intermediate tokens before the final answer; the similarity to human reasoning is irrelevant, but generating many intermediate tokens makes Transformers extremely powerful without increasing model size.

Pre‑trained models possess inference ability even without fine‑tuning, yet the desired output often does not appear at the top of the distribution, so greedy decoding fails to surface it.

Prompt techniques (e.g., chain‑of‑thought prompts or “let’s think step‑by‑step”) and supervised fine‑tuning were once common ways to trigger reasoning; now reinforcement‑learning fine‑tuning (RL‑FT) is the most powerful method, pioneered by Jonathan Lai at Google.

Aggregating multiple responses instead of relying on a single output dramatically improves LLM reasoning performance.

Zhou co‑founded the Reasoning Team at Google Brain, now part of DeepMind, focusing on building LLMs with strong reasoning capabilities to advance general AI.

His research emphasizes chain‑of‑thought prompting, self‑consistency, and LLM optimization, accumulating over 83,000 citations on Google Scholar.

He also co‑organized the CoLM conference, chaired its 2024 edition, and received the 2022 Google Research Tech Impact Award and WSDM Test‑of‑Time Award.

In Stanford’s CS25 “Transformers United V5” course—one of the most popular and discussion‑rich classes featuring Geoffrey Hinton, Ashish Vaswani, and Andrej Karpathy—the lectures attract millions of YouTube views, covering breakthroughs from GPT to applications in art, biology, and robotics.

Course page: https://web.stanford.edu/class/cs25/

Why Intermediate Tokens Are Crucial?

Zhou argues that any problem solvable by a Boolean circuit can be solved by a constant‑size Transformer that generates intermediate tokens, avoiding the need for ever‑larger models.

This theory links circuit size (number of logic gates) to problem‑solving capacity, showing that generating intermediate tokens enables fixed‑size models to handle extremely large computational tasks.

Technical Details of the Reasoning Process

Language models already have reasoning ability; the key lies in the decoding process.

Example: a simple math problem “I have 3 apples, my dad has 2 more than me. How many apples total?” Greedy decoding may output the wrong answer “5 apples.” By considering multiple candidate answers and using chain‑of‑thought decoding, the model can produce a correct solution.

Chain‑of‑thought decoding involves two steps: (1) go beyond greedy decoding to examine more generated candidates; (2) select candidates with higher confidence for the final answer.

Simple prompting like “let’s think step‑by‑step” can also induce chain‑of‑thought reasoning without complex computation.

Supervised fine‑tuning (SFT) collects human‑annotated question‑answer pairs with step‑by‑step solutions, maximizing the likelihood of human solutions during training.

However, SFT’s generalization is limited; scaling data and using RL‑FT improves performance.

Key lesson: do not blindly scale model size without the right direction.

Improving SFT generalization involves correcting human annotation errors and recognizing that machine‑generated data can sometimes surpass human‑crafted data.

Self‑Improvement Loop

Self‑improvement (self‑boost) lets the model generate its own training data: the model solves a problem, generates intermediate steps, and then maximizes the likelihood of the correct answer, a process known as Reject Sampling.

Research such as “STaR: Bootstrapping Reasoning With Reasoning” demonstrates that once a better model generates responses or training data, it can iteratively improve itself.

Recent works like “ReFT: Reasoning with Reinforced Fine‑Tuning” (arXiv, Jan 2024) and OpenAI’s o1 highlight the rise of RL‑FT.

Why can machine‑generated training data be better than human data? Because it directly optimizes the desired metric, allowing gradient‑based improvement of the model’s objective.

Marginalizing over multiple sampled responses and selecting the most frequent answer yields substantial gains.

Retrieval vs. Reasoning

Distinguishing retrieval from reasoning is often difficult; both can complement each other.

Providing relevant example problems (retrieval) helps the model recall and apply similar reasoning steps to new tasks.

Model ensembles—generating answers from multiple models and selecting the most consistent result—also improve reliability.

Final Takeaways from Denny Zhou

LLM reasoning is always better than no reasoning; RL‑FT outperforms SFT; aggregating multiple answers beats single‑answer selection, though it is more expensive; combining retrieval with reasoning yields the best results.

Future breakthroughs should focus on building real applications rather than saturating benchmark tests.

As Zhou quotes Richard Feynman, “Truth is always simpler than you think,” urging researchers to keep their work concise and clear.

Source: Machine Heart

LLM Reasoning chain of thought reinforcement learning

Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.