Why Intermediate Tokens Make LLMs Reason Better: Insights from Denny Zhou

The article analyzes Denny Zhou's Stanford CS25 lecture on large language model reasoning, explaining how intermediate token generation, chain‑of‑thought prompting, self‑consistency, reinforcement‑learning fine‑tuning, and answer aggregation together unlock powerful reasoning capabilities beyond traditional greedy decoding.

Data Party THU
Data Party THU
Data Party THU
Why Intermediate Tokens Make LLMs Reason Better: Insights from Denny Zhou

Background

The lecture by Denny Zhou (Google DeepMind) in Stanford CS‑25 presented a unified view of reasoning in large language models (LLMs). Reasoning is defined as the generation of a sequence of intermediate tokens before the final answer, independent of any human‑like logical process.

Key Technical Insights

Intermediate‑token reasoning: A transformer of fixed size can solve any problem that a Boolean circuit can solve if it is allowed to emit intermediate tokens. The computational capacity is determined by the number of logical gates, not by scaling the model parameters.

Latent reasoning in pretrained models: Pre‑trained LLMs already contain reasoning ability, but greedy decoding often hides it because the correct answer may not be the most probable token at each step.

Prompt engineering: Chain‑of‑thought (CoT) prompts, "let’s think step‑by‑step" instructions, and self‑consistency sampling reliably surface the latent reasoning. These methods require only natural‑language prompts and no model modification.

Learning‑based improvements: Supervised fine‑tuning (SFT) maximises the likelihood of human‑written solution traces but scales poorly. Reinforcement‑learning‑from‑human‑feedback (RLHF) and the ReFT (Reasoning with Reinforced Fine‑Tuning) paradigm directly optimise a task‑specific reward, enabling models to generate their own training data, perform reject‑sampling, and iteratively improve without additional human labels.

Aggregation of multiple samples: Sampling many reasoning paths and selecting the most frequent final answer (marginalisation) yields substantially higher accuracy than a single greedy decode, at the cost of higher compute.

Theoretical Model

The presented theorem states that any Boolean circuit of size G (number of gates) can be simulated by a constant‑size transformer that emits a bounded number of intermediate tokens proportional to G. This shows that the expressive power of LLM reasoning is limited by the length of the token chain rather than model depth.

Training Paradigms

Supervised Fine‑Tuning (SFT): Collect human‑annotated question‑answer pairs with step‑by‑step reasoning traces. Optimise the log‑likelihood of the full token sequence. Scaling data and model size is required for further gains.

Reinforced Fine‑Tuning (ReFT / RLHF): Define a reward function that measures answer correctness (e.g., exact match, numeric tolerance). Use policy‑gradient methods to maximise the expected reward, optionally with self‑generated data via reject‑sampling.

Self‑Consistency: Sample k reasoning paths with a temperature > 0, extract the final answers, and return the most common one. This approximates marginalising over latent intermediate tokens.

Practical Examples

Arithmetic: The question "I have 3 apples, my father has 2 more, how many in total?" is often answered incorrectly by greedy decoding (e.g., Llama, Qwen). Using a CoT prompt such as "Let's think step‑by‑step" or sampling 20 reasoning traces and applying self‑consistency produces the correct answer 5.

Retrieval‑augmented reasoning: Adding a prompt that asks the model to recall a similar geometry problem enables the model to retrieve a relevant solution sketch, then complete the new problem. This demonstrates synergy between external retrieval and internal token‑by‑token reasoning.

Future Directions

The speaker advocates moving beyond benchmark‑centric evaluation toward real‑world applications. Two promising research avenues are:

Self‑improvement loops: generate new problem‑solution pairs, fine‑tune on them, and repeat.

Deeper integration of retrieval mechanisms with token‑level reasoning to handle tasks lacking a single verifiable answer.

References

Lecture slides: https://dennyzhou.github.io/LLM-Reasoning-Stanford-CS-25.pdf

Video recording: https://www.youtube.com/watch?v=ebnX5Ur1hBk&list=PLoROMvodv4rNiJRchCzutFw5ItR_Z27CM&index=38

Twitter thread: https://x.com/denny_zhou/status/1948499173986201915

Lecture cover image
Lecture cover image
Intermediate token illustration
Intermediate token illustration
Chain‑of‑thought decoding example
Chain‑of‑thought decoding example
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMPrompt engineeringreasoningchain-of-thoughtAI researchSelf-Consistency
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.