Why Long CoT and In‑Context RL Are the Next Frontier for LLMs
The article analyses recent breakthroughs such as OpenAI's o1, Long CoT, and test‑time search, arguing that enabling LLMs to perform self‑critique and reinforcement learning with long output sequences is essential for future AI performance, while warning against overly structured workflows.
Key observations of OpenAI o1
The model is allowed to make mistakes during reasoning.
It repeatedly reflects on its output and retries, often inserting phrases like "but, wait…".
Its reasoning is unconstrained: it can restate the problem, draw analogies, and decompose tasks.
Long Context vs. Long Chain‑of‑Thought (Long CoT)
Long Context refers to handling very long input sequences (e.g., prefilling, Mooncake) with manageable compute cost. Long CoT, by contrast, generates long output sequences, which is far more expensive and slower. The argument is that performance, not cost, should be the primary driver; extending the reasoning horizon is worthwhile despite higher compute.
Test‑time search and self‑critique
Noam Brown emphasizes that models should be able to perform autonomous test‑time search, similar to the search component in AlphaGo. Hyung Won Chung argues that imposing a fixed reasoning structure limits the model; instead, the model should discover its own reasoning patterns and be incentivized to self‑critique.
In‑Context Reinforcement Learning (RL) formulation
Generating a Long CoT can be viewed as an in‑context RL problem. Each token generation step forms a trajectory: s₁, a₁, r₁, a₂, r₂, a₃, r₃, … where aᵢ is an action (a reasoning step) and rᵢ is a reward derived from the model’s own reflection (self‑critique). This aligns with the paper In‑Context RL (arXiv:2210.14215).
Simplified reward model
Because intermediate rewards are hard to estimate, a binary reward can be used: any trajectory that eventually yields the correct final answer receives a positive reward, regardless of intermediate errors; trajectories ending with an incorrect answer receive a negative reward. This reduces the problem to a contextual bandit that can be optimized with REINFORCE.
Training with REINFORCE
The basic REINFORCE update increases the gradient for correct answers and decreases it for wrong ones. Stability can be improved with KL‑regularization toward a reference policy and reward normalization. The update rule can be expressed as:
∇θ J ≈ (E[∇θ log πθ(a|s)·(R − b)])where R is the binary reward and b is a baseline (e.g., moving‑average reward).
Empirical observation
During RL training, models not only improve task performance but also tend to generate longer token sequences, indicating an emergent increase in reasoning depth.
References
https://arxiv.org/abs/2210.14215 https://www.youtube.com/watch?v=eaAonE58sLU https://www.youtube.com/watch?v=kYWUEV_e2ss https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson.pdf https://lilianweng.github.io/posts/2024-11-28-reward-hacking/ https://blog.ml.cmu.edu/2025/01/08/optimizing-llm-test-time-compute-involves-solving-a-meta-rl-problem/
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
