Why Long CoT and In‑Context RL Are the Next Frontier for LLMs

The article analyses recent breakthroughs such as OpenAI's o1, Long CoT, and test‑time search, arguing that enabling LLMs to perform self‑critique and reinforcement learning with long output sequences is essential for future AI performance, while warning against overly structured workflows.

Architect
Architect
Architect
Why Long CoT and In‑Context RL Are the Next Frontier for LLMs

Key observations of OpenAI o1

The model is allowed to make mistakes during reasoning.

It repeatedly reflects on its output and retries, often inserting phrases like "but, wait…".

Its reasoning is unconstrained: it can restate the problem, draw analogies, and decompose tasks.

Long Context vs. Long Chain‑of‑Thought (Long CoT)

Long Context refers to handling very long input sequences (e.g., prefilling, Mooncake) with manageable compute cost. Long CoT, by contrast, generates long output sequences, which is far more expensive and slower. The argument is that performance, not cost, should be the primary driver; extending the reasoning horizon is worthwhile despite higher compute.

Test‑time search and self‑critique

Noam Brown emphasizes that models should be able to perform autonomous test‑time search, similar to the search component in AlphaGo. Hyung Won Chung argues that imposing a fixed reasoning structure limits the model; instead, the model should discover its own reasoning patterns and be incentivized to self‑critique.

In‑Context Reinforcement Learning (RL) formulation

Generating a Long CoT can be viewed as an in‑context RL problem. Each token generation step forms a trajectory: s₁, a₁, r₁, a₂, r₂, a₃, r₃, … where aᵢ is an action (a reasoning step) and rᵢ is a reward derived from the model’s own reflection (self‑critique). This aligns with the paper In‑Context RL (arXiv:2210.14215).

Simplified reward model

Because intermediate rewards are hard to estimate, a binary reward can be used: any trajectory that eventually yields the correct final answer receives a positive reward, regardless of intermediate errors; trajectories ending with an incorrect answer receive a negative reward. This reduces the problem to a contextual bandit that can be optimized with REINFORCE.

Training with REINFORCE

The basic REINFORCE update increases the gradient for correct answers and decreases it for wrong ones. Stability can be improved with KL‑regularization toward a reference policy and reward normalization. The update rule can be expressed as:

∇θ J ≈ (E[∇θ log πθ(a|s)·(R − b)])

where R is the binary reward and b is a baseline (e.g., moving‑average reward).

Empirical observation

During RL training, models not only improve task performance but also tend to generate longer token sequences, indicating an emergent increase in reasoning depth.

References

https://arxiv.org/abs/2210.14215 https://www.youtube.com/watch?v=eaAonE58sLU https://www.youtube.com/watch?v=kYWUEV_e2ss https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson.pdf https://lilianweng.github.io/posts/2024-11-28-reward-hacking/ https://blog.ml.cmu.edu/2025/01/08/optimizing-llm-test-time-compute-involves-solving-a-meta-rl-problem/
LLMmodel trainingAI researchReinforceLong CoTSelf‑CritiqueIn‑Context RL
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.