Artificial Intelligence 11 min read

Why Do Reasoning LLMs Lose Instruction-Following Ability? A Deep Dive into Recent Findings

This article compares two recent papers that investigate why large reasoning models such as Llama and Qwen show degraded instruction‑following performance when using chain‑of‑thought prompting, analyzing attention patterns, training effects, and proposed mitigation strategies.

Baobao Algorithm Notes

May 26, 2025

Why Do Reasoning LLMs Lose Instruction-Following Ability? A Deep Dive into Recent Findings

The author observes that recent reasoning language models are unexpectedly losing their ability to follow instructions, a problem explored in two newly released papers.

Paper Overview

Paper A – “When Thinking Fails: The Pitfalls of Reasoning for Instruction‑Following in LLMs” ( PaperA) studies how chain‑of‑thought (COT) affects instruction‑following (IF) performance of models that already possess reasoning capabilities.

Paper B – “Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models” ( PaperB) compares the IF ability of a Reasoner model with its Instruct‑style baseline.

Common Experimental Steps

Verify that the Reasoner’s IF ability is worse than the baseline.

Use heuristics to locate the cause and validate with data.

Try a seemingly unrelated solution to see its effect.

Different Research Goals

PaperA

focuses on the difference between using COT and not using COT for the same model; PaperB focuses on the gap between a Reasoner model and its Instruct base.

Attribution Differences

PaperA

attributes the drop to reduced attention on instruction tokens when COT is applied. PaperB attributes the drop to the length of the COT sequence, claiming longer COT leads to lower IF success.

Key Empirical Findings

Figures from both papers show that, for Llama and Qwen series, Instruct models consistently outperform their Reasoner counterparts on the IFEval benchmark.

Table 1 in PaperB demonstrates that the same model’s IF accuracy declines when a COT prompt is required, especially for the Llama family.

Observations on task types reveal that COT improves IF for format‑heavy instructions (JSON, Markdown) but harms IF for constraints such as word‑count limits, non‑English responses, or strict content quality requirements.

Mechanistic Insight

Attention‑score analysis shows lower scores for instruction tokens when COT is used on tasks where IF performance drops, and higher scores when COT helps.

Impact of Reasoner Training

PaperB

evaluates three training regimes—pure supervised fine‑tuning (SFT), pure reinforcement learning (RL), and SFT + RL—on the same base model and finds that all three degrade IF ability, even though no IF data were used during training.

Proposed Mitigations

PaperA

suggests four approaches: few‑shot prompting, forced self‑reflection, letting the model decide whether to use COT, and adding a binary classifier to select COT usage. PaperB proposes two methods: explicitly limiting COT length and appending the IF requirement at the end of the COT output.

Both papers conclude that a classifier deciding when to apply COT yields the best trade‑off, while the other methods often hurt reasoning performance.

Critical Evaluation

The author criticizes both works for stopping at attention‑score analysis without deeper investigation into training data distribution, token bandwidth, or architectural constraints.

Furthermore, the papers reach opposite conclusions about the effect of COT length: PaperB reports a clear negative correlation, whereas PaperA finds the correlation statistically insignificant.

References

PaperA

– When Thinking Fails: The Pitfalls of Reasoning for Instruction‑Following in LLMs, https://arxiv.org/abs/2505.11423 PaperB – Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models, https://arxiv.org/abs/2505.14810