Why Do Reasoning LLMs Lose Instruction-Following Ability? A Deep Dive into Recent Findings
This article compares two recent papers that investigate why large reasoning models such as Llama and Qwen show degraded instruction‑following performance when using chain‑of‑thought prompting, analyzing attention patterns, training effects, and proposed mitigation strategies.
The author observes that recent reasoning language models are unexpectedly losing their ability to follow instructions, a problem explored in two newly released papers.
Paper Overview
Paper A – “When Thinking Fails: The Pitfalls of Reasoning for Instruction‑Following in LLMs” ( PaperA) studies how chain‑of‑thought (COT) affects instruction‑following (IF) performance of models that already possess reasoning capabilities.
Paper B – “Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models” ( PaperB) compares the IF ability of a Reasoner model with its Instruct‑style baseline.
Common Experimental Steps
Verify that the Reasoner’s IF ability is worse than the baseline.
Use heuristics to locate the cause and validate with data.
Try a seemingly unrelated solution to see its effect.
Different Research Goals
PaperAfocuses on the difference between using COT and not using COT for the same model; PaperB focuses on the gap between a Reasoner model and its Instruct base.
Attribution Differences
PaperAattributes the drop to reduced attention on instruction tokens when COT is applied. PaperB attributes the drop to the length of the COT sequence, claiming longer COT leads to lower IF success.
Key Empirical Findings
Figures from both papers show that, for Llama and Qwen series, Instruct models consistently outperform their Reasoner counterparts on the IFEval benchmark.
Table 1 in PaperB demonstrates that the same model’s IF accuracy declines when a COT prompt is required, especially for the Llama family.
Observations on task types reveal that COT improves IF for format‑heavy instructions (JSON, Markdown) but harms IF for constraints such as word‑count limits, non‑English responses, or strict content quality requirements.
Mechanistic Insight
Attention‑score analysis shows lower scores for instruction tokens when COT is used on tasks where IF performance drops, and higher scores when COT helps.
Impact of Reasoner Training
PaperBevaluates three training regimes—pure supervised fine‑tuning (SFT), pure reinforcement learning (RL), and SFT + RL—on the same base model and finds that all three degrade IF ability, even though no IF data were used during training.
Proposed Mitigations
PaperAsuggests four approaches: few‑shot prompting, forced self‑reflection, letting the model decide whether to use COT, and adding a binary classifier to select COT usage. PaperB proposes two methods: explicitly limiting COT length and appending the IF requirement at the end of the COT output.
Both papers conclude that a classifier deciding when to apply COT yields the best trade‑off, while the other methods often hurt reasoning performance.
Critical Evaluation
The author criticizes both works for stopping at attention‑score analysis without deeper investigation into training data distribution, token bandwidth, or architectural constraints.
Furthermore, the papers reach opposite conclusions about the effect of COT length: PaperB reports a clear negative correlation, whereas PaperA finds the correlation statistically insignificant.
References
PaperA– When Thinking Fails: The Pitfalls of Reasoning for Instruction‑Following in LLMs, https://arxiv.org/abs/2505.11423 PaperB – Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models, https://arxiv.org/abs/2505.14810
Figures
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
