How 800 Data Points Halve LLM Chain‑of‑Thought Length and Boost Accuracy
The ICLR‑2026 paper introduces LCPO, a lightweight preference‑optimization technique that uses only 800 curated examples and 50 training steps to cut large‑model chain‑of‑thought generation length by about 50% while maintaining or even improving answer accuracy, dramatically reducing training and inference costs.
Large reasoning models (LRMs) such as DeepSeek‑R1 and Qwen‑32B achieve strong performance on math and coding tasks by generating long chains of thought (CoT), but this often leads to overthinking: verbose reasoning, higher computational cost, slower inference, and occasional errors on simple problems.
Problem and Existing Solutions
Current remedies either truncate the output during inference—resulting in unstable performance and degraded speed—or employ large‑scale online reinforcement learning, which demands hundreds of thousands of training examples and thousands of GPU hours.
Key Research Questions
Is there a short yet correct reasoning path already present in the model's generation space?
How can we steer the model toward that efficient path using minimal data and training?
Core Finding
Experiments with DeepSeek‑R1‑Distill‑Qwen‑7B showed that the shortest answers (top‑ranked by length) retain almost the same accuracy as longer ones, while the longest answers suffer a sharp drop in correctness. This indicates that models inherently possess a “concise mode” that is simply not activated by default.
Method Overview (LCPO)
The proposed three‑step pipeline extracts only the most concise correct answers as positive examples and the longest answers as negative examples, using the model’s own correctness as a difficulty label (Easy, Medium, Difficult). From an original pool of 22 k samples, only 800 Easy answers are selected for training.
Data Selection: Use the model’s answer correctness to label problems as Easy (all correct), Medium (partially correct), or Difficult (all wrong). Train solely on Easy examples, pairing the shortest correct answer (positive) with the longest answer (negative).
Algorithm Innovation: Analyze existing preference‑optimization objectives (DPO, SimPO, ORPO) and discover that the negative log‑likelihood (NLL) term interferes with learning length preferences. LCPO modifies the loss to directly balance NLL, isolating the length signal without extra hyper‑parameters.
Training Efficiency: The approach reduces data requirements by 1–2 orders of magnitude and cuts total training cost to roughly 10.4 A100‑GPU hours, compared to thousands of hours for online RL methods.
Results
On DeepSeek‑R1‑Distill‑Qwen‑1.5B/7B, LCPO halves the average reasoning length while preserving accuracy. Even when evaluated on out‑of‑distribution tasks (MMLU, GPQA‑Diamond, WinoGrande), the method achieves over 55% length reduction with a slight accuracy increase, suggesting the learned “efficient thinking” generalizes beyond the training domain.
A concrete example demonstrates a simple math problem that previously required eight verification steps and a 79.37% token waste; after LCPO training, the model solves it with a single verification, drastically reducing token consumption.
Implications
The work reveals that large models already contain efficient reasoning paths; the challenge is to provide the right preference signals to activate them. This opens a new direction for low‑cost, high‑efficiency alignment of LLM behavior, enabling faster inference, lower API costs, and fewer overthinking‑induced errors.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
