How 800 Data Points Halve LLM Chain‑of‑Thought Length and Boost Accuracy

The ICLR‑2026 paper introduces LCPO, a lightweight preference‑optimization technique that uses only 800 curated examples and 50 training steps to cut large‑model chain‑of‑thought generation length by about 50% while maintaining or even improving answer accuracy, dramatically reducing training and inference costs.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
How 800 Data Points Halve LLM Chain‑of‑Thought Length and Boost Accuracy

Large reasoning models (LRMs) such as DeepSeek‑R1 and Qwen‑32B achieve strong performance on math and coding tasks by generating long chains of thought (CoT), but this often leads to overthinking: verbose reasoning, higher computational cost, slower inference, and occasional errors on simple problems.

Problem and Existing Solutions

Current remedies either truncate the output during inference—resulting in unstable performance and degraded speed—or employ large‑scale online reinforcement learning, which demands hundreds of thousands of training examples and thousands of GPU hours.

Key Research Questions

Is there a short yet correct reasoning path already present in the model's generation space?

How can we steer the model toward that efficient path using minimal data and training?

Core Finding

Experiments with DeepSeek‑R1‑Distill‑Qwen‑7B showed that the shortest answers (top‑ranked by length) retain almost the same accuracy as longer ones, while the longest answers suffer a sharp drop in correctness. This indicates that models inherently possess a “concise mode” that is simply not activated by default.

Method Overview (LCPO)

The proposed three‑step pipeline extracts only the most concise correct answers as positive examples and the longest answers as negative examples, using the model’s own correctness as a difficulty label (Easy, Medium, Difficult). From an original pool of 22 k samples, only 800 Easy answers are selected for training.

Data Selection: Use the model’s answer correctness to label problems as Easy (all correct), Medium (partially correct), or Difficult (all wrong). Train solely on Easy examples, pairing the shortest correct answer (positive) with the longest answer (negative).

Algorithm Innovation: Analyze existing preference‑optimization objectives (DPO, SimPO, ORPO) and discover that the negative log‑likelihood (NLL) term interferes with learning length preferences. LCPO modifies the loss to directly balance NLL, isolating the length signal without extra hyper‑parameters.

Training Efficiency: The approach reduces data requirements by 1–2 orders of magnitude and cuts total training cost to roughly 10.4 A100‑GPU hours, compared to thousands of hours for online RL methods.

Results

On DeepSeek‑R1‑Distill‑Qwen‑1.5B/7B, LCPO halves the average reasoning length while preserving accuracy. Even when evaluated on out‑of‑distribution tasks (MMLU, GPQA‑Diamond, WinoGrande), the method achieves over 55% length reduction with a slight accuracy increase, suggesting the learned “efficient thinking” generalizes beyond the training domain.

A concrete example demonstrates a simple math problem that previously required eight verification steps and a 79.37% token waste; after LCPO training, the model solves it with a single verification, drastically reducing token consumption.

Implications

The work reveals that large models already contain efficient reasoning paths; the challenge is to provide the right preference signals to activate them. This opens a new direction for low‑cost, high‑efficiency alignment of LLM behavior, enabling faster inference, lower API costs, and fewer overthinking‑induced errors.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Large Language ModelsChain-of-ThoughtPreference OptimizationLow‑Resource TrainingEfficient InferenceLCPO
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.