Artificial Intelligence 16 min read

How OmniThought Enables Adaptive Reasoning Chains for Better LLM Performance

This article introduces the OmniThought dataset, which annotates over two million chain‑of‑thought reasoning steps with Reasoning Verbosity and Cognitive Difficulty scores, and explains how these metrics guide the training of DistilQwen‑ThoughtX models that adapt chain length to task difficulty, achieving superior performance compared to existing distilled LLMs.

Alibaba Cloud Big Data AI Platform

May 29, 2025

How OmniThought Enables Adaptive Reasoning Chains for Better LLM Performance

Introduction

Recent breakthroughs in natural language processing (NLP) driven by large language models (LLMs) have transformed language understanding, generation, and reasoning. Advanced reasoning models such as OpenAI o1, DeepSeek‑R1, and QwQ‑32B excel on complex tasks by using Chain‑of‑Thought (CoT) prompting, which mimics incremental human reasoning. However, overly long reasoning chains can cause “overthinking,” slowing responses and increasing errors.

OmniThought Dataset Construction

The PAI team proposed two metrics—Reasoning Verbosity (RV) and Cognitive Difficulty (CD)—and built the OmniThought dataset containing more than 2 million annotated CoT examples. Data sources include OpenThoughts2‑1M (≈640 k problems across math, code, science, puzzles) and DeepMath‑103K (≈103 k math problems). Teacher models (DeepSeek‑R1, QwQ‑32B) generated multiple reasoning chains for each problem, which were validated using an LLM‑as‑a‑judge protocol.

You are a rigorous logical validator analyzing problem‑solving components.
Your task is to separately assess the validity of the reasoning process and final solution.
For SOLUTION VALIDITY: Directly compare it to the correct answer.
For REASONING PROCESS VALIDATION:
  a. Verify stepwise logical coherence and soundness
  b. Confirm all critical problem constraints are addressed
  c. Check for self‑contradictions or unsupported leaps
  d. Verify the process can derive the proposed solution
Evaluation Protocol:
- Solution validity must be FALSE for any numerical mismatch.
- Reasoning validity requires all criteria (a‑d) satisfied.
Output format: reasoning_valid: bool, solution_valid: bool

Each problem in OmniThought has at least two verified correct reasoning chains, resulting in 708 k unique problems.

Reasoning Verbosity (RV)

RV measures the amount of redundant reasoning. Scores range from 0 (minimal explanation) to 9 (extensive, detailed reasoning). The distribution is defined as:

0‑1: Minimal verbosity, direct answer.
2‑3: Low verbosity, clear concise reasoning.
4‑5: Medium verbosity, detailed explanation.
6‑7: High verbosity, thorough exploration.
8‑9: Very high verbosity, deep, nested arguments.

Cognitive Difficulty (CD)

CD reflects the inherent difficulty of the reasoning required. Scores also range from 0 to 9, with higher values indicating more abstract or multi‑step reasoning.

0‑1: Elementary knowledge, single‑step thinking.
2‑3: Multi‑step arithmetic, rule‑based reasoning.
4‑5: Basic logic/algebra, non‑obvious inference.
6‑7: Advanced techniques (determinants, DP, code reasoning).
8‑9: Highly abstract methods, nested proofs, complex algorithm analysis.

Analysis shows most CD scores cluster around 4‑5, while larger models tend to generate higher‑difficulty chains.

Training Adaptive Reasoning Models

Using RV and CD scores, three subsets of OmniThought (short, medium, long) were created and used to fine‑tune Qwen2.5‑7B‑Instruct via SFT, producing DistilQwen‑ThoughtX models with adaptive chain lengths. Experiments on GSM8K, MATH500, and AIME24 demonstrate that longer chains improve accuracy on difficult tasks but may hurt performance on simple ones.

DistilQwen‑ThoughtX Models and Release

DistilQwen‑ThoughtX‑7B and DistilQwen‑ThoughtX‑32B were trained on the filtered OmniThought data and released on Hugging Face and ModelScope. Users can download them with the following code:

from huggingface_hub import snapshot_download
model_name = "alibaba-pai/DistilQwen-ThoughtX-7B"
snapshot_download(repo_id=model_name, cache_dir="./DistilQwen-ThoughtX-7B/")
model_name = "alibaba-pai/DistilQwen-ThoughtX-32B"
snapshot_download(repo_id=model_name, cache_dir="./DistilQwen-ThoughtX-32B/")

The OmniThought dataset itself is also available via load_dataset("alibaba-pai/OmniThought").

Conclusion

OmniThought provides a large, richly annotated CoT resource that enables LLMs to adapt reasoning chain length to task difficulty, mitigating overthinking while enhancing accuracy on complex problems. The released DistilQwen‑ThoughtX models demonstrate the effectiveness of this approach, outperforming prior distilled models and offering a practical toolkit for the community.

References

Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang. "Reasoning with OmniThought: A Large CoT Dataset with Verbosity and Cognitive Difficulty Annotations." arXiv preprint.

Chengyu Wang, Junbing Yan, Yuanhao Yue, Jun Huang. "EasyDistill: A Comprehensive Toolkit for Effective Knowledge Distillation of Large Language Models." arXiv preprint.

Wenrui Cai et al. "Training Small Reasoning LLMs with Cognitive Preference Alignment." arXiv preprint.

Chengyu Wang et al. "DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models." ACL 2025.

Yuanhao Yue et al. "Building a Family of Data Augmentation Models for Low‑cost LLM Fine‑tuning on the Cloud." COLING 2025.

Yuanhao Yue et al. "Distilling Instruction‑following Abilities of Large Language Models with Task‑aware Curriculum Planning." EMNLP 2024.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM reasoning dataset Distillation CoT

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.