How HiPO Gives LLMs a Smart Thinking Switch to Cut Costs and Boost Accuracy
This article explains the overthinking problem of large language models, introduces the HiPO framework with hybrid data cold‑start and reinforcement‑learning reward mechanisms that let models decide when to think deeply or answer directly, and shows experimental results demonstrating significant efficiency gains and accuracy improvements across multiple benchmarks.
When users ask simple questions to large language models (LLMs), the models often generate unnecessarily long chain‑of‑thought reasoning, wasting compute resources and sometimes producing wrong answers due to “overthinking”.
The Overthinking Dilemma
LLMs’ success on complex tasks stems from chain‑of‑thought (CoT) prompting, but this also leads to high token usage, latency, and cost. Existing solutions—training‑time adaptive inference, external prompts, or post‑hoc pruning—have limited scalability or effectiveness.
HiPO: Hybrid Policy Optimization
The KwaiKAT team (Fast Kwai) and Nanjing University’s NLINK and ARiSE labs propose HiPO, a framework that equips LLMs with an intelligent “thinking switch”. It combines two core components:
Hybrid Data Cold‑Start – A curated dataset containing both “Think‑on” (with reasoning) and “Think‑off” (direct answer) responses, built from high‑quality math and code reasoning corpora.
Hybrid Reinforcement‑Learning Reward System – A reward that balances answer correctness, format, and a bias term to prevent the model from always choosing the more accurate but costlier “Think‑on” mode.
During training, each question generates N “Think‑on” and N “Think‑off” answers using a strong reasoning model (e.g., DeepSeek‑V3). All answers are automatically verified, and the mode with the higher pass rate is selected; if the rates are close, the system prefers the cheaper “Think‑off”. The shortest correct answer in the winning mode becomes the final sample, and a justification token explains the mode choice.
Reward Details
Base reward scores answer correctness (ACC) and format (FORMAT). A dynamic bias (bias_off) is added to “Think‑off” based on the average “Think‑on” reward, scaled by a small factor ω (≈0.01). This prevents the model from over‑favoring “Think‑on” during RL.
Two advantage functions guide token‑level optimization:
Judge Advantage (A_judge) – Encourages reasonable mode selection by comparing global and within‑mode performance.
Answer Advantage (A_answer) – Rewards higher quality answers within the chosen mode.
Experimental Results
HiPO was evaluated on Qwen‑3 series models (8B, 1.7B, 32B) across benchmarks such as AIME2024/2025, HumanEval, LiveCodeBench v6, MATH‑500, and GPQA‑Diamond. Compared with baselines (AdaptThink, AutoThink, etc.), HiPO achieved:
≈30% reduction in average token length and 37% lower thinking ratio, cutting inference cost.
≈6.3% absolute gain in accuracy, showing that efficiency does not sacrifice performance.
Consistent superiority over existing adaptive inference methods.
Analysis of RL training shows the “Think‑on” activation rate dropping from 89.5% to 53.1%, and task‑dependent mode usage (high for math/code reasoning, low for simpler code tasks), confirming effective dynamic behavior.
Future Outlook
HiPO’s smart thinking switch paves the way for practical LLM deployment by reducing compute and latency, offers a new direction for model compression, and enhances meta‑cognitive abilities, moving AI from brute‑force reasoning to intelligent efficiency.
The framework and models are open‑sourced on Hugging Face for community use.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
