How HiPO Gives LLMs a Smart Thinking Switch to Cut Costs and Boost Accuracy

This article explains the overthinking problem of large language models, introduces the HiPO framework with hybrid data cold‑start and reinforcement‑learning reward mechanisms that let models decide when to think deeply or answer directly, and shows experimental results demonstrating significant efficiency gains and accuracy improvements across multiple benchmarks.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
How HiPO Gives LLMs a Smart Thinking Switch to Cut Costs and Boost Accuracy

When users ask simple questions to large language models (LLMs), the models often generate unnecessarily long chain‑of‑thought reasoning, wasting compute resources and sometimes producing wrong answers due to “overthinking”.

The Overthinking Dilemma

LLMs’ success on complex tasks stems from chain‑of‑thought (CoT) prompting, but this also leads to high token usage, latency, and cost. Existing solutions—training‑time adaptive inference, external prompts, or post‑hoc pruning—have limited scalability or effectiveness.

HiPO: Hybrid Policy Optimization

The KwaiKAT team (Fast Kwai) and Nanjing University’s NLINK and ARiSE labs propose HiPO, a framework that equips LLMs with an intelligent “thinking switch”. It combines two core components:

Hybrid Data Cold‑Start – A curated dataset containing both “Think‑on” (with reasoning) and “Think‑off” (direct answer) responses, built from high‑quality math and code reasoning corpora.

Hybrid Reinforcement‑Learning Reward System – A reward that balances answer correctness, format, and a bias term to prevent the model from always choosing the more accurate but costlier “Think‑on” mode.

During training, each question generates N “Think‑on” and N “Think‑off” answers using a strong reasoning model (e.g., DeepSeek‑V3). All answers are automatically verified, and the mode with the higher pass rate is selected; if the rates are close, the system prefers the cheaper “Think‑off”. The shortest correct answer in the winning mode becomes the final sample, and a justification token explains the mode choice.

Reward Details

Base reward scores answer correctness (ACC) and format (FORMAT). A dynamic bias (bias_off) is added to “Think‑off” based on the average “Think‑on” reward, scaled by a small factor ω (≈0.01). This prevents the model from over‑favoring “Think‑on” during RL.

Two advantage functions guide token‑level optimization:

Judge Advantage (A_judge) – Encourages reasonable mode selection by comparing global and within‑mode performance.

Answer Advantage (A_answer) – Rewards higher quality answers within the chosen mode.

Experimental Results

HiPO was evaluated on Qwen‑3 series models (8B, 1.7B, 32B) across benchmarks such as AIME2024/2025, HumanEval, LiveCodeBench v6, MATH‑500, and GPQA‑Diamond. Compared with baselines (AdaptThink, AutoThink, etc.), HiPO achieved:

≈30% reduction in average token length and 37% lower thinking ratio, cutting inference cost.

≈6.3% absolute gain in accuracy, showing that efficiency does not sacrifice performance.

Consistent superiority over existing adaptive inference methods.

Analysis of RL training shows the “Think‑on” activation rate dropping from 89.5% to 53.1%, and task‑dependent mode usage (high for math/code reasoning, low for simpler code tasks), confirming effective dynamic behavior.

Future Outlook

HiPO’s smart thinking switch paves the way for practical LLM deployment by reducing compute and latency, offers a new direction for model compression, and enhances meta‑cognitive abilities, moving AI from brute‑force reasoning to intelligent efficiency.

The framework and models are open‑sourced on Hugging Face for community use.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

efficiencyLLMreinforcement learningadaptive inferenceHybrid Policy Optimization
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.