Artificial Intelligence 13 min read

Skill‑Driven Reasoning Cuts Tokens by Up to 59% While Boosting Accuracy

The article introduces the TRS (Thinking with Reasoning Skills) framework, which distills historical LLM reasoning traces into reusable skill cards, enabling offline skill‑base construction and online retrieval that dramatically reduces token consumption (6‑59%) and often improves accuracy on math and coding tasks.

PaperAgent

Apr 29, 2026

Skill‑Driven Reasoning Cuts Tokens by Up to 59% While Boosting Accuracy

Current large‑language‑model (LLM) reasoning pipelines such as OpenAI o1 and DeepSeek‑R1 achieve impressive accuracy but often generate thousands of tokens of intermediate "thinking" steps, inflating inference cost and latency.

The TRS framework, proposed by Qiyuan Technology together with Tsinghua and Peking University, requires no additional training and works with black‑box models. It distills long reasoning trajectories into compact, reusable skill cards , allowing models to reason with fewer tokens while maintaining or even improving accuracy (token reduction of 6‑59% with no accuracy loss).

1. Token inflation crisis

Modern reasoning models use explicit chain‑of‑thought (CoT) to boost reliability, but inference cost scales linearly with token count. In commercial API pricing, output tokens are often more expensive than input tokens, and complex problems trigger massive verification, trial‑and‑error, and back‑tracking loops, stressing infrastructure.

Existing speed‑up methods (Chain‑of‑Draft, TALE, NoWait) all try to make the model "think shorter"; however, forcing a shorter reasoning space creates an efficiency‑accuracy trade‑off—simple problems benefit, but difficult ones often fail.

Core question: Can we avoid zero‑shot reasoning and instead invoke already‑distilled solution experience directly?

2. Core insight: from zero‑shot to skill recall

Human experts rarely derive solutions from scratch; they rely on reusable skills such as "find invariants" or "two‑pointer". TRS systematizes this by offline distilling both successful and failed trajectories into structured skill cards.

Offline : Convert long reasoning traces (including successes and failures) into skill cards.

Online : Retrieve the most relevant skill cards for a new query and inject them into the prompt.

Standard CoT for an integral problem requires a long sequence of "integration by parts → trigonometric substitution → trial‑and‑error". TRS retrieves a "chain rule + substitution" skill card and solves the problem in three steps, dramatically cutting token usage.

3. Method details: TRS framework

3.1 Skill Card design

Each skill card is a compact text with five fields:

Trigger : scenario trigger words (e.g., "contains integral").

Do : core operation steps (minimal executable recipe).

Avoid : anti‑patterns or common traps.

Check : constraints or invariants that must be verified.

Risk : edge cases and failure modes.

Correct trajectories capture successful patterns; erroneous trajectories are distilled into "anti‑pattern → correction" strategies, which is crucial for improving accuracy on hard problems.

3.2 Offline skill‑base construction

Run the reasoning model on source problems to obtain trajectories and results.

Use a stronger distillation model (e.g., Gemini Flash) to compress trajectories into skill cards and 10‑20 retrieval keywords.

Store them in a key‑value store where Key = Concat(question, keywords) and Value = skill card.

The paper validates the approach on DEEPMATH‑103K (93 K build, 10 K test) and NEMOTRON‑COMPETITIVEPROGRAMMING‑V1 (26.6 K build, 1 K test).

3.3 Online retrieval and injection

Retrieve : use BM25 for math queries or a hybrid BM25 + dense embedding for code queries to fetch top‑k skill cards.

Inject : prepend the selected skill cards to the prompt (Figure 13 shows the template).

Lightweight gating : add an arbitration instruction in the prompt – "use only directly applicable skills; ignore irrelevant or contradictory advice".

Why tokens drop? Although the input grows with the injected skill, redundant exploration, trial‑and‑error loops, and repeated verification disappear, so the net token count and end‑to‑end cost and latency both decrease.

4. Main experiments: breaking the efficiency‑accuracy trade‑off

4.1 Math reasoning (DeepMath‑103K)

Doubao Seed reduces tokens by 53.8% with only –0.2% accuracy loss.

GPT‑4o‑mini gains 1.8% accuracy and cuts cost by 6.9%.

GPT‑OSS‑120B keeps accuracy unchanged while cutting cost by 16.9%.

4.2 Competitive programming

GPT‑4o‑mini: accuracy 22.0% → 24.4% (+2.4%); cost ↓6.3%.

Doubao Seed‑2.0: accuracy 63.6% → 64.4% (+0.8%); cost ↓6.0%.

GPT‑OSS‑120B: accuracy 54.2% → 58.3% (+4.1%); cost ↑4.8% due to larger prompt but accuracy gain is significant.

5. Deep analysis: why TRS wins

5.1 Larger advantage on hard problems vs. TALE/CoD/NoWait

Existing acceleration methods catastrophically fail on difficult questions. TRS on GPT‑OSS raises accuracy from ~45% to ~80% in the hardest interval while cutting tokens from ~15 k to ~7 k.

5.2 Control experiments: not just simple RAG

Ablation studies show that the benefit cannot be explained by retrieval alone; the combination of structured skill cards with sufficient coverage is required. Models need executable procedural guidance, not merely "relevant context".

5.3 Cross‑model transfer

Using a skill base built by Doubao for OSS, or vice‑versa, yields positive gains.

Same‑model style alignment gives the biggest benefit (e.g., Doubao using Doubao library).

Cross‑source skills can sometimes achieve even more aggressive token cuts.

5.4 Retrieval strategy

Math queries have high lexical overlap, so BM25 works; code queries have diverse surface forms, requiring semantic matching, so a hybrid BM25 + dense embedding is used. Default settings: BM25(k=1) for math, Hybrid(k=5) for code.

5.5 External competition transfer: AoPS skill base

Distilled 7,616 skill cards from the AoPS competition set and evaluated on AIME 2024/2025/2026 and HMMT 2025. Across 25 model‑benchmark pairs, 13 showed accuracy improvements and 20 showed cost reductions. Doubao‑1.8 averaged +1.88% accuracy and –2.8% cost; Gemini‑3‑Flash improved accuracy with a slight cost increase.

Table 6 shows AIME 2024 I as the best transfer (+2.54%); later AIME 2026 gains plateau, indicating that domain proximity remains a key factor.

6. Engineering implications

Enterprises can offline distill skill bases with strong models (GPT‑4, Gemini) and deploy them for lightweight models (GPT‑4o‑mini, Doubao) to achieve a "master‑experience, apprentice‑execution" cost structure.

https://github.com/stallone0000/Reasoning-Skill
huggingface.co/datasets/stallone0000/Reasoning-Skill
https://reasoning-skill.onrender.com
https://arxiv.org/pdf/2604.21764
Thinking with Reasoning Skills: Fewer Tokens, More Accuracy

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Code Generation Inference Optimization Large Language Models Math Reasoning Token Efficiency Reasoning Skills TRS

Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.