Why Your Claude Code Quota Drains Fast and How to Save Up to 90% of Tokens
A typical Claude Code session spends 98% of its tokens on input rather than generated code, so most of the budget is wasted on context, file reads, and system prompts; this article explains the billing model, common waste patterns, monitoring tools, and a four‑layer optimization pyramid that can cut token usage by 50‑90%.
Token Consumption Mechanics
Claude Code uses a four‑dimensional billing system. Tokens are classified as:
Input tokens – prompts, conversation history, file contents, CLAUDE.md, system prompts, tool definitions (baseline price).
Output tokens – model replies and internal “thinking” tokens (3‑5× input price).
Cache‑write tokens – first write of cached content (1.25× input price).
Cache‑read tokens – subsequent reads of cached content (0.1× input price, i.e. 90 % discount).
In a typical session 98 % of tokens are input and only ~0.6 % are actual code output. Each API call resends the entire conversation history, system prompt and tool definitions, so the cost per round grows as the context grows.
Prompt caching mitigates this: immutable parts (system prompt, CLAUDE.md) are cached and subsequent reads cost only 10 % of the normal price. In a 157‑round session 98 % of tokens were cache‑reads, reducing effective cost from $5 / M to $0.50 / M.
Pricing (per million tokens)
Opus 4.6 – $5 input, $25 output, $0.50 cache‑read.
Sonnet 4.6 – $3 input, $15 output, $0.30 cache‑read.
Haiku 4.5 – $1 input, $5 output, $0.10 cache‑read.
Built‑in Monitoring Commands
/cost– shows total token usage and cost estimate (example output shown). /context – breakdown of context components (system prompt, MCP tools, conversation history, etc.). /stats – usage statistics for Pro/Max users. /usage – current usage versus plan quota.
Community Monitoring Tools
ccusage – CLI ( npx ccusage) parses local JSONL logs and generates daily/weekly token‑usage reports.
claude‑monitor – Python real‑time dashboard refreshed every 3 seconds, predicts burn‑rate.
RTK – Rust CLI ( cargo install rtk) filters Bash output, merges duplicate lines and truncates verbose success output.
Recommendation: Run /cost after each completed feature, similar to checking heart‑rate during a run.
Four‑Layer Optimization Pyramid
Layer 1 – Zero‑Cost, Immediate Wins (≈5 min)
Create a .claudeignore file at the project root to exclude unnecessary files. Example:
node_modules/
.git/
dist/
build/
.next/
*.lock
*.log
coverage/
__pycache__/
*.sqlite
*.wasm
*.min.jsEffect: reduces context consumption by 30‑70 %.
Use precise prompts instead of vague ones. Example:
Vague: “Fix this bug”.
Precise: “Add try‑catch to getUserById in src/auth.ts:45 ”.
Precise prompts cut token usage by 90‑97 % (e.g., 10 000 → 300 tokens).
Run /clear after each feature or bug fix to discard stale context.
Merge related messages into a single request to avoid sending the same context multiple times.
Write prompts in English; English tokenization is ~30 % more efficient than Chinese.
Layer 2 – Light Configuration (≈30 min)
Trim CLAUDE.md to ≤200 lines and treat it as an index, not a full encyclopedia. Teams reported an 88 % reduction (11 000 → 1 300 tokens).
Model routing (70/20/10 rule):
File discovery, simple transforms, pair‑programming → Haiku 4.5 (≈90 % of Sonnet’s capability at 1/3 the price).
Most coding tasks → Sonnet 4.6 (best speed‑quality balance).
Complex architectural decisions, cross‑file refactoring → Opus 4.6 (deep reasoning, 1 M context).
All‑Opus sessions cost $15‑30; mixed routing drops cost to $3‑7.
Control extended thinking (the most expensive hidden cost). Set a global effort level:
export CLAUDE_CODE_EFFORT_LEVEL=medium
/effort low # simple task
/effort high # complex reasoningOr cap thinking tokens:
export CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1
export MAX_THINKING_TOKENS=10000Manually trigger /compact when context reaches 60 % instead of waiting for auto‑compact at 95 %.
Disconnect unused MCP servers (each consumes 2‑5 K tokens).
Layer 3 – Tool‑Assisted Optimizations (requires installation)
RTK (Rust Token Killer) – CLI proxy that applies four strategies (smart filtering, merging, deduplication, truncation). Benchmarks: cargo test output reduced from 155 lines to 3 lines (98 % compression); overall session tokens 118 K → 24 K (80 % reduction).
Hooks for automatic noise filtering. Example JSON snippet:
{
"hooks": {
"PreToolUse": [
{
"matcher": "Bash",
"hooks": [
{
"type": "command",
"command": "~/.claude/hooks/filter-test-output.sh"
}
]
}
]
}
}Output is limited to 10 000 characters; excess is saved to a file.
Skills – on‑demand loading . Split a large CLAUDE.md into many small skill files. At startup only skill names and one‑line descriptions (~450 tokens) are loaded; full content is fetched when the skill is activated. This can save 98 % of tokens for unused skills (e.g., 23 core skills + 158 on‑demand skills reduced load by 42.5 %).
Advanced tactics:
Fine‑grained sub‑agent management – use sub‑agents only for tasks involving >10 files, >3 parallel jobs, or requiring context isolation. Sub‑agents cost 4‑7× a single agent; use Haiku for sub‑agents.
Precise file reads – Read(file_path="src/service.py", offset=150, limit=50) instead of reading the whole file (≈300 tokens vs ~2 400).
Prefer Edit over Write for small changes: editing a single line in a 500‑line file costs ~200 output tokens, writing the whole file costs ~5 000 (96 % saving).
Batch API and prompt caching – batch non‑real‑time tasks for a 50 % discount; keep system prompts and CLAUDE.md stable to retain cache benefits.
Practical Case Studies
RTK Deployment – 80 % Compression
Before: 118 K tokens in a 30‑minute session.
After: 24 K tokens.
Savings: 80 %.
.claudeignore + CLAUDE.md Slimming – 88 % Reduction
Before: 11 000 tokens.
After: 1 300 tokens.
Savings: 88 %.
Skills Modularisation – 42.5 % Load Reduction
Before: all project conventions loaded each start.
After: 23 core skills always loaded + 158 on‑demand skills.
Load reduced by 42.5 %.
Full‑Stack Optimization – ~90 % Overall Savings
Model routing (70 % Haiku, 20 % Sonnet, 10 % Opus). .claudeignore to drop build artifacts.
Skills replace monolithic CLAUDE.md.
Frequent /clear after each feature.
Result: ~90 % token reduction with no noticeable quality loss.
Token‑Saving Checklist
Basic (Must‑Do)
Project contains a .claudeignore that excludes node_modules, build outputs, lock files, etc. CLAUDE.md ≤200 lines.
Run /clear after each completed task.
Prompts include exact file paths and line numbers.
Combine related requests into a single message.
Intermediate (Recommended)
Use Sonnet for daily coding, Haiku for simple tasks.
Adjust thinking intensity with /effort based on task complexity.
Manually /compact when context reaches 60 %.
Disconnect rarely used MCP servers.
Periodically check /cost and /context.
Advanced (Bonus)
Install RTK or similar CLI compressor.
Configure hooks to filter test output.
Split project conventions into on‑demand Skills.
Use Haiku for sub‑agents, Sonnet/Opus for the main session.
Read files with offset/limit to fetch only needed lines.
Bottom line: The cheapest token is the one you never have to send.
References
[1] GitHub Issue #41930 – https://github.com/anthropics/claude-code/issues/41930
[2] DEV Community analysis – https://dev.to/slima4/where-do-your-claude-code-tokens-actually-go-we-traced-every-single-one-423e
[3] GitHub Issue #13579 – https://github.com/anthropics/claude-code/issues/13579
[4] RTK project – https://github.com/rtk-ai/rtk
[5] Zhihu user experiment – https://zhuanlan.zhihu.com/p/1968758095460147508
ArcThink
ArcThink makes complex information clearer and turns scattered ideas into valuable insights and understanding.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
