How to Slash AI Token Costs: MCP vs Skill and 6 Proven Optimization Techniques
This article explains the fundamental differences between web session tokens and AI tokens, compares MCP and Skill token consumption, presents pricing formulas for major models, and offers practical strategies—including prompt compression, context management, and dynamic toolsets—to dramatically reduce AI token expenses.
Web Session Token
In web applications a token acts as a temporary identity credential generated after login; it is attached to subsequent requests so the server can recognize the user.
Generated once, reused many times; creation costs resources, later use is near‑zero cost.
Small size, usually only dozens to a few hundred bytes.
Security‑oriented, used solely for authentication and authorization, unrelated to AI computation.
AI Token
In large language models a token is the smallest unit of text that the model processes – essentially the model’s "brain cell". Each token triggers a computation step, so token count directly reflects compute cost.
Dimension | Web Session Token | AI Token
---------------------|-------------------|----------
Nature | Identity credential| Compute unit
Billing | Free or negligible| Pay‑per‑token
Quantity per request | 1 | Hundreds to millions
Lifecycle | Hours‑to‑days | Fresh per requestAI Token Calculation
Token usage is bidirectional: both the prompt (input) and the model’s response (output) consume tokens, and output tokens are often priced higher than input tokens (e.g., Claude 3.5 Sonnet output costs 5× the input price).
Tokenization
English: 1 token ≈ 0.75 words ("ChatGPT" may be 1‑2 tokens).
Chinese: 1 character ≈ 1‑2 tokens (technical docs usually assume 1.5 tokens per character).
Code: symbols and indentation count; a Python snippet can be 1.5× the raw character count in tokens.
English: "Hello world" = 2 Tokens
Chinese: "你好世界" = 4‑6 Tokens
Code: "def hello():" = 4‑5 TokensPricing Formula
Typical cost calculation (e.g., DeepSeek, Yuanbao, Alibaba Cloud Bailei):
Total Cost = (input_token_count × input_price + output_token_count × output_price) / 1,000,000Model pricing examples (price per million tokens):
Model | Input Price (¥) | Output Price (¥)
-------------------|-----------------|-----------------
Tongyi Qianwen‑Max | 2.4 | 9.6
DeepSeek‑V3 | 2.0 | 8.0
GPT‑4 level | 30+ | 60+Skill vs MCP
MCP (Anthropic’s open protocol) enables AI agents to call external tools, but it loads the full definition of every tool into the context at startup, causing heavy token consumption. Skill, also from Anthropic, is a lightweight, on‑demand capability package that loads only a short description initially and fetches full definitions only when used.
MCP Token Pitfalls
When MCP starts, all tool definitions (name, description, JSON schema, return format, error handling) are loaded at once.
Example: 5 standard MCP tools (GitHub, Slack, Google Drive, etc.) consume ≈ 97 000 tokens, which is 48 % of Claude 3.5 Sonnet’s 200 k context window.
Intermediate results double‑pass tokens; a 2‑hour meeting transcript can add ~50 k tokens.
Skill Advantages
Startup loads only a brief description (~12 tokens).
Full instruction and code are loaded on demand (~300 tokens).
Overall token‑saving rate ≈ 99.6 %.
Which Consumes More Tokens?
Conclusion:
MCP consumes far more tokens than Skill, but Skill is not a universal replacement.MCP remains valuable for standardization, real‑time data access, and precise control, while Skill excels at high‑frequency, repeatable tasks, fast response, and privacy.
MCP Token Optimization Strategies
Code Execution Mode
Anthropic’s official solution: let the AI generate code that calls MCP tools, instead of the AI invoking tools directly. This reduces token usage from ~150 000 to ~2 000 (≈ 98.7 % saving).
Traditional flow: AI → call tool → receive result → call next tool (all intermediate results stay in the AI context).
Code‑execution flow: AI generates a script; the script chains tools; AI only sees the final result.
Dynamic Toolsets
Speakeasy’s approach loads only tools relevant to the user’s intent, using semantic search and hierarchical categorization.
Input token reduction ≈ 96.7 %.
Total token reduction ≈ 96.4 %.
Six Practical Token‑Saving Tips
Prompt Compression
Remove redundant words, use concise phrasing, and replace natural language with structured formats (JSON, abbreviations) and omit polite filler.
Before: "请按照以下非常重要的步骤操作:第一步、第二步、第三步"
After: "步骤:1. 2. 3."
Saved: 18 → 8 Tokens (55 % reduction)Context Management
Regularly clear irrelevant dialogue history before new tasks.
Summarize long conversations into a ~100‑word abstract to replace raw text.
Use a sliding window to retain only the most recent N turns.
Cache Mechanisms
For high‑frequency queries (e.g., stock price), employ multi‑level caching: exact‑match Redis cache → semantic cache → model call.
User query → Redis cache → Semantic cache → ModelModel Routing
Simple tasks (Q&A, translation): use lightweight models such as GPT‑3.5 or Claude Haiku.
Complex tasks (reasoning, code generation): route to flagship models like GPT‑5 or Claude Opus.
Estimated savings: 70 % of requests can use lightweight models, cutting cost by ~80 %.
Asynchronous & Batch Processing
Real‑time: 1 000 requests × 3 s ≈ 50 min, $50.
Batch (vLLM): generate 1 000 results in 8 min, $8 – analogous to bulk SQL inserts.
Monitoring & Alerts
Track token consumption per team/project.
Alert when hourly cost exceeds three times the average.
Cost attribution to identify “token whales”.
Future Trends of Token Economics
Context compression: models automatically condense history while preserving key information.
Specialized lightweight models: task‑specific models can improve token efficiency by up to 10×.
Edge computing: offload simple inference to local devices, reserving cloud LLM calls for complex workloads.
Conclusion
Understanding token mechanics is essential for controlling AI‑era costs. Both web developers and AI product managers must develop token awareness, balance MCP’s standardization benefits against Skill’s efficiency, and apply the outlined optimization tactics to keep budgets in check.
References
Atal Upadhyay, "MCP Token Problem: Building Efficient AI Agents with Skills", https://atalupadhyay.wordpress.com/2025/11/11/mcp-token-problem-building-efficient-ai-agents-with-skills/
Speakeasy, "Reducing MCP token usage by 100x", https://www.speakeasy.com/blog/how-we-reduced-token-usage-by-100x-dynamic-toolsets-v2
Intuition Labs, "Claude Skills vs. MCP: A Technical Comparison", https://intuitionlabs.ai/articles/claude-skills-vs-mcp
Anthropic, "Code execution with MCP: building more efficient AI agents", https://www.anthropic.com/engineering/code-execution-with-mcp
Skywork AI, "Claude Skills vs MCP vs General LLM Tools", https://skywork.ai/blog/ai-agent/claude-skills-vs-mcp-vs-llm-tools-comparison-2025/
CData, "Claude Skills vs MCP: Better Together", https://www.cdata.com/blog/claude-skills-vs-mcp-better-together-with-connect-ai
AgiFlow, "token-usage-metrics", https://github.com/AgiFlow/token-usage-metrics
Java Backend Technology
Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
