How to Slash AI Token Costs: MCP vs Skill and 6 Proven Optimization Techniques

This article explains the fundamental differences between web session tokens and AI tokens, compares MCP and Skill token consumption, presents pricing formulas for major models, and offers practical strategies—including prompt compression, context management, and dynamic toolsets—to dramatically reduce AI token expenses.

Java Backend Technology
Java Backend Technology
Java Backend Technology
How to Slash AI Token Costs: MCP vs Skill and 6 Proven Optimization Techniques

Web Session Token

In web applications a token acts as a temporary identity credential generated after login; it is attached to subsequent requests so the server can recognize the user.

Generated once, reused many times; creation costs resources, later use is near‑zero cost.

Small size, usually only dozens to a few hundred bytes.

Security‑oriented, used solely for authentication and authorization, unrelated to AI computation.

AI Token

In large language models a token is the smallest unit of text that the model processes – essentially the model’s "brain cell". Each token triggers a computation step, so token count directly reflects compute cost.

Dimension            | Web Session Token | AI Token
---------------------|-------------------|----------
Nature               | Identity credential| Compute unit
Billing              | Free or negligible| Pay‑per‑token
Quantity per request | 1                 | Hundreds to millions
Lifecycle            | Hours‑to‑days     | Fresh per request

AI Token Calculation

Token usage is bidirectional: both the prompt (input) and the model’s response (output) consume tokens, and output tokens are often priced higher than input tokens (e.g., Claude 3.5 Sonnet output costs 5× the input price).

Tokenization

English: 1 token ≈ 0.75 words ("ChatGPT" may be 1‑2 tokens).

Chinese: 1 character ≈ 1‑2 tokens (technical docs usually assume 1.5 tokens per character).

Code: symbols and indentation count; a Python snippet can be 1.5× the raw character count in tokens.

English: "Hello world" = 2 Tokens
Chinese: "你好世界" = 4‑6 Tokens
Code: "def hello():" = 4‑5 Tokens

Pricing Formula

Typical cost calculation (e.g., DeepSeek, Yuanbao, Alibaba Cloud Bailei):

Total Cost = (input_token_count × input_price + output_token_count × output_price) / 1,000,000

Model pricing examples (price per million tokens):

Model               | Input Price (¥) | Output Price (¥)
-------------------|-----------------|-----------------
Tongyi Qianwen‑Max | 2.4             | 9.6
DeepSeek‑V3        | 2.0             | 8.0
GPT‑4 level        | 30+            | 60+

Skill vs MCP

MCP (Anthropic’s open protocol) enables AI agents to call external tools, but it loads the full definition of every tool into the context at startup, causing heavy token consumption. Skill, also from Anthropic, is a lightweight, on‑demand capability package that loads only a short description initially and fetches full definitions only when used.

Skill vs MCP token comparison diagram
Skill vs MCP token comparison diagram

MCP Token Pitfalls

When MCP starts, all tool definitions (name, description, JSON schema, return format, error handling) are loaded at once.

Example: 5 standard MCP tools (GitHub, Slack, Google Drive, etc.) consume ≈ 97 000 tokens, which is 48 % of Claude 3.5 Sonnet’s 200 k context window.

Intermediate results double‑pass tokens; a 2‑hour meeting transcript can add ~50 k tokens.

Skill Advantages

Startup loads only a brief description (~12 tokens).

Full instruction and code are loaded on demand (~300 tokens).

Overall token‑saving rate ≈ 99.6 %.

Which Consumes More Tokens?

Conclusion:

MCP consumes far more tokens than Skill, but Skill is not a universal replacement.

MCP remains valuable for standardization, real‑time data access, and precise control, while Skill excels at high‑frequency, repeatable tasks, fast response, and privacy.

MCP Token Optimization Strategies

Code Execution Mode

Anthropic’s official solution: let the AI generate code that calls MCP tools, instead of the AI invoking tools directly. This reduces token usage from ~150 000 to ~2 000 (≈ 98.7 % saving).

Traditional flow: AI → call tool → receive result → call next tool (all intermediate results stay in the AI context).

Code‑execution flow: AI generates a script; the script chains tools; AI only sees the final result.

Dynamic Toolsets

Speakeasy’s approach loads only tools relevant to the user’s intent, using semantic search and hierarchical categorization.

Input token reduction ≈ 96.7 %.

Total token reduction ≈ 96.4 %.

Six Practical Token‑Saving Tips

Prompt Compression

Remove redundant words, use concise phrasing, and replace natural language with structured formats (JSON, abbreviations) and omit polite filler.

Before: "请按照以下非常重要的步骤操作:第一步、第二步、第三步"
After:  "步骤:1. 2. 3."
Saved: 18 → 8 Tokens (55 % reduction)

Context Management

Regularly clear irrelevant dialogue history before new tasks.

Summarize long conversations into a ~100‑word abstract to replace raw text.

Use a sliding window to retain only the most recent N turns.

Cache Mechanisms

For high‑frequency queries (e.g., stock price), employ multi‑level caching: exact‑match Redis cache → semantic cache → model call.

User query → Redis cache → Semantic cache → Model

Model Routing

Simple tasks (Q&A, translation): use lightweight models such as GPT‑3.5 or Claude Haiku.

Complex tasks (reasoning, code generation): route to flagship models like GPT‑5 or Claude Opus.

Estimated savings: 70 % of requests can use lightweight models, cutting cost by ~80 %.

Asynchronous & Batch Processing

Real‑time: 1 000 requests × 3 s ≈ 50 min, $50.

Batch (vLLM): generate 1 000 results in 8 min, $8 – analogous to bulk SQL inserts.

Monitoring & Alerts

Track token consumption per team/project.

Alert when hourly cost exceeds three times the average.

Cost attribution to identify “token whales”.

Future Trends of Token Economics

Context compression: models automatically condense history while preserving key information.

Specialized lightweight models: task‑specific models can improve token efficiency by up to 10×.

Edge computing: offload simple inference to local devices, reserving cloud LLM calls for complex workloads.

Conclusion

Understanding token mechanics is essential for controlling AI‑era costs. Both web developers and AI product managers must develop token awareness, balance MCP’s standardization benefits against Skill’s efficiency, and apply the outlined optimization tactics to keep budgets in check.

References

Atal Upadhyay, "MCP Token Problem: Building Efficient AI Agents with Skills", https://atalupadhyay.wordpress.com/2025/11/11/mcp-token-problem-building-efficient-ai-agents-with-skills/

Speakeasy, "Reducing MCP token usage by 100x", https://www.speakeasy.com/blog/how-we-reduced-token-usage-by-100x-dynamic-toolsets-v2

Intuition Labs, "Claude Skills vs. MCP: A Technical Comparison", https://intuitionlabs.ai/articles/claude-skills-vs-mcp

Anthropic, "Code execution with MCP: building more efficient AI agents", https://www.anthropic.com/engineering/code-execution-with-mcp

Skywork AI, "Claude Skills vs MCP vs General LLM Tools", https://skywork.ai/blog/ai-agent/claude-skills-vs-mcp-vs-llm-tools-comparison-2025/

CData, "Claude Skills vs MCP: Better Together", https://www.cdata.com/blog/claude-skills-vs-mcp-better-together-with-connect-ai

AgiFlow, "token-usage-metrics", https://github.com/AgiFlow/token-usage-metrics

Artificial IntelligenceMCPprompt engineeringCost managementSkillToken Optimization
Java Backend Technology
Written by

Java Backend Technology

Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.