How mcpx Cuts Token Overhead in MCP Tool Calls for Local LLMs

The article explains how mcpx reduces MCP tool definition tokens from tens of thousands to a few hundred by discovering tools at execution time, improving accuracy and speed for local large language models while preserving prompt cache integrity.

AI Engineering
AI Engineering
AI Engineering
How mcpx Cuts Token Overhead in MCP Tool Calls for Local LLMs

Traditional MCP integration loads all tool definitions into the model’s context before a conversation, consuming 40‑50k tokens for a 32k‑context model and exhausting the context window.

Anthropic’s internal tests show five MCP servers use 55k tokens (GitHub 26k, Slack 21k, Sentry 3k, Grafana 3k, Splunk 2k); adding Jira adds another 17k, easily exceeding 100k tokens, with extreme cases reaching 134k tokens.

mcpx: dynamic tool discovery

Developer cs50victor built mcpx on top of philschmid’s mcp‑cli. The core idea is to avoid pre‑loading tools at the API layer and instead discover them on demand at execution time.

mcpx                           # list all servers/tools
mcpx grep "*browser*"          # search by pattern
mcpx filesystem/read_file      # show a single tool’s schema
mcpx filesystem/read_file '{"path": "./README.md"}'  # invoke the tool

Using this approach, tool definitions shrink from roughly 47k tokens to about 400 tokens, reducing inference cost and latency. Anthropic’s internal benchmarks on a large tool library report accuracy improvements: Opus 4 rises from 49 % to 74 %; Opus 4.5 rises from 79.5 % to 88.1 %.

Architecture

API Layer:    tools: [bash]           ← static, always cached
Execution:   bash → mcpx discover    ← dynamic, on‑demand

The model only needs the bash tool; additional tools are discovered via shell commands, matching Anthropic’s “tool‑search tool” concept while being more suitable for local deployments.

Because Anthropic’s prompt cache remains complete, adding new MCP servers does not invalidate the cache, avoiding extra cost and latency.

mcpx also provides a daemon mode that keeps stateful connections (browser sessions, database handles) and a global tool‑disable feature similar to a .gitignore, mitigating errors from similarly named tools.

Example configuration

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "."]
    },
    "github": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-github"]
    }
  }
}

Installation

brew tap cs50victor/mcpx && brew install mcpx
# or
curl -fsSL https://raw.githubusercontent.com/cs50victor/mcpx/dev/install.sh | bash

The solution is especially suitable for local deployment scenarios (e.g., Ollama or other self‑hosted models) where Anthropic’s official offering requires API support; mcpx achieves comparable functionality via shell commands.

The trade‑off is that each tool discovery incurs an extra inference call, but Anthropic notes that the token savings and accuracy gains outweigh the additional latency for context‑constrained local models.

Project development includes adding MCP registry support. Source code: https://github.com/cs50victor/mcpx

MCP tool call comparison
MCP tool call comparison
mcpx architecture diagram
mcpx architecture diagram
MCPTool Callinglocal LLMAnthropicshell integrationToken Optimizationmcpx
AI Engineering
Written by

AI Engineering

Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.