Elegant Solution to Prompt Bloat: Semantic Retrieval of Tools for Efficient LLM Inference
The article explains how the limited context window of large language models causes prompt bloat when many tool descriptions are embedded, and presents the RAG‑MCP architecture that stores tool metadata in a vector database, uses semantic retrieval to select only the most relevant tools, dramatically shortens prompts, and improves inference speed and tool‑call accuracy.
Problem
Large language models (LLMs) have limited context windows. Embedding the full descriptions of many external tools managed via the Model Context Protocol (MCP) consumes a large fraction of the token budget, reduces the model's reasoning capacity, and makes tool selection more error‑prone. This phenomenon is called prompt bloat .
RAG‑MCP Architecture
Amazon Bedrock Knowledge Base combines Retrieval‑Augmented Generation (RAG) with MCP. Tool metadata are stored in a vector database; a semantic search retrieves only the most relevant tool specifications for a user query, which are then inserted into an augmented prompt sent to the LLM.
Core Concepts
Retrieval‑Augmented Generation (RAG) matches a user query against embeddings in a vector store and injects the top‑k most relevant passages as context, improving answer relevance and reducing token usage.
Model Context Protocol (MCP) standardises tool metadata (name, description, input schema) and separates the MCP Server (exposes tool list and executes calls) from the MCP Client (fetches metadata and forwards calls to the model).
Tool Definition Schema
{
"name": "string", // unique identifier
"description": "string", // optional human‑readable description
"inputSchema": {
"type": "object",
"properties": { ... }
}
}Filesystem MCP Server Example
=== All Available Tools (11 tools) ===
1. 🔧 get_file_info
Description: Retrieve detailed metadata about a file or directory.
Parameters: path
2. 🔧 write_file
Description: Create a new file or overwrite an existing file with new content.
Parameters: path, content
3. 🔧 move_file
Description: Move or rename files and directories; fails if destination exists.
Parameters: source, destination
4. 🔧 edit_file
Description: Line‑based edits to a text file; returns a git‑style diff.
Parameters: path, edits, dryRun
5. 🔧 read_multiple_files
Description: Read contents of multiple files simultaneously.
Parameters: paths
6. 🔧 create_directory
Description: Create a new directory or ensure it exists.
Parameters: path
7. 🔧 read_file
Description: Read the complete contents of a file.
Parameters: path
8. 🔧 directory_tree
Description: Recursive JSON view of files and directories.
Parameters: path
9. 🔧 list_allowed_directories
Description: List directories the server is allowed to access.
10. 🔧 search_files
Description: Recursively search for files matching a pattern (case‑insensitive).
Parameters: path, pattern, excludePatterns
11. 🔧 list_directory
Description: Detailed listing of files and directories in a path.
Parameters: pathBenefits of RAG‑MCP
Dynamic tool retrieval : Only the semantically closest tool specs are fetched, dramatically shrinking the prompt.
Context augmentation : Retrieved specs are inserted into an augmented prompt, giving the model precise execution guidance.
Scalability : Large, frequently changing tool sets can be maintained without manual prompt edits.
End‑to‑End Workflow (12 steps)
MCP Client reads all enabled MCP Server tools and writes them to a JSONL file.
Upload the JSONL file to an Amazon S3 bucket that serves as a Bedrock Knowledge Base data source.
Chunk the JSONL file with a custom chunker so that each tool becomes a separate chunk.
Generate embeddings for each chunk using an embedding model (e.g., Amazon Titan Text Embeddings V2).
Store the embeddings in a vector database (Amazon OpenSearch Serverless or Aurora pgvector).
When a user query arrives, encode it with the same embedding model to obtain a query vector.
Perform a similarity search in the vector store and retrieve the top‑k most relevant tool embeddings.
Build an augmented prompt from the retrieved tool specifications.
Send the augmented prompt to the LLM; the model decides whether to invoke a tool.
If a tool is needed, the LLM triggers the call via MCP.
Return the tool execution result to the client.
Steps 6‑10 may repeat for multiple tool calls within a single request.
Implementation Snippets
Key Python classes illustrate how to interact with an MCP Server and Bedrock Knowledge Base.
class MCPClient:
async def __aenter__(self):
await self.connect()
return self
async def connect(self):
# initialise stdio connection and MCP session
...
async def list_tools(self):
if not self._session:
raise MCPToolError("MCP session not initialized")
tools_response = await self._session.list_tools()
return tools_responseThe query_semantic method shows how to call Bedrock’s retrieve API, parse the JSON results, and return a QueryResult containing the matched tool specifications.
def query_semantic(self, query_text: str, max_results: int = 10) -> QueryResult:
response = self.bedrock_client.retrieve(
knowledgeBaseId=self.knowledge_base_id,
retrievalQuery={"text": query_text},
retrievalConfiguration={
"vectorSearchConfiguration": {"numberOfResults": max_results}
}
)
results = []
for result in response["retrievalResults"]:
try:
content = json.loads(result["content"]["text"])
results.append(content)
except json.JSONDecodeError:
continue
return QueryResult(tools=results, total_results=len(results))Configuration Tips
Vector store: Amazon OpenSearch Serverless for high‑throughput production; Aurora pgvector for cost‑sensitive workloads.
Embedding model: Amazon Titan Text Embeddings V2.
Retrieval top‑k: 5‑10 results balances prompt length and relevance.
Enable hybrid search (semantic + keyword) for complex queries.
Monitor ingestion jobs with Amazon CloudWatch.
References
RAG‑MCP paper: https://arxiv.org/html/2505.03275v1
MCP specification: https://modelcontextprotocol.io/docs/concepts/architecture
Amazon Bedrock Knowledge Base documentation: https://aws.amazon.com/cn/bedrock/knowledge-bases
Retrieval‑Augmented Generation overview: https://aws.amazon.com/cn/what-is/retrieval-augmented-generation
MCP Python SDK: https://github.com/modelcontextprotocol/python-sdk
GitHub repository with full code: https://github.com/memoverflow/rag-mcp
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Amazon Cloud Developers
Official technical community of Amazon Cloud. Shares practical AI/ML, big data, database, modern app development, IoT content, offers comprehensive learning resources, hosts regular developer events, and continuously empowers developers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
