Practical Agent Performance Tuning: Slash Latency 75%, Cut Token Costs 71%, Boost Throughput 217%

The article walks through a systematic performance map of LangChain agents and demonstrates concrete latency, token‑usage, and concurrency optimizations—streaming responses, Redis caching, model routing, prompt trimming, context summarisation, dynamic tool selection, parallel graph nodes and batch processing—showing real‑world gains of up to 75% lower latency, 71% fewer tokens and a 217% throughput increase.

James' Growth Diary
James' Growth Diary
James' Growth Diary
Practical Agent Performance Tuning: Slash Latency 75%, Cut Token Costs 71%, Boost Throughput 217%

Hello, I'm James. In the previous post we used LangSmith to X‑ray every step of an agent and locate slow or costly stages. This article moves from diagnosis to action, applying three "surgical knives" to latency, token consumption, and concurrency.

01 Build a Performance Map

A typical LangGraph agent call chain looks like this (times are approximate):

用户请求
  │
  ▼
[节点调度] ~5ms
  │
  ▼
[工具描述注入] ~50‑300ms (Token 计算)
  │
  ▼
[LLM 推理] ~500ms‑3s ← 最大头
  │
  ▼
[工具执行] ~200ms‑2s ← 第二大头(外部 API)
  │
  ▼
[结果回写 State] ~5ms
  │
  ▼
响应返回

LLM inference plus tool execution account for over 90% of total latency, so the optimisation goal is to reduce LLM calls, shrink input tokens per call, and avoid serial tool waits.

02 Latency Optimisation – Make the User Feel Faster

Two angles: lower real latency and improve perceived latency. The most direct perceived‑latency trick is streaming output, which changes the response pattern from "wait for the whole answer" to "send tokens as they are generated".

Combining streaming with a Redis cache (identical queries hit in ~5 ms vs ~1500 ms) and model routing (simple tasks use gpt‑4o‑mini, 15× cheaper and 2‑3× faster) yields the following effect:

// version: [email protected] · @langchain/[email protected] · [email protected]
import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage } from "@langchain/core/messages";
import { RedisCache } from "@langchain/community/caches/ioredis";
import { setGlobalLLMCache } from "@langchain/core/language_models/base";
import { Redis } from "ioredis";

// ① Redis global cache: same prompt hit returns in 5 ms
const cache = new RedisCache(new Redis({ host: "localhost", port: 6379 }), { ttl: 3600 });
setGlobalLLMCache(cache);

// ② Model routing: 80% of requests are simple → use mini model (15× cheaper, 2‑3× faster)
type Complexity = "simple" | "medium" | "complex";
function routeModel(task: string): ChatOpenAI {
  const isComplex = task.includes("推理") || task.includes("设计") || task.length > 500;
  const isSimple = task.length < 50 && !task.includes("分析") && !task.includes("代码");
  const modelMap: Record<Complexity, string> = {
    simple: "gpt-4o-mini", // cheap & fast
    medium: "gpt-4o",
    complex: "o1-preview",
  };
  const level: Complexity = isComplex ? "complex" : isSimple ? "simple" : "medium";
  return new ChatOpenAI({ model: modelMap[level], streaming: true });
}

// ③ Streaming endpoint: first token appears within 200 ms instead of waiting 3 s
app.get("/stream", async (req, res) => {
  res.setHeader("Content-Type", "text/event-stream");
  const llm = routeModel(req.query.q as string);
  const stream = await llm.stream([new HumanMessage(req.query.q as string)]);
  for await (const chunk of stream) {
    res.write(`data: ${JSON.stringify({ token: chunk.content })}

`);
  }
  res.end();
});

// Note: cache key must be a stable hash of prompt+model+temperature; dynamic parts (timestamps, IDs) kill hit rate.

In a typical customer‑service agent (80% simple Q&A) the average latency dropped from 1.8 s to 0.6 s and cache hit rate reached 37%.

03 Token‑Usage Optimisation – Trim the Bill

A single agent call consumes tokens across several parts (average numbers shown):

System Prompt: 500‑2000 tokens (high optimisation potential)

History: 1000‑5000 tokens (high)

Tool description: 1000‑3000 tokens (high)

RAG results: 500‑2000 tokens (medium)

User input: 100‑500 tokens (low)

Model output: 300‑2000 tokens (medium)

The three biggest killers are system prompt, history, and tool description. The following three‑step "token surgery" is applied:

import { BaseMessage, AIMessage } from "@langchain/core/messages";
import { StructuredTool } from "@langchain/core/tools";
import { ChatOpenAI } from "@langchain/openai";

// ① System Prompt trimming
const compactPrompt = `专业客服。准确、友好、诚实、守法。不确定时直接说不知道。`;

// ② Context trimming with summarisation (uses cheap mini model)
async function trimMessages(messages: BaseMessage[], keepFirst = 2, keepLast = 6): Promise<BaseMessage[]> {
  if (messages.length <= keepFirst + keepLast) return messages;
  const head = messages.slice(0, keepFirst);
  const tail = messages.slice(-keepLast);
  const middle = messages.slice(keepFirst, -keepLast);
  if (!middle.length) return [...head, ...tail];
  const summarizer = new ChatOpenAI({ model: "gpt-4o-mini", maxTokens: 100 });
  const raw = middle.map(m => `${m._getType()}: ${m.content}`).join("
");
  const summary = await summarizer.invoke(`用 2‑3 句话总结:
${raw}`);
  return [...head, new AIMessage(`[对话摘要] ${summary.content}`), ...tail];
}

// ③ Dynamic tool injection – only expose the three tools the LLM actually needs
async function selectRelevantTools(query: string, allTools: StructuredTool[]): Promise<StructuredTool[]> {
  const router = new ChatOpenAI({ model: "gpt-4o-mini", maxTokens: 60 });
  const names = allTools.map(t => t.name).join(", ");
  const result = await router.invoke(`问题:${query}
工具:${names}
输出最相关 3 个工具名(逗号分隔):`);
  const selected = (result.content as string).split(",").map(s => s.trim());
  return allTools.filter(t => selected.includes(t.name));
}

Applying these three levers reduces token consumption per request from ~8000 to ~1800 tokens (≈77% saving).

04 Concurrency Optimisation – Stop Tools from Queuing

Many agents call tools sequentially (A → B → C). When calls are independent they can run in parallel, cutting total time dramatically.

import { StateGraph, Annotation } from "@langchain/langgraph";
import { ChatOpenAI } from "@langchain/openai";

const AgentState = Annotation.Root({
  query: Annotation<string>(),
  weatherResult: Annotation<string>(),
  newsResult: Annotation<string>(),
  stockResult: Annotation<string>(),
  finalAnswer: Annotation<string>(),
});

const workflow = new StateGraph(AgentState)
  .addNode("fetchWeather", async s => ({ weatherResult: await weatherTool.invoke(s.query) }))
  .addNode("fetchNews", async s => ({ newsResult: await newsTool.invoke(s.query) }))
  .addNode("fetchStock", async s => ({ stockResult: await stockTool.invoke(s.query) }))
  .addNode("merge", async state => {
    const llm = new ChatOpenAI({ model: "gpt-4o" });
    const answer = await llm.invoke(`整合信息:天气=${state.weatherResult};新闻=${state.newsResult};股票=${state.stockResult};问题=${state.query}`);
    return { finalAnswer: answer.content };
  })
  .addEdge("__start__", "fetchWeather")
  .addEdge("__start__", "fetchNews")
  .addEdge("__start__", "fetchStock")
  .addEdge("fetchWeather", "merge")
  .addEdge("fetchNews", "merge")
  .addEdge("fetchStock", "merge")
  .addEdge("merge", "__end__");

// Serial 2100 ms → Parallel 800 ms, 62% faster

Batch processing with exponential back‑off further protects against rate‑limit snow‑balls:

async function batchProcess<T>(items: T[], processor: (item: T) => Promise<string>, batchSize = 5, delayMs = 200): Promise<string[]> {
  const results: string[] = [];
  const sleep = (ms: number) => new Promise(r => setTimeout(r, ms));
  async function withRetry(fn: () => Promise<string>, maxRetries = 3): Promise<string> {
    for (let i = 0; i < maxRetries; i++) {
      try { return await fn(); }
      catch (e: any) {
        if (e.status === 429 && i < maxRetries - 1) { await sleep(Math.pow(2, i) * 1000); continue; }
        throw e;
      }
    }
    throw new Error("Max retries exceeded");
  }
  for (let i = 0; i < items.length; i += batchSize) {
    const batch = items.slice(i, i + batchSize);
    const batchResults = await Promise.all(batch.map(item => withRetry(() => processor(item))));
    results.push(...batchResults);
    if (i + batchSize < items.length) await sleep(delayMs);
  }
  return results;
}

Processing 50 documents drops from 30 s (serial) to 6 s (batch size = 5), a 5× speedup.

05 Quantified Gains Across All Three Dimensions

Real‑world production numbers (AI‑客服 Agent, 50 k MAU):

Average response time: 3.2 s → 0.8 s (‑75%)

Monthly token consumption: 180 M → 52 M (‑71%)

Monthly API cost: $2,700 → $780 (‑71%)

Throughput: 12 req/s → 38 req/s (+217%)

Cache hit rate: 0% → 37%

The three biggest contributors are:

Streaming output – cuts perceived latency by 75%.

Prompt trimming + context summarisation – cuts token usage by 55%.

Parallel node execution – raises throughput by over 200%.

06 Common Pitfalls

Cache key includes timestamps or session IDs → hit rate ≈ 0. Fix: keep the prompt template stable and strip dynamic parts.

Over‑aggressive context trimming → LLM forgets earlier user intent. Fix: replace trimmed middle with a concise summary instead of deleting it.

Too large batch size triggers 429 rate‑limit snow‑ball. Fix: size batches according to RPM quota and use exponential or jittered back‑off.

Streaming tokens are not fully recorded by tracing tools. Fix: log each token in the onLLMNewToken callback.

Rule‑based model routing misclassifies complex queries as simple, hurting answer quality. Fix: add an LLM‑based intent classifier before routing.

07 How Production‑Grade Products Do It

Below are distilled design choices from two open‑source agents.

7.1 Claude Code

Uses memoize to cache system‑context; cache is cleared precisely when the prompt changes.

Implements compact_boundary messages that trigger immediate GC of pre‑boundary history.

// memoized git status (Node.js)
export const getGitStatus = memoize(async (): Promise<string | null> => {
  const [branch, mainBranch, status, log, userName] = await Promise.all([
    getBranch(),
    getDefaultBranch(),
    execFileNoThrow(gitExe(), ["status", "--short"], ...),
    execFileNoThrow(gitExe(), ["log", "--oneline", "-n", "5"], ...),
    execFileNoThrow(gitExe(), ["config", "user.name"], ...),
  ]);
  const truncated = status.length > MAX_STATUS_CHARS ? status.substring(0, MAX_STATUS_CHARS) + "
... (truncated)" : status;
  return [branch, mainBranch, userName, truncated, log].join('

');
});

function setSystemPromptInjection(value: string | null): void {
  systemPromptInjection = value;
  getUserContext.cache?.clear?.(); // precise invalidation
  getSystemContext.cache?.clear?.();
}

When a compact_boundary message arrives, the engine splices out all earlier messages, freeing memory instantly.

if (message.subtype === 'compact_boundary') {
  const mutableIdx = this.mutableMessages.length - 1;
  if (mutableIdx > 0) this.mutableMessages.splice(0, mutableIdx); // immediate GC
  const localIdx = messages.length - 1;
  if (localIdx > 0) messages.splice(0, localIdx);
}

7.2 Hermes Agent

Provides a ContextCompressor that triggers when token usage exceeds 75% of the model window, summarises the middle segment with a cheap auxiliary LLM, and replaces it with a concise summary.

Runs external API calls in a dedicated thread ( interruptible_api_call) so the main loop can poll for interruption every 100 ms.

Uses decorrelated jitter back‑off to avoid synchronized retries across many clients.

# Python‑style context compressor (simplified)
class ContextCompressor:
    def __init__(self, keep_first=2, keep_last=8, threshold=0.75):
        self.keep_first = keep_first
        self.keep_last = keep_last
        self.threshold = threshold
    def should_compress(self, total_tokens, window):
        return total_tokens / window > self.threshold
    async def compress(self, messages, aux_llm):
        head = messages[:self.keep_first]
        tail = messages[-self.keep_last:]
        middle = messages[self.keep_first:-self.keep_last]
        if not middle: return messages
        summary = await aux_llm.call_llm(messages=[{"role": "user", "content": f"总结以下对话历史:
" + "
".join(f"{m['role']}: {m['content']}" for m in middle)}], max_tokens=500)
        return head + [{"role": "assistant", "content": f"[对话摘要] {summary}"}] + tail
# Interruptible API call (Node.js)
function interruptible_api_call(agent, api_kwargs) {
  const result = { response: null, error: null };
  const holder = { client: null };
  function _call() {
    try {
      holder.client = agent._create_request_openai_client('chat_completion_request', api_kwargs);
      result.response = holder.client.chat.completions.create(api_kwargs);
    } catch (e) { result.error = e; }
    finally { if (holder.client) agent._close_request_openai_client(holder.client, 'request_complete'); }
  }
  const thread = threading.Thread(target=_call, daemon=true);
  thread.start();
  while (thread.is_alive()) {
    thread.join(0.1);
    if (agent.interrupted && holder.client) { holder.client.close(); break; }
  }
  return [result.response, result.error];
}
# Decorrelated jitter back‑off (Python)
def jittered_backoff(attempt, base=1.0, cap=30.0):
    prev = base * (2 ** max(attempt - 1, 0))
    return min(random.uniform(base, prev * 3), cap)

async def call_with_retry(fn, max_retries=5):
    for i in range(max_retries):
        try: return await fn()
        except RateLimitError:
            if i == max_retries - 1: raise
            await asyncio.sleep(jittered_backoff(i))

7.3 Cross‑Product Comparison

All products cache at the system‑prompt or context‑injection layer, not inside the LLM call.

They all perform immediate GC after summarisation to avoid memory bloat.

Production agents favour jittered back‑off over plain exponential back‑off.

API calls are decoupled from the main event loop via threads or async workers.

Conclusion

Latency: streaming + Redis cache + model routing reduces perceived latency by ~75%.

Token cost: prompt trimming, context summarisation, and selective tool injection cut token usage by ~55%.

Concurrency: LangGraph parallel nodes + batch processing + jittered back‑off raise throughput by >200%.

Overall, a typical AI‑客服 agent sees latency drop from 3.2 s to 0.8 s, token spend fall from 180 M to 52 M, cost cut from $2.7k to $0.78k, and request rate climb from 12 req/s to 38 req/s.

Next time we’ll discuss monitoring and alerting for agents in production, because relying on LangSmith traces alone isn’t enough; proactive health checks are required.

Agent performance bottleneck map
Agent performance bottleneck map
Latency optimisation triad: streaming, cache, model routing
Latency optimisation triad: streaming, cache, model routing
Token consumption breakdown
Token consumption breakdown
Parallel vs serial tool execution, 62% time saved
Parallel vs serial tool execution, 62% time saved
Quantitative before‑after comparison
Quantitative before‑after comparison
Common optimisation pitfalls
Common optimisation pitfalls
Industry product optimisation matrix
Industry product optimisation matrix
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

concurrencyLangChainLatencytoken costLangGraphAgent Optimization
James' Growth Diary
Written by

James' Growth Diary

I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.