33 min read

LLM Semantic Routing Explained: Model‑Based Intent Classification and Three Keyword‑Matching Pitfalls

This article breaks down LLM semantic routing as a classifier, compares keyword, embedding, and LLM‑based routes, provides full TypeScript implementations, introduces hybrid routing for speed and accuracy, and covers production‑grade observability and dynamic configuration to avoid common pitfalls.

James' Growth Diary

May 14, 2026

LLM Semantic Routing Explained: Model‑Based Intent Classification and Three Keyword‑Matching Pitfalls

Hello, I’m James. In the previous post we completed multimodal RAG pipelines for images, tables, and PDFs; this article moves to the next dimension – routing.

01 What Is Semantic Routing? Three Implementation Paths

Routing is a classifier that takes a user query and outputs the processing chain it should follow. The three technical routes are:

Keyword rules : regular‑expression or string matching; works when intent and vocabulary are fixed; typical error rate 15‑40% for complex intents.

Embedding similarity : compute cosine similarity between the query vector and example vectors; works when intent clusters are clear; typical error rate 8‑15%.

LLM classification : let the model directly output a label; handles fuzzy and composite intents; error rate 2‑5% with strong models.

The core advantage of LLM routing is not raw accuracy but the ability to handle fuzzy and compound intents that keyword or embedding methods miss.

02 Minimal Implementation: withStructuredOutput for Routing

LangChain.js provides the most concise LLM routing by using withStructuredOutput to force the model to emit a JSON object. Each intent is defined with example sentences, and a confidence field is added for production debugging.

// ===== Scheme 1: Keyword router =====
function keywordRouter(query: string): string {
  const rules = [
    { pattern: /退款|退货|取消订单|申请退/, route: "refund" },
    { pattern: /快递|物流|配送|包裹|运单/, route: "logistics" },
    { pattern: /功能|规格|参数|对比|哪款好/, route: "product" }
  ];
  for (const { pattern, route } of rules) {
    if (pattern.test(query)) return route;
  }
  return "general";
}
// Problems: cannot match "帮我把上周下的单子取消掉", rule explosion, etc.

// ===== Scheme 2: Embedding router =====
import { OpenAIEmbeddings } from "@langchain/openai";
import { cosineSimilarity } from "@langchain/core/utils/math";
const embeddingModel = new OpenAIEmbeddings({ model: "text-embedding-3-small" });
const routeLabels = {
  refund: ["我想退款", "这个订单能取消吗", "退货怎么操作"],
  logistics: ["快递到哪了", "物流查询", "几天能送到"],
  product: ["这款有什么功能", "跟竞品比如何", "哪个型号好"],
  general: ["你们是做什么的", "如何联系客服"]
};
// Pre‑compute example vectors, then find the highest cosine score.
// Issues: only captures the primary intent, threshold tuning required.

// ===== Scheme 3: LLM router =====
import { ChatOpenAI } from "@langchain/openai";
import { z } from "zod";
const RouteSchema = z.object({
  route: z.enum(["refund", "logistics", "product", "general"]),
  confidence: z.number().min(0).max(1),
  reasoning: z.string()
});
const llmRouterModel = new ChatOpenAI({ model: "gpt-4o-mini", temperature: 0 });
const llmRouterChain = ChatPromptTemplate.fromMessages([
  ["system", `意图分类（每条附例句）：
- refund：退款/退货/取消订单。例："我要退货"
- logistics：快递/物流/配送。例："我的快递到哪了"
- product：产品功能/选型/对比。例："这款有什么功能"
- general：以上之外。例："你们是哪家公司"`],
  ["human", "{query}"]
]).pipe(llmRouterModel.withStructuredOutput(RouteSchema));
async function llmRouter(query: string): Promise<string> {
  const result = await llmRouterChain.invoke({ query });
  return result.confidence >= 0.75 ? result.route : "general";
}
// Advantages: can infer primary and secondary intents, provides confidence and reasoning.</p><h3>Three Routing Code Comparisons</h3><p>The code snippets above illustrate the implementation complexity and boundary conditions of each approach.</p><h2>03 Integrating with LangGraph: Routing Node + Conditional Edges</h2><p>In production a routing decision must be followed by the appropriate business chain. LangGraph’s <code>addConditionalEdges</code> connects the routing node to the downstream nodes.</p><pre><code>import { StateGraph, Annotation } from "@langchain/langgraph";
import { ChatOpenAI } from "@langchain/openai";
import { z } from "zod";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";
import { Document } from "@langchain/core/documents";

const GraphState = Annotation.Root({
  query: Annotation<string>(),
  route: Annotation<string>(),
  response: Annotation<string>()
});

async function routerNode(state) {
  const RouteSchema = z.object({
    route: z.enum(["refund", "logistics", "product", "general"]),
    confidence: z.number()
  });
  const model = new ChatOpenAI({ model: "gpt-4o-mini", temperature: 0 });
  const result = await model.withStructuredOutput(RouteSchema).invoke([
    { role: "system", content: "根据问题分类：refund退款、logistics物流、product产品、general其他" },
    { role: "human", content: state.query }
  ]);
  return { route: result.confidence >= 0.75 ? result.route : "general" };
}

async function refundNode(state) {
  const refundDocs = [
    new Document({ pageContent: "退款申请须在收货7天内提交，超时不予受理" }),
    new Document({ pageContent: "退款到账时间：支付宝1-3个工作日，银行卡3-5个工作日" }),
    new Document({ pageContent: "退款提交后24小时内审核，审核通过后原路退回" })
  ];
  const vectorStore = await MemoryVectorStore.fromDocuments(
    refundDocs,
    new OpenAIEmbeddings({ model: "text-embedding-3-small" })
  );
  const docs = await vectorStore.similaritySearch(state.query, 3);
  const context = docs.map(d => d.pageContent).join("
");
  const model = new ChatOpenAI({ model: "gpt-4o-mini", temperature: 0.3 });
  const response = await model.invoke([
    { role: "system", content: `你是退款专员，基于以下退款政策回答用户问题：

${context}` },
    { role: "human", content: state.query }
  ]);
  return { response: response.content };
}

// Similar logisticsNode and productNode omitted for brevity.

const routeMap = { refund: "refundNode", logistics: "logisticsNode", product: "productNode", general: "generalNode" };

const app = new StateGraph(GraphState)
  .addNode("router", routerNode)
  .addNode("refundNode", refundNode)
  .addNode("logisticsNode", async (state) => ({ response: `物流信息：${state.query}` }))
  .addNode("productNode", async (s) => ({ response: `产品咨询：${s.query}` }))
  .addNode("generalNode", async (s) => ({ response: `通用回答：${s.query}` }))
  .addEdge("__start__", "router")
  .addConditionalEdges("router", (s) => routeMap[s.route] ?? "generalNode", {
    refundNode: "refundNode",
    logisticsNode: "logisticsNode",
    productNode: "productNode",
    generalNode: "generalNode"
  })
  .addEdge("refundNode", "__end__")
  .addEdge("logisticsNode", "__end__")
  .addEdge("productNode", "__end__")
  .addEdge("generalNode", "__end__")
  .compile();

const result = await app.invoke({ query: "我昨天买的手机想退货" });
// result.route === "refund", result.response contains the generated answer.

Each business node first retrieves relevant facts from its own vector store (via similaritySearch) and then lets the LLM generate a response, which is far more reliable than pure memorisation.

04 Multi‑Route Concurrency: Handling Compound Intents

When a user asks for two independent intents, e.g., “Check my refund order and tell me if the phone is still in stock”, the system must run both chains in parallel and merge the results. LangGraph’s Send primitive enables this.

import { Send } from "@langchain/langgraph";
async function multiRouterNode(state) {
  const MultiRouteSchema = z.object({
    routes: z.array(z.object({
      route: z.enum(["refund", "logistics", "product", "general"]),
      sub_query: z.string().describe("子查询")
    })).describe("所有独立意图，最多3个")
  });
  const model = new ChatOpenAI({ model: "gpt-4o", temperature: 0 });
  const result = await model.withStructuredOutput(MultiRouteSchema).invoke([
    { role: "system", content: "识别用户查询中所有独立意图，每个意图单独拆分子查询。" },
    { role: "human", content: state.query }
  ]);
  return result.routes.map(r => new Send(r.route + "Node", { ...state, query: r.sub_query }));
}
// The graph will execute all Send targets concurrently and combine the outputs via a reducer.

05 Hybrid Routing: Embedding Fast Path + LLM Fallback

Production‑grade systems usually adopt a hybrid strategy: most high‑frequency, low‑ambiguity intents are resolved with embedding similarity (≈30 ms), while the remaining fuzzy cases fall back to the LLM (≈300 ms). The code below shows the full flow.

const routeExamples = {
  refund: ["我想退款", "订单我要取消", "退货怎么申请"],
  logistics: ["快递到哪了", "包裹几天能到", "物流单号查询"],
  product: ["这款有什么功能", "跟竞品比怎么样", "有没有优惠"],
  general: ["你们是哪家公司", "怎么联系客服"]
};
const emb = new OpenAIEmbeddings({ model: "text-embedding-3-small" });
const exampleVectors = await Promise.all(
  Object.entries(routeExamples).map(async ([route, examples]) => ({
    route,
    vectors: await emb.embedDocuments(examples)
  }))
);

async function llmRouter(query) {
  const RouteSchema = z.object({
    route: z.enum(["refund", "logistics", "product", "general"]),
    confidence: z.number().min(0).max(1)
  });
  const model = new ChatOpenAI({ model: "gpt-4o-mini", temperature: 0 });
  const result = await model.withStructuredOutput(RouteSchema).invoke([
    { role: "system", content: `意图分类规则：
- refund：退款/退货/取消订单。例："我要退货"
- logistics：快递/物流/配送。例："我的快递到哪了"
- product：产品功能/选型/对比。例："这款有什么功能"
- general：以上之外。例："你们是哪家公司"` },
    { role: "human", content: query }
  ]);
  return result.confidence >= 0.75 ? result.route : "general";
}

async function hybridRouter(query) {
  const qVec = await emb.embedQuery(query);
  let bestRoute = "general";
  let bestScore = 0;
  for (const { route, vectors } of exampleVectors) {
    const maxScore = Math.max(...vectors.map(v => cosineSimilarity([qVec], [v])[0][0]));
    if (maxScore > bestScore) { bestScore = maxScore; bestRoute = route; }
  }
  if (bestScore > 0.92) {
    console.log(`[Hybrid] Embedding fast path, route=${bestRoute}, score=${bestScore.toFixed(3)}`);
    return bestRoute;
  }
  console.log(`[Hybrid] LLM fallback, best_score=${bestScore.toFixed(3)}`);
  return await llmRouter(query);
}

Benchmark results (averaged over many queries):

GPT‑4o routing – ~800 ms latency, 97 % accuracy (extreme precision required).

GPT‑4o‑mini routing – ~300 ms, 94 % accuracy (covers most production cases).

Pure embedding routing – ~30 ms, 85 % accuracy (only when intent is highly fixed).

Hybrid (recommended) – ~50 ms (≈90 % fast path), 95 % accuracy, ideal for high‑frequency deterministic intents with occasional fuzzy fallbacks.

06 Routing Observability: Logging, Metrics, and Alerts

Without observability a routing error can silently affect users. The three‑layer stack consists of structured logging, sliding‑window accuracy statistics, and threshold‑based alerts.

import { createLogger, transports, format } from "winston";

const routeLogger = createLogger({
  level: "info",
  format: format.combine(format.timestamp(), format.json()),
  transports: [
    new transports.File({ filename: "logs/routing.log" }),
    new transports.Console({ format: format.simple() })
  ]
});

interface RouteLogEntry {
  sessionId: string;
  query: string;
  route: string;
  confidence: number;
  reasoning: string;
  method: "embedding" | "llm";
  latencyMs: number;
  isFallback: boolean;
}

async function routerWithLogging(query: string, sessionId: string) {
  const start = Date.now();
  const RouteSchema = z.object({
    route: z.enum(["refund", "logistics", "product", "general"]),
    confidence: z.number(),
    reasoning: z.string()
  });
  const model = new ChatOpenAI({ model: "gpt-4o-mini", temperature: 0 });
  const result = await model.withStructuredOutput(RouteSchema).invoke([
    { role: "system", content: "意图分类：refund退款、logistics物流、product产品、general其他" },
    { role: "human", content: query }
  ]);
  const isFallback = result.confidence < 0.75;
  const finalRoute = isFallback ? "general" : result.route;
  const latency = Date.now() - start;
  const logEntry: RouteLogEntry = {
    sessionId,
    query,
    route: finalRoute,
    confidence: result.confidence,
    reasoning: result.reasoning,
    method: "llm",
    latencyMs: latency,
    isFallback
  };
  routeLogger.info("route_decision", logEntry);
  if (isFallback) {
    routeLogger.warn("route_fallback", { sessionId, query, originalRoute: result.route, confidence: result.confidence });
  }
  return { route: finalRoute, logEntry };
}

class RouteAccuracyMonitor {
  private stats = { total: 0, correct: 0, fallbackCount: 0, routeDistribution: {}, windowStartTime: Date.now() };
  private readonly windowMs = 60 * 60 * 1000; // 1 hour
  record(route: string, isCorrect: boolean, isFallback: boolean) {
    if (Date.now() - this.stats.windowStartTime > this.windowMs) this.reset();
    this.stats.total++;
    if (isCorrect) this.stats.correct++;
    if (isFallback) this.stats.fallbackCount++;
    this.stats.routeDistribution[route] = (this.stats.routeDistribution[route] ?? 0) + 1;
  }
  getAccuracy() { return this.stats.total === 0 ? 1 : this.stats.correct / this.stats.total; }
  getFallbackRate() { return this.stats.total === 0 ? 0 : this.stats.fallbackCount / this.stats.total; }
  getReport() {
    return {
      accuracy: `${(this.getAccuracy() * 100).toFixed(1)}%`,
      fallbackRate: `${(this.getFallbackRate() * 100).toFixed(1)}%`,
      total: this.stats.total,
      distribution: this.stats.routeDistribution
    };
  }
  private reset() {
    this.stats = { total: 0, correct: 0, fallbackCount: 0, routeDistribution: {}, windowStartTime: Date.now() };
  }
}

const monitor = new RouteAccuracyMonitor();

async function checkAndAlert() {
  const report = monitor.getReport();
  const accuracy = monitor.getAccuracy();
  const fallbackRate = monitor.getFallbackRate();
  if (accuracy < 0.88) {
    routeLogger.error("route_accuracy_alert", {
      alert: "ACCURACY_DROP",
      accuracy: report.accuracy,
      threshold: "88%",
      action: "检查路由 prompt 或近期新增意图是否未覆盖"
    });
  }
  if (fallbackRate > 0.15) {
    routeLogger.warn("route_fallback_rate_alert", {
      alert: "HIGH_FALLBACK_RATE",
      fallbackRate: report.fallbackRate,
      threshold: "15%",
      action: "查看 fallback query 样本，补充对应意图分类"
    });
  }
  routeLogger.info("route_hourly_report", report);
}
setInterval(checkAndAlert, 60 * 60 * 1000);

07 Dynamic Routing Extension: Adding New Intents Without Code Changes

Product teams can add a new intent (e.g., “coupon”) by editing a JSON configuration file. The system reads the file at startup, builds the Zod enum and system prompt dynamically, and registers a corresponding LangGraph node.

interface IntentConfig {
  key: string;          // routing key, matches LLM label
  label: string;        // human‑readable name
  description: string;   // prompt description
  examples: string[];   // few‑shot examples
  node: string;         // LangGraph node name
}

function loadIntentConfig(path: string): IntentConfig[] {
  const raw = fs.readFileSync(path, "utf-8");
  return JSON.parse(raw) as IntentConfig[];
}

function buildDynamicRouterPrompt(intents: IntentConfig[]): string {
  const lines = intents.map(i => `- ${i.key}: ${i.description}。例：${i.examples.slice(0,2).join('、')}`);
  return `意图分类规则（请严格按规则分类）：
${lines.join('
')}`;
}

async function dynamicRouter(query: string, configPath = "./intents.json") {
  const intents = loadIntentConfig(configPath);
  const intentKeys = intents.map(i => i.key) as [string, ...string[]];
  const DynamicRouteSchema = z.object({
    route: z.enum(intentKeys).describe("意图分类"),
    confidence: z.number().min(0).max(1),
    reasoning: z.string()
  });
  const systemPrompt = buildDynamicRouterPrompt(intents);
  const model = new ChatOpenAI({ model: "gpt-4o-mini", temperature: 0 });
  const result = await model.withStructuredOutput(DynamicRouteSchema).invoke([
    { role: "system", content: systemPrompt },
    { role: "human", content: query }
  ]);
  return { route: result.confidence >= 0.75 ? result.route : "general", confidence: result.confidence };
}

function buildDynamicGraph(intents: IntentConfig[]) {
  let graph = new StateGraph(GraphState).addNode("router", async (state) => {
    const { route } = await dynamicRouter(state.query);
    return { route };
  });
  for (const intent of intents) {
    graph = graph.addNode(intent.node, async (state) => ({ response: `[${intent.label}] 处理 query: ${state.query}` }))
                 .addEdge(intent.node, "__end__");
  }
  const nodeMap = Object.fromEntries(intents.map(i => [i.node, i.node]));
  graph = graph
    .addEdge("__start__", "router")
    .addConditionalEdges("router", (state) => {
      const intent = intents.find(i => i.key === state.route);
      return intent?.node ?? "generalNode";
    }, nodeMap);
  return graph.compile();
}

// Example usage after adding a new "coupon" intent to intents.json:
const intents = loadIntentConfig("./intents.json");
const dynamicApp = buildDynamicGraph(intents);
const result = await dynamicApp.invoke({ query: "有没有优惠码可以用" });
// result.route will be "coupon" and the flow will hit couponNode.

08 Common Pitfalls (Two‑Week Pain Points)

Prompt too vague : Only listing class names leads to ambiguous boundaries. Always add concrete example sentences and clear edge cases.

Missing Zod enum : Without enum constraints the model may output capitalised or Chinese strings (e.g., "Refund"), breaking downstream if (route === "refund") checks.

Routing after RAG : Performing a full‑knowledge‑base retrieval before routing wastes tokens. Route first, then each chain queries its own vector store.

Ignoring low confidence : About 5‑10 % of queries are fuzzy. Without a 0.75 confidence threshold they fall into random chains, causing user complaints.

Lack of observability : Without logs, accuracy windows, and alerts, routing drift is only discovered via complaints.

Addressing these issues with structured prompts, enum validation, early routing, confidence fallback, and the three‑layer monitoring stack eliminates silent errors.

Summary

Routing is fundamentally a classifier; choose keyword, embedding, or LLM based on intent fuzziness.

Use withStructuredOutput and a Zod enum to force strong‑typed labels and capture reasoning for debugging.

Make each downstream node RAG‑aware: retrieve from a dedicated vector store before generating an answer.

Apply a confidence threshold (≈0.75) to fallback to a generic chain and log the event.

Hybrid routing (embedding fast path + LLM fallback) delivers ~50 ms P90 latency and ~95 % accuracy.

Three‑layer observability (structured logs → sliding‑window metrics → alerts) is essential to detect drift early.

Dynamic configuration (JSON intent file) decouples intent addition from code deployment, ideal for fast‑moving products.

In the next post we will explore five design patterns for intent recognition, from simple keywords to self‑routing LLMs, and clarify the boundaries of each solution.

Follow me, James, for more AI‑era engineering insights.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM observability LangChain intent classification hybrid routing LangGraph semantic routing

Written by

James' Growth Diary

I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.