End-to-End Observability with LangSmith: Trace Debugging and RAG Evaluation from Development to Production
This article walks through LangSmith’s three core capabilities—Trace, Evaluation, and Dataset management—showing how to integrate zero‑code tracing, quantify RAG performance with custom evaluators, run version‑comparison experiments, and set up production monitoring with sampling and feedback loops.
LangSmith Core Capabilities
LangSmith records every step of an LLM Agent execution as a Trace . Each node in the trace tree contains latency, token usage, inputs, outputs, and error information. The platform provides three core functions:
Trace – full visibility into execution steps for debugging and production troubleshooting.
Evaluation – quantitative measurement of RAG pipelines on Faithfulness, Relevance, and Correctness.
Dataset – managed collection of test cases for regression testing and continuous evaluation.
Zero‑Code Integration with LangChain.js
Set three environment variables and LangChain.js automatically captures traces without any code changes.
# .env file
LANGSMITH_TRACING=true
LANGSMITH_API_KEY=lsv2_pt_your_key_here
LANGSMITH_PROJECT=my-rag-agent
OPENAI_API_KEY=sk-...After setting the variables, any LangChain chain (e.g., a RAG pipeline) will appear in the LangSmith UI under the specified project.
Deep Trace Usage
Automatic tracing only covers built‑in LangChain components. Custom functions must be instrumented manually.
Option 1 – traceable wrapper (recommended)
import { traceable } from "langsmith/traceable";
import { RunTree } from "langsmith";
import { LangChainTracer } from "@langchain/core/tracers/tracer_langchain";
import { RunCollectorCallbackHandler } from "@langchain/core/tracers/run_collector";
const processQuery = traceable(
async (query: string, topK: number = 5) => {
const cleaned = query.trim().toLowerCase();
const results = await vectorStore.similaritySearch(cleaned, topK);
return results;
},
{ name: "processQuery", metadata: { version: "v2" } }
);Option 2 – Manual RunTree for complex flows
async function ragPipeline(question: string) {
const parent = new RunTree({ name: "RAG Pipeline", run_type: "chain", inputs: { question } });
await parent.postRun();
const retrieveChild = await parent.createChild({
name: "Retrieve",
run_type: "retriever",
inputs: { query: question }
});
await retrieveChild.postRun();
const docs = await vectorStore.similaritySearch(question, 5);
await retrieveChild.end({ outputs: { documents: docs } });
await retrieveChild.patchRun();
const generateChild = await parent.createChild({
name: "Generate",
run_type: "llm",
inputs: { question, context: docs.map(d => d.pageContent).join("
") }
});
await generateChild.postRun();
const answer = await chain.invoke({ question, context: docs });
await generateChild.end({ outputs: { answer } });
await generateChild.patchRun();
await parent.end({ outputs: { answer } });
await parent.patchRun();
return answer;
}Option 3 – Adding metadata and tags
const result = await chain.invoke(
{ question: "...", context: "..." },
{ runName: "RAG-v2", tags: ["production", "rag"], metadata: { userId: "user_123", queryType: "technical", env: "prod" } }
);Metadata fields such as userId or env can be used in the UI to filter traces.
RAG Quantitative Evaluation
LangSmith evaluates RAG pipelines on three dimensions:
Faithfulness – does the answer stay within the retrieved context?
Relevance – how well do the retrieved documents match the question?
Correctness – similarity between the answer and a reference answer.
Typical workflow:
Create a Dataset with input questions and reference answers.
Implement custom evaluators (LLM‑as‑Judge) that return a score for each dimension.
Run an evaluate experiment that applies the evaluators to every dataset example.
Example (TypeScript)
import { Client } from "langsmith";
import { evaluate, EvaluationResult } from "langsmith/evaluation";
const client = new Client();
// Step 1: create dataset
const dataset = await client.createDataset("rag-eval-v1", { description: "RAG system evaluation dataset v1" });
await client.createExamples({
inputs: [
{ question: "What is a Checkpoint in LangGraph?" },
{ question: "How to choose a vector DB?" },
{ question: "What is Rerank?" }
],
outputs: [
{ answer: "Checkpoint is LangGraph's persistence mechanism..." },
{ answer: "Use Chroma for dev, Qdrant for medium, Milvus for large scale." },
{ answer: "Rerank re‑orders candidates using a Cross‑Encoder." }
],
datasetId: dataset.id
});
// Step 2: faithfulness evaluator
const faithfulnessEvaluator = async ({ input, output }) => {
const verdict = await model.invoke(`判断答案是否只基于上下文,不引入额外信息。
问题:${input.question}
上下文:${output.context}
答案:${output.answer}
只回复 faithful 或 hallucinated。`);
const passed = verdict.content.toString().trim().toLowerCase() === "faithful";
return { key: "faithfulness", score: passed ? 1 : 0, comment: passed ? "忠实" : "存在幻觉" };
};
// Step 3: relevance evaluator
const relevanceEvaluator = async ({ output, referenceOutput }) => {
const score = await model.invoke(`参考答案:${referenceOutput.answer}
实际答案:${output.answer}
打分0-10,只回复数字。`);
return { key: "answer_relevance", score: parseInt(score.content.toString()) / 10 };
};
// Step 4: run experiment
await evaluate(ragTarget, {
data: "rag-eval-v1",
evaluators: [faithfulnessEvaluator, relevanceEvaluator],
experimentPrefix: "rag-v2",
metadata: { version: "v2" }
});Comparative Experiments
LangSmith’s experiment comparison runs two prompt versions and shows metrics side‑by‑side.
// Baseline version
await evaluate(ragTarget, { data: "rag-eval-v1", evaluators: [faithfulnessEvaluator, relevanceEvaluator], experimentPrefix: "baseline" });
// Optimized version with Chain‑of‑Thought
const ragTargetCoT = async (input) => {
const docs = await vectorStore.similaritySearch(input.question, 5);
const context = docs.map(d => d.pageContent).join("
");
const answer = await model.invoke(`先分析问题核心诉求,再基于上下文逐步推理,最后给出简洁答案。
问题:${input.question}
上下文:${context}`);
return { answer: answer.content.toString(), context };
};
await evaluate(ragTargetCoT, { data: "rag-eval-v1", evaluators: [faithfulnessEvaluator, relevanceEvaluator], experimentPrefix: "cot-v1", metadata: { promptVersion: "cot-v1" } });Result summary (extracted from the UI):
Baseline – Faithfulness 0.73, Relevance 0.68, Avg latency 1.2 s, Tokens 4,230.
CoT version – Faithfulness 0.89 , Relevance 0.81 , Avg latency 1.8 s, Tokens 5,610.
The CoT prompt improves quality but increases latency and token cost, enabling a data‑driven trade‑off decision.
Production Monitoring with Sampling and User Feedback
In production, trace a random 10 % of requests and always trace errors.
async function invokeWithSampling(question, context, userId) {
const shouldTrace = Math.random() < 0.1;
const config = shouldTrace ? {
callbacks: [new LangChainTracer()],
metadata: { userId, env: "prod", version: process.env.APP_VERSION },
tags: ["prod", "sampled"]
} : {};
try {
return await chain.invoke({ question, context }, config);
} catch (error) {
// Ensure errors are fully traced
await chain.invoke({ question, context }, {
callbacks: [new LangChainTracer()],
metadata: { userId, error: true }
});
throw error;
}
}User feedback is collected via a RunCollectorCallbackHandler. The run ID is stored in the response and later sent to LangSmith.
async function recordFeedback(runId, isPositive, comment) {
await client.createFeedback(runId, "user_feedback", {
score: isPositive ? 1 : 0,
comment
});
}Key monitoring metrics (configured in the LangSmith dashboard) include:
P95 latency – alert if > 5 s.
Daily token consumption – alert if > 80 % of budget.
Error rate – alert if > 1 %.
User negative‑feedback rate – alert if > 10 %.
Common Pitfalls
Trace volume explosion – enabling tracing in production can generate hundreds of thousands of traces per month. Mitigate with the 10 % sampling strategy and error‑only tracing.
Unstable evaluator scores – keep the evaluator model temperature at 0 and include a clear rubric in the prompt.
Missing metadata – inject fields such as userId, sessionId, version, and env via a helper so traces are searchable.
Dataset drift – add new test cases to the Dataset whenever a feature is released; treat Dataset maintenance as code maintenance.
Overly long reference answers – keep ground‑truth answers short (2‑3 sentences) focusing on required key information.
Summary
Zero‑code integration: three environment variables enable automatic trace capture in LangChain.js.
Trace makes a black‑box Agent transparent: every input, output, latency, and token count is visible.
Evaluation is the core advantage: create a Dataset, write LLM‑as‑Judge evaluators, run comparative experiments, and make data‑driven decisions.
Production monitoring requires a feedback loop: sampling controls cost, user feedback creates quality signals, and dashboards track latency, token usage, and error rates.
Metadata is a first‑class citizen: proper tagging enables fast root‑cause analysis across millions of traces.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
James' Growth Diary
I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
