Artificial Intelligence 33 min read

How Trace Analysis Turns AI Agents from Black Boxes into Optimized Systems

Trace analysis converts the opaque decision‑making of AI agents into observable data, enabling systematic collection, parallel error detection, targeted improvements, and iterative experimentation, while revealing common failure patterns, budgeting trade‑offs, over‑fitting risks, and cost‑optimization opportunities through a reusable Trace Analyzer Skill framework.

Qborfy AI

Apr 16, 2026

How Trace Analysis Turns AI Agents from Black Boxes into Optimized Systems

Why Trace Analysis Matters

AI agents are often treated as black boxes: you give them a task, they return a result, but you cannot see the intermediate decisions, tool calls, latency, token usage, or where the process went wrong. Without this visibility, improving the harness (the surrounding orchestration) is guesswork.

"Models are largely black boxes today; their internal mechanisms are hard to explain. However, we can observe their input and output in text space and use that information to drive our improvement loop." – LangChain engineer

LangChain’s Trace Analyzer Skill demonstrates the power of this approach: by analysing traces, they lifted a programming‑agent benchmark score from 52.8 % to 66.5 % without swapping the underlying model.

Trace Analyzer Skill: A Systematic Improvement Method

The method consists of four repeatable steps.

Collect Traces – Pull full execution records from LangSmith (or any compatible tracing platform). Each trace contains:

Step‑by‑step inputs and outputs

Tool‑call logs

Latency, token consumption, and cost

Task success flag

Parallel Error Analysis – Spawn multiple analysis agents, each processing a batch of traces to surface failure patterns. A master agent aggregates the findings into a unified list of common problems and improvement suggestions.

Targeted Improvement – Apply concrete fixes to the harness based on the aggregated patterns (e.g., refine system prompts, add verification middleware, adjust time budgets).

Re‑run Experiments – Execute the updated harness on the same test suite, compare metrics, and record the impact.

Typical Failure Patterns Discovered

Reasoning Error – The agent makes a wrong inference early on, steering the whole workflow off‑track. Fix: improve the system prompt’s analysis framework and add planning‑stage checks.

Non‑Compliance – The agent ignores explicit task instructions and follows its own interpretation. Fix: strengthen instruction‑following prompts and insert a PreCompletionChecklistMiddleware .

Missing Verification – Code is generated but never tested. Fix: adopt a "Build & Verify" pattern with a verification middleware.

Timeout – Complex tasks exceed the time budget. Fix: inject time‑budget warnings, split tasks into smaller sprints, and optimise reasoning budget allocation.

Infinite Loop – The same file is edited repeatedly without progress. Fix: use LoopDetectionMiddleware to limit edit iterations.

Reasoning Budget Optimisation (The "Reasoning Sandwich")

GPT‑5.2‑Codex offers four reasoning modes: low , medium , high , and xhigh . Higher modes improve reasoning quality but consume more time and tokens. Trace data showed that always using xhigh actually lowered the overall score because many tasks timed out.

The solution is a "Reasoning Sandwich": allocate the highest budget to planning and verification (the stages that need careful thought) while using a medium budget for the implementation stage, which is more mechanical.

Avoiding Over‑Fitting

When improvements target only a few failing tasks, the harness may over‑fit: performance rises on those tasks but degrades elsewhere. LangChain engineers warned, "Targeted fixes that only improve a handful of traces can cause regression on other tasks."

Analyse a large enough sample of traces to find universal patterns.

Validate every change on the full test suite, not just the failing cases.

Include human review to spot over‑fitting risk.

Building Your Own Trace‑Analysis System

Below is a concise decision guide for selecting a tracing backend.

Ecology Binding

LangSmith – tightly bound to LangChain, SaaS with free tier.

Langfuse – framework‑agnostic, open‑source, self‑hostable.

Self‑built – fully custom, highest flexibility.

Deployment Model

LangSmith – cloud SaaS / private‑hosted.

Langfuse – open‑source, self‑hosted.

Self‑built – fully self‑hosted.

Cost

LangSmith – pay‑as‑you‑go with free quota.

Langfuse – free software, only server costs.

Self‑built – only infrastructure cost.

Ease of Adoption

LangSmith – ★ extremely low (one‑line integration).

Langfuse – ★★ moderate (install and configure).

Self‑built – ★★★★ high (full development effort).

Choose the option that matches your stage:

# Decision tree (no styling)
Your situation?
│
├─ Just exploring traces? → use LangSmith (free tier)
├─ Need production reliability? → use Langfuse (open‑source self‑hosted)
├─ Have compliance or security constraints? → self‑built or Langfuse private deployment
└─ Using a non‑LangChain framework? → Langfuse or self‑built

Step‑by‑Step Implementation

Select a tracing tool (see guide above).

Define key metrics such as task success rate, average completion time, token consumption, tool‑call count, and failure‑type distribution.

Establish an analysis workflow – run a weekly trace aggregation, detect failure patterns, and synthesize improvement suggestions.

Build an experiment framework – after each harness change, evaluate on the standard test set and record metrics.

Log change history – store every harness modification together with its impact for future reference.

Full Production‑Grade Trace Analyzer Skill (Python)

The following code implements a reusable, extensible trace‑analysis pipeline. It includes data models, a heuristic analyzer, an optional LLM‑backed analyzer, a parallel engine, cost analysis, regression‑risk detection, and an experiment tracker.

"""
Trace Analyzer Skill – complete implementation
Features:
1. Collect and parse Agent run traces
2. Parallel launch of multiple sub‑agents for failure‑pattern detection
3. Aggregate results into a structured report
4. Provide harness improvement suggestions
5. Auto‑detect over‑fitting risk
"""
import json, os, time
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Any, Optional, Callable
from dataclasses import dataclass, field
from abc import ABC, abstractmethod
from concurrent.futures import ThreadPoolExecutor, as_completed

# ==================== Data models ====================
@dataclass
class TraceStep:
    """Single execution step"""
    step_id: str
    step_type: str  # "llm_call" | "tool_call" | "middleware" | "decision"
    input_summary: str  # truncated input
    output_summary: str  # truncated output
    duration_ms: int
    token_count: Dict[str, int] = None  # {"input": N, "output": N}
    error: Optional[str] = None
    metadata: Dict = field(default_factory=dict)

@dataclass
class TraceRecord:
    """Complete trace record"""
    trace_id: str
    task_description: str
    task_success: bool
    total_duration_ms: int
    total_tokens: int
    total_cost_usd: float
    steps: List[TraceStep]
    model_info: Dict = field(default_factory=dict)
    harness_config: Dict = field(default_factory=dict)
    timestamp: float = field(default_factory=time.time)

@dataclass
class FailurePattern:
    """Identified failure pattern"""
    pattern_id: str
    pattern_name: str
    category: str  # reasoning/compliance/verification/timeout/loop
    description: str
    severity: str  # critical / high / medium / low
    affected_traces: List[str]
    frequency: float
    suggested_fix: str
    related_harness_component: str
    evidence_snippets: List[str]

@dataclass
class AnalysisReport:
    """Analysis report"""
    analysis_id: str
    analyzed_trace_count: int
    time_range: tuple
    summary: Dict[str, Any]
    failure_patterns: List[FailurePattern]
    improvement_suggestions: List[Dict]
    regression_risk: List[Dict]
    cost_analysis: Dict
    generated_at: str

# ==================== Failure pattern library ====================
BUILTIN_FAILURE_PATTERNS = {
    "reasoning_error": {
        "name": "推理错误",
        "category": "reasoning",
        "description": "AI 在分析问题时做出了错误的推断，导致方向性错误",
        "detection_heuristics": [
            "早期步骤出现与任务目标不符的决策",
            "后续步骤持续基于错误前提进行",
            "最终输出与初始需求偏差大"
        ],
        "suggested_fix": "改进系统提示词中的问题分析框架；增加规划阶段的约束检查",
        "harness_component": "系统提示词"
    },
    "non_compliance": {
        "name": "不遵循指令",
        "category": "compliance",
        "description": "AI 没有按照任务要求执行，而是按自己的理解来做",
        "detection_heuristics": [
            "任务要求中的关键点未被覆盖",
            "AI 自行添加了不需要的功能",
            "跳过了明确要求的步骤"
        ],
        "suggested_fix": "强化系统提示词的任务遵循指令；增加 PreCompletionChecklistMiddleware",
        "harness_component": "系统提示词 + 中间件"
    },
    "missing_verification": {
        "name": "缺少验证",
        "category": "verification",
        "description": "AI 写完代码就停止，没有运行测试或验证结果",
        "detection_heuristics": [
            "最后一步是代码输出而非测试执行",
            "没有工具调用类型的 test_run",
            "总步骤数 < 预期最小值"
        ],
        "suggested_fix": "实施 Build & Verify 模式；添加 PreCompletionChecklistMiddleware",
        "harness_component": "中间件 + 验证层 + 系统提示词"
    },
    "timeout": {
        "name": "超时",
        "category": "timeout",
        "description": "任务复杂度过高，AI 在时间限制内未完成",
        "detection_heuristics": [
            "trace 以 timeout/error 结束",
            "最后几步明显加速（上下文焦虑）",
            "总耗时超过阈值"
        ],
        "suggested_fix": "注入时间预算警告；拆分为更小的 Sprint；优化推理预算分配",
        "harness_component": "中间件 + 执行流程 + 上下文管理"
    },
    "infinite_loop": {
        "name": "死循环",
        "category": "loop",
        "description": "AI 在同一问题上反复修改，无法跳出",
        "detection_heuristics": [
            "同一文件被编辑 > N 次",
            "相邻步骤高度相似（编辑相似内容）",
            "无进展的迭代次数 > 阈值"
        ],
        "suggested_fix": "实施 LoopDetectionMiddleware；设置最大迭代限制",
        "harness_component": "中间件"
    }
}

# ==================== Analyzer interfaces ====================
class BaseTraceAnalyzer(ABC):
    """Base class for trace analyzers"""
    @abstractmethod
    def analyze(self, traces: List[TraceRecord]) -> List[FailurePattern]:
        """Analyse traces and return discovered failure patterns"""
        pass

    @property
    @abstractmethod
    def name(self) -> str:
        """Analyzer name"""
        pass

class HeuristicAnalyzer(BaseTraceAnalyzer):
    """Heuristic‑rule based analyzer"""
    def __init__(self, patterns: Dict = None):
        self.patterns = patterns or BUILTIN_FAILURE_PATTERNS

    @property
    def name(self) -> str:
        return "heuristic_analyzer"

    def analyze(self, traces: List[TraceRecord]) -> List[FailurePattern]:
        patterns_found = []
        for pattern_key, pattern_def in self.patterns.items():
            affected = []
            for trace in traces:
                if self._matches_pattern(trace, pattern_def):
                    affected.append(trace.trace_id)
            if affected:
                frequency = len(affected) / len(traces)
                patterns_found.append(FailurePattern(
                    pattern_id=pattern_key,
                    pattern_name=pattern_def['name'],
                    category=pattern_def['category'],
                    description=pattern_def['description'],
                    severity=self._calculate_severity(frequency),
                    affected_traces=affected,
                    frequency=frequency,
                    suggested_fix=pattern_def['suggested_fix'],
                    related_harness_component=pattern_def['harness_component'],
                    evidence_snippets=self._extract_evidence(affected[:3], traces)
                ))
        patterns_found.sort(key=lambda p: p.frequency, reverse=True)
        return patterns_found

    def _matches_pattern(self, trace: TraceRecord, pattern_def: Dict) -> bool:
        """Check whether a trace matches a failure pattern"""
        heuristics = pattern_def.get('detection_heuristics', [])
        if pattern_def['category'] == 'verification':
            has_test = any('test' in s.step_type.lower() or 'verify' in s.step_type.lower() for s in trace.steps)
            return not has_test and not trace.task_success
        elif pattern_def['category'] == 'loop':
            edit_files = [s.metadata.get('file_edited') for s in trace.steps if s.metadata.get('file_edited')]
            from collections import Counter
            counts = Counter(edit_files)
            return any(c > 5 for c in counts.values())
        elif pattern_def['category'] == 'timeout':
            return (not trace.task_success) and (trace.total_duration_ms > 300000)  # >5 min
        # other categories could have more sophisticated checks
        return False

    def _calculate_severity(self, frequency: float) -> str:
        if frequency >= 0.4:
            return "critical"
        elif frequency >= 0.2:
            return "high"
        elif frequency >= 0.1:
            return "medium"
        else:
            return "low"

    def _extract_evidence(self, trace_ids: List[str], all_traces: List[TraceRecord]) -> List[str]:
        snippets = []
        for tid in trace_ids:
            trace = next((t for t in all_traces if t.trace_id == tid), None)
            if trace and trace.steps:
                last_step = trace.steps[-1]
                snippets.append(f"Trace {tid}: 最后一步 '{last_step.step_type}' — {last_step.output_summary[:100]}...")
        return snippets

class LLMAnalyzer(BaseTraceAnalyzer):
    """LLM‑backed semantic analyzer (framework placeholder)"""
    def __init__(self, llm_client=None):
        self.client = llm_client

    @property
    def name(self) -> str:
        return "llm_analyzer"

    def analyze(self, traces: List[TraceRecord]) -> List[FailurePattern]:
        failed = [t for t in traces if not t.task_success]
        if not failed:
            return []
        prompt = self._build_analysis_prompt(failed)
        # In a real system, call the LLM here. For illustration we mock the result.
        result = self._mock_llm_analysis(failed)
        return result

    def _build_analysis_prompt(self, traces: List[TraceRecord]) -> str:
        traces_data = []
        for t in traces[:10]:  # limit sample size
            steps_summary = "
".join([
                f"  [{s.step_type}] {s.input_summary[:80]} → {s.output_summary[:80]}"
                for s in t.steps[:15]
            ])
            traces_data.append(
                f"### Trace {t.trace_id} (成功={t.task_success}, 耗时={t.total_duration_ms//1000}s)
{steps_summary}"
            )
        return (
            "你是一个专业的 AI Agent Trace 分析专家。请分析以下失败的 Agent 运行记录，找出共同的失败模式。

" +
            "

".join(traces_data) +
            "

请以 JSON 格式返回你的发现：
{
  \"patterns\": [
    {
      \"name\": \"模式名称\",
      \"category\": \"reasoning/compliance/verification/timeout/loop/other\",
      \"description\": \"描述这个模式\",
      \"severity\": \"critical/high/medium/low\",
      \"frequency\": 出现比例,
      \"suggested_fix\": \"建议如何修复\",
      \"harness_component\": \"相关的 Harness 组件\"
    }
  ]
}
"
        )

    def _mock_llm_analysis(self, traces: List[TraceRecord]) -> List[FailurePattern]:
        # Placeholder – return empty list for the demo
        return []

# ==================== Main analysis engine ====================
class TraceAnalysisEngine:
    """Coordinates multiple analyzers, merges results, and produces a report"""
    def __init__(self, analyzers: List[BaseTraceAnalyzer] = None):
        self.analyzers = analyzers or [HeuristicAnalyzer(), LLMAnalyzer()]
        self.analysis_history: List[AnalysisReport] = []

    def run_analysis(self, traces: List[TraceRecord], max_workers: int = 4) -> AnalysisReport:
        print(f"[TraceAnalyzer] 开始分析 {len(traces)} 条 Trace...")
        start_time = time.time()
        all_patterns = []
        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            futures = {executor.submit(analyzer.analyze, traces): analyzer.name for analyzer in self.analyzers}
            for future in as_completed(futures):
                analyzer_name = futures[future]
                try:
                    patterns = future.result()
                    print(f"[TraceAnalyzer] {analyzer_name} 完成，发现 {len(patterns)} 个模式")
                    all_patterns.extend(patterns)
                except Exception as e:
                    print(f"[TraceAnalyzer] {analyzer_name} 出错: {e}")
        merged_patterns = self._deduplicate_patterns(all_patterns)
        suggestions = self._generate_suggestions(merged_patterns)
        regression_risks = self._detect_regression_risks(merged_patterns)
        cost_analysis = self._analyze_costs(traces)
        report = AnalysisReport(
            analysis_id=f"analysis_{int(time.time())}",
            analyzed_trace_count=len(traces),
            time_range=(
                datetime.fromtimestamp(min(t.timestamp for t in traces)).isoformat(),
                datetime.fromtimestamp(max(t.timestamp for t in traces)).isoformat()
            ),
            summary={
                "total_traces": len(traces),
                "success_rate": sum(1 for t in traces if t.task_success) / len(traces),
                "avg_duration_ms": sum(t.total_duration_ms for t in traces) / len(traces),
                "avg_cost_usd": sum(t.total_cost_usd for t in traces) / len(traces),
                "patterns_found": len(merged_patterns)
            },
            failure_patterns=merged_patterns,
            improvement_suggestions=suggestions,
            regression_risk=regression_risks,
            cost_analysis=cost_analysis,
            generated_at=datetime.now().isoformat()
        )
        self.analysis_history.append(report)
        elapsed = time.time() - start_time
        print(f"[TraceAnalyzer] 分析完成，耗时 {elapsed:.1f}s")
        return report

    def _deduplicate_patterns(self, patterns: List[FailurePattern], threshold: float = 0.7) -> List[FailurePattern]:
        unique = []
        seen = {}
        for p in patterns:
            if p.category not in seen:
                seen[p.category] = p
                unique.append(p)
            else:
                existing = seen[p.category]
                existing.affected_traces = list(set(existing.affected_traces + p.affected_traces))
                existing.frequency = len(existing.affected_traces) / len(existing.affected_traces)
                existing.evidence_snippets.extend(p.evidence_snippets)
        return unique

    def _generate_suggestions(self, patterns: List[FailurePattern]) -> List[Dict]:
        suggestions = []
        by_component = {}
        for p in patterns:
            comp = p.related_harness_component
            by_component.setdefault(comp, []).append(p)
        priority_map = {'critical': 4, 'high': 3, 'medium': 2, 'low': 1}
        for component, comp_patterns in by_component.items():
            most_severe = max(comp_patterns, key=lambda p: priority_map.get(p.severity, 0))
            suggestions.append({
                "component": component,
                "priority": most_severe.severity,
                "problem": f"{most_severe.pattern_name} ({most_severe.frequency:.0%} 出现率)",
                "action": most_severe.suggested_fix,
                "affected_patterns": [p.pattern_id for p in comp_patterns],
                "estimated_effort": self._estimate_effort(most_severe)
            })
        suggestions.sort(key=lambda s: priority_map.get(s['priority'], 99))
        return suggestions

    def _detect_regression_risks(self, patterns: List[FailurePattern]) -> List[Dict]:
        risks = []
        for p in patterns:
            if len(p.affected_traces) <= 2 and p.severity != 'critical':
                risks.append({
                    "level": "warning",
                    "pattern": p.pattern_name,
                    "reason": f"该模式仅影响 {len(p.affected_traces)} 条 trace，针对性修复可能造成过拟合",
                    "recommendation": "建议收集更多数据确认这是否是普遍性问题"
                })
        if len(self.analysis_history) >= 2:
            prev = self.analysis_history[-2]
            prev_ids = {p.pattern_id for p in prev.failure_patterns}
            new_patterns = [p for p in patterns if p.pattern_id not in prev_ids]
            for np in new_patterns[:3]:
                risks.append({
                    "level": "info",
                    "pattern": np.pattern_name,
                    "reason": "新出现的失败模式（上次分析未发现）",
                    "recommendation": "观察是否会在后续分析中持续出现"
                })
        return risks

    def _analyze_costs(self, traces: List[TraceRecord]) -> Dict:
        successful = [t for t in traces if t.task_success]
        failed = [t for t in traces if not t.task_success]
        return {
            "total_spent": sum(t.total_cost_usd for t in traces),
            "avg_per_trace": sum(t.total_cost_usd for t in traces) / (len(traces) or 1),
            "avg_successful": sum(t.total_cost_usd for t in successful) / (len(successful) or 1),
            "avg_failed": sum(t.total_cost_usd for t in failed) / (len(failed) or 1),
            "wasted_on_failures": sum(t.total_cost_usd for t in failed),
            "optimization_candidates": self._find_cost_optimization_candidates(traces)
        }

    def _find_cost_optimization_candidates(self, traces: List[TraceRecord]) -> List[Dict]:
        candidates = []
        for trace in traces:
            expensive = [s for s in trace.steps if s.token_count and s.token_count.get('total', 0) > 5000]
            for step in expensive:
                candidates.append({
                    "trace_id": trace.trace_id,
                    "step_type": step.step_type,
                    "tokens_used": step.token_count.get('total', 0),
                    "suggestion": f"考虑压缩 {step.step_type} 步骤的输入/输出"
                })
        candidates.sort(key=lambda c: c['tokens_used'], reverse=True)
        return candidates[:5]

    def _estimate_effort(self, pattern: FailurePattern) -> str:
        effort_map = {
            "系统提示词": "低（修改文本即可）",
            "中间件": "中等（需要编写代码）",
            "验证层": "中等（需要编写代码+测试）",
            "工具集": "较高（可能需要集成新工具）",
            "上下文管理": "中等（重构文档结构）",
            "执行流程": "中等（调整编排逻辑）"
        }
        return effort_map.get(pattern.related_harness_component, "未知")

# ==================== Experiment tracking ====================
class ExperimentTracker:
    """Records each harness modification and its impact"""
    def __init__(self, storage_path: str = "./experiments"):
        self.storage = Path(storage_path)
        self.storage.mkdir(exist_ok=True)
        self.history_file = self.storage / "history.jsonl"

    def record_experiment(self, experiment_id: str, harness_changes: Dict, baseline_metrics: Dict, experiment_metrics: Dict, notes: str = "") -> Dict:
        deltas = {}
        for key in baseline_metrics:
            if key in experiment_metrics:
                old_val = baseline_metrics[key]
                new_val = experiment_metrics[key]
                if isinstance(old_val, (int, float)) and isinstance(new_val, (int, float)):
                    delta = new_val - old_val
                    pct = (delta / old_val * 100) if old_val != 0 else 0
                    deltas[key] = {"before": old_val, "after": new_val, "delta": round(delta, 4), "delta_pct": round(pct, 2)}
        record = {
            "experiment_id": experiment_id,
            "timestamp": datetime.now().isoformat(),
            "harness_changes": harness_changes,
            "baseline": baseline_metrics,
            "experiment": experiment_metrics,
            "deltas": deltas,
            "verdict": self._make_verdict(deltas),
            "notes": notes
        }
        with open(self.history_file, 'a', encoding='utf-8') as f:
            f.write(json.dumps(record, ensure_ascii=False) + '
')
        return record

    def _make_verdict(self, deltas: Dict) -> str:
        success = deltas.get('success_rate', {})
        if success and success.get('delta_pct', 0) > 5:
            return "🟢 有效改进"
        elif success and success.get('delta_pct', 0) < -3:
            return "🔴 回退（效果变差）"
        else:
            return "🟡 效果不明显"

    def get_history(self, limit: int = 20) -> List[Dict]:
        if not self.history_file.exists():
            return []
        records = []
        with open(self.history_file, 'r', encoding='utf-8') as f:
            for line in f:
                line = line.strip()
                if line:
                    records.append(json.loads(line))
        return records[-limit:]

Broader Value of Trace Analysis

Beyond harness optimisation, trace analysis helps you:

Detect tool problems – sometimes the agent deviates because a downstream tool returns incorrect or ambiguous data.

Understand model behaviour – large‑scale trace data reveals when the model performs well and when it tends to fail, informing future prompt engineering.

Optimise costs – detailed token usage per step lets you prune high‑cost, low‑impact operations.