Artificial Intelligence 27 min read

Evaluating Agent Observability: A Multi‑Dimensional Framework for Behavior, Quality, and Cost

The guide outlines a comprehensive, multi‑dimensional observability framework for AI agents—covering behavior insight, quality assessment, latency and token metrics, tool‑call tracking, error tracing, and cost monitoring—while demonstrating practical implementation with OpenTelemetry, Amazon CloudWatch, and open‑source tools such as MLflow and Langfuse.

Amazon Cloud Developers

Dec 24, 2025

Evaluating Agent Observability: A Multi‑Dimensional Framework for Behavior, Quality, and Cost

Introduction

The article explains why AI agents are entering a paradigm shift and why traditional "Metrics→Logs→Traces" observability is insufficient for agents that make autonomous decisions. It emphasizes the need to understand not only what happened but also why it happened and what to do next.

Core Observability Elements

Agent observability is defined as a multi‑dimensional concept that extends traditional application monitoring with AI‑specific behavior characteristics. The key dimensions are:

Response‑time metrics : TotalTime, TTFT (time to first token), ModelLatency.

Token‑usage metrics : InputTokenCount, OutputTokenCount.

Tool‑usage metrics : InvocationCount, tool execution time.

Examples illustrate each metric, e.g., a request for "Paris weather" may take 500 ms to understand, 300 ms for the weather API, and 200 ms to generate the answer, totaling 1000 ms for TotalTime.

Tracing for Agents

Tracing provides a complete execution‑chain view that captures the agent’s reasoning steps, tool calls, and context propagation. Unlike metrics and logs, tracing records the full decision path, enabling developers to locate performance bottlenecks, root‑cause errors, and understand reasoning logic.

The article references Amazon X‑Ray and OpenTelemetry best practices, highlighting the need for span IDs and trace IDs to build a hierarchical execution graph.

OpenTelemetry Integration

Agents can embed the OpenTelemetry SDK (e.g., via opentelemetry‑instrument for Python) to automatically generate spans for each operation. A sample JSON span is provided:

{
  "name": "chat",
  "context": {
    "trace_id": "0x68888fcdba6326c1fc004fe9396ad6a8",
    "span_id": "0x4f4c5c4caf92a36d"
  },
  "kind": "SpanKind.CLIENT",
  "start_time": "2025-07-29T09:09:33.427326Z",
  "end_time": "2025-07-29T09:09:34.932205Z",
  "status": {"status_code": "OK"},
  "attributes": {
    "session.id": "session-1234",
    "gen_ai.system": "strands-agents",
    "gen_ai.operation.name": "chat",
    "gen_ai.request.model": "us.anthropic.claude-3-5-haiku-20241022-v1:0",
    "gen_ai.usage.prompt_tokens": 443,
    "gen_ai.usage.output_tokens": 76,
    "gen_ai.usage.total_tokens": 519
  },
  "events": [
    {"name": "gen_ai.user.message", "timestamp": "2025-07-29T09:09:33.427368Z", "attributes": {"content": "[{'text':'Research and recommend suitable travel destinations...'}]"}},
    {"name": "gen_ai.choice", "timestamp": "2025-07-29T09:09:34.932167Z", "attributes": {"finish_reason": "tool_use", "message": "[{'text':'I\'ll search for the best traditional cultural experiences in Beijing.'}, {'toolUse':{'toolUseId':'tooluse_JSt-cJ9fRU28RmhdJ1XENA','name':'web_search','input':{'query':'Top traditional cultural attractions and experiences in Beijing 2024'}}}]"}}
  ]
}

This span shows session IDs, model names, token usage, and event details that enable end‑to‑end analysis.

Strands Agents Integration

Strands Agents already embed OpenTelemetry according to the Generative AI semantic conventions, automatically emitting spans for user messages, system prompts, model calls, and tool interactions. The article notes that this eliminates the need for custom tracing code.

Amazon CloudWatch Generative AI Observability

Amazon CloudWatch provides a managed observability service for generative AI workloads, including:

End‑to‑end prompt tracing.

Two built‑in dashboards: Model Invocations (usage, token, cost) and AgentCore agents (performance, decision metrics).

Key metrics: total and average invocation counts, token statistics, latency percentiles (P90, P99), error rates, throttling events, and cost attribution.

Data is sent via the OTLP protocol to CloudWatch Logs (aws/spans) and can be queried with Transaction Search, which stores spans as structured logs and supports both detailed (list) and aggregated (group) analyses.

Transaction Search

Transaction Search converts X‑Ray spans to the OpenTelemetry semantic format, stores them in a dedicated log group, and allows arbitrary span‑property queries, eliminating blind spots caused by sampling.

AgentCore Observability

When running on Amazon Bedrock AgentCore Runtime, the service automatically creates CloudWatch log groups, configures IAM permissions, and pre‑populates OTEL environment variables. No manual configuration is required beyond adding the OpenTelemetry SDK.

AgentCore provides differentiated default metrics for different resource types (Agent, Memory, Gateway, Tools), such as invocation counts, session counts, error type breakdowns, and memory‑specific latency metrics.

OpenTelemetry Collector

The Collector processes telemetry with three core components:

Receivers : ingest data from agents.

Processors : transform, filter, sample, and enrich attributes (e.g., add environment tags).

Exporters : forward processed data to backends like CloudWatch.

Processors are especially valuable for Agent observability, enabling custom sampling, sensitive‑data filtering, and batch optimization.

Open‑Source Alternatives

Beyond AWS‑managed services, the article showcases third‑party tools:

MLflow : captures spans, inputs, outputs, and metadata for debugging. Example code demonstrates Python decorators that annotate model retrieval, agent creation, and chain execution.

Langfuse : an open‑source observability platform for LLM applications that visualizes traces, token usage, latency, and cost. The article provides a comparative case where Claude 3.7 and Amazon Nova Lite are tested on the same query; Langfuse shows Claude 3.7 has lower cost while Nova Lite has lower latency.

Practical Use‑Case: E‑commerce After‑sales Chatbot

A Strands‑based chatbot integrates multiple MCP servers to call an e‑commerce API Gateway. The article walks through the development UI that displays model and tool calls, then explains how in production the data should be hidden and sent to Langfuse for secure monitoring.

Conclusion

Building a robust observability stack for AI agents—covering latency, token consumption, tool usage, error classification, and cost—allows teams to move from “seeing” agent behavior to truly understanding it. The combination of Amazon CloudWatch Generative AI Observability, Bedrock AgentCore observability, OpenTelemetry, and open‑source tools such as MLflow and Langfuse provides a complete, interoperable solution for both managed and self‑hosted deployments.

For hands‑on guidance, the article links to GitHub repositories containing quick‑start notebooks and sample code.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability OpenTelemetry Agentic AI MLflow Langfuse Amazon CloudWatch

Written by

Amazon Cloud Developers

Official technical community of Amazon Cloud. Shares practical AI/ML, big data, database, modern app development, IoT content, offers comprehensive learning resources, hosts regular developer events, and continuously empowers developers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.