Artificial Intelligence 21 min read

Mastering AI Application Observability: From Metrics to Full‑Stack Tracing

This article explains why cost and performance are critical in the AI era, outlines the three main pain points of AI application development, and details a full‑stack observability solution—including architecture layers, key metrics like TTFT and TPOT, OpenTelemetry tracing, and practical tips for frameworks such as Dify—integrated into Alibaba Cloud CloudMonitor 2.0.

Alibaba Cloud Observability

Jun 16, 2025

Mastering AI Application Observability: From Metrics to Full‑Stack Tracing

In the AI era, rapid evolution of models and applications makes inference cost and performance critical, and end-to-end AI observability is essential.

The AI ecosystem can be divided into three parts: models (e.g., DeepSeek, Qwen), development frameworks (LangChain, LlamaIndex, Spring AI, low‑code platforms), and AI applications (chatbots, Copilot, agents).

When developing AI applications, three pain points arise: how to use it, how to use it cheaply, and how to use it well.

Usage: inconsistent responses across identical prompts and occasional request stalls.

Cost: token consumption varies; need clear visibility of input vs output tokens.

Quality: ensure answers stay within expected range and avoid unsafe or hallucinated outputs.

A typical AI application architecture consists of three layers: user‑facing front‑ends (iOS, Android, mini‑programs) behind an API gateway, the AI application layer (Python or Java services), and the model service layer (multiple models behind an AI gateway that handles traffic protection, token limiting, content filtering, and caching).

Full‑stack observability requires tracing the entire request path, building a unified observability data platform that correlates trace data with metrics such as GPU utilization, KV‑cache hit rate, and model‑side indicators.

Key AI‑specific metrics are Token, Error, and Duration. Within model inference, two stages are measured: TTFT (time‑to‑first‑token) during the Prefill phase and TPOT (time‑per‑output‑token) during the Decode phase. Through these metrics, one can balance latency and throughput for online versus batch scenarios.

We implement tracing with OpenTelemetry, instrumenting the client, API gateway, AI application, AI gateway, and model layers. Manual instrumentation is used for the gateway, while automatic, non‑intrusive instrumentation covers the application and model layers.

For Python services we extend the OpenTelemetry Python agent to support popular AI frameworks (Dify, LangChain, LlamaIndex) and to handle multi‑process (Gunicorn) and coroutine (gevent) environments without stability issues.

Specific practice on Dify includes scaling Nginx upload limits, enlarging PostgreSQL connection pools, replacing Redis queues with RocketMQ, and moving storage to cloud object stores.

Observability also covers model quality evaluation: input/output logs are sent to a log platform, where external judge models assess correctness, toxicity, hallucinations, and compliance, with results clustered and visualized.

Future work includes adding observability for MCP tools (tracking token consumption and latency of each tool call) and releasing this feature by the end of May.

All of these capabilities are integrated into Alibaba Cloud CloudMonitor 2.0, providing unified dashboards for token usage, request counts, topology, and detailed session analysis across AI applications, gateways, and underlying models.