How to Build a Full‑Stack Observability System for Production‑Grade AI Agents
This article explains how to design and implement a comprehensive, cloud‑native observability framework for AI applications, covering architecture layers, key metrics such as token usage, TTFT and TPOT, OpenTelemetry tracing, Dify deployment tips, model evaluation, and MCP token‑blackhole challenges.
In the AI era, rapid evolution of models and applications makes inference cost and performance critical, and end‑to‑end AI observability is essential. The AI application ecosystem can be divided into three parts: emerging models (e.g., DeepSeek, Qwen), development frameworks (LangChain, LlamaIndex, Spring AI, low‑code platforms like Dify and Coze), and AI agents ranging from chatbots to Copilot‑style assistants.
Key Challenges in AI Application Development
Developers face three main pain points: (1) Usability – inconsistent responses across model versions and occasional request hangs; (2) Cost efficiency – token consumption varies per request and needs clear visibility; (3) Quality – ensuring generated answers meet expectations without hallucinations or policy violations.
Typical AI Application Architecture
A three‑layer architecture is common: the user/business layer (iOS, Android, mini‑programs), the AI application layer (Python or Java services acting as AI agents), and the model service layer. Requests pass through an API gateway (e.g., Higress) for traffic protection, then flow to the AI layer, which may route to multiple models based on cost or priority via an AI gateway that handles token throttling and content filtering.
Full‑Stack Observability Approach
Observability is achieved through three mechanisms:
Tracing the entire call chain from client to model to pinpoint failures.
Collecting metrics across layers, including GPU utilization, KV‑cache hit rates, and token consumption.
Logging model inputs/outputs for quality assessment.
These data are visualized in Alibaba Cloud’s unified monitoring platform, covering user experience, application latency, gateway health, model inference metrics, and infrastructure utilization.
Trace‑Based End‑to‑End Diagnosis
Using OpenTelemetry, instrumentation is added from the client, API gateway, AI application, AI gateway, to the model layer. Manual instrumentation covers the gateway, while automatic, non‑intrusive probes capture Python and Java frameworks (LangChain, LlamaIndex, Dify, etc.). The collected spans are reported to the cloud observability service.
The trace reveals two critical inference phases: Prefill (tokenization and KV‑cache building) measured by TTFT (time‑to‑first‑token) and Decode (subsequent token generation) measured by TPOT (time‑per‑output‑token). Combined with token counts, total latency can be estimated.
Best Practices for Dify Production Deployment
Dify’s architecture consists of an Nginx reverse proxy, a Flask API backend, Redis for caching and message queuing, and PostgreSQL storage. Recommended optimizations include increasing NGINX_CLIENT_MAX_BODY_SIZE for large document uploads, enlarging the PostgreSQL connection pool (e.g., to 300), replacing Redis‑based queuing with RocketMQ for high‑scale workloads, and moving local vector stores to external services.
Enabling the built‑in observability provides per‑workflow timing, but it lacks cross‑service tracing and stores data in PostgreSQL, which can become a bottleneck at scale. Alibaba Cloud’s probe integrates seamlessly, supporting multi‑process (unicorn/gunicorn) and gevent environments without stability issues.
Model Output Evaluation
All model inputs and outputs are streamed to Alibaba Cloud Log Service, where they can be sampled and evaluated using built‑in or custom templates to detect hallucinations, toxicity, or MCP attacks. Evaluation results can be categorized and clustered for semantic analysis.
MCP Token‑Blackhole Problem
When an AI agent invokes many MCP tools, token consumption can explode (e.g., a single answer consuming thousands of tokens while triggering tens of model calls). Observability must capture each tool’s latency and token usage to expose this hidden cost.
Conclusion
The presented observability stack enables end‑to‑end tracing, metric collection, and quality evaluation for AI agents in cloud‑native environments, helping teams diagnose performance bottlenecks, control costs, and ensure reliable AI services.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
