Turning Coding Agents Transparent: Alibaba Cloud’s LoongSuite Observability and Auditing Solution
The article details Alibaba Cloud’s LoongSuite platform, which leverages OpenTelemetry to provide non‑intrusive, end‑to‑end observability, auditing, and cost tracking for various AI Agent types—including coding assistants, personal assistants, and framework‑based agents—by introducing unified data collection, enriched GenAI semantic conventions, and plug‑in architectures that enable full traceability of LLM calls, tool executions, and multi‑round reasoning.
Introduction
When AI agents are deployed at scale, their execution becomes a black box. Three common challenges arise: execution process black‑boxing (LLM calls, tool executions, multi‑round reasoning are invisible to traditional Metrics + Log + Trace), behavior traceability (agents can read/write files, execute commands, call third‑party APIs without audit), and cost measurement difficulty (token consumption is the primary cost and multiplies with each reasoning round and tool call). A complete observability capability that covers LLM calls, tool executions, multi‑round reasoning, and memory retrieval is required.
Agent Shape Classification and Observability Challenges
The market can be divided into three agent categories, each with distinct deployment environments and observability needs:
Coding agents run locally on developers’ machines (e.g., Claude Code, Codex, Cursor, Qoder, QoderWork).
Personal assistants run as standalone services for end users (e.g., OpenClaw, Hermes, QwenPaw).
Framework‑based agents are built on GenAI frameworks such as LangChain, AgentScope, Dify, etc.
LoongSuite Pilot – Side‑car Data Collection for Coding Agents
LoongSuite Pilot is a lightweight daemon that automatically discovers installed coding agents and injects data collection without modifying the agents.
One‑time deployment, full coverage : after a single installation the daemon monitors Claude Code, Codex, Cursor, Qoder and QoderWork.
Background silent operation : runs as a background process, no configuration changes for developers.
Breakpoint‑resume : network glitches or process restarts do not cause data loss or duplication.
Flexible granularity : can record full message content (model name, token count, latency) or only metadata to meet data‑privacy requirements.
Plug‑in architecture : adding a new coding agent requires implementing only 2‑3 abstract methods.
Personal Assistant Plug‑ins – One‑Command Integration
A dedicated plug‑in captures the complete request‑response lifecycle for a personal assistant. It records session, turn and step identifiers ( gen_ai.session.id, gen_ai.turn.id, gen_ai.step.id) so that a multi‑turn conversation can be reconstructed from a single trace.
Framework‑Based Agents – Zero‑Code Probe Insertion
LoongSuite Python Agent extends OpenTelemetry Python Contrib and provides zero‑code instrumentation for 17 popular GenAI frameworks (LangChain, AgentScope, Dify, MCP, etc.). The workflow is:
# 1. Install LoongSuite Python Agent
pip install loongsuite-distro
# 2. Auto‑detect and install required instrumentation libraries
loongsuite-bootstrap
# 3. One‑line start with probe injection
loongsuite-instrument \
--traces_exporter otlp \
--service_name my-agent-app \
python my_agent_app.pyThe probe automatically scans the runtime environment, installs the matching instrumentation packages, and injects spans without any code changes.
Observability Views and Auditing Effects
After data collection, the Cloud Monitoring 2.0 console provides several dimensions:
Full‑link call tree : a hierarchical trace from ENTRY → AGENT → STEP → LLM → TOOL. For a 10‑round ReAct task, the STEP span quickly identifies the problematic round.
Token and cost tracking : fields gen_ai.usage.input_tokens, output_tokens, total_tokens plus Alibaba‑specific input_cost, output_cost, total_cost enable per‑request token breakdown, aggregation by agent/user/time, and cache‑related metrics.
Session and multi‑turn tracing : the three‑level identifiers allow cross‑turn analysis and user‑behavior insights.
Tool call audit : each tool invocation records name, parameters, result and latency, making file reads/writes, command executions and MCP calls fully auditable.
Behavior analysis dashboard : top‑level cards break down tool usage (commands, file ops, web requests, MCP calls) and highlight abnormal volumes.
Security audit overview : a risk snapshot aggregates high‑risk operations (dangerous commands, external web requests, sensitive file access, prompt injection) with trend charts for rapid risk assessment.
High‑risk session drill‑down : sessions are scored by aggregated risk metrics, sorted for immediate analyst focus, and detailed event tables provide full context for each alert.
Extending the OpenTelemetry GenAI Semantic Specification
The OTel GenAI spec (still in Development) defines basic fields such as gen_ai.operation.name, gen_ai.span.kind, and token usage. Real‑world scenarios require richer semantics. Alibaba’s cross‑domain use case (e.g., “order a milk tea” involving Qwen, Flash Buy, Gaode, Alipay agents) motivated the LoongSuite GenAI Semantic Convention, an open‑source vendor‑enhanced standard that adds:
Entry Span ( gen_ai.span.kind=ENTRY) to capture the original user request and model response before any system prompts.
Step Span ( gen_ai.operation.name=react, gen_ai.react.round) to represent each ReAct iteration, making long traces readable.
Skill semantics ( gen_ai.skill.*) to tag business‑level functions, enabling aggregation such as “which skill has the highest error rate”.
These extensions have been applied in OpenClaw, QwenPaw and Hermes Agent, and a proposal has been submitted to the OTel community (issue #3540).
GenAI Utils – Engineering Layer for the Semantic Convention
GenAI Utils provides unified telemetry handling for Python, Node.js, Go (Java forthcoming). It separates data extraction (instrumentation libraries) from semantic conversion (via ExtendedTelemetryHandler), so that any change to the semantic convention only requires updating GenAI Utils; all plug‑ins inherit the change automatically.
Supported invocation types cover the full GenAI lifecycle:
LLMInvocation
InvokeAgentInvocation
CreateAgentInvocation
ExecuteToolInvocation
EmbeddingInvocation
RetrieveInvocation
RerankInvocation
MemoryInvocation
Comparison with Built‑in Observability (e.g., OpenClaw)
OpenClaw’s built‑in diagnostics emit independent points without parent‑child relationships, resulting in a flat view. LoongSuite’s plug‑ins use OTel context propagation so that all spans share a common traceId and form a complete call tree (ENTRY → AGENT → STEP → LLM / TOOL). In addition, LoongSuite records rich fields such as gen_ai.input.messages, gen_ai.output.messages, gen_ai.tool.call.arguments, and gen_ai.tool.call.result, enabling deep audit and fault isolation.
Automatic Span Types Recognized by the Probe
The probe automatically generates the following span kinds, covering the entire agent lifecycle:
ENTRY – request entry.
AGENT – agent execution unit.
STEP – ReAct iteration (identified by gen_ai.react.round).
LLM – large‑model call (includes request parameters, token usage, input/output messages).
TOOL – tool invocation (name, parameters, result, latency).
MCP – MCP protocol call.
CHAIN – chain orchestration.
RETRIEVER – retrieval operation.
EMBEDDING – vectorization.
RERANKER – re‑ranking.
WORKFLOW – workflow orchestration.
Observability Effects in Practice
For a 10‑round ReAct task, the STEP span pinpoints the failing round, then the nested LLM or TOOL span reveals the exact cause – a top‑down debugging workflow that dramatically reduces mean‑time‑to‑resolution.
Token‑cost view enables:
Per‑request token breakdown.
Aggregation by agent, user or time.
Cache‑related fields ( cache_read.input_tokens, cache_creation.input_tokens) to evaluate cache effectiveness.
Session identifiers ( gen_ai.session.id, gen_ai.turn.id, gen_ai.step.id) provide:
Cross‑turn session tracing.
Step‑level analysis within a single turn.
Behavior‑driven user insights.
Security dashboards aggregate high‑risk operations (dangerous commands, external web requests, sensitive file access, prompt‑injection‑triggered actions). Sessions are scored and sorted, allowing analysts to focus on the most critical sessions first.
Why Extend the Community OTel GenAI Spec
The community spec must remain broadly applicable and stable, so its evolution is cautious. As of 2024 it is still in Development and lacks constructs for complex cross‑domain workflows and business‑level skill tagging. Alibaba’s production experience (e.g., multi‑agent “order milk tea” flow) demonstrated the need for:
Entry spans to preserve the original user request.
Step spans to separate each ReAct iteration.
Skill attributes to group operations by business function.
The LoongSuite GenAI Semantic Convention fills these gaps and will be contributed back to the OTel community.
Core Extensions Details
Extension 1 – Entry Span & Step Span
Problem : Long‑running agents generate hundreds of spans, making the trace unreadable.
Solution :
Entry Span ( gen_ai.span.kind=ENTRY) is created at the very beginning of an agent call, recording the raw user input and the model’s initial response.
Step Span ( gen_ai.operation.name=react, gen_ai.react.round) represents each ReAct cycle, allowing the trace to be segmented by reasoning round.
This modeling has been deployed in OpenClaw, QwenPaw and Hermes Agent.
Extension 2 – Skill Semantics
Problem : Business‑level functions (e.g., “order drink”, “checkout”) are invisible in the generic trace.
Solution : Introduce gen_ai.skill.* attributes (e.g., gen_ai.skill.name, gen_ai.skill.type) attached to execute_tool or invoke_skill spans. This enables queries such as “which skill has the highest error rate” or “does a new skill version increase latency”. The proposal has been submitted as OTel issue #3540.
Engineering Layer – GenAI Utils
GenAI Utils implements the following principles:
Instrumentation layer only extracts data : framework adapters hook into the runtime (via monkey‑patch or native hooks) and populate language‑specific Invocation objects.
ExtendedTelemetryHandler performs semantic conversion : all span creation, attribute attachment, metric recording and context propagation are centralized, so updating the semantic convention only requires a change in this handler.
Version support : currently available for Python, Node.js and Go; Java version is forthcoming. The Python and Node.js packages are open‑source.
The supported invocation types (LLMInvocation, InvokeAgentInvocation, CreateAgentInvocation, ExecuteToolInvocation, EmbeddingInvocation, RetrieveInvocation, RerankInvocation, MemoryInvocation) cover the full GenAI lifecycle.
Conclusion
Alibaba Cloud’s LoongSuite delivers a complete, non‑intrusive observability and auditing solution for the three major AI‑agent shapes:
LoongSuite Pilot provides side‑car monitoring for local coding agents.
Specialized plug‑ins give personal assistants full‑chain tracing with a single command.
LoongSuite Python Agent offers zero‑code instrumentation for 17 major GenAI frameworks.
The open‑source LoongSuite GenAI Semantic Convention extends the community OTel spec with Entry/Step spans and Skill semantics, making complex agent workflows transparent, analyzable, governable and evolvable.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
