Operations 27 min read

Why Open‑Source LoongSuite Pilot Is Needed as AI Coding Agents Become Core Infrastructure

The article analyzes how AI coding agents like Cursor, Claude Code, and Codex have become essential developer tools, yet suffer from almost zero observability, and explains how the open‑source LoongSuite Pilot provides a unified collection platform, semantic schema, security controls, dashboards, and ROI metrics to turn these agents into manageable infrastructure.

Alibaba Cloud Developer

Jun 12, 2026

Why Open‑Source LoongSuite Pilot Is Needed as AI Coding Agents Become Core Infrastructure

Observability challenges for AI coding agents

AI coding agents (e.g., Cursor, Claude Code, Codex, Qoder) perform multi‑step ReAct reasoning loops, often >10 per task, involving model calls, tool selections, and reflections. Traditional metrics, logs, and traces capture only isolated HTTP requests and cannot reconstruct the hierarchical decision flow. Each agent stores data in different formats and locations (IDE history files, local SQLite databases, session logs), preventing cross‑agent comparison. Existing observability probes target server‑side processes, while AI coding agents run on developers' local machines, scattering data across the endpoint.

LoongSuite Pilot design

Choice 1 – All‑in‑One collection platform

Instead of writing a dedicated extractor per agent, Pilot provides a unified platform deployed once to automatically cover all installed agents. Detection rules, deployment modes, and collection configurations are declared in agents.d/*.json files. Adding a new agent only requires understanding its data format and implementing conversion logic; the framework code remains unchanged.

{
  "id": "claude-code",
  "displayName": "Claude Code",
  "deployMode": "hook",
  "detection": {"paths": ["~/.claude"], "commands": ["claude"]},
  "hook": {"settingsPath": "~/.claude/settings.json", "events": ["UserPromptSubmit","PreToolUse","PostToolUse","Stop"], "hookCommand": "$PILOT_DATA/hooks/claude-code-loongsuite-pilot-hook.sh"},
  "input": {"type": "hook-jsonl", "logDir": "$PILOT_DATA/logs/claude-code"}
}

Choice 2 – Adapt agents, not the other way around

Because most agents are closed‑source, Pilot does not modify them. It abstracts heterogeneity into five collection base classes— BaseHookInput, BaseIdeInput, BaseSqliteInput, BaseSessionInput, BaseCliForwarder —each providing an incremental extraction strategy. A new agent selects a base class and implements 2‑3 methods.

Choice 3 – Semantic normalization

Raw data from different agents are normalized into a unified AgentActivityEntry following the LoongSuite GenAI observability semantic conventions (an extension of OpenTelemetry GenAI conventions). This yields consistent field names such as gen_ai.usage.input_tokens, gen_ai.session.id, and gen_ai.tool.call.id across all agents.

Choice 4 – Granularity control with security masking

Collection granularity can be tuned per agent type to emit either full message content or only structured metadata (model name, token usage, tool name). Sensitive data (cloud AccessKey, API keys, DB connection strings, private keys) are automatically masked using configurable modes ( mask.mode: none|all|custom).

Choice 5 – Multi‑target output

Collected data can be emitted in parallel to local JSONL files, Alibaba Cloud SLS Logstore, HTTP endpoints, or OTLP Trace backends, ensuring that a failure in one channel does not block others.

End‑to‑end workflow

Installation is a single command that downloads the latest Pilot version, deploys it under ~/.loongsuite-pilot/, installs hook scripts, and starts a background process. After installation Pilot silently monitors all supported agents.

Data can be viewed via a built‑in local dashboard ( loongsuite-pilot monitor start at http://127.0.0.1:8765), which shows per‑agent event counts, recent activity, success rates, and Pilot’s CPU/memory usage.

Raw JSONL logs are stored under ~/.loongsuite-pilot/logs/output/. Example Claude Code events illustrate a tool call and its result, including session/turn/step IDs, tool name, arguments, and trace identifiers, enabling precise reconstruction of the execution chain.

Key analytical capabilities

ROI measurement – Full‑trace data enable calculation of task completion rate, token‑efficiency ratio, human‑AI collaboration ratio, and self‑correction rate.

Agent selection – The unified schema (session → turn → step) allows side‑by‑side comparison of agents on identical tasks, revealing latency, token consumption, tool‑usage patterns, and scenario suitability.

Token cost vs. effectiveness – Real‑world sessions show non‑linear relationships; three “token black‑hole” patterns are identified using gen_ai.usage.input_tokens and gen_ai.usage.total_tokens fields:

Loop‑retry: repeated trial‑and‑error cycles generate many STEP events with failed tool calls.

Context bloat: input token count grows stair‑stepwise as the session lengthens, indicating ineffective context management.

Over‑cautious reasoning: LLM spans consume >80% of tokens while tool spans are minimal.

Security auditing – Pilot captures every agent‑initiated operation, supporting a six‑layer audit pipeline: (1) runtime monitoring, (2) risk‑status dashboard, (3) risk‑entity ranking (e.g., dangerous_command, sensitive_access, model_context_secret_leak), (4) data‑leak chain analysis, (5) entity‑centric investigation, (6) session‑level replay.

Example log entries

{
  "event.name": "tool.call",
  "gen_ai.agent.type": "claude-code",
  "gen_ai.session.id": "8e06a611-d9ae-4c43-b03d-a285e8bda3ab",
  "gen_ai.turn.id": "8e06a611-d9ae-4c43-b03d-a285e8bda3ab:t1",
  "gen_ai.step.id": "8e06a611-d9ae-4c43-b03d-a285e8bda3ab:t1:s3",
  "gen_ai.tool.name": "Bash",
  "gen_ai.tool.call.id": "toolu_vrtx_0115QdGCWqoQ4Mnj6aKwcEuy",
  "gen_ai.tool.call.arguments": "{\"command\":\"ls /workspace/agent-data-collection/docs/\",\"description\":\"List docs directory\"}",
  "trace_id": "09f11db9fca4348e70ad34aa620e810c",
  "span_id": "dc688dbb763a7f4c",
  "parent_span_id": "d5afbe7a280d0d52"
}

{
  "event.name": "tool.result",
  "gen_ai.tool.name": "Bash",
  "gen_ai.tool.call.id": "toolu_vrtx_0115QdGCWqoQ4Mnj6aKwcEuy",
  "gen_ai.tool.call.result": "E2E-REMOTE-TEST-GUIDE.md
...",
  "gen_ai.step.id": "8e06a611-d9ae-4c43-b03d-a285e8bda3ab:t1:s3",
  "trace_id": "09f11db9fca4348e70ad34aa620e810c",
  "span_id": "dc688dbb763a7f4c"
}

These entries demonstrate the hierarchical identifiers ( session.id → turn.id → step.id) that locate a tool call within the full session trace, the pairing of tool.call.id between call and result events, and the OpenTelemetry identifiers ( trace_id, span_id, parent_span_id) that enable end‑to‑end trace reconstruction.

Quantitative queries (SLS SQL examples)

SELECT "user.id", "gen_ai.agent.type",
  SUM(CAST("gen_ai.usage.input_tokens" AS BIGINT)) AS total_input,
  SUM(CAST("gen_ai.usage.output_tokens" AS BIGINT)) AS total_output,
  SUM(CAST("gen_ai.usage.total_tokens" AS BIGINT)) AS total_tokens
FROM log
WHERE "event.name" = 'llm.response'
  AND "user.id" = '${userID}'
GROUP BY "user.id", "gen_ai.agent.type"
ORDER BY total_tokens DESC;

SELECT "gen_ai.agent.type", "gen_ai.tool.name",
  COUNT(*) AS call_count,
  AVG(CAST("gen_ai.tool.call.duration" AS DOUBLE)) AS avg_duration_ms
FROM log
WHERE "event.name" = 'tool.result'
GROUP BY "gen_ai.agent.type", "gen_ai.tool.name"
ORDER BY call_count DESC
LIMIT 10;

SELECT "gen_ai.session.id", "gen_ai.agent.type", "user.id", "gen_ai.request.model",
  CAST("gen_ai.usage.total_tokens" AS BIGINT) AS total_tokens
FROM log
WHERE "event.name" = 'llm.response'
  AND CAST("gen_ai.usage.total_tokens" AS BIGINT) > 50000
ORDER BY total_tokens DESC
LIMIT 20;

Because all agents share the same field names, a single SQL statement works for Claude Code, Cursor, Codex, and Qoder.

Agent comparison case study

Three agents were given the same task—adding unit tests to a module. All completed the task, but their traces revealed distinct behaviors:

Claude Code – Shortest total latency, fastest per‑round LLM inference, tool usage dominated by find and Read, balanced time between reasoning and execution.

Cursor – Fewest LLM calls but longest per‑round inference (high‑thinking mode), diverse tool chain (Shell, Read, Grep, Write), longest total time and highest token output, suitable for deep code understanding.

Qoder – Lowest token consumption, minimal tool latency (lightweight API calls), many LLM rounds with slower inference, “small‑step‑fast‑run” style, ideal for token‑sensitive or iterative scenarios.

The unified schema makes these differences observable without manual log parsing.

Security audit workflow

Pilot’s data feed powers a six‑layer audit view:

Management dashboard showing active agents and runtime status.

Risk‑status dashboard aggregating high‑risk event counts, today’s operations, and sensitive writes.

Risk‑entity ranking across application, user, host, tool, external domain/IP, and command dimensions, labeling entities with tags such as dangerous_command, sensitive_access, model_context_secret_leak.

Data‑leak chain analysis separating attack‑driven exfiltration, model‑context leakage, and sensitive‑data type distribution.

Entity‑centric investigation presenting a panoramic view of all entities involved in a risk event, with drill‑down to detailed behavior.

Session‑level replay showing a chronological timeline of events, parameters, and context for forensic analysis.

Open‑source release

LoongSuite Pilot is released at https://github.com/alibaba/loongsuite-pilot. It completes the LoongSuite observability suite for endpoint agents and invites contributions for additional agent adapters, dashboards, and community templates via the agents.d/ directory. Related projects include LoongCollector, LoongSuite Python/Go/Java agents, and the LoongSuite Semantic Conventions repository.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability Metrics OpenTelemetry open-source security AI Coding Agent LoongSuite Pilot

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Observability challenges for AI coding agents

LoongSuite Pilot design

Choice 1 – All‑in‑One collection platform

Choice 2 – Adapt agents, not the other way around

Choice 3 – Semantic normalization

Choice 4 – Granularity control with security masking

Choice 5 – Multi‑target output

End‑to‑end workflow

Key analytical capabilities

Example log entries

Quantitative queries (SLS SQL examples)

Agent comparison case study

Security audit workflow

Open‑source release

Alibaba Cloud Developer

How this landed with the community

Was this worth your time?

0 Comments

Choice 1 – All‑in‑One collection platform

Choice 2 – Adapt agents, not the other way around

Choice 3 – Semantic normalization

Choice 4 – Granularity control with security masking

Choice 5 – Multi‑target output