How to Make AI Agents Auditable and Controlled with OpenClaw, SLS, and OTEL

This article explains how to combine OpenClaw session logs, application logs, and OpenTelemetry metrics in Alibaba Cloud SLS to answer who triggered an AI agent, what actions were taken, how much it cost, and whether the behavior is traceable, enabling a complete observability and security solution for AI agents.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
How to Make AI Agents Auditable and Controlled with OpenClaw, SLS, and OTEL

Observability Architecture for OpenClaw AI Agents

OpenClaw integrates three data streams into Alibaba Cloud Simple Log Service (SLS): session audit logs, application logs, and OpenTelemetry (OTEL) metrics/traces. Together they form a complete logs + metrics + traces observability stack that answers the four essential questions for a controlled AI agent:

Who invoked the agent?

What actions (especially high‑risk tools) were performed?

How much cost was incurred?

Is the behavior auditable and under control?

1. Session Audit Logs (JSONL)

Each session creates a ~/.openclaw/agents/<id>/sessions/*.jsonl file. Every line is a JSON object with a type field that distinguishes the entry:

{
  "type": "message",
  "id": "70f4d0c5",
  "parentId": "b5690259",
  "message": {
    "role": "user",
    "content": [{"type": "text", "text": "帮我读取 /etc/passwd 文件"}]
  }
}

Typical sequence for a read‑file request:

User message (type = "message", role = "user").

Assistant reply with a toolCall (name = "read", arguments.path). The reply also contains provider, model, and usage.totalTokens.

Tool result (type = "message", role = "toolResult") with the file content.

Final assistant response (stopReason = "stop") that includes the total token count and cost.

All fields ( id, parentId, timestamp) allow reconstruction of the full execution tree, making it possible to answer “who did what, with which model, and at what cost”.

2. Application Logs (tslog JSONL)

OpenClaw’s gateway writes structured logs using the tslog library. Each line contains a _meta object with log level, timestamp, source file, and a custom subsystem binding (e.g., gateway/ws, tools-invoke).

{
  "0": "{\"subsystem\":\"gateway/ws\"}",
  "1": "unauthorized conn=e32bf86b remote=127.0.0.1 reason=token_mismatch",
  "_meta": {"logLevelName": "WARN", "date": "2026-02-27T07:46:20.727Z"},
  "time": "2026-02-27T07:46:20.728Z"
}

Key fields for operations monitoring: _meta.logLevelName – severity (TRACE, DEBUG, INFO, WARN, ERROR, FATAL). _meta.path.filePath – source file for precise debugging. 0.subsystem – logical component (gateway, tools‑invoke, webhook, etc.). 1 – free‑form message text, indexed for full‑text search.

3. OTEL Metrics & Traces (diagnostics‑otel plugin)

Enabled with OpenClaw ≥ v26.2.19, the diagnostics-otel plugin exports both metrics and traces via OTLP (HTTP/Protobuf). Important metrics (all prefixed with openclaw.) include: openclaw.tokens – total tokens processed (counter). openclaw.cost.usd – estimated cost in USD (counter). openclaw.run.duration_ms – execution latency per request (histogram). openclaw.webhook.received / openclaw.webhook.error – webhook request volume and error count. openclaw.message.queued, openclaw.message.processed, openclaw.queue.depth, openclaw.queue.wait_ms – queue health indicators. openclaw.session.state, openclaw.session.stuck – session lifecycle and stuck‑session detection.

Key traces (span names) provide end‑to‑end request flow: openclaw.model.usage – model call with provider, model, sessionKey, token counts. openclaw.webhook.processed – webhook handling with channel and chatId. openclaw.message.processed – message processing outcome. openclaw.session.stuck – detection of long‑running or dead‑locked sessions.

4. Ingestion & Indexing in SLS

All three pipelines are configured as SLS logstores with appropriate field indexing:

Session store ( session-audit ) indexes type, message.role, message.provider, message.model, message.usage.totalTokens, message.usage.cost.total, and message.stopReason.

Application store ( gateway-logs ) indexes _meta.logLevelName, _meta.date, _meta.path.filePath, 0.subsystem, and 1 (full‑text).

OTEL store ( otel-metrics / otel-traces ) stores the exported metrics and spans directly.

These indexes enable fast SPL queries such as:

# High‑cost model usage
SELECT provider, model, SUM(costUsd) FROM session-audit
WHERE type='message' AND message.provider IS NOT NULL
GROUP BY provider, model
ORDER BY SUM(costUsd) DESC;

# Stuck sessions (no progress > 5 min)
SELECT sessionId, MAX(timestamp) AS last_ts FROM session-audit
WHERE type='message' AND message.stopReason='toolUse'
GROUP BY sessionId HAVING now() - last_ts > 300;

# Webhook error rate per channel
SELECT subsystem, COUNT(*) AS err FROM gateway-logs
WHERE _meta.logLevelName='WARN' AND 0.subsystem='gateway/ws'
GROUP BY subsystem;

5. Typical Auditing Scenarios

Cost attribution – aggregate openclaw.tokens and openclaw.cost.usd by provider/model to spot unexpected spend.

High‑risk tool monitoring – filter session logs for toolCall entries where name is in { exec, cron, sessions_spawn, …} and raise alerts.

Sensitive data leakage – search toolResult messages for patterns such as API_KEY, BEGIN RSA PRIVATE KEY, or regex LTAI[a-zA-Z0-9]{12,20}.

Webhook failures – monitor openclaw.webhook.error and correlate with _meta.logLevelName='WARN' entries to identify mis‑configurations.

Queue health – track openclaw.queue.depth and openclaw.queue.wait_ms histograms to detect back‑pressure.

6. Real‑Time Monitoring & Alerting

OTEL metrics are visualized in SLS dashboards and can trigger alerts. Example alert definitions:

# Token spike ( > 10 k tokens/min )
ALERT token_spike IF sum(rate(openclaw_tokens[1m])) > 10000;

# Stuck sessions count > 0
ALERT stuck_sessions IF sum(rate(openclaw_session_stuck[5m])) > 0;

# Webhook error rate > 5%
ALERT webhook_error_rate IF (sum(rate(openclaw_webhook_error[5m])) / sum(rate(openclaw_webhook_received[5m]))) > 0.05;

When an alert fires, the workflow is:

OTEL alert identifies the symptom (e.g., token spike).

Application‑log query narrows the component (subsystem, filePath) and time window.

Session‑log query reconstructs the exact user request, tool calls, and cost.

Operators can take remediation actions (e.g., revoke token, block tool, adjust queue workers).

7. Deployment Steps

Enable the diagnostics-otel plugin: openclaw plugins enable diagnostics-otel.

Configure the plugin in ~/.openclaw/openclaw.json (set endpoint, protocol, enable metrics, traces, logs).

Create SLS logstores: session-audit, gateway-logs, otel-metrics, otel-traces.

Define LoongCollector inputs: file collection for ~/.openclaw/agents/*/sessions/*.jsonl and /tmp/openclaw/openclaw-*.log, plus an OTLP input pointing to the plugin endpoint.

Set field indexes as described above and enable real‑time dashboards.

8. Summary

By combining session audit logs, structured application logs, and OTEL metrics/traces, OpenClaw provides a full‑stack observability solution that answers the four control questions for AI agents. The three data pipelines are complementary: OTEL gives high‑level health and cost signals, application logs pinpoint component failures, and session logs deliver a complete, auditable record of every user‑agent interaction. Integrated in SLS, they support powerful SPL queries, dashboards, and alerting, enabling continuous, automated verification that the agent is operating under strict control.

observabilityMetricsAI AgentSLSOpenClawOTEL
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.