Why OpenClaw’s Control Plane Uses a Two‑Phase Protocol and runId for Reliable Agent Jobs
The article explains how OpenClaw’s control plane guarantees reliable, idempotent, and observable agent execution by enforcing a two‑phase protocol, strict handshake, role‑based authorization, layered deduplication, gap‑recovery mechanisms, and schema‑driven validation, turning a simple message flow into a production‑grade job system.
Background and Core Problem
OpenClaw’s earlier overview showed how messages travel, but real‑world usage revealed gaps: lost state on disconnections, unsafe retries, and long‑running tasks blocking the gateway. The fundamental issue is that being able to run a job does not mean it runs reliably – the challenge lies in the control plane (L2).
Why the Control Plane Is a Two‑Phase Protocol
The control plane is not a UI; it is a protocol contract defining who can connect, which methods are allowed, and which events can be subscribed.
WebSocket connections must start with connect, after which the server sends connect.challenge (a nonce). The client must complete authentication within 10 seconds or the connection is closed (code 1008).
Agent commands follow a two‑phase flow: first accepted(runId) (immediate hard‑ack) and later final (ok/error). This separates acceptance from execution, preventing gateway thread blockage.
connect.challenge(nonce) ← server pushes nonce
connect(req, auth, device) → hello‑ok(methods, events, snapshot, policy)
agent(message, idempotencyKey) → accepted(runId)
event:agent(streaming, seq) ← async push
agent(final) → ok/error(runId, summary)The protocol defines three frame types (Request, Response, Event) via a type discriminator in GatewayFrameSchema.
Connection Reliability (Handshake)
OpenClaw enforces three rules to keep the connection trustworthy:
The first frame must be connect. Any other method before a successful handshake causes the server to close the socket.
A 10‑second handshake timeout is hard‑coded; exceeding it results in close(1008, "handshake timeout").
For non‑local connections the server sends a connect.challenge containing a random nonce and timestamp. The client must sign this nonce; old signatures cannot be reused, preventing replay attacks.
Authentication proceeds in a fixed priority order: trusted‑proxy → rate‑limiting → Tailscale verification → token/password check.
Strict Role‑Based Authorization
Method access is governed by a combination of node role (only three whitelisted methods) and operator scopes (admin, read, write, pairing, approvals). Unknown methods default to admin requirement, ensuring no accidental exposure when new APIs are added.
// Example of role whitelist
const NODE_ROLE_METHODS = new Set([
"node.invoke.result", // return call result
"node.event", // report event
"skills.bins" // list executable skills
]);Authorization logic (simplified):
❶ node role → only NODE_ROLE_METHODS
❷ operator.admin → allow all
❸ "exec.approvals." prefix → require admin
❹ ADMIN_ONLY_METHODS → require admin
❺ APPROVAL_METHODS → need approvals or write
❻ PAIRING_METHODS → need pairing (write does not override)
❼ READ_METHODS → need read or write
❽ WRITE_METHODS → need write
❾ unknown → require admin (conservative default)Idempotency and Two‑Layer Deduplication
To make retries safe, OpenClaw makes the idempotency key a required field in the request schema. Deduplication works on two layers:
Layer 1 – context.dedupe : a cross‑request cache storing completed results. If a cached entry exists, the server returns it immediately.
Layer 2 – inflightByContext : a per‑connection WeakMap that merges concurrent identical requests, ensuring the underlying operation runs only once.
// Layer 1 example
const dedupeKey = `agent:${idem}`;
const cached = context.dedupe.get(dedupeKey);
if (cached) {
respond(cached.ok, cached.payload, cached.error, { cached: true });
return;
}
// Layer 2 example
const existing = inflight.get(dedupeKey);
if (existing) {
const result = await existing;
respond(result.ok, result.payload, result.error);
return;
}Both layers write the accepted response immediately (so retries receive the same runId) and later overwrite the cache with the final result or error.
Event Model and Gap Recovery
Events are not replayed; they carry seq and stateVersion for gap detection. Clients must implement a three‑step recovery after a disconnection:
Reconnect and obtain a fresh hello‑ok snapshot (including snapshot.presence, snapshot.health, and policy).
Pull sessions.list + status + health to synchronize UI state.
If a runId exists without a final result, call agent.wait to fetch the outcome.
This “push‑pull” dual path ensures the UI can recover from network jitter without assuming events are replayed.
Schema‑Driven Validation and Error Handling
All inbound messages are validated with TypeBox + AJV . Schemas enforce additionalProperties: false and mark critical fields (e.g., idempotencyKey, runId) as required. Invalid payloads are rejected with HTTP 400 before reaching business logic.
OpenClaw defines five canonical error codes, each returned via errorShape(ErrorCodes.XXX, message) so the frontend can react based on error.code rather than parsing free‑form messages.
12 Invariants for a Robust Control Plane
The article concludes with a checklist that can be copied into design documents:
Connection must handshake successfully (only connect allowed before).
Handshake timeout (default 10 s, close 1008).
Non‑local connections require challenge‑nonce authentication.
All methods undergo schema validation (no extra fields).
Role and scope enforce least‑privilege (operator scopes + node whitelist).
Unknown methods default to admin denial.
Side‑effecting methods require a required idempotency key.
Long‑running tasks follow the two‑phase pattern (accepted → final, dedupe written twice).
Events carry seq and stateVersion for gap detection.
Push‑pull paths: push events, pull snapshots on reconnection.
Separate health checks ( health) from RPC reachability ( heartbeat / tick).
Standardized error codes without leaking secrets.
Embedding these invariants as hard constraints rather than best‑practice suggestions makes the system observable, retry‑safe, and operable at scale.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
