Backend Development 20 min read

Why OpenClaw’s Control Plane Uses a Two‑Phase Protocol and runId for Reliable Agent Jobs

The article explains how OpenClaw’s control plane guarantees reliable, idempotent, and observable agent execution by enforcing a two‑phase protocol, strict handshake, role‑based authorization, layered deduplication, gap‑recovery mechanisms, and schema‑driven validation, turning a simple message flow into a production‑grade job system.

Architect

Feb 23, 2026

Why OpenClaw’s Control Plane Uses a Two‑Phase Protocol and runId for Reliable Agent Jobs

Background and Core Problem

OpenClaw’s earlier overview showed how messages travel, but real‑world usage revealed gaps: lost state on disconnections, unsafe retries, and long‑running tasks blocking the gateway. The fundamental issue is that being able to run a job does not mean it runs reliably – the challenge lies in the control plane (L2).

Why the Control Plane Is a Two‑Phase Protocol

The control plane is not a UI; it is a protocol contract defining who can connect, which methods are allowed, and which events can be subscribed.

WebSocket connections must start with connect, after which the server sends connect.challenge (a nonce). The client must complete authentication within 10 seconds or the connection is closed (code 1008).

Agent commands follow a two‑phase flow: first accepted(runId) (immediate hard‑ack) and later final (ok/error). This separates acceptance from execution, preventing gateway thread blockage.

connect.challenge(nonce)          ← server pushes nonce
connect(req, auth, device)       → hello‑ok(methods, events, snapshot, policy)
agent(message, idempotencyKey)    → accepted(runId)
event:agent(streaming, seq)       ← async push
agent(final)                      → ok/error(runId, summary)

The protocol defines three frame types (Request, Response, Event) via a type discriminator in GatewayFrameSchema.

Connection Reliability (Handshake)

OpenClaw enforces three rules to keep the connection trustworthy:

The first frame must be connect. Any other method before a successful handshake causes the server to close the socket.

A 10‑second handshake timeout is hard‑coded; exceeding it results in close(1008, "handshake timeout").

For non‑local connections the server sends a connect.challenge containing a random nonce and timestamp. The client must sign this nonce; old signatures cannot be reused, preventing replay attacks.

Authentication proceeds in a fixed priority order: trusted‑proxy → rate‑limiting → Tailscale verification → token/password check.

Strict Role‑Based Authorization

Method access is governed by a combination of node role (only three whitelisted methods) and operator scopes (admin, read, write, pairing, approvals). Unknown methods default to admin requirement, ensuring no accidental exposure when new APIs are added.

// Example of role whitelist
const NODE_ROLE_METHODS = new Set([
  "node.invoke.result", // return call result
  "node.event",          // report event
  "skills.bins"          // list executable skills
]);

Authorization logic (simplified):

❶ node role → only NODE_ROLE_METHODS
❷ operator.admin → allow all
❸ "exec.approvals." prefix → require admin
❹ ADMIN_ONLY_METHODS → require admin
❺ APPROVAL_METHODS → need approvals or write
❻ PAIRING_METHODS → need pairing (write does not override)
❼ READ_METHODS → need read or write
❽ WRITE_METHODS → need write
❾ unknown → require admin (conservative default)

Idempotency and Two‑Layer Deduplication

To make retries safe, OpenClaw makes the idempotency key a required field in the request schema. Deduplication works on two layers:

Layer 1 – context.dedupe : a cross‑request cache storing completed results. If a cached entry exists, the server returns it immediately.

Layer 2 – inflightByContext : a per‑connection WeakMap that merges concurrent identical requests, ensuring the underlying operation runs only once.

// Layer 1 example
const dedupeKey = `agent:${idem}`;
const cached = context.dedupe.get(dedupeKey);
if (cached) {
  respond(cached.ok, cached.payload, cached.error, { cached: true });
  return;
}

// Layer 2 example
const existing = inflight.get(dedupeKey);
if (existing) {
  const result = await existing;
  respond(result.ok, result.payload, result.error);
  return;
}

Both layers write the accepted response immediately (so retries receive the same runId) and later overwrite the cache with the final result or error.

Event Model and Gap Recovery

Events are not replayed; they carry seq and stateVersion for gap detection. Clients must implement a three‑step recovery after a disconnection:

Reconnect and obtain a fresh hello‑ok snapshot (including snapshot.presence, snapshot.health, and policy).

Pull sessions.list + status + health to synchronize UI state.

If a runId exists without a final result, call agent.wait to fetch the outcome.

This “push‑pull” dual path ensures the UI can recover from network jitter without assuming events are replayed.

Schema‑Driven Validation and Error Handling

All inbound messages are validated with TypeBox + AJV . Schemas enforce additionalProperties: false and mark critical fields (e.g., idempotencyKey, runId) as required. Invalid payloads are rejected with HTTP 400 before reaching business logic.

OpenClaw defines five canonical error codes, each returned via errorShape(ErrorCodes.XXX, message) so the frontend can react based on error.code rather than parsing free‑form messages.

12 Invariants for a Robust Control Plane

The article concludes with a checklist that can be copied into design documents:

Connection must handshake successfully (only connect allowed before).

Handshake timeout (default 10 s, close 1008).

Non‑local connections require challenge‑nonce authentication.

All methods undergo schema validation (no extra fields).

Role and scope enforce least‑privilege (operator scopes + node whitelist).

Unknown methods default to admin denial.

Side‑effecting methods require a required idempotency key.

Long‑running tasks follow the two‑phase pattern (accepted → final, dedupe written twice).

Events carry seq and stateVersion for gap detection.

Push‑pull paths: push events, pull snapshots on reconnection.

Separate health checks ( health) from RPC reachability ( heartbeat / tick).

Standardized error codes without leaking secrets.

Embedding these invariants as hard constraints rather than best‑practice suggestions makes the system observable, retry‑safe, and operable at scale.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Backend system design WebSocket idempotency Control Plane OpenClaw Two-Phase Protocol

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.