Cloud Native 26 min read

Breaking Cloud‑Native Gateway Limits: Routing & Session Persistence for AI Sandboxes

The article details a cloud‑native gateway design that solves the zero‑loss routing and session‑persistence challenges of massive AI sandbox Web VNC streams by dissecting protocol stages, exposing classic gateway pitfalls, and presenting a two‑phase URL‑plus‑cookie routing architecture built on OpenResty, Lua, and Redis.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Breaking Cloud‑Native Gateway Limits: Routing & Session Persistence for AI Sandboxes

As AI agents evolve from text‑only LLMs to autonomous entities that need to manipulate browsers and terminals, a fundamental infrastructure problem emerges: how to safely run them while observing their GUI interactions in real time.

Problem Context

Each agent runs in an isolated, short‑lived Kubernetes sandbox (often a Docker container or Firecracker micro‑VM). To observe the graphical desktop, the team streams the sandbox’s GUI to the browser via Web VNC (RFB protocol) using noVNC and websockify. When the number of sandboxes scales from dozens to tens of thousands, the classic distributed networking challenge appears: a single public gateway must reliably route every composite Web VNC request—HTML, static assets, and WebSocket upgrades—to the correct, dynamically moving pod without any loss.

Why Conventional Solutions Fail

Traditional upstream configuration (static upstream blocks in nginx.conf) cannot keep up with the rapid creation and destruction of sandbox pods whose IPs change every minute. Reloading Nginx on each change degrades performance and drops existing WebSocket connections.

IP‑hash session‑persistence also breaks because client IPs change (mobile Wi‑Fi ↔ 4G) and large‑scale NAT setups produce multiple public IPs for the same user, leaving the gateway unable to determine which pod a request belongs to.

Initial API‑Gateway Attempts

The team tried a custom Lua plugin on APISIX/Kong that extracted the sandbox ID from the first request URL, looked up the real backend IP in Redis, and rewrote the upstream. This worked only for the initial HTML request; subsequent static asset requests and the WebSocket handshake lost the ID, resulting in 404 errors.

Hard‑coding the ID into every noVNC asset URL ( ?sandbox_id=37579) was deemed unsustainable because each noVNC release would require a massive manual patch.

Two‑Phase Routing Design

The final architecture separates control and data planes:

Control Plane – Sandbox Manager : a permanent micro‑service that watches pod lifecycle events via Kubernetes List/Watch, writes Sandbox_ID → IP:Port mappings into a high‑availability Redis hash ( vnc_sandboxes), and removes entries when pods terminate.

Data Plane – OpenResty Gateway : uses access_by_lua to run Lua scripts at the earliest request stage.

The routing decision follows a "first‑packet‑driven + state‑stored" chain:

When the initial request GET /vnc/37579/ arrives, a Lua regex ^/vnc/(\d+)(.*)$ extracts the sandbox ID.

The script queries Redis (milliseconds) for the backend IP and forwards the request.

Before sending the HTML response, the script injects a persistent cookie:

Set-Cookie: vnc_session_id=37579; Path=/vnc/; HttpOnly

All subsequent static asset and WebSocket upgrade requests lack the ID in the URL, so the script falls back to the cookie, extracts the ID, re‑queries Redis, and routes the request to the same pod.

This "URL probe, Cookie guard" mechanism guarantees 100 % session‑consistency while keeping the gateway stateless.

Performance Optimisations

To avoid blocking the Nginx worker during Redis lookups, the team moved configuration fetching to the init_worker_by_lua phase, using a non‑blocking timer and lua-resty-http to cache Redis credentials in a shared memory dictionary ( lua_shared_dict). Workers then read the cache instantly, eliminating the thundering‑herd effect.

Connection pooling is enforced with red:set_keepalive(10000, 100) instead of closing sockets, preventing port exhaustion under high load.

Web VNC streams require disabling Nginx buffering ( proxy_buffering off;) to avoid frame latency, and timeouts are raised to 3600s with WebSocket ping/pong frames every 15 seconds to keep idle connections alive.

Debugging Hard‑to‑Find Issues

During production rollout, the team observed intermittent Connection closed (code: 1006) errors. Packet captures showed the HTML page was served, but the WebSocket handshake failed. A deep dive revealed a global Nginx config file that cleared the Connection header, overriding the required Upgrade directive in the location /vnc/ block.

Another subtle bug stemmed from a stale DNS resolver configuration in nginx.conf. The resolver pointed to an old internal DNS server, causing the gateway to resolve the Redis hostname to a decommissioned replica, which always returned 404 for sandbox mappings.

Security Hardening

To prevent path‑traversal attacks, the Lua regex only accepts numeric IDs ( \d+). The session cookie is marked HttpOnly and scoped to Path=/vnc/, limiting its exposure to other services.

Network isolation is enforced by ensuring sandbox pods have no public Service; all traffic must pass through the OpenResty gateway, which validates the cookie and performs strict ID matching before forwarding to the pod’s internal 6080 port.

Conclusion

The journey demonstrates that for stateful protocols like Web VNC, true statelessness is impossible; instead, the state must be deliberately moved to the most appropriate layer—in this case, the client browser via an HTTP cookie. By leveraging low‑level OpenResty capabilities and a clear control‑plane/data‑plane split, the team built a highly scalable, observable, and secure gateway that can support tens of thousands of AI sandbox sessions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KubernetesAPI GatewayDynamic RoutingOpenRestyWeb VNC
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.