Cloud Native 26 min read

Cloud‑Native Dynamic Routing & Session Persistence for AI Sandboxes via Web VNC

The article details how the team built a high‑performance, reliable cloud‑native gateway for millions of AI sandbox VNC sessions, addressing challenges of dynamic pod IPs, multi‑stage Web VNC traffic, session consistency, and security by using OpenResty, Lua scripts, Redis‑backed routing, cookie‑based state storage, and extensive Nginx tuning.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Cloud‑Native Dynamic Routing & Session Persistence for AI Sandboxes via Web VNC

Background and Challenge

As AI agents evolve from pure text generators to autonomous entities that can control browsers and terminals, a fundamental infrastructure problem emerges: how to run them safely while observing their GUI interactions in real time. In a cloud‑native environment each agent is placed in a short‑lived, isolated Kubernetes sandbox, and the graphical desktop is streamed to the browser via Web VNC.

Web VNC Composite Session

The VNC session consists of three tightly coupled stages:

First stage – HTML entry page : the browser requests GET /vnc/37579/, the gateway extracts the sandbox ID and returns the noVNC HTML skeleton.

Second stage – Asset storm : the HTML triggers dozens of parallel sub‑requests for rfb.js, ui.js, CSS, JSON, icons, etc. These URLs (e.g., /vnc/app/ui.js) no longer contain the sandbox ID.

Third stage – WebSocket handshake : the client sends a request to /vnc/websockify with Connection: Upgrade and Upgrade: websocket. The response is 101 Switching Protocols, after which a full‑duplex WebSocket carries mouse/keyboard events and screen updates.

The critical requirement is that **all three stages of a single session must be routed to the same backend pod**; any deviation breaks the VNC connection (e.g., “Connection closed (code: 1006)”).

Why Traditional Solutions Fail

Static upstream blocks in Nginx cannot keep up with the rapid creation and destruction of sandbox pods, leading to frequent reloads and connection drops.

IP‑hash session persistence assumes a stable client IP, which is invalid for mobile developers, corporate NAT, or multi‑exit setups.

Custom Lua plugins that rewrite URLs based on the first request cannot handle the later static‑resource and WebSocket requests that lack the sandbox ID, resulting in 404 errors.

Architecture Breakthrough

Control plane – Sandbox Manager : a dedicated micro‑service watches Kubernetes pod events, writes mappings such as Sandbox_ID_37579 -> 10.18.124.21:6080 into a high‑availability Redis hash ( vnc_sandboxes), and removes entries when pods terminate.

Data plane – OpenResty gateway : an Nginx instance with early‑stage access_by_lua scripts performs the routing logic without storing any state in the gateway process.

First‑Packet Driven Routing

When the initial request GET /vnc/37579/ arrives, the Lua script matches ^/vnc/(\d+)(.*)$, extracts 37579, queries Redis for the backend IP, and injects a persistent cookie:

Set-Cookie: vnc_session_id=37579; Path=/vnc/; HttpOnly

This cookie carries the routing key to every subsequent request.

Cookie Fallback for Stateless Requests

Static‑resource and WebSocket upgrade requests lack the ID in the URL. The gateway reads ngx.var.cookie_vnc_session_id, retrieves the same backend IP from Redis, and forwards the request, guaranteeing session‑consistent routing.

Performance Optimizations

To avoid DNS lookups on every request, the Redis configuration is fetched once in the init_worker_by_lua phase and cached in a lua_shared_dict. Connection pooling is enforced with resty.redis and red:set_keepalive(10000, 100) to prevent port exhaustion.

function _M.init_worker()
  local worker_id = ngx.worker.id()
  if worker_id == 0 then
    ngx.timer.at(0, fetch_and_cache_redis_config)
  end
end

Production Issues and Fixes

A global configuration file unintentionally cleared the Connection header ( proxy_set_header Connection ""), wiping out the WebSocket upgrade directive. Replacing it with an map variable isolated the VNC location from the polluted global block.

Stale DNS entries in resolver 186.20.35.21 12.19.21.11; caused the OpenResty Lua HTTP client to resolve the Redis hostname to an out‑of‑date replica, leading to 404 lookups. Updating the resolver to match the cluster’s CoreDNS eliminated the mismatch.

Tuning for VNC Traffic

Disable buffering in the VNC location: proxy_buffering off; to stream frames instantly.

Raise proxy_read_timeout and proxy_send_timeout to 3600s to survive long idle periods.

Enable WebSocket ping/pong (client‑side) to keep the connection alive during inactivity.

Security Hardening

Path‑traversal is prevented by the strict regex ^/vnc/(\d+)(.*)$, allowing only numeric IDs. The session cookie is marked HttpOnly and scoped to Path=/vnc/, limiting its exposure to other services.

Conclusion

State cannot be eliminated for Web VNC sessions; instead, it is offloaded to the client via standard HTTP cookies. This keeps the gateway stateless, horizontally scalable, and able to support tens of thousands of concurrent AI sandbox VNC streams with low latency.

Gateway dynamic routing architecture
Gateway dynamic routing architecture
Web VNC session request sequence diagram
Web VNC session request sequence diagram
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cloud-nativeRedisgatewayLuaOpenRestyAI sandboxWeb VNC
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.