Building a Million-Scale WebSocket Gateway: Architecture, Optimization & Performance
This article details the design, refactoring, and performance testing of a high‑traffic WebSocket gateway for Shimo Docs, covering the evolution from a Node.js Socket.IO version to a Go‑based microservice architecture, TLS memory tuning, socket ID generation, heartbeat handling, custom Kafka headers, and resource‑efficient scaling to half‑a‑million concurrent connections.
Content summary: Web service push technology evolved from long‑polling and short‑polling to the HTML5 WebSocket standard, which simplifies message push and notification. How to implement a WebSocket gateway that supports millions of connections? This article shares the practical experience of Shimo Docs senior engineer Du Minxiang based on the reconstruction of Shimo's WebSocket service.
1 Introduction
In Shimo Docs, features such as document sharing, comments, slide presentations and spreadsheet sync require multi‑client data synchronization and server‑side push, which cannot be satisfied by short‑ or long‑polling, so the industry‑standard solution based on the HTML5 WebSocket specification was chosen.
As daily peak connections grew to millions, memory and CPU usage surged, prompting a gateway redesign.
2 Gateway 1.0
Gateway 1.0 was built with Node.js and Socket.IO and met the early traffic requirements.
2.1 Architecture
Architecture diagram:
Client connection flow:
User connects through NGINX; the event is observed by business services.
Business service queries user data and publishes a message to Redis.
Gateway subscribes to Redis and receives the message.
Gateway looks up the user session in the cluster and pushes the message to the client.
2.2 Pain points
Resource waste: NGINX only passes certificates, most traffic is proxied, causing CPU/memory overhead; Node gateway also heavy.
Lack of monitoring integration.
Business‑gateway coupling prevents targeted horizontal scaling.
3 Gateway 2.0
Gateway 2.0 separates the gateway function (WS‑Gateway) from business processing (WS‑API). WS‑Gateway handles authentication, TLS, and WebSocket management; WS‑API communicates with component services via gRPC, enabling module‑level scaling and reduced hardware consumption.
3.1 Overall architecture
Architecture diagram:
Client connection flow:
Client completes WebSocket handshake with WS‑Gateway.
WS‑Gateway stores the session, caches the connection mapping in Redis, and publishes an online event to Kafka.
WS‑API consumes the online event from Kafka.
WS‑API retrieves necessary data from Redis, applies filtering, and publishes the final message to Kafka.
WS‑Gateway subscribes to Kafka, receives the message, and pushes it to the appropriate client.
3.2 Handshake process
In good network conditions the client follows steps 1‑6 to enter WebSocket mode; in poor conditions the connection degrades to HTTP long‑polling.
{"sid":"xxx","upgrades":["websocket"],"pingInterval":xxx,"pingTimeout":xxx}
Client resends request with the sid.
Server returns 40 to acknowledge.
Client sends POST to confirm downgrade path.
Server returns ok, completing the first phase.
WebSocket connection is established after 2probe/3probe checks.
3.3 TLS memory optimization
In version 1.0 TLS termination was done by NGINX, consuming ~30% of total memory. In 2.0 the certificate is mounted on the service itself, reducing NGINX load.
Two options to further reduce memory:
Move TLS termination to a layer‑7 load balancer.
Improve Go TLS handshake performance (refer to Go PR #43563 and related benchmarks).
3.4 Socket ID design
Each connection gets a unique ID generated by the SnowFlake algorithm. In physical‑machine deployments the machine number guarantees uniqueness; in Kubernetes a registration service assigns the machine ID, which is stored in a database and reused after restarts.
3.5 Cluster session management – event broadcast
Session data is stored in memory and partially persisted in Redis with keys such as ws:user:clients:${uid}, ws:guid:clients:${guid}, and ws:client:${socket.id}. Two broadcast strategies were evaluated:
Event broadcast – simple but message count grows with node count.
Service registry – clear mapping but adds operational cost.
After testing, Redis was chosen for broadcast because payloads are ~1 KB and the scenario is simple.
3.6 Heartbeat mechanism
After a WebSocket connection is established, the server sends heartbeat parameters. Clients report heartbeats at the configured interval; timestamps are first updated in memory, then periodically synced to Redis to avoid spikes.
Server sends heartbeat config.
Client sends heartbeat packets; server updates the session timestamp.
Any upstream data also refreshes the timestamp.
Server periodically clears expired sessions.
Redis‑based timestamps drive cleanup of connection‑user‑file mappings.
for {
select {
case <-t.C:
now := time.Now().Unix()
var clients []*Connection
dispatcher.clients.Range(func(_, v interface{}) bool {
client := v.(*Connection)
lastTs := atomic.LoadInt64(&client.LastMessageTS)
if now-lastTs > int64(expireTime) {
clients = append(clients, client)
} else {
dispatcher.clearRedisMapping(client.Id, client.Uid, lastTs, clearTimeout)
}
return true
})
for _, cli := range clients {
cli.WsClose()
}
}
}Dynamic heartbeat intervals reduce QPS from 500 000 /s to 500 000 /y, where y is the configured divisor.
3.7 Custom Kafka headers
Headers such as X‑ID, X‑Uid, X‑Guid, X‑Event, etc., carry routing and tracing information, avoiding payload decoding in the gateway.
3.8 Message send/receive
type Packet struct { ... }
type Connect struct {
*websocket.Conn
mux sync.RWMutex
}
func NewConnect(conn net.Conn) *Connect {
c := &Connect{send: make(chan Packet, N)}
go c.reader()
go c.writer()
return c
}
func (c *Connect) Write(data []byte) (err error) {
c.mux.Lock()
defer c.mux.Unlock()
// write logic
return nil
}Reducing goroutine count from three to two per connection saves memory.
3.9 Core object pooling
var ConnectionPool = sync.Pool{New: func() interface{} { return &Connection{} }}
func GetConn() *Connection { return ConnectionPool.Get().(*Connection) }
func PutConn(cli *Connection) { cli.Reset(); ConnectionPool.Put(cli) }3.10 Data transmission optimization
MessagePack is used for serialization, and MTU is tuned (e.g., 1400 bytes) to avoid fragmentation.
3.11 Infrastructure support
The service is built with the Ego framework, providing structured logging, dynamic log levels, and integrated monitoring of CPU, latency (P99), memory, and goroutine counts. Redis and Kafka client metrics are also visualized.
4 Performance testing
4.1 Test setup
One 4‑core 8 GB VM as the server targeting 480 k connections.
Eight 4‑core 8 GB VMs as clients, each opening 60 k ports.
4.2 Scenario 1 – 500 k online users
WS‑Gateway on a 16‑core 32 GB machine used 22.38 % CPU and 70.59 % memory; peak connection rate 16 k/s, 47 KB per user.
4.3 Scenario 2 – Broadcast with acknowledgments
Every 5 s a message (~1 KB) is sent to all users with receipt. After 5 min the service restarted due to memory exhaustion (broadcast code consumed 9.32 % memory, receipt handling 10.38 %).
4.4 Scenario 3 – Broadcast without acknowledgments
Similar load but no receipts; memory usage rose to 93 % with peak send 100 k msg/s.
4.5 Scenario 4 – High churn (40 k up/down per second)
CPU 46.96 %, memory 65.6 %; connection creation 18.5 k/s, receive 330 k msg/s, send 394 k msg/s, no crashes.
4.6 Summary
Under 16 C 32 G hardware, the gateway sustains 500 k connections across all scenarios with acceptable CPU and memory usage, confirming the redesign meets current scale requirements.
5 Conclusion
Decoupling the gateway from business services, removing NGINX dependency, optimizing handshake, socket ID generation, heartbeat handling, custom headers, message processing, object pooling, and payload compression collectively reduced resource consumption and improved reliability.
6 Q&A
6.1 Value of SocketID
SocketID allows Kafka consumers to locate the corresponding TCP connection; it also helps preserve message order and avoid loss during rolling updates.
6.2 Why Redis for broadcast
Redis pub/sub is simpler for cluster‑wide broadcast in Kubernetes and maintains compatibility with legacy logic, while Kafka is used for other asynchronous flows.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
