Design and Performance Optimization of a High‑Concurrency WebSocket Gateway (Version 2.0)
This article describes the evolution from a Node.js‑based WebSocket gateway to a Go‑powered, gRPC‑enabled architecture, detailing the redesign of the gateway, resource‑saving techniques, heartbeat and TLS optimizations, message‑broker choices, extensive performance testing, and the resulting improvements in CPU, memory, and scalability for millions of concurrent connections.
1 Introduction
In Shimo Docs, features such as document sharing, comments, slide presentations, and spreadsheet follow‑along require multi‑client data synchronization and server‑initiated push, which the HTTP protocol cannot satisfy; therefore a WebSocket solution was adopted.
As the business grew, daily peak connections reached the million‑level, causing a sharp increase in memory and CPU consumption and prompting a gateway reconstruction.
2 Gateway 1.0
2.1 Architecture
Gateway 1.0 was implemented with Node.js and Socket.IO, which adequately supported the early traffic volume.
2.2 Pain points
Resource consumption: Nginx only performed TLS termination and request pass‑through, wasting resources, while the Node gateway consumed excessive CPU and memory.
Maintenance & monitoring: No integration with Shimo’s monitoring system, making alerting and troubleshooting difficult.
Business coupling: Business services and gateway logic were bundled together, preventing independent horizontal scaling of the business layer.
3 Gateway 2.0
Gateway 2.0 addresses the above issues by separating the gateway function (WS‑Gateway) from business processing (WS‑API). The new design integrates user authentication, TLS certificate verification, and WebSocket connection management into WS‑Gateway, while WS‑API communicates with component services via gRPC, enabling targeted scaling, removing Nginx, reducing hardware consumption, and joining Shimo’s monitoring ecosystem.
3.1 Overall architecture
The architecture diagram (omitted) shows WS‑Gateway handling TLS handshake and connection management, and WS‑API handling business logic and message routing.
3.2 Handshake process
When network conditions are good, the client completes a six‑step handshake and upgrades to WebSocket; under poor conditions the connection degrades to HTTP long‑polling.
{"sid":"xxx","upgrades":["websocket"],"pingInterval":xxx,"pingTimeout":xxx}3.3 TLS memory‑usage optimization
In version 1.0 TLS termination was performed by Nginx, consuming about 30% of total memory. In version 2.0 the TLS certificate is mounted directly on the service, reducing the memory footprint.
3.4 Socket ID design
Each connection receives a unique identifier generated by the SnowFlake algorithm. In physical‑machine deployments a fixed machine ID guarantees uniqueness; in Kubernetes the ID is allocated via a registration service that writes the instance information to a database.
3.5 Cluster session management – event broadcast
Session data is stored in memory on the gateway node and partially persisted in Redis. The key‑value layout is:
Key
Description
ws:user:clients:${uid}
Ordered set mapping users to WebSocket connections
ws:guid:clients:${guid}
Ordered set mapping files to WebSocket connections
ws:client:${socket.id}
Redis hash storing all user and file relationships for a given socket
Two broadcast strategies were evaluated:
Strategy
Advantages
Disadvantages
Event broadcast
Simple implementation
Message count grows with node count
Service registry
Clear session‑to‑node mapping
Additional operational overhead
For the message broker, three candidates were compared (Redis, Kafka, RocketMQ). Benchmarks showed Redis performed best for payloads under 10 KB, which matches the typical broadcast size, so Redis was chosen for event broadcasting.
3.6 Heartbeat mechanism
After a successful handshake, the client receives heartbeat parameters from the server. The client reports timestamps periodically; the server updates the in‑memory session and synchronizes to Redis at a lower frequency to avoid Redis overload.
for {
select {
case <-t.C:
now := time.Now().Unix()
var clients []*Connection = make([]*Connection, 0)
dispatcher.clients.Range(func(_, v interface{}) bool {
client := v.(*Connection)
lastTs := atomic.LoadInt64(&client.LastMessageTS)
if now-lastTs > int64(expireTime) {
clients = append(clients, client)
} else {
dispatcher.clearRedisMapping(client.Id, client.Uid, lastTs, clearTimeout)
}
return true
})
for _, cli := range clients {
cli.WsClose()
}
}
}Dynamic heartbeat intervals are calculated as QPS₂ = 500 000 / y, where y is the maximum interval multiplier, reducing heartbeat‑generated load by up to y‑fold.
3.7 Custom Kafka headers
To avoid costly message‑body decoding at the gateway, essential metadata (e.g., X‑ID, X‑Uid, X‑Guid, X‑Event, X‑Operator) is placed in Kafka headers, enabling efficient routing and full traceability.
Field
Description
Details
X‑ID
WebSocket ID
Connection identifier
X‑Uid
User ID
Identifies the user
X‑Guid
File ID
Identifies the document/file
X‑Inner
Gateway internal command
Join/leave events
X‑Event
Gateway event type
Connect/Message/Disconnect
X‑Operator
API layer command
Unicast, broadcast, internal ops
X‑Auth‑Type
User authentication type
SDKV2, main site, WeChat, mobile, desktop
X‑Trace‑ID
Trace identifier
End‑to‑end tracing
3.8 Message receive & send
Initial implementation used three goroutines per connection (reader, writer, and a background task). To reduce memory usage, the writer goroutine was eliminated and writes are performed synchronously with a mutex.
type Packet struct { ... }
type Connect struct {
*websocket.Conn
mux sync.RWMutex
}
func NewConnect(conn net.Conn) *Connect {
c := &Connect{send: make(chan Packet, N)}
go c.reader()
return c
}
func (c *Connect) Write(data []byte) (err error) {
c.mux.Lock()
defer c.mux.Unlock()
// write logic
return nil
}3.9 Core object pooling
Connection objects are pooled using sync.Pool to reduce GC pressure. Helper functions GetConn() and PutConn() acquire and release objects.
var ConnectionPool = sync.Pool{New: func() interface{} { return &Connection{} }}
func GetConn() *Connection { return ConnectionPool.Get().(*Connection) }
func PutConn(cli *Connection) { cli.Reset(); ConnectionPool.Put(cli) }3.10 Data transmission optimization
Message bodies are serialized with MessagePack and compressed to keep payloads around 1 KB. MTU is tuned (e.g., 1400 bytes) to avoid IP fragmentation.
4 Performance testing
4.1 Test preparation
One 16‑core, 32 GB VM as the service host, targeting 480 k concurrent connections.
Eight 4‑core, 8 GB VMs as client generators, each opening 60 k ports.
4.2 Scenario 1 – 500 k online users
Service
CPU
Memory
Count
CPU %
Mem %
WS‑Gateway
16 cores
32 GB
1
22.38%
70.59%
Peak connection establishment: 16 k connections/s; each user consumes ~47 KB memory.
4.3 Scenario 2 – 500 k users, push every 5 s with acknowledgments
After 5 minutes the service restarted due to memory exhaustion (≈9.32% consumed by broadcast code, 10.38% by acknowledgment handling).
4.4 Scenario 3 – 500 k users, push every 5 s without acknowledgments
Service
CPU
Memory
Count
CPU %
Mem %
WS‑Gateway
16 cores
32 GB
1
30%
93%
Peak connection establishment: 11 k connections/s; send peak: 100 k messages/s; flame‑graph shows most CPU spent in the 5‑second broadcast loop.
4.5 Scenario 4 – 500 k users, push every 5 s with acknowledgments and 40 k up/down events per second
Service
CPU
Memory
Count
CPU %
Mem %
WS‑Gateway
16 cores
32 GB
1
46.96%
65.6%
Peak connection establishment: 18.5 k connections/s; receive peak: 329 k messages/s; send peak: 393 k messages/s; no abnormal behavior observed.
4.6 Test summary
On a 16‑core, 32 GB machine, a single instance handled up to 500 k concurrent WebSocket connections across all four scenarios. CPU and memory usage stayed within expectations, and the service remained stable over long‑duration runs, confirming the redesign meets current scalability requirements.
5 Conclusion
The rapid growth of user volume makes gateway refactoring essential. Version 2.0 achieves:
Decoupling of gateway and business services, removing Nginx dependency.
Degradable handshake, SnowFlake‑based Socket ID, optimized heartbeat, custom Kafka headers for low‑overhead routing, streamlined message handling, connection‑object pooling, MessagePack compression, and full integration with monitoring infrastructure.
Migration of all business calls to gRPC, providing traceable, controllable entry points for future feature expansion.
6 Technical references
Microservice framework: https://github.com/gotomicro/ego
Kafka, Redis, MySQL client monitoring SDK: https://github.com/gotomicro/ego-component
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
