Backend Development 25 min read

Building a Million-Scale WebSocket Gateway: Architecture, Optimization & Performance

This article details the design, refactoring, and performance testing of a high‑traffic WebSocket gateway for Shimo Docs, covering the evolution from a Node.js Socket.IO version to a Go‑based microservice architecture, TLS memory tuning, socket ID generation, heartbeat handling, custom Kafka headers, and resource‑efficient scaling to half‑a‑million concurrent connections.

21CTO

Nov 30, 2021

Building a Million-Scale WebSocket Gateway: Architecture, Optimization & Performance

Content summary: Web service push technology evolved from long‑polling and short‑polling to the HTML5 WebSocket standard, which simplifies message push and notification. How to implement a WebSocket gateway that supports millions of connections? This article shares the practical experience of Shimo Docs senior engineer Du Minxiang based on the reconstruction of Shimo's WebSocket service.

1 Introduction

In Shimo Docs, features such as document sharing, comments, slide presentations and spreadsheet sync require multi‑client data synchronization and server‑side push, which cannot be satisfied by short‑ or long‑polling, so the industry‑standard solution based on the HTML5 WebSocket specification was chosen.

As daily peak connections grew to millions, memory and CPU usage surged, prompting a gateway redesign.

2 Gateway 1.0

Gateway 1.0 was built with Node.js and Socket.IO and met the early traffic requirements.

2.1 Architecture

Architecture diagram:

Client connection flow:

User connects through NGINX; the event is observed by business services.

Business service queries user data and publishes a message to Redis.

Gateway subscribes to Redis and receives the message.

Gateway looks up the user session in the cluster and pushes the message to the client.

2.2 Pain points

Resource waste: NGINX only passes certificates, most traffic is proxied, causing CPU/memory overhead; Node gateway also heavy.

Lack of monitoring integration.

Business‑gateway coupling prevents targeted horizontal scaling.

3 Gateway 2.0

Gateway 2.0 separates the gateway function (WS‑Gateway) from business processing (WS‑API). WS‑Gateway handles authentication, TLS, and WebSocket management; WS‑API communicates with component services via gRPC, enabling module‑level scaling and reduced hardware consumption.

3.1 Overall architecture

Architecture diagram:

Client connection flow:

Client completes WebSocket handshake with WS‑Gateway.

WS‑Gateway stores the session, caches the connection mapping in Redis, and publishes an online event to Kafka.

WS‑API consumes the online event from Kafka.

WS‑API retrieves necessary data from Redis, applies filtering, and publishes the final message to Kafka.

WS‑Gateway subscribes to Kafka, receives the message, and pushes it to the appropriate client.

3.2 Handshake process

In good network conditions the client follows steps 1‑6 to enter WebSocket mode; in poor conditions the connection degrades to HTTP long‑polling.

{"sid":"xxx","upgrades":["websocket"],"pingInterval":xxx,"pingTimeout":xxx}

Client resends request with the sid.

Server returns 40 to acknowledge.

Client sends POST to confirm downgrade path.

Server returns ok, completing the first phase.

WebSocket connection is established after 2probe/3probe checks.

3.3 TLS memory optimization

In version 1.0 TLS termination was done by NGINX, consuming ~30% of total memory. In 2.0 the certificate is mounted on the service itself, reducing NGINX load.

Two options to further reduce memory:

Move TLS termination to a layer‑7 load balancer.

Improve Go TLS handshake performance (refer to Go PR #43563 and related benchmarks).

3.4 Socket ID design

Each connection gets a unique ID generated by the SnowFlake algorithm. In physical‑machine deployments the machine number guarantees uniqueness; in Kubernetes a registration service assigns the machine ID, which is stored in a database and reused after restarts.

3.5 Cluster session management – event broadcast

Session data is stored in memory and partially persisted in Redis with keys such as ws:user:clients:${uid}, ws:guid:clients:${guid}, and ws:client:${socket.id}. Two broadcast strategies were evaluated:

Event broadcast – simple but message count grows with node count.

Service registry – clear mapping but adds operational cost.

After testing, Redis was chosen for broadcast because payloads are ~1 KB and the scenario is simple.

3.6 Heartbeat mechanism

After a WebSocket connection is established, the server sends heartbeat parameters. Clients report heartbeats at the configured interval; timestamps are first updated in memory, then periodically synced to Redis to avoid spikes.

Server sends heartbeat config.

Client sends heartbeat packets; server updates the session timestamp.

Any upstream data also refreshes the timestamp.

Server periodically clears expired sessions.

Redis‑based timestamps drive cleanup of connection‑user‑file mappings.

for {
    select {
    case <-t.C:
        now := time.Now().Unix()
        var clients []*Connection
        dispatcher.clients.Range(func(_, v interface{}) bool {
            client := v.(*Connection)
            lastTs := atomic.LoadInt64(&client.LastMessageTS)
            if now-lastTs > int64(expireTime) {
                clients = append(clients, client)
            } else {
                dispatcher.clearRedisMapping(client.Id, client.Uid, lastTs, clearTimeout)
            }
            return true
        })
        for _, cli := range clients {
            cli.WsClose()
        }
    }
}

Dynamic heartbeat intervals reduce QPS from 500 000 /s to 500 000 /y, where y is the configured divisor.

3.7 Custom Kafka headers

Headers such as X‑ID, X‑Uid, X‑Guid, X‑Event, etc., carry routing and tracing information, avoiding payload decoding in the gateway.

3.8 Message send/receive

type Packet struct { ... }

type Connect struct {
    *websocket.Conn
    mux sync.RWMutex
}

func NewConnect(conn net.Conn) *Connect {
    c := &Connect{send: make(chan Packet, N)}
    go c.reader()
    go c.writer()
    return c
}

func (c *Connect) Write(data []byte) (err error) {
    c.mux.Lock()
    defer c.mux.Unlock()
    // write logic
    return nil
}

Reducing goroutine count from three to two per connection saves memory.

3.9 Core object pooling

var ConnectionPool = sync.Pool{New: func() interface{} { return &Connection{} }}

func GetConn() *Connection { return ConnectionPool.Get().(*Connection) }

func PutConn(cli *Connection) { cli.Reset(); ConnectionPool.Put(cli) }

3.10 Data transmission optimization

MessagePack is used for serialization, and MTU is tuned (e.g., 1400 bytes) to avoid fragmentation.

3.11 Infrastructure support

The service is built with the Ego framework, providing structured logging, dynamic log levels, and integrated monitoring of CPU, latency (P99), memory, and goroutine counts. Redis and Kafka client metrics are also visualized.

4 Performance testing

4.1 Test setup

One 4‑core 8 GB VM as the server targeting 480 k connections.

Eight 4‑core 8 GB VMs as clients, each opening 60 k ports.

4.2 Scenario 1 – 500 k online users

WS‑Gateway on a 16‑core 32 GB machine used 22.38 % CPU and 70.59 % memory; peak connection rate 16 k/s, 47 KB per user.

4.3 Scenario 2 – Broadcast with acknowledgments

Every 5 s a message (~1 KB) is sent to all users with receipt. After 5 min the service restarted due to memory exhaustion (broadcast code consumed 9.32 % memory, receipt handling 10.38 %).

4.4 Scenario 3 – Broadcast without acknowledgments

Similar load but no receipts; memory usage rose to 93 % with peak send 100 k msg/s.

4.5 Scenario 4 – High churn (40 k up/down per second)

CPU 46.96 %, memory 65.6 %; connection creation 18.5 k/s, receive 330 k msg/s, send 394 k msg/s, no crashes.

4.6 Summary

Under 16 C 32 G hardware, the gateway sustains 500 k connections across all scenarios with acceptable CPU and memory usage, confirming the redesign meets current scale requirements.

5 Conclusion

Decoupling the gateway from business services, removing NGINX dependency, optimizing handshake, socket ID generation, heartbeat handling, custom headers, message processing, object pooling, and payload compression collectively reduced resource consumption and improved reliability.

6 Q&A

6.1 Value of SocketID

SocketID allows Kafka consumers to locate the corresponding TCP connection; it also helps preserve message order and avoid loss during rolling updates.

6.2 Why Redis for broadcast

Redis pub/sub is simpler for cluster‑wide broadcast in Kubernetes and maintains compatibility with legacy logic, while Kafka is used for other asynchronous flows.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

redis Go Kafka WebSocket gateway

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1 Introduction

2 Gateway 1.0

2.1 Architecture

2.2 Pain points

3 Gateway 2.0

3.1 Overall architecture

3.2 Handshake process

3.3 TLS memory optimization

3.4 Socket ID design

3.5 Cluster session management – event broadcast

3.6 Heartbeat mechanism

3.7 Custom Kafka headers

3.8 Message send/receive

3.9 Core object pooling

3.10 Data transmission optimization

3.11 Infrastructure support

4 Performance testing

4.1 Test setup

4.2 Scenario 1 – 500 k online users

4.3 Scenario 2 – Broadcast with acknowledgments

4.4 Scenario 3 – Broadcast without acknowledgments

4.5 Scenario 4 – High churn (40 k up/down per second)

4.6 Summary

5 Conclusion

6 Q&A

6.1 Value of SocketID

6.2 Why Redis for broadcast

21CTO

How this landed with the community

Was this worth your time?

0 Comments

4.2 Scenario 1 – 500 k online users

4.3 Scenario 2 – Broadcast with acknowledgments

4.4 Scenario 3 – Broadcast without acknowledgments

4.5 Scenario 4 – High churn (40 k up/down per second)