Design and Performance Optimization of a High‑Scale WebSocket Gateway (Version 2.0)

This article describes the architectural evolution from Gateway 1.0 to Gateway 2.0 for a high‑traffic document collaboration platform, detailing the redesign of the WebSocket layer, resource‑usage optimizations, heartbeat mechanisms, custom Kafka headers, message serialization, load‑testing results, and the operational lessons learned to support millions of concurrent connections.

Top Architect
Top Architect
Top Architect
Design and Performance Optimization of a High‑Scale WebSocket Gateway (Version 2.0)

1 Introduction

In several Shimo Docs services (document sharing, comments, slide presentations, spreadsheet sync, etc.) the need for multi‑client data synchronization and server‑side push cannot be satisfied by plain HTTP, so a WebSocket‑based solution is adopted.

With daily peak connections reaching the million‑level, memory and CPU consumption grew sharply, prompting a gateway redesign.

2 Gateway 1.0

2.1 Architecture

Gateway 1.0 was built with Node.js and Socket.IO and met the early traffic requirements.

2.2 Pain Points

Resource waste: Nginx only performed TLS termination and passed traffic through, causing high CPU/memory usage.

Lack of monitoring integration with Shimo’s observability platform.

Business‑gateway coupling made horizontal scaling of business logic impossible.

3 Gateway 2.0

3.1 Overall Architecture

Gateway 2.0 separates the gateway function (WS‑Gateway) from business processing (WS‑API). WS‑Gateway handles authentication, TLS, and WebSocket connection management, while WS‑API communicates with component services via gRPC, enabling per‑module scaling and removing Nginx.

3.2 Handshake Process

The client first performs a normal Socket.IO handshake (GET → server returns

{"sid":"xxx","upgrades":["websocket"],"pingInterval":...,"pingTimeout":...}

), then upgrades to WebSocket. In poor network conditions the connection falls back to HTTP long‑polling.

3.3 TLS Memory Optimization

TLS termination moved from Nginx to the service itself; profiling shows TLS handshake consumes ~30% of total memory. Two solutions are considered: using a layer‑7 load balancer for TLS offload, or optimizing Go’s TLS implementation (referencing Go PR #43563).

3.4 Socket ID Design

Each connection receives a unique SnowFlake‑generated ID. In physical‑machine deployments the machine ID is fixed; in Kubernetes the replica ID is stored in a database and used as the SnowFlake node identifier.

3.5 Cluster Session Management (Event Broadcast)

Session data is stored in memory and partially persisted in Redis using keys such as ws:user:clients:${uid}, ws:guid:clients:${guid}, and ws:client:${socket.id}. Two broadcast strategies were evaluated: simple event broadcast (easy but scales with node count) and a registration‑center approach (clear mapping but adds operational overhead). Event broadcast with Redis was chosen after benchmarking 100 w operations.

3.6 Heartbeat Mechanism

After a successful handshake the server pushes heartbeat parameters. Clients report heartbeats at the configured interval, updating the in‑memory timestamp and periodically syncing to Redis to avoid a thundering‑herd effect.

for {
    select {
    case <-t.C:
        now := time.Now().Unix()
        var clients []*Connection
        dispatcher.clients.Range(func(_, v interface{}) bool {
            client := v.(*Connection)
            lastTs := atomic.LoadInt64(&client.LastMessageTS)
            if now-lastTs > int64(expireTime) {
                clients = append(clients, client)
            } else {
                dispatcher.clearRedisMapping(client.Id, client.Uid, lastTs, clearTimeout)
            }
            return true
        })
        for _, cli := range clients {
            cli.WsClose()
        }
    }
}

The QPS of heartbeat reporting can be dynamically reduced by increasing the interval after a configurable number of successful heartbeats, with a lower bound of QPS2 = 500000 / y.

3.7 Custom Kafka Headers

To avoid the cost of decoding message bodies at the gateway, operation commands and parameters are placed in Kafka headers (e.g., X-ID, X-Uid, X-Guid, X-Operator, etc.), enabling traceability and lightweight routing.

3.8 Message Send/Receive

type Packet struct { ... }

type Connect struct {
    *websocket.Conn
    send chan Packet
}

func NewConnect(conn net.Conn) *Connect {
    c := &Connect{send: make(chan Packet, N)}
    go c.reader()
    go c.writer()
    return c
}

Initial implementation used three goroutines per connection; later it was reduced to two (reader + on‑demand writer) and a sync.Pool was introduced for Connection objects to reduce GC pressure.

3.9 Core Object Pool

var ConnectionPool = sync.Pool{New: func() interface{} { return &Connection{} }}

func GetConn() *Connection { return ConnectionPool.Get().(*Connection) }

func PutConn(c *Connection) { c.Reset(); ConnectionPool.Put(c) }

3.10 Data Transfer Optimization

MessagePack is used to serialize payloads, reducing size. MTU is tuned (e.g., 1400 bytes) to avoid fragmentation, and ping tests are used to find the optimal value.

3.11 Infrastructure Support

The service is built with the EGO framework, providing structured logging, dynamic log levels, and built‑in metrics (CPU, P99 latency, memory, goroutine count). Monitoring dashboards for Redis, Kafka, and the gateway are shown.

4 Performance Stress Tests

4.1 Test Setup

One 4‑core 8 GB VM as the gateway targeting 480 k concurrent connections.

Eight 4‑core 8 GB VMs as clients, each opening 60 k ports.

4.2 Scenario 1 – User Login (500 k online)

WS‑Gateway peak connection rate: 16 k connections/s, memory per user ≈ 47 KB.

4.3 Scenario 2 – Broadcast with Ack (500 k online)

Every 5 s a broadcast is sent to all users with acknowledgment. After 5 min the service restarts due to memory pressure (broadcast code consumes ~9.3% memory, ack handling ~10.4%).

4.4 Scenario 3 – Broadcast without Ack (500 k online)

Peak connection rate: 11 k/s, send rate: 100 k messages/s, memory usage stays below 94%.

4.5 Scenario 4 – Mixed Load (500 k online, 40 k up/down per second)

Peak connection rate: 18.5 k/s, receive rate: 330 k msgs/s, send rate: 394 k msgs/s, no abnormal behavior.

4.6 Test Summary

On a 16‑core 32 GB machine the gateway sustains 500 k concurrent connections across all scenarios with acceptable CPU and memory consumption, confirming the redesign meets current scale requirements.

5 Conclusion

The gateway redesign decouples gateway and business services, removes Nginx dependency, introduces degradable handshake, SnowFlake‑based Socket IDs, optimized heartbeat handling, custom Kafka headers for traceability, streamlined send/receive code, connection object pooling, MessagePack compression, and full observability. All these improvements reduce per‑user resource usage and increase reliability, providing a solid foundation for future growth.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceMicroservicesredisGoKafkaWebSocket
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.