Backend Development 25 min read

Design and Performance Optimization of a High‑Concurrency WebSocket Gateway (Version 2.0)

This article describes the evolution from a Node.js‑based WebSocket gateway to a Go‑powered, gRPC‑enabled architecture, detailing the redesign of the gateway, resource‑saving techniques, heartbeat and TLS optimizations, message‑broker choices, extensive performance testing, and the resulting improvements in CPU, memory, and scalability for millions of concurrent connections.

Architect

Dec 1, 2021

Design and Performance Optimization of a High‑Concurrency WebSocket Gateway (Version 2.0)

1 Introduction

In Shimo Docs, features such as document sharing, comments, slide presentations, and spreadsheet follow‑along require multi‑client data synchronization and server‑initiated push, which the HTTP protocol cannot satisfy; therefore a WebSocket solution was adopted.

As the business grew, daily peak connections reached the million‑level, causing a sharp increase in memory and CPU consumption and prompting a gateway reconstruction.

2 Gateway 1.0

2.1 Architecture

Gateway 1.0 was implemented with Node.js and Socket.IO, which adequately supported the early traffic volume.

2.2 Pain points

Resource consumption: Nginx only performed TLS termination and request pass‑through, wasting resources, while the Node gateway consumed excessive CPU and memory.

Maintenance & monitoring: No integration with Shimo’s monitoring system, making alerting and troubleshooting difficult.

Business coupling: Business services and gateway logic were bundled together, preventing independent horizontal scaling of the business layer.

3 Gateway 2.0

Gateway 2.0 addresses the above issues by separating the gateway function (WS‑Gateway) from business processing (WS‑API). The new design integrates user authentication, TLS certificate verification, and WebSocket connection management into WS‑Gateway, while WS‑API communicates with component services via gRPC, enabling targeted scaling, removing Nginx, reducing hardware consumption, and joining Shimo’s monitoring ecosystem.

3.1 Overall architecture

The architecture diagram (omitted) shows WS‑Gateway handling TLS handshake and connection management, and WS‑API handling business logic and message routing.

3.2 Handshake process

When network conditions are good, the client completes a six‑step handshake and upgrades to WebSocket; under poor conditions the connection degrades to HTTP long‑polling.

{"sid":"xxx","upgrades":["websocket"],"pingInterval":xxx,"pingTimeout":xxx}

3.3 TLS memory‑usage optimization

In version 1.0 TLS termination was performed by Nginx, consuming about 30% of total memory. In version 2.0 the TLS certificate is mounted directly on the service, reducing the memory footprint.

3.4 Socket ID design

Each connection receives a unique identifier generated by the SnowFlake algorithm. In physical‑machine deployments a fixed machine ID guarantees uniqueness; in Kubernetes the ID is allocated via a registration service that writes the instance information to a database.

3.5 Cluster session management – event broadcast

Session data is stored in memory on the gateway node and partially persisted in Redis. The key‑value layout is:

Key

Description

ws:user:clients:${uid}

Ordered set mapping users to WebSocket connections

ws:guid:clients:${guid}

Ordered set mapping files to WebSocket connections

ws:client:${socket.id}

Redis hash storing all user and file relationships for a given socket

Two broadcast strategies were evaluated:

Strategy

Advantages

Disadvantages

Event broadcast

Simple implementation

Message count grows with node count

Service registry

Clear session‑to‑node mapping

Additional operational overhead

For the message broker, three candidates were compared (Redis, Kafka, RocketMQ). Benchmarks showed Redis performed best for payloads under 10 KB, which matches the typical broadcast size, so Redis was chosen for event broadcasting.

3.6 Heartbeat mechanism

After a successful handshake, the client receives heartbeat parameters from the server. The client reports timestamps periodically; the server updates the in‑memory session and synchronizes to Redis at a lower frequency to avoid Redis overload.

for {
    select {
    case <-t.C:
        now := time.Now().Unix()
        var clients []*Connection = make([]*Connection, 0)
        dispatcher.clients.Range(func(_, v interface{}) bool {
            client := v.(*Connection)
            lastTs := atomic.LoadInt64(&client.LastMessageTS)
            if now-lastTs > int64(expireTime) {
                clients = append(clients, client)
            } else {
                dispatcher.clearRedisMapping(client.Id, client.Uid, lastTs, clearTimeout)
            }
            return true
        })
        for _, cli := range clients {
            cli.WsClose()
        }
    }
}

Dynamic heartbeat intervals are calculated as QPS₂ = 500 000 / y, where y is the maximum interval multiplier, reducing heartbeat‑generated load by up to y‑fold.

3.7 Custom Kafka headers

To avoid costly message‑body decoding at the gateway, essential metadata (e.g., X‑ID, X‑Uid, X‑Guid, X‑Event, X‑Operator) is placed in Kafka headers, enabling efficient routing and full traceability.

Field

Description

Details

X‑ID

WebSocket ID

Connection identifier

X‑Uid

User ID

Identifies the user

X‑Guid

File ID

Identifies the document/file

X‑Inner

Gateway internal command

Join/leave events

X‑Event

Gateway event type

Connect/Message/Disconnect

X‑Operator

API layer command

Unicast, broadcast, internal ops

X‑Auth‑Type

User authentication type

SDKV2, main site, WeChat, mobile, desktop

X‑Trace‑ID

Trace identifier

End‑to‑end tracing

3.8 Message receive & send

Initial implementation used three goroutines per connection (reader, writer, and a background task). To reduce memory usage, the writer goroutine was eliminated and writes are performed synchronously with a mutex.

type Packet struct { ... }

type Connect struct {
    *websocket.Conn
    mux sync.RWMutex
}

func NewConnect(conn net.Conn) *Connect {
    c := &Connect{send: make(chan Packet, N)}
    go c.reader()
    return c
}

func (c *Connect) Write(data []byte) (err error) {
    c.mux.Lock()
    defer c.mux.Unlock()
    // write logic
    return nil
}

3.9 Core object pooling

Connection objects are pooled using sync.Pool to reduce GC pressure. Helper functions GetConn() and PutConn() acquire and release objects.

var ConnectionPool = sync.Pool{New: func() interface{} { return &Connection{} }}

func GetConn() *Connection { return ConnectionPool.Get().(*Connection) }

func PutConn(cli *Connection) { cli.Reset(); ConnectionPool.Put(cli) }

3.10 Data transmission optimization

Message bodies are serialized with MessagePack and compressed to keep payloads around 1 KB. MTU is tuned (e.g., 1400 bytes) to avoid IP fragmentation.

4 Performance testing

4.1 Test preparation

One 16‑core, 32 GB VM as the service host, targeting 480 k concurrent connections.

Eight 4‑core, 8 GB VMs as client generators, each opening 60 k ports.

4.2 Scenario 1 – 500 k online users

Service

CPU

Memory

Count

CPU %

Mem %

WS‑Gateway

16 cores

32 GB

22.38%

70.59%

Peak connection establishment: 16 k connections/s; each user consumes ~47 KB memory.

4.3 Scenario 2 – 500 k users, push every 5 s with acknowledgments

After 5 minutes the service restarted due to memory exhaustion (≈9.32% consumed by broadcast code, 10.38% by acknowledgment handling).

4.4 Scenario 3 – 500 k users, push every 5 s without acknowledgments

Service

CPU

Memory

Count

CPU %

Mem %

WS‑Gateway

16 cores

32 GB

30%

93%

Peak connection establishment: 11 k connections/s; send peak: 100 k messages/s; flame‑graph shows most CPU spent in the 5‑second broadcast loop.

4.5 Scenario 4 – 500 k users, push every 5 s with acknowledgments and 40 k up/down events per second

Service

CPU

Memory

Count

CPU %

Mem %

WS‑Gateway

16 cores

32 GB

46.96%

65.6%

Peak connection establishment: 18.5 k connections/s; receive peak: 329 k messages/s; send peak: 393 k messages/s; no abnormal behavior observed.

4.6 Test summary

On a 16‑core, 32 GB machine, a single instance handled up to 500 k concurrent WebSocket connections across all four scenarios. CPU and memory usage stayed within expectations, and the service remained stable over long‑duration runs, confirming the redesign meets current scalability requirements.

5 Conclusion

The rapid growth of user volume makes gateway refactoring essential. Version 2.0 achieves:

Decoupling of gateway and business services, removing Nginx dependency.

Degradable handshake, SnowFlake‑based Socket ID, optimized heartbeat, custom Kafka headers for low‑overhead routing, streamlined message handling, connection‑object pooling, MessagePack compression, and full integration with monitoring infrastructure.

Migration of all business calls to gRPC, providing traceable, controllable entry points for future feature expansion.

6 Technical references

Microservice framework: https://github.com/gotomicro/ego

Kafka, Redis, MySQL client monitoring SDK: https://github.com/gotomicro/ego-component

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

redis Go Performance Testing WebSocket

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1 Introduction

2 Gateway 1.0

2.1 Architecture

2.2 Pain points

3 Gateway 2.0

3.1 Overall architecture

3.2 Handshake process

3.3 TLS memory‑usage optimization

3.4 Socket ID design

3.5 Cluster session management – event broadcast

3.6 Heartbeat mechanism

3.7 Custom Kafka headers

3.8 Message receive & send

3.9 Core object pooling

3.10 Data transmission optimization

4 Performance testing

4.1 Test preparation

4.2 Scenario 1 – 500 k online users

4.3 Scenario 2 – 500 k users, push every 5 s with acknowledgments

4.4 Scenario 3 – 500 k users, push every 5 s without acknowledgments

4.5 Scenario 4 – 500 k users, push every 5 s with acknowledgments and 40 k up/down events per second

4.6 Test summary

5 Conclusion

6 Technical references

Architect

How this landed with the community

Was this worth your time?

0 Comments

4.2 Scenario 1 – 500 k online users

4.3 Scenario 2 – 500 k users, push every 5 s with acknowledgments

4.4 Scenario 3 – 500 k users, push every 5 s without acknowledgments

4.5 Scenario 4 – 500 k users, push every 5 s with acknowledgments and 40 k up/down events per second