How Bilibili Scaled Live Chat with GOIM: Architecture and Performance Optimizations
This article explains how Bilibili built the high‑stability, high‑availability, low‑latency GOIM live‑chat system, detailing its component modules, memory and module optimizations, network redesign, testing results, and ongoing monitoring to handle millions of concurrent users.
Background
Live streaming chat (弹幕) requires three guarantees: high stability (steady connections), high availability (automatic fail‑over when a node crashes), and low latency (message delay < 1 s). The GOIM (B‑Station Live Chat) service was built to meet these requirements.
GOIM Architecture
The system consists of the following components (see
):
Client : establishes a long‑polling (Comet) connection to the server.
Comet : maintains the long‑lived TCP/WebSocket connection, handles low‑level protocol details and keeps the link alive.
Logic : performs per‑message processing such as authentication, IP filtering, and blacklist checks.
Router : stores session information and maps each user to a specific machine for routing.
Kafka (third‑party): distributed publish/subscribe queue; each message is tagged with a topic for scalable distribution.
Jop : runs on multiple machines, pulls messages from Logic and forwards them to all Comet instances.
GOIM evolved from an earlier project called Gopush, adding optimizations specific to massive live‑chat workloads.
Optimization Paths
Memory Optimizations
Single memory block per message : messages are aggregated in a Job object; Comet holds only a pointer to the aggregated block, eliminating duplicate allocations.
Per‑user memory on the stack : each user’s temporary data is allocated inside its dedicated Goroutine stack, avoiding heap fragmentation.
Self‑managed memory pools : critical paths in the Comet module replace ad‑hoc new / malloc calls with pre‑allocated pools, reducing GC pressure.
Module Optimizations
Parallel, non‑interfering message distribution : each Comet channel operates independently, preventing contention between streams.
Controlled concurrency : a fixed pool of worker Goroutines is created ahead of time; asynchronous tasks are dispatched to this pool to avoid sudden spikes.
Sharded global locks : locks for socket pools and online‑user tables are partitioned by CPU core count, reducing lock contention.
Network Optimizations
Initially all services ran in a single IDC, causing bandwidth bottlenecks and single‑point failures. The architecture was redesigned to a multi‑IDC topology (see
):
Deploy entry points in several IDC locations; the Svrlist module routes users to the nearest stable node.
Continuously monitor drop‑rate per IDC and dynamically adjust routing based on real‑time statistics.
Automatically disable failed servers to maintain 100 % message delivery.
Apply traffic shaping to respect ISP bandwidth caps.
Cross‑IDC traffic traverses public networks, so redundant telecom lines and backup paths between IDC‑1 and IDC‑2 were added to improve stability and reduce packet loss.
Testing and Results
2015 stress test (see
): two physical machines each handled ~250 k concurrent users, pushing 20‑50 messages / s per stream. Peak throughput reached 50 msg/s per stream, 24.4 M messages/s overall. CPU was saturated, memory ~4 GB, network traffic ~3 GB, indicating CPU as the bottleneck.
2016 optimization (see
): all heap allocations were moved to stack‑based pools and the system was consolidated onto a single machine supporting 1 M concurrent users. Throughput increased dramatically, but the ultimate limiting factor became network traffic volume.
Monitoring and Fault Detection
Simulated clients generate traffic to measure message arrival rates.
Real‑time CPU profiling (Ppof) captures snapshots for performance analysis.
Whitelist specific users to collect server‑side logs for issue tracking.
Server load monitoring with SMS alerts for rapid response.
These mechanisms provide low‑cost, high‑efficiency observability and enable continuous improvement of the service.
Conclusion
The GOIM system demonstrates a comprehensive approach to building a highly stable, highly available, and low‑latency live‑chat platform. By refining memory management, controlling module concurrency, sharding locks, and redesigning the network topology across multiple IDC sites, Bilibili achieved orders‑of‑magnitude performance gains while maintaining 100 % message delivery.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
