Design and Optimization of Bilibili Live Chat (GOIM) System
This article explains how Bilibili's GOIM live chat system was architected and continuously optimized to achieve high stability, high availability, and sub‑second latency through modular backend components, memory and module optimizations, and multi‑IDC network improvements.
With the rapid growth of live streaming, live chat (弹幕) has become popular, and a live chat system must guarantee high stability, high availability, and low latency; Bilibili architect Liu Ding presents best practices for the GOIM live chat service architecture.
High‑concurrency real‑time chat requires three key aspects: stable connections to ensure real‑time interaction, availability through fallback machines to avoid connection interruption, and latency kept under one second to meet interactive demands.
GOIM was introduced to satisfy these requirements. Its main modules include Client (establishes connections to Comet), Comet (maintains long‑lived connections and handles protocol), Logic (processes messages, performs authentication and filtering), Router (stores session information), Kafka (distributed publish/subscribe message queue), and Jop (message distribution across machines).
Memory optimization was achieved by ensuring each message occupies a single memory block, placing per‑user memory on the stack within Goroutines, and using memory pools in Comet to control allocations.
Module optimization focused on three points: making message distribution fully parallel and independent, controlling the number of Goroutines by pre‑allocating a fixed pool, and sharding global locks based on CPU cores to reduce contention.
Performance testing showed that in 2015, two physical machines handling 250 k concurrent users each produced a CPU bottleneck; later moving all memory to the stack allowed a single machine to support 1 M users, after which network bandwidth became the limiting factor.
Network optimization introduced multi‑IDC deployment, optimal node selection via the Svrlist module, IDC service‑quality monitoring (drop‑rate), automatic failover of disconnected servers, striving for 100 % message delivery, and traffic control to manage bandwidth costs.
The revised architecture placed Comet instances in multiple IDC locations, used link pooling across operators, and switched from push to pull data flow to a central hub, complemented by fault monitoring (client simulation, CPU profiling, whitelist logging, load alerts) to ensure high availability.
Overall, Bilibili continues to pursue low‑cost, high‑efficiency improvements to the GOIM system to deliver the best possible live chat experience.
Java Architect Essentials
Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.