Building a Million‑User Live‑Stream Danmaku System with Short‑Polling
To support Southeast Asian live streaming with up to a million concurrent users per room, we replaced Tencent Cloud’s inadequate danmaku service with a custom solution that optimizes bandwidth via HTTP compression, response simplification, ordering, frequency control, and a short‑polling delivery mechanism, while ensuring reliability through service splitting, caching, and lock‑free ring buffers.
Background
To better support Southeast Asian live streaming, the product added bullet‑screen (danmaku) functionality. The first version used Tencent Cloud, but it suffered from stutter and insufficient bullet count, prompting the development of a custom danmaku system capable of supporting a million concurrent users per room.
Problem Analysis
Based on the background, the system faces the following issues:
Bandwidth pressure
If a delivery occurs every 3 seconds, at least 15 messages are needed to avoid visual stutter. 15 messages plus HTTP header exceed 3 KB, resulting in roughly 8 Gbps per second, while available bandwidth is only 10 Gbps.
Stutter and loss caused by weak networks
This issue is already observed in the production environment.
Performance and reliability
With a million concurrent users, QPS can exceed 300 k. Ensuring stability during peak events like Double‑Eleven is critical.
Bandwidth Optimization
To reduce bandwidth pressure, we adopted the following measures:
Enable HTTP compression
Gzip compression can achieve over 40% reduction (gzip outperforms deflate by 4‑5%).
Simplify response structure
Optimize content ordering
Higher repetition yields better compression, so placing strings and numbers together improves gzip efficiency.
Frequency control
Bandwidth control: add request interval parameters to make client request rates server‑controllable, providing degraded service during traffic spikes.
Sparse control: during periods of low activity, adjust next request time to avoid unnecessary client requests.
Danmaku Stutter and Loss Analysis
When developing the danmaku system, the key question is the delivery mechanism: push vs pull?
Long Polling via AJAX
The client opens an AJAX request to the server and waits for a response. The server must support request suspension and return data as soon as an event occurs. Enabling HTTP Keep‑Alive can also reduce handshake time.
Advantages: fewer poll cycles, low latency, good browser compatibility. Disadvantages: the server must maintain many connections.
WebSockets
Long polling reduces invalid requests but still requires many connections. WebSocket offers true bidirectional communication, lower header overhead, stronger real‑time performance, and supports binary frames and compression.
Advantages: minimal control overhead (2‑10 bytes header for server‑to‑client, plus 4 bytes mask for client‑to‑server), stronger real‑time capability, and persistent connections.
Long Polling vs WebSockets
Both rely on TCP long connections. How does TCP detect a broken connection?
TCP Keepalive parameters: keepalive_probes (default 7), keepalive_time (default 2 h), keepalive_intvl (default 75 s).
In weak Southeast Asian networks, TCP connections frequently drop:
Long Polling detects disconnection at min(keepalive_intvl, polling_interval). WebSockets detect at min(keepalive_intvl, client_sending_interval).
Since a disconnection may occur before the next packet, TCP long connections are of limited value, and WebSockets become unsuitable under weak networks.
Even after the server detects a broken WebSocket, it cannot push data until the client reconnects, causing potential data loss.
Each reconnection requires a new application‑level handshake.
According to Tencent Cloud’s danmaku system, push mode is used for < 300 users, and polling for larger audiences, likely implemented with WebSocket. Given our constraints, both Long Polling and WebSockets are unsuitable, so we ultimately adopted a short‑polling approach for danmaku delivery.
Reliability and Performance
To ensure service stability, we split the system: complex logic resides in the sending side, while the high‑frequency pull side is separated. This prevents the pull service from overwhelming the send service and vice versa, facilitating scale‑up and scale‑out, and clarifying business boundaries.
On the pull side , we introduced a local cache. The service periodically RPCs the danmaku service to refresh the cache, allowing subsequent requests to read directly from memory, drastically reducing latency and moving data closer to users, thus mitigating external service failures.
Data is sharded by time using a RingBuffer that retains only the tail pointer, advancing one slot per second and storing timestamps with corresponding danmaku lists, keeping up to 60 seconds of data. Reads traverse the buffer backward from the tail copy, ensuring ordered retrieval with high efficiency.
Write operations are single‑threaded, eliminating concurrency concerns. Reads are lock‑free because the read direction (counter‑clockwise) never overlaps with the write direction (clockwise) within the 30‑second read window.
On the send side , we limit the total visible danmaku per user, discarding excess messages. Optional branches such as avatar fetching and sensitive‑word filtering are designed to fail gracefully, providing degraded but functional service.
Conclusion
During the Double‑12 event, even when Redis experienced a brief outage, the service efficiently and stably supported 700 k concurrent users, successfully meeting the target.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
