Building a Million-User Live-Stream Danmaku System: Bandwidth, Latency, and Reliability Solutions
To support Southeast Asian live-streaming, we designed a custom danmaku system capable of handling up to a million concurrent users per room, tackling bandwidth pressure, weak-network latency, and reliability by employing HTTP compression, response simplification, short-polling, local caching, and lock-free ring buffers.
Background
To better support Southeast Asian live streaming, the product added a danmaku feature. The first version used Tencent Cloud, but suffered from stutter and low bullet density, prompting the development of a custom danmaku system capable of supporting up to one million concurrent users per room.
Problem Analysis
Bandwidth pressure: delivering 15 danmaku messages every 3 seconds plus HTTP headers exceeds 3 KB per packet, resulting in roughly 8 Gbps traffic, while the available bandwidth is only 10 Gbps.
Weak network causing danmaku stutter and loss.
Performance and reliability: with a million users online, QPS can exceed 300 k, requiring robust handling during peak events like Double Eleven.
Bandwidth Optimization
We reduced bandwidth consumption with the following measures:
Enable HTTP compression. Gzip can achieve over 40% compression, outperforming deflate by 4‑5%.
Simplify response structure.
Optimize content ordering: placing similar strings and numbers together improves gzip compression.
Frequency control: add request interval parameters to limit client request rates and apply sparse control during low-traffic periods.
Danmaku Stutter and Loss Analysis
The key design decision was choosing a delivery mechanism: push vs pull.
Long Polling via AJAX
The client opens an AJAX request that the server holds until an event occurs, optionally enabling HTTP keep‑alive to save handshake time. 优点: reduces polling frequency, low latency, good browser compatibility; drawback : the server must maintain many connections.
WebSockets
WebSocket provides true bidirectional communication with minimal header overhead (2‑10 bytes for server‑to‑client frames, plus 4 bytes mask for client‑to‑server), better real‑time performance, and supports binary frames and compression.
However, in weak networks the TCP long‑connection often drops, and both Long Polling and WebSocket struggle to detect disconnections quickly. TCP keep‑alive probes (keepalive_probes, keepalive_time, keepalive_intvl) help but are insufficient under unstable conditions.
Given the environment, neither long polling nor WebSocket was suitable, so we adopted a short‑polling approach for danmaku delivery.
Reliability and Performance
We split the service into two parts: a sending side handling complex logic and a pulling side handling high‑frequency read requests. This prevents the high‑QPS pull service from overwhelming the send service and facilitates independent scaling.
On the pull side we introduced a local cache. The service periodically RPC‑calls the danmaku service to refresh an in‑memory buffer, allowing subsequent requests to read directly from memory, drastically reducing latency and external dependency impact.
Data is sharded by time into a ring buffer that retains only the last 60 seconds. The buffer stores timestamps and associated danmaku lists, enabling fast, ordered reads without locks because writes are single‑threaded and reads only access the most recent 30 seconds of data. 在发送弹幕的一端, we apply rate limiting to discard excess bullets and use graceful degradation for optional features (avatar fetching, profanity filtering) so core delivery remains unaffected.
Summary
During the Double Twelve event, even when Redis experienced a brief outage, the system supported 700 k concurrent users with high efficiency and stability, meeting the target objectives.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
