Design and Optimization of a High‑Performance Live‑Streaming Danmaku System
This article details the challenges and solutions for building a scalable, low‑latency danmaku service for live streaming, covering background requirements, bandwidth constraints, protocol choices, short‑polling implementation, reliability measures, and performance results that supported 700,000 concurrent users during a major event.
Background
To better support Southeast Asian live‑streaming, a danmaku feature was added. The first version using Tencent Cloud showed frequent stutter and missing comments, prompting the development of a custom danmaku system capable of handling up to one million concurrent users per room.
Problem Analysis
Bandwidth pressure: delivering at least 15 comments every 3 seconds with HTTP headers exceeds 3 KB per batch, resulting in roughly 8 Gbps traffic, close to the 10 Gbps limit.
Weak networks cause stutter and loss, especially in the Southeast Asian region.
Performance and reliability: expected QPS can exceed 300 k, requiring robust handling during peak events like Double‑Eleven.
Bandwidth Optimization
Enable HTTP gzip compression, achieving >40 % reduction.
Simplify response structure (see image).
Reorder content to increase redundancy, improving compression ratios.
Frequency control: add request‑interval parameters to limit client request rates and implement sparse control during low‑traffic periods.
Danmaku Stutter and Loss Analysis
The main dilemma was choosing between push and pull delivery. Long polling via AJAX reduces request overhead but requires many persistent connections. WebSockets offer bidirectional communication with lower header overhead but still need many connections and struggle in weak networks.
Both Long Polling and WebSockets were unsuitable, so a short‑polling approach was adopted to achieve timely delivery while keeping connection overhead low.
Reliability and Performance
Service splitting isolates the heavy send‑danmaku logic from the high‑frequency pull‑danmaku service, allowing independent scaling and preventing one service from overwhelming the other.
Pull‑side uses a local cache refreshed via RPC, reducing latency and external dependency impact.
Sent‑side employs rate limiting and graceful degradation (e.g., avatar fetch or profanity filter failures do not block core flow).
A time‑based ring buffer stores only the latest 60 seconds of comments, enabling lock‑free reads for up to 30 seconds of data.
Summary
During the Double‑Twelve event, despite a brief Redis outage, the system reliably supported 700 k concurrent users, meeting the performance goals.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.