How We Scaled a Live‑Streaming Danmaku System to 1M Concurrent Users
This article details the design, bandwidth optimization, and reliability engineering behind a custom live‑streaming danmaku service that supports up to one million simultaneous users, covering problem analysis, compression techniques, polling strategies, service splitting, and performance results from a major traffic event.
Background
To better support Southeast Asian live‑streaming, the product added a danmaku feature. The first version, built on Tencent Cloud, suffered frequent stutter and low comment density, prompting the development of a custom danmaku system capable of supporting up to one million concurrent users per room.
Problem Analysis
The system faces three main challenges:
Bandwidth pressure: delivering at least 15 comments every 3 seconds results in >3 KB per packet, about 8 Gbps total, while the available bandwidth is only 10 Gbps.
Weak networks cause danmaku stutter and loss.
Performance and reliability: expected QPS exceeds 300 k, requiring robust handling during peak events.
Bandwidth Optimization
We reduced bandwidth by:
Enabling HTTP compression (gzip achieves >40 % reduction).
Simplifying the response structure.
Reordering content so repetitive strings and numbers are placed together to improve compression.
Implementing frequency control, including bandwidth throttling and sparse‑period throttling.
Danmaku Stutter and Loss Analysis
We evaluated push vs pull mechanisms.
Long Polling via AJAX
Client holds an AJAX request; server responds when an event occurs. Enabling HTTP Keep‑Alive saves handshake time.
WebSockets
WebSocket provides full‑duplex communication with minimal header overhead and supports binary frames and compression.
Both approaches rely on TCP long connections. In weak Southeast Asian networks, TCP keep‑alive intervals (keepalive_probes, keepalive_time, keepalive_intvl) cause frequent disconnections, making both long polling and WebSocket unsuitable.
We therefore adopted a short‑polling strategy for danmaku delivery.
Reliability and Performance
We split the service into two parts: a high‑frequency pull service with local caching and a low‑frequency push service with rate limiting and graceful degradation.
The pull side caches danmaku in memory, updates via periodic RPC, and uses a time‑based ring buffer that retains the last 60 seconds of comments, enabling lock‑free reads for the most recent 30 seconds.
The push side limits comment volume, applies fallback handling for avatar fetching and profanity filtering, and ensures core delivery continues even when auxiliary calls fail.
Summary
During the Double‑12 event, despite a brief Redis outage, the system sustained 700 k concurrent users with high efficiency and stability, meeting the target.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
