How We Scaled a Live‑Streaming Danmaku System to 1M Concurrent Users

This article details the design, bandwidth optimization, and reliability engineering behind a custom live‑streaming danmaku service that supports up to one million simultaneous users, covering problem analysis, compression techniques, polling strategies, service splitting, and performance results from a major traffic event.

Java High-Performance Architecture
Java High-Performance Architecture
Java High-Performance Architecture
How We Scaled a Live‑Streaming Danmaku System to 1M Concurrent Users

Background

To better support Southeast Asian live‑streaming, the product added a danmaku feature. The first version, built on Tencent Cloud, suffered frequent stutter and low comment density, prompting the development of a custom danmaku system capable of supporting up to one million concurrent users per room.

Problem Analysis

The system faces three main challenges:

Bandwidth pressure: delivering at least 15 comments every 3 seconds results in >3 KB per packet, about 8 Gbps total, while the available bandwidth is only 10 Gbps.

Weak networks cause danmaku stutter and loss.

Performance and reliability: expected QPS exceeds 300 k, requiring robust handling during peak events.

Bandwidth Optimization

We reduced bandwidth by:

Enabling HTTP compression (gzip achieves >40 % reduction).

Simplifying the response structure.

Reordering content so repetitive strings and numbers are placed together to improve compression.

Implementing frequency control, including bandwidth throttling and sparse‑period throttling.

Danmaku Stutter and Loss Analysis

We evaluated push vs pull mechanisms.

Long Polling via AJAX

Client holds an AJAX request; server responds when an event occurs. Enabling HTTP Keep‑Alive saves handshake time.

WebSockets

WebSocket provides full‑duplex communication with minimal header overhead and supports binary frames and compression.

Both approaches rely on TCP long connections. In weak Southeast Asian networks, TCP keep‑alive intervals (keepalive_probes, keepalive_time, keepalive_intvl) cause frequent disconnections, making both long polling and WebSocket unsuitable.

We therefore adopted a short‑polling strategy for danmaku delivery.

Reliability and Performance

We split the service into two parts: a high‑frequency pull service with local caching and a low‑frequency push service with rate limiting and graceful degradation.

The pull side caches danmaku in memory, updates via periodic RPC, and uses a time‑based ring buffer that retains the last 60 seconds of comments, enabling lock‑free reads for the most recent 30 seconds.

The push side limits comment volume, applies fallback handling for avatar fetching and profanity filtering, and ensures core delivery continues even when auxiliary calls fail.

Summary

During the Double‑12 event, despite a brief Redis outage, the system sustained 700 k concurrent users with high efficiency and stability, meeting the target.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Backendperformancelive streamingScalabilitydanmaku
Java High-Performance Architecture
Written by

Java High-Performance Architecture

Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.