Design and Evolution of Baidu Live Streaming Message Service
Baidu’s live‑streaming message service evolved from a basic group‑chat model to a scalable multicast architecture that delivers up to a million concurrent users per room with sub‑second latency, using message aggregation, compression, and priority channels to manage QPS, bandwidth, and reliable gift delivery.
The core functions of a live‑streaming service are real‑time audio/video streaming and the exchange of live‑room messages. This article introduces the design practice and evolution of Baidu's live‑streaming message service.
Background : Live‑room interactions are more than simple chat; they also include gifts, entrance notifications, likes, purchases, host recommendations, and link‑mic requests. All these rely on a real‑time message flow that underpins both user interaction and live‑room control.
Live Message 1.0 – Foundations : The system initially followed a classic group‑chat model. Key differences from ordinary group chat include massive concurrent users (up to millions), frequent user join/leave (tens of thousands QPS), and short session duration (hours). Two main problems were identified: maintaining millions of users in a room and delivering messages to millions of online users.
Design Goals :
Second‑level end‑to‑end latency.
Support >1 million concurrent online users per room.
Allow reasonable message dropping for peak overload.
Assume each room sends ≤20 messages per second.
The core challenge is delivering ≤N (≤20) messages within ≤S (≤2) seconds to a million users.
Group‑Chat Pressure Analysis : Mapping the group‑chat flow to live‑room scenarios reveals six million‑level pressure points (user list split, device lookup, routing lookup, long‑connection push, client pull, read‑state update). Directly reusing the group‑chat pipeline would overwhelm storage, routing, and connection services.
Optimization Attempts :
Merge user‑list and device data to reduce lookups.
Replace pull‑based fetching with push‑only delivery (allowing some loss).
Simplify or drop read‑state handling.
After optimization, three million‑level steps remain: user‑list split, dynamic routing, and long‑connection push.
Multicast (mcast) Solution : Introduce a long‑connection multicast group identified by a global mcastID. Each mcast is a set of connections; the routing layer records which connection‑service instances host members. Clients join/leave via SDK calls (mcastJoin / mcastLeave). Message push follows a 1:M * 1:N expansion: backend selects a router instance, which forwards the message to all relevant connection‑service instances, which then push to each client connection.
Performance Evaluation :
Join/leave traffic ≤20 kQPS (two orders of magnitude below message traffic).
Router split cost is low (tens to hundreds of instances).
One long‑connection instance can handle 5‑8 × 10⁴ QPS; 20 instances suffice for 10⁶ QPS, 100 instances for 5 × 10⁶ QPS.
Capacity table (instances vs. QPS) shows that the architecture can scale to millions of connections with modest horizontal expansion.
Message‑Peak Problem : If 100 different message types arrive per second, total downstream QPS reaches 10⁸, requiring ~2 000 instances at 5 × 10⁴ QPS each. To mitigate, a delay‑aggregation strategy batches messages per second, increasing latency by ~500 ms but reducing downstream QPS by 100×.
Bandwidth Issue : Even with aggregation, a single instance pushing 10 000 connections at 100 msg/s (2 KB each) consumes ~15.3 Gbps, exceeding a 10 Gbps NIC. Compression (≈6.7:1) reduces bandwidth to ~2.3 Gbps per instance, making the load manageable.
Client‑Side Optimizations :
Rate‑limiting per user to avoid UI overload.
Priority‑based real‑time delivery for high‑importance messages (e.g., gifts).
Enhanced heartbeat, HTTPDNS, and multi‑point access to improve connection stability.
Reliability for Gifts : A dedicated high‑priority multicast channel ensures gift messages reach hosts reliably, with fallback pull via short‑connection if the long‑connection fails.
Further Development : The service now supports multiple client platforms (Android, iOS, H5, mini‑programs, PC), cross‑app message sharing, non‑login users, degradation paths, moderation pipelines, cross‑room messaging, and new interactive scenarios such as live quizzes and e‑commerce.
In summary, the multicast mcast mechanism effectively solves the real‑time delivery problem for millions of concurrent users, offering controllable QPS and bandwidth, easy horizontal scaling, and applicability beyond live streaming.
Baidu App Technology
Official Baidu App Tech Account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.