Cloud Native 22 min read

How ZRTC Powers Millions of Live Streams: Architecture & Scaling

ZRTC, the real‑time audio‑video platform behind 作业帮, has been refined for over three years to support massive, multi‑cloud, multi‑protocol live streaming, employing a unified SDK, intelligent scheduling, custom SFU services, and extensive performance tuning to achieve high concurrency, low latency, and robust high‑availability.

Zuoyebang Tech Team
Zuoyebang Tech Team
Zuoyebang Tech Team
How ZRTC Powers Millions of Live Streams: Architecture & Scaling

Background

ZRTC, the real‑time audio‑video solution of 作业帮, has been continuously refined and optimized by the streaming technology team, running stably in large‑scale production environments for more than three years. It supports a variety of interactive course types and accumulates valuable practical experience.

Technical Overall Architecture

1. Technical Overall Framework

The system adopts a multi‑cloud, multi‑protocol, multi‑PaaS fusion architecture. From SDK to server side it supports CDN‑based RTMP live streaming and low‑latency RTC streaming. It integrates two major RTC service providers and three CDN providers for redundancy, and deploys active‑active instances across multiple cloud servers for reliability.

General SDK Layer

The unified SDK abstracts a common interface for low‑latency live, multi‑person interaction, traditional live, and VOD. It modularizes capture, codec, transport, synchronization, and rendering, enabling flexible combinations such as push‑multiple, pull‑multiple, and render‑multiple to support diverse course models.

Intelligent Scheduling

The scheduling system acts as the brain, dynamically allocating multi‑PaaS services based on cost, quality, burst capacity, and disaster‑recovery needs. It can downgrade protocols (e.g., RTC to RTMP) and manage per‑user routing, allowing a student to be served by different PaaS providers or downgraded to RTMP as needed.

Core System Layer

This layer handles media distribution and processing. The zrtclive and zrtcmeeting services are high‑performance SFUs built on customized WebRTC code. The zrelay service uses the open‑source KCP protocol for multi‑hop inter‑service forwarding. Protocol conversion, recording, and MCU mixing services provide complete media pipelines.

Other

A dedicated monitoring and alarm system, along with a management backend, provides real‑time alerts and post‑analysis. Reusable components such as libice, librtcbase, libzybrtc, libkcp, and a high‑performance network library (store‑framework) are also abstracted.

2. Large‑Scale Distribution Architecture

Using a Beijing teacher and Guangzhou students as an example, the teacher obtains the optimal server IP from the scheduler, exchanges SDP via HTTP, and connects to an edge zrtclive server via RTC. The stream is forwarded to the IDC relay (zrelay) via KCP, which registers the stream with the scheduler. Guangzhou students pull the stream from the best edge server; if they are the first in the region, the local zrelay fetches the teacher’s stream from the Beijing edge via dynamic KCP back‑source.

The relay nodes form a mesh network, continuously testing network conditions and reporting to the scheduler, which performs pre‑configuration and real‑time routing based on historical and live data. The zrelay and zrtclive nodes form a tree structure supporting multi‑level cascading, enabling high scalability.

Performance testing showed that same‑cloud links are stable, while cross‑cloud, long‑distance links experience jitter. Fixed POP points within the same province or city are deployed to improve stability, and the scheduler dynamically adjusts routes when POP points fail.

How to Achieve High Concurrency

1. Architectural Aspects

Remove room‑level business logic

The system abstracts only atomic push/pull capabilities, avoiding complex room management and thus eliminating the “bucket effect”. Signalling uses short‑lived HTTP connections, while optional long‑lived connections are maintained by business users.

IDC‑level tree distribution

In large‑scale live mode, zrtclive back‑sources only hit a few zrelay nodes, reducing internal consumption. The tree structure enables horizontal scaling for higher concurrency.

IDC‑level mesh interconnection

Since a single IDC’s bandwidth is limited, inter‑IDC relays extend capacity across multiple clouds and regions. A mesh‑based intelligent routing algorithm resolves cross‑cloud network issues.

Optimize scheduling based on business attributes

Different course models receive tailored scheduling: high‑quality large‑live for super‑small classes, low‑cost conference mode for small groups, with students of the same group placed on the same server CPU to minimize back‑source streams.

2. Programming Aspects

Improved WebRTC architecture

The original multi‑threaded model is replaced by an asynchronous event‑driven model, reducing thread‑switch overhead and lock contention, which is crucial for high‑concurrency servers.

Multi‑core distribution

A lock‑free queue, shared pointers, and message notifications implement a single‑process multi‑core distribution model, fully utilizing multi‑core performance.

System optimization

Network card multi‑queue configuration, sufficient socket buffers (net.core.rmem_*/wmem_*), and appropriate kernel versions are essential for UDP‑heavy workloads.

Performance analysis tools

Tools like perf and valgrind generate flame graphs for fine‑grained optimization, e.g., reducing costly vfprintf calls.

CPU affinity binding

Binding threads to fixed cores avoids scheduler overhead, but must be used judiciously to prevent resource contention.

sendmmsg interface

sendmmsg batches multiple packets per system call, improving throughput at the cost of higher implementation complexity.

How to Achieve High Availability

1. Architectural Aspects

Load balancing, circuit breaking, rate limiting, disaster recovery, degradation, real‑time alerts, and post‑analysis are combined to keep self‑healing time within 15‑20 seconds. Stream services require rapid reconnection and automatic failover due to their long‑connection nature.

2. Single Program

Each release undergoes unit tests, valgrind memory checks, stress tests, and “aging‑against” tests using the custom zrtcbench tool, which simulates massive push/pull scenarios and validates stability.

3. Multi‑cloud Multi‑protocol Multi‑PaaS

The system is designed as a super‑fusion architecture, integrating multiple cloud providers and RTC services to achieve mutual disaster tolerance, burst traffic handling, QoS complementarity, and graceful degradation (e.g., fallback to RTMP).

How to Ensure High Quality

1. Objective Metrics and Dual Validation

Latency metrics using NTP‑synchronized end‑to‑end measurement (≈10 ms error).

Stutter metrics defined by frame interval thresholds (≥200 ms considered a stall).

Single‑scenario testing with simulated weak‑network conditions embedded in the SDK.

2. Proper Use of WebRTC

Adapt WebRTC’s built‑in anti‑weak‑network mechanisms (NACK, FEC, NETEQ, jitter buffer, pacer) to large‑scale live scenarios, e.g., fixing GOP intervals and adjusting I‑frame handling.

3. Server‑Side Importance

SFU servers perform selective forwarding based on frame completeness and dependency, using SVC/LTR techniques to balance smoothness and low latency.

4. Scenario‑Specific Optimizations

Transform LTR from a 1‑to‑1 solution to a 1‑to‑many model, introduce “small I‑frames” generated adaptively based on upstream loss, and enable them only in small‑group live rooms to avoid excessive bitrate.

Reflections

Balance generic design with business‑specific optimizations.

Recognize WebRTC’s original low‑latency focus when scaling to massive audiences.

QOE improvements must align with business priorities to avoid over‑optimization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cloud nativeperformance optimizationSystem ArchitectureReal-time StreamingHigh ConcurrencyWebRTC
Zuoyebang Tech Team
Written by

Zuoyebang Tech Team

Sharing technical practices from Zuoyebang

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.