How to Build Scalable, High‑Availability Real‑Time Audio‑Video Systems
This talk explains the evolution and practical implementation of large‑scale real‑time audio‑video communication, covering common architectures such as direct P2P, MCU, and SFU, network topologies, scalability, high‑availability techniques, edge computing, and emerging technologies like WebRTC, SDN, and AI‑driven enhancements.
On July 29‑30, 2021, the CIC 2021 Cloud Computing Summit in Beijing featured a presentation titled "Practice and Evolution of Large‑Scale Real‑Time Audio‑Video Technology Architecture" by Shen Weifeng, sharing common real‑time audio‑video communication architectures, network topologies, and the complexity of real‑world scenarios.
Common Real‑Time Audio‑Video Architectures
Direct (P2P)
In a peer‑to‑peer setup each client registers with a server for discovery, then connects directly without a media server. NAT traversal may require STUN or a relay server.
MCU (Multipoint Conferencing Unit)
Clients send audio‑video streams to a central server that decodes, synchronizes, mixes, and re‑encodes them before forwarding to all participants. This star topology imposes high CPU load on the server.
SFU (Selective Forwarding Unit)
Clients send streams to a server that forwards them unchanged to subscribed peers. Mixing is done on the client side, reducing server load. SFU can forward different video resolutions (Simulcast/SVC) based on each client’s bandwidth.
Comparison and Summary
Direct P2P is unsuitable for large conferences and lacks content moderation. With decreasing compute and bandwidth costs, SFU becomes advantageous for massive concurrency, while MCU remains common in traditional enterprise scenarios.
Network Topology Construction
Ring Topology
Nodes form a closed loop; simple routing but a single node failure breaks the network.
Star/Tree Topology
Star has a central hub; easy to manage but the hub is a bottleneck. Tree extends the star, allowing hierarchical scaling with multiple edge nodes.
Mesh Topology
Every device connects to every other, offering high reliability and low latency but with complex routing and traffic control.
Diversity of Real‑World Scenarios
Network Access Diversity – Mobile (3G/4G/5G), wired broadband (LAN, ADSL, PON/FTTH), and Wi‑Fi each present different bandwidth and stability characteristics.
Device Diversity – Desktops, mobiles, wearables, IoT devices vary in network modules, cameras, microphones, CPUs, and GPUs.
Server Access Diversity – Multi‑line BGP, multi‑carrier dedicated lines, or single‑carrier lines affect routing and redundancy.
Dynamic changes include bandwidth fluctuations, packet loss, jitter, and varying capture quality (noise, distortion).
Network dynamics: bandwidth, loss, jitter, latency.
Capture dynamics: noise, distortion, jitter.
Architecture Evolution and Practice
High Concurrency and High Availability
To achieve high concurrency, services are clustered and load‑balanced across multiple servers.
Automatic fault recovery and graceful degradation ensure the system remains usable when components fail; for example, disabling video while keeping audio.
Elastic scaling of compute and network resources is enabled by virtualization and SDN technologies.
Geographic disaster recovery deploys multiple clusters in different locations and routes traffic to healthy clusters when failures occur.
These techniques allow the service to achieve 99.95% availability worldwide.
High‑Quality Service
Quality is maintained through bandwidth estimation, congestion control, packet loss recovery, forward error correction, multi‑layer distribution (SVC/Simulcast), noise reduction, echo cancellation, adaptive volume, and resource reservation, enabling high‑quality experience even with 70% packet loss.
Massive Scale and Ultra‑High Concurrency
In SFU, selective forwarding reduces bandwidth: instead of forwarding every stream to all participants, the server forwards only the streams needed for each user’s layout (e.g., 1 large + 6 small videos).
Edge computing nodes arranged in a tree topology extend conference size and reduce latency for the last mile.
For one‑way live streaming, CDN can be used, but interactive latency is higher (3‑10 s), so edge or central DC is preferred for two‑way communication.
Paile Cloud Audio‑Video System Architecture
The left side handles registration, authentication, configuration, discovery, and scheduling; the right side provides big‑data analytics, health monitoring, alerts, and elastic scaling. Core services include voice calls, video calls, interactive whiteboard, live interaction, and cloud recording.
Industry Trends and Emerging Technologies
WebRTC
Since Google open‑sourced GIPS as WebRTC in 2010 and it became a W3C standard in 2014, WebRTC has dramatically lowered the barrier to real‑time communication, spawning many services.
SDN
Software‑Defined Networking separates control and data planes, enabling programmable, virtualized networks that simplify path optimization and automation.
Machine‑Learning‑Based Algorithms
Network: intelligent congestion control, bandwidth estimation, routing.
Video: virtual backgrounds, super‑resolution, video fusion, deepfake.
Audio: speech recognition, enhancement.
VR, AR, and 3D
Combining virtual reality, augmented reality, and 3D technologies promises immersive conference experiences where participants feel as if they share a physical meeting room.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Qingyun Technology Community
Official account of the Qingyun Technology Community, focusing on tech innovation, supporting developers, and sharing knowledge. Born to Learn and Share!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
