Fundamentals 13 min read

Introduction to WebRTC Architecture, Core Concepts, and Multi‑Party Communication Solutions

This article provides a comprehensive overview of WebRTC, covering its origin, core architecture layers, basic audio‑video capture concepts, the process of a one‑to‑one real‑time call, and compares three multi‑party communication architectures—Mesh, MCU, and SFU—highlighting their advantages and drawbacks.

360 Smart Cloud
360 Smart Cloud
360 Smart Cloud
Introduction to WebRTC Architecture, Core Concepts, and Multi‑Party Communication Solutions

1. WebRTC Introduction

WebRTC (Web Real‑Time Communication) is a technology acquired by Google in 2010 through the purchase of Global IP Solutions and subsequently open‑sourced. It provides core real‑time audio‑video capabilities such as capture, encoding/decoding, network transport, and rendering, and works on browsers, desktop apps, Android, iOS, and IoT devices as long as an IP connection meets the specification.

WebRTC Architecture Diagram

Diagram Color Legend

Purple and light‑purple: Web developer API layer.

Solid blue: Browser‑vendor API layer.

Dashed blue: Browser‑vendor customizable implementation.

Web API

The Web API layer exposes JavaScript APIs to application developers, allowing them to use WebRTC without dealing with low‑level details and to implement peer‑to‑peer communication.

WebRTC C++ API

This layer provides C++ interfaces used by browsers to implement the WebRTC specification.

Session Management

Manages session creation, context handling, and signaling protocols such as SDP for exchanging media negotiation messages and controlling the RTCPeerConnection state.

Engine Layer

Divided into three sub‑modules: Voice Engine (audio), Video Engine (video), and Transport (network). The P2P module relies on STUN, TURN, and ICE for NAT traversal.

Driver Layer

Consists of Audio Capture/Render, Video Capture, and Network I/O modules.

2. Basic Concepts of Audio‑Video Capture

Camera: Captures images and video.

Frame Rate: Number of frames captured per second; higher rates yield smoother video but consume more bandwidth.

Resolution: Determines image clarity; higher resolutions require more bandwidth and are often adjusted dynamically based on network conditions.

Aspect Ratio: Commonly 16:9 (modern) or 4:3 (legacy).

Microphone: Captures audio; characterized by sample rate and bit depth.

Track: Independent media stream component (e.g., audio track, video track) that does not intersect with other tracks.

Stream: Container for one or more tracks; can be a MediaStream (audio/video) or DataStream (data channels).

3. One‑to‑One Real‑Time Audio‑Video Call Process

The diagram consists of two WebRTC endpoints, a signaling server, and a STUN/TURN server.

WebRTC endpoint: Handles capture, encoding/decoding, NAT traversal, and media transport.

Signal server: Manages signaling messages such as room join/leave and media negotiation.

STUN/TURN server: Provides public IP discovery and relays media when direct NAT traversal fails.

When a user joins a room, the device is checked for availability, then audio‑video capture starts. The captured media can be previewed locally or recorded for later upload.

After media is ready, the endpoint sends a “join” signal to the Signal server, which creates a room. The second endpoint also joins the same room, triggering a “peer joined” notification.

The first endpoint creates an RTCPeerConnection object, encodes the captured media, and attempts P2P transmission. If NAT traversal fails, the media is relayed via the TURN server.

The remote endpoint decodes the received media and renders it, completing the one‑to‑one call. For bidirectional communication, both sides exchange media through their respective RTCPeerConnection objects.

4. Multi‑Party Audio‑Video Live Streaming Architectures

Common WebRTC multi‑party architectures include Mesh, MCU, and SFU.

Mesh (Network Topology) Scheme

Each participant establishes a direct connection with every other participant, forming a full mesh. This approach incurs high bandwidth usage on each client because media streams are sent individually to all peers.

Advantages: no media server needed, utilizes client bandwidth, reduces server cost. Disadvantages: uplink bandwidth grows linearly with participants, limiting scalability beyond a few users.

MCU (Multipoint Conferencing Unit) Scheme

The MCU receives media streams from all participants, decodes them, mixes audio/video, re‑encodes the combined stream, and distributes it back to every participant.

Advantages: abstracts codec differences, provides a unified stream. Disadvantages: heavy CPU load due to decoding, mixing, and re‑encoding.

SFU (Selective Forwarding Unit) Scheme

The SFU forwards each incoming media stream to all other participants without mixing. It acts as a router, minimizing CPU usage.

Advantages: low CPU consumption and low latency. Disadvantages: potential synchronization issues and increased client‑side rendering complexity when handling multiple streams.

End of article. For more content, please follow the public account.

MCUVideo StreamingReal‑Time Communicationp2pwebrtcMedia ArchitectureSFU
360 Smart Cloud
Written by

360 Smart Cloud

Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.