Backend Development 15 min read

Technical Architecture Evolution for Real-Time Multi-Party Audio and Video Streaming

The article details the progressive architectural decisions and technical solutions behind a real-time multi‑party audio/video streaming platform, covering format research, backend service design, protocol choices, codec selection, node topology redesign, and video integration to achieve low‑latency, cross‑platform communication.

Architect
Architect
Architect
Technical Architecture Evolution for Real-Time Multi-Party Audio and Video Streaming

The author, co‑founder of Red Dot Live, shares the technical architecture exploration aimed at real‑time connection scenarios, focusing on the choices and evolution made during product development.

In the first version, the product required iOS recording and cross‑platform playback, leading to research on supported audio live‑stream formats such as HLS and HTTP MP3 streams, with HLS offering stable latency above 8 seconds and MP3 streams achieving over 90% support on Android.

Due to Android fragmentation, the team adopted HLS for PC and iOS while using HTTP MP3 streams for Android, implementing a single‑machine backend that accepted MP3 uploads, generated HLS segments, and served them via Nginx.

Later, a history‑playback feature was added by uploading finished media files to UPYUN, leveraging third‑party storage to reduce cost and development time.

The next major update introduced multi‑person voice, requiring mixing, a low‑latency codec, UDP‑based transport, and echo cancellation (AEC). MP3 was rejected because of its ~200 ms encoding delay, and Opus was chosen for its superior compression and low latency.

Key related projects and protocols were evaluated: FFmpeg for media processing, WebRTC for browser‑based real‑time communication, NetEQ and AECM for jitter control and echo cancellation, RTP/RTCP for transport, and Live555 as an open‑source RTP/RTSP server.

The team compared a self‑developed solution with a WebRTC‑based approach, noting that while WebRTC offers a complete media pipeline, it lacks a mature server component and incurs higher integration costs, leading to the decision to continue with the custom solution.

For the multi‑person voice service, a three‑module node (Room, Master, Slave) was initially used, forming a fully connected mesh that simplified reliability but prevented cross‑data‑center deployment.

Subsequently, the node architecture was refactored into a tree topology with Room and Client modules, unique node IDs, and etcd for service coordination, enabling scalable cross‑data‑center deployment.

Video support was later added, requiring low latency, retransmission, forward error correction (FEC), and a private protocol to convert streams to RTMP for CDN distribution. The design emphasized H264 frame handling, polynomial‑based FEC, time‑bounded retransmission, audio‑centric synchronization, and server‑side mixing avoidance to reduce delay.

Overall, the article chronicles the iterative architectural changes undertaken to improve real‑time multi‑party audio and video communication across mobile and web platforms.

Real-time StreamingHLSWebRTCBackend Servicesmobile platformsaudio/video architectureOpus codec
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.