Fundamentals 49 min read

Fundamentals of Audio and Video Processing, Compression, and Streaming Protocols

This article provides a comprehensive overview of audio and video fundamentals, including signal conversion, PCM encoding, compression techniques, spatial audio concepts, video encoding standards such as H.264/H.265, streaming protocols, bitrate control, and practical optimization algorithms for both audio and video pipelines.

Rare Earth Juejin Tech Community

Oct 18, 2024

Fundamentals

Audio Basics

Sound is an energy wave whose pitch is determined by frequency, volume by amplitude and distance, and timbre by waveform. Converting sound to a digital signal involves three steps: acoustic waves reach a microphone diaphragm, the diaphragm’s motion creates an analogue electrical signal, and an ADC converts this analogue signal to a digital signal.

Digital audio A/D conversion consists of sampling, quantization, and encoding. PCM (Pulse Code Modulation) samples a continuous analogue signal at discrete time intervals and encodes the quantized values into binary code groups for transmission.

Sampling must satisfy the Nyquist theorem (sampling rate ≥ 2× highest frequency) to avoid loss, typically 40‑50 kHz for audio files.

Audio Compression

PCM is the lossless “raw” encoding; audio compression applies a second layer of encoding to reduce storage size. Lossless compression (e.g., FLAC, ALAC) preserves original quality, while lossy compression (e.g., MP3, AAC, OGG) discards perceptually redundant information based on psychoacoustic masking.

Video Basics

Encoding Principles

Video consists of a sequence of frames (FPS). Bitrate determines visual quality and required bandwidth. Typical bitrate calculations: bitrate = width × height × color depth × fps.

Video compression removes redundancy through spatial, temporal, coding, visual, and knowledge redundancies.

Video Compression

H.264/AVC uses intra‑frame (spatial) and inter‑frame (temporal) compression. Intra‑compression (I‑frames) resembles JPEG; inter‑compression (P‑ and B‑frames) predicts differences between frames to reduce data.

Encoding steps: grouping frames into GOPs, defining frame types (I, P, B), predicting frames, and transmitting residual data.

Bitstream Structure

Bitstreams consist of NAL (Network Abstraction Layer) units and VCL (Video Coding Layer) units. NAL packages data for network transport, while VCL contains the core compression engine, macroblocks, and slice syntax.

In H.264, macroblocks are fixed 16×16 blocks; H.265 introduces Coding Tree Units (CTUs) ranging from 4×4 to 64×64, using both DCT and DST transforms for higher efficiency.

Audio Optimization

Noise Reduction

Noise can be additive or multiplicative. Time‑domain denoising (e.g., moving average, median) smooths signals, while frequency‑domain methods (FFT, filtering) remove specific frequency components. Wavelet denoising thresholds wavelet coefficients, and adaptive filters (Wiener, Kalman, LMS, RLS) adjust to signal statistics.

Echo Cancellation

Acoustic Echo Cancellation (AEC) uses adaptive filtering to estimate and subtract echo paths, handling both circuit and acoustic echo.

Loudness Normalization

Loudness (perceived volume) differs from raw amplitude; standards such as LKFS/LUFS provide industry‑wide normalization (e.g., Spotify ‑14 LKFS, Apple Music ‑16 LKFS).

Audio Output

LKFS Normalization

Spotify

-14 LKFS

Apple Music

-16 LKFS

Amazon music

-9 ~ -13 LKFS

Youtube

-13 ~ -15 LKFS

Spatial Audio

Spatial audio creates immersive 3D sound using interaural time difference (ITD), interaural level difference (ILD), head‑related transfer functions (HRTFs), and head movement cues.

Video Optimization

Rate Control Algorithms

Rate control balances bitrate allocation and quantization parameter (QP) adjustment. Common algorithms include CBR, VBR, and ABR, with hierarchical control at GOP, frame, and basic unit levels.

Jitter Buffer

A jitter buffer smooths packet loss, reordering, and delay variations by buffering incoming packets before decoding, trading latency for playback stability.

Scalable Video Coding (SVC)

SVC encodes video in multiple layers (spatial, temporal, quality) allowing adaptive streaming based on available bandwidth.

Industry Standards

Codec Families

Major codec families include ISO‑MPEG/ITU‑T (H.264/H.265), AOM (AV1, VP9), and AVS (Chinese national standards). H.264/AVC introduced I, P, B frames; H.265/HEVC adds CTU‑based coding for higher compression.

Container Formats

Common containers: MP4 (box/atom structure), MOV, AVI, MKV, OGV, WebM, FLV, QLV. Each encapsulates video, audio, subtitles, and metadata.

Streaming Protocols

HLS (HTTP‑based, uses .m3u8 playlists and .ts segments) offers high compatibility but higher latency. RTMP (TCP‑based) provides low latency for live push. RTSP (RTP‑based) is used for surveillance. MPEG‑DASH is another adaptive HTTP streaming method.

#EXTM3U
#EXT-X-VERSION:3             // version info
#EXT-X-TARGETDURATION:11     // max segment duration
#EXT-X-MEDIA-SEQUENCE:0      // start sequence number
#EXTINF:10.5,
index0.ts
#EXTINF:9.6,
index1.ts

Application Scenarios

Human Review Business

Audio‑volume balancing improves reviewer comfort by dynamically adjusting DynamicsCompressorNode parameters during playback.

Live Streaming

Platforms like Douyu, Huya, and Bilibili use a mix of HLS, RTMP, and WebRTC (for P2P or chat). Douyu employs HTTP‑FLV with .xs sub‑streams; Huya relies on pure HLS; Bilibili uses HLS with M4S segments.

Real‑Time Meetings

WebRTC underpins modern video conferencing (e.g., Tencent Meeting, Feishu, DingTalk). Tencent’s xCast engine with Pere protocol handles cross‑platform media transport, converting streams to SIP, TencentRTC, or WebRTC as needed.

Overall, the article details the theoretical foundations, compression algorithms, standards, and practical implementations that underpin modern audio‑video systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Video Encoding Audio Processing compression Streaming Protocols media fundamentals

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.