Fundamentals of Audio and Video Processing, Compression, and Streaming Protocols
This article provides a comprehensive overview of audio and video fundamentals, including signal conversion, PCM encoding, compression techniques, spatial audio concepts, video encoding standards such as H.264/H.265, streaming protocols, bitrate control, and practical optimization algorithms for both audio and video pipelines.
Fundamentals
Audio Basics
Sound is an energy wave whose pitch is determined by frequency, volume by amplitude and distance, and timbre by waveform. Converting sound to a digital signal involves three steps: acoustic waves reach a microphone diaphragm, the diaphragm’s motion creates an analogue electrical signal, and an ADC converts this analogue signal to a digital signal.
Digital audio A/D conversion consists of sampling, quantization, and encoding. PCM (Pulse Code Modulation) samples a continuous analogue signal at discrete time intervals and encodes the quantized values into binary code groups for transmission.
Sampling must satisfy the Nyquist theorem (sampling rate ≥ 2× highest frequency) to avoid loss, typically 40‑50 kHz for audio files.
Audio Compression
PCM is the lossless “raw” encoding; audio compression applies a second layer of encoding to reduce storage size. Lossless compression (e.g., FLAC, ALAC) preserves original quality, while lossy compression (e.g., MP3, AAC, OGG) discards perceptually redundant information based on psychoacoustic masking.
Video Basics
Encoding Principles
Video consists of a sequence of frames (FPS). Bitrate determines visual quality and required bandwidth. Typical bitrate calculations: bitrate = width × height × color depth × fps.
Video compression removes redundancy through spatial, temporal, coding, visual, and knowledge redundancies.
Video Compression
H.264/AVC uses intra‑frame (spatial) and inter‑frame (temporal) compression. Intra‑compression (I‑frames) resembles JPEG; inter‑compression (P‑ and B‑frames) predicts differences between frames to reduce data.
Encoding steps: grouping frames into GOPs, defining frame types (I, P, B), predicting frames, and transmitting residual data.
Bitstream Structure
Bitstreams consist of NAL (Network Abstraction Layer) units and VCL (Video Coding Layer) units. NAL packages data for network transport, while VCL contains the core compression engine, macroblocks, and slice syntax.
In H.264, macroblocks are fixed 16×16 blocks; H.265 introduces Coding Tree Units (CTUs) ranging from 4×4 to 64×64, using both DCT and DST transforms for higher efficiency.
Audio Optimization
Noise Reduction
Noise can be additive or multiplicative. Time‑domain denoising (e.g., moving average, median) smooths signals, while frequency‑domain methods (FFT, filtering) remove specific frequency components. Wavelet denoising thresholds wavelet coefficients, and adaptive filters (Wiener, Kalman, LMS, RLS) adjust to signal statistics.
Echo Cancellation
Acoustic Echo Cancellation (AEC) uses adaptive filtering to estimate and subtract echo paths, handling both circuit and acoustic echo.
Loudness Normalization
Loudness (perceived volume) differs from raw amplitude; standards such as LKFS/LUFS provide industry‑wide normalization (e.g., Spotify ‑14 LKFS, Apple Music ‑16 LKFS).
Audio Output
LKFS Normalization
Spotify
-14 LKFS
Apple Music
-16 LKFS
Amazon music
-9 ~ -13 LKFS
Youtube
-13 ~ -15 LKFS
Spatial Audio
Spatial audio creates immersive 3D sound using interaural time difference (ITD), interaural level difference (ILD), head‑related transfer functions (HRTFs), and head movement cues.
Video Optimization
Rate Control Algorithms
Rate control balances bitrate allocation and quantization parameter (QP) adjustment. Common algorithms include CBR, VBR, and ABR, with hierarchical control at GOP, frame, and basic unit levels.
Jitter Buffer
A jitter buffer smooths packet loss, reordering, and delay variations by buffering incoming packets before decoding, trading latency for playback stability.
Scalable Video Coding (SVC)
SVC encodes video in multiple layers (spatial, temporal, quality) allowing adaptive streaming based on available bandwidth.
Industry Standards
Codec Families
Major codec families include ISO‑MPEG/ITU‑T (H.264/H.265), AOM (AV1, VP9), and AVS (Chinese national standards). H.264/AVC introduced I, P, B frames; H.265/HEVC adds CTU‑based coding for higher compression.
Container Formats
Common containers: MP4 (box/atom structure), MOV, AVI, MKV, OGV, WebM, FLV, QLV. Each encapsulates video, audio, subtitles, and metadata.
Streaming Protocols
HLS (HTTP‑based, uses .m3u8 playlists and .ts segments) offers high compatibility but higher latency. RTMP (TCP‑based) provides low latency for live push. RTSP (RTP‑based) is used for surveillance. MPEG‑DASH is another adaptive HTTP streaming method.
#EXTM3U
#EXT-X-VERSION:3 // version info
#EXT-X-TARGETDURATION:11 // max segment duration
#EXT-X-MEDIA-SEQUENCE:0 // start sequence number
#EXTINF:10.5,
index0.ts
#EXTINF:9.6,
index1.tsApplication Scenarios
Human Review Business
Audio‑volume balancing improves reviewer comfort by dynamically adjusting DynamicsCompressorNode parameters during playback.
Live Streaming
Platforms like Douyu, Huya, and Bilibili use a mix of HLS, RTMP, and WebRTC (for P2P or chat). Douyu employs HTTP‑FLV with .xs sub‑streams; Huya relies on pure HLS; Bilibili uses HLS with M4S segments.
Real‑Time Meetings
WebRTC underpins modern video conferencing (e.g., Tencent Meeting, Feishu, DingTalk). Tencent’s xCast engine with Pere protocol handles cross‑platform media transport, converting streams to SIP, TencentRTC, or WebRTC as needed.
Overall, the article details the theoretical foundations, compression algorithms, standards, and practical implementations that underpin modern audio‑video systems.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.