Fundamentals 21 min read

How NetEQ Revolutionizes Audio Jitter Buffering and Packet‑Loss Concealment

This article explains NetEQ, the adaptive audio jitter buffer and packet‑loss concealment technology used in WebRTC, detailing its architecture, core modules, jitter‑estimation algorithms, decision logic, and speed‑change processing, and shows how a custom server‑side implementation can improve stability and listening experience.

Douyu Streaming
Douyu Streaming
Douyu Streaming
How NetEQ Revolutionizes Audio Jitter Buffering and Packet‑Loss Concealment

1. NetEQ Overview

NetEQ is essentially an audio jitter buffer, short for Network Equalizer, and is one of the two core technologies of the GIPS voice engine, providing advanced adaptive jitter buffering with packet‑loss concealment. Google acquired this technology in 2010 and open‑sourced it as part of WebRTC in 2011.

The module integrates adaptive jitter control and loss‑concealment algorithms with the decoder, allowing high‑quality speech even under severe packet loss.

The diagram shows NetEQ’s four main components: Adaptive Packet Buffer, Speech Decoder, Jitter Control & Error Concealment, and Play‑out, with jitter control and concealment being the core.

1.1 Key Features

Significantly improves speech quality.

Reduces latency by 30‑80 ms compared with the best adaptive jitter buffers.

Deployed only on the receiving side.

Requires minimal network configuration.

Compatible with all standard speech codecs.

1.2 Adaptive Network Jitter Estimation

The blue line represents network delay, while the yellow line shows NetEQ’s response. Unlike traditional adaptive jitter buffers, NetEQ uses a non‑causal processing mechanism that detects delay variations and corrects speech errors before playback, achieving high‑quality audio.

2. Overall Architecture

2.1 Input Path

RTP packets are received and passed to NetEQ via InsertPacket.

If the packet contains RED redundancy, it is unpacked and duplicate original packets are ignored.

The original RTP packet is inserted into the Packet Buffer.

Inter‑arrival time (IAT) is calculated from timestamps and used as a jitter estimate.

IAT values are processed by the DelayManager to produce the target buffer level.

2.2 Output Path

A timer calls GetAudio every 10 ms to retrieve audio data.

Update PlayedOutTs and compute the number of samples left in the Sync Buffer.

Search the jitter buffer for the smallest timestamp greater than PlayedOutTs (availableTS) and discard late packets.

Calculate bufsize (samples left + buffered data) and derive BufferLevelFilter.

Based on BufferLevelFilter, bufsize, timestamps and previous playback mode, decide MCU control commands.

If needed, extract data from the jitter buffer into shared memory.

Generate DSP commands from MCU decisions.

Decoder reads from shared memory, decodes, and the DSP processes the data according to the playback mode.

Extract 10 ms of data from the voice buffer at curPosition, update the buffer and position, and output the audio.

3. Core Modules

3.1 Buffers

3.1.1 Packet Buffer

Purpose: cache incoming packets.

Implementation: std::list<Packet>, ordered by timestamp.

Default capacity: 500 packets.

3.1.2 Decoded Buffer

Purpose: store decoded PCM data.

Implementation: int16 array.

3.1.3 Algorithm Buffer

Purpose: hold PCM after DSP algorithms such as Expand.

3.1.4 Sync Buffer

Purpose: store played and unplayed decoded data, distinguished by curPosition.

Implementation: multi‑channel PCM array with playback markers (circular buffer).

3.2 Delay Manager (Jitter Estimation)

3.2.1 Jitter Definition

Jitter is the difference between a packet’s arrival interval and the average arrival interval.

3.2.2 Stable Jitter Estimation

Count the absolute arrival interval IAT (0‑64, with 20 ms packets, max 1.28 s).

Update the probability distribution of IAT values with a forgetting factor, increasing the probability of the current IAT and making the distribution more stable over time.

Determine the target buffer level BLo as the IAT value covering 95 % probability.

3.2.3 Peak Jitter Estimation

Two length‑8 arrays store peak amplitudes and intervals.

If IAT exceeds twice the 95 % IAT, it is considered a peak; intervals <10 s replace the oldest entry, 10‑20 s are ignored, >20 s clear the array.

If the peak array has fewer than 8 entries, peak estimation is disabled; otherwise the target level is set to the maximum peak value.

3.2.4 Buffer Level Filter

The adaptive average jitter delay is computed as:

where f is the forgetting factor and B is the 95 % IAT value. Accelerated or slowed playback adjusts BLc accordingly.

3.3 Decision Logic

Normal + Normal: normal decoding, with possible accelerate or decelerate commands.

Loss + Normal: invoke PLC (packet loss concealment) and wait up to 100 ms.

Loss + Loss: repeat PLC, reducing energy per frame to avoid distortion.

Normal + Loss: smooth transition (merge) after PLC‑generated frame.

3.3.1 Accelerate Conditions

Accelerate when data arrives on time but buffer delay exceeds network delay. Conditions are illustrated in the following diagram:

3.3.2 Decelerate Conditions

Decelerate when data arrives on time but buffer delay is smaller than network delay. Conditions are shown below:

3.3.3 Packet‑Loss Concealment (Expand)

When a packet is missing, NetEQ uses iLBC‑based PLC to reconstruct LPC coefficients, residual signal, and gradually reduces energy for consecutive lost frames.

3.3.4 Merge (Fusion) Conditions

PLC limit not reached but buffer delay is too large.

Buffer delay is large and PLC limit reached.

Buffer delay is acceptable and PLC limit reached.

Playable frame and target frame differ by more than 100 ms.

3.4 Time‑Scale‑Modification (WSOLA)

WSOLA (Waveform Similarity Overlap‑Add) performs pitch‑preserving speed changes by finding the most similar waveform segment and overlapping with a Hann window.

3.5 Normal Processing

When a packet is ready for playback, it is decoded and sent to the voice buffer, taking the previous DSP mode into account.

3.6 Accelerate Implementation

n' = level_factor_ * n + (1 - |level_factor_|) * |buffer_size_packets| - time_stretched_packets

Depending on the target buffer level, level_factor_ is set to 251‑254 (scaled 256). The function SetTargetBufferLevel selects the factor.

3.7 Decelerate (PreemptiveExpand)

Decelerate stretches the signal using WSOLA, inserting an extra pitch period to increase frame length.

4. Douyu’s Custom NetEQ Design

4.1 Thread Model

Both InsertPacket and GetAudio are intended to run on the same thread to keep the implementation lock‑free.

4.2 Complete Flow and Module Interaction

4.3 Integration with AudioMixer

Replace the existing AudioSynchronization with NetEQ and simulate packet loss/jitter via the RTP packet queue; after NetEQ stabilizes, the whole pipeline (RTP queue + AudioSynchronization) can be replaced.

4.4 Buffer‑Level Logic

Unlike WebRTC’s client‑side NetEQ, the server‑side implementation sacrifices a small amount of latency to compute a more stable target buffer level, dramatically reducing stretch rates and improving overall listening comfort.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

real-time communicationWebRTCsignal processingNetEQpacket loss concealmentaudio jitter buffer
Douyu Streaming
Written by

Douyu Streaming

Official account of Douyu Streaming Development Department, sharing audio and video technology best practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.