How NetEQ Powers High‑Quality Voice Calls: Inside Google’s Adaptive Jitter Buffer
This article explains NetEQ, the adaptive audio jitter buffer and packet‑loss concealment engine used in WebRTC, detailing its architecture, core modules, jitter estimation, buffer management, and speed‑up/slow‑down algorithms that together ensure robust, low‑latency voice quality even under high packet loss.
1. NetEQ Overview
NetEQ is essentially an audio jitter buffer, short for Network Equalizer. It is one of the two core technologies of the GIPS voice engine, featuring an advanced adaptive jitter buffer with packet‑loss concealment. Google acquired this technology in 2010 when it bought Global IP Solutions, and released it as part of WebRTC in 2011.
NetEQ integrates adaptive jitter control and packet‑loss concealment algorithms with the decoder, allowing high voice quality even in high‑loss environments.
The diagram below shows NetEQ’s modules: Adaptive packet buffer, speech decoder, jitter control and error concealment, and playback. The jitter control and concealment module is the core.
NetEQ’s position in the audio processing pipeline is illustrated in the following diagram.
1.1 Main Features of NetEQ
Significantly improves voice quality.
Reduces latency introduced by jitter buffering by 30–80 ms compared to the best adaptive jitter buffer techniques.
Deployed only on the receiving side.
Reduces required network configuration.
Compatible with all standard voice codecs.
1.2 Adaptive Network Jitter Estimation
The blue line represents network delay, the yellow line shows NetEQ’s response to jitter. Unlike traditional adaptive jitter buffers, NetEQ uses a non‑sequential processing mechanism that combines intelligent jitter buffering with PLC to form an independent processing unit, detecting delay changes and correcting speech errors before playback.
2. Overall Architecture
2.1 NetEQ Input Path
When the service receives an RTP packet, it calls NetEQ’s InsertPacket function to enter the receive module.
For RTP packets with RED redundancy, the packet is unpacked to restore the original packet, ignoring duplicate originals.
The original RTP packet is inserted into the Packet Buffer.
For each original packet, the inter‑arrival time (Iat) is calculated as the receive‑time difference divided by the 20 ms packet interval.
The Iat value is processed by the core network jitter estimation module (DelayManager) to obtain the target buffer level.
2.2 NetEQ Output Path
The timer calls GetAudio every 10 ms to retrieve 10 ms of audio data from NetEQ.
Update PlayedOutTs and compute the number of samples left to play ( sampleLeft).
Traverse timestamps in the jitter buffer to find the smallest timestamp greater than PlayedOutTs (availableTS); discard late packets.
Calculate bufsize as the sum of sampleLeft and buffered data.
Compute BufferLevelFilter from bufsize.
Based on BufferLevelFilter, timestamps, previous playback mode, and MCU commands, decide whether to fetch data from the jitter buffer.
If needed, read data from the jitter buffer into shared memory.
Generate DSP commands from MCU decisions.
The decoder reads data from shared memory and decodes it.
DSP commands determine the playback mode and process decoded data and voice buffer accordingly.
Extract 10 ms of data from the voice buffer starting at curPosition.
Update the voice buffer and curPosition.
Adjust endTimeStamp and output the 10 ms of audio.
3. Core Modules
3.1 NetEQ Buffers
3.1.1 Packet Buffer
Purpose: cache packets.
Implementation: std::list<Packet>.
Ordered by timestamp.
Default capacity: 500 packets.
3.1.2 Decoded Buffer
Purpose: store decoded PCM data.
Implementation: int16 array.
3.1.3 Algorithm Buffer
Purpose: hold PCM data after DSP algorithms such as Expand.
3.1.4 Sync Buffer
Purpose: store already‑played data and decoded but not‑yet‑played data, distinguished by curPosition.
Implementation: multi‑channel PCM array with playback markers (circular buffer).
3.2 Jitter Estimation Module (DelayManager)
3.2.1 Jitter Definition
Jitter is the difference between a packet’s arrival interval and the average arrival interval.
3.2.2 Stable Jitter Estimation
Count the absolute arrival interval Iat for each packet (0 for early, 1 for on‑time, up to 64 for a maximum of 1.28 s at 20 ms packet spacing).
Update the probability distribution of Iat values, applying a forgetting factor, increasing the probability of the current Iat, and ensuring the distribution sums to ~1.
Determine the Iat value that satisfies 95 % probability as the target buffer level BLo.
3.2.3 Peak Jitter Estimation
Two length‑8 arrays track peak Iat values and intervals.
If Iat exceeds twice the 95 % Iat value, it is considered a peak; intervals are managed based on duration thresholds (10 s, 20 s).
When enough peaks are collected, the target level is set to the maximum peak value.
3.2.4 Buffer Level Filter (Jitter Delay Calculation)
The adaptive average jitter delay is computed as:
where f is the forgetting factor and B is the 95 % Iat value. Accelerated or decelerated playback causes the BLc value to change sharply, ensuring that speed changes do not persist and preserving audio comfort.
3.3 Decision Logic Module
Analyzes the current and previous frame states (normal or lost) to decide processing: normal playback, acceleration, deceleration, packet loss concealment (PLC), or merge.
3.3.1 Acceleration Conditions
Triggered when data arrives normally but buffer delay exceeds network delay. Conditions are illustrated in the following diagram:
3.3.2 Deceleration Conditions
Triggered when data arrives normally but buffer delay is less than network delay. Conditions are shown below:
3.3.3 Packet Loss Concealment (PLC) Conditions
Continuous PLC up to 100 ms wait.
First‑time PLC when the previous frame was received but the current frame is missing.
PLC when the jitter buffer is empty.
3.3.4 Merge Conditions
Merge is used to smoothly connect PLC‑generated data with buffered data when any of the following holds:
PLC limit not reached but buffer delay is too large.
Buffer delay is large and PLC limit reached.
Buffer delay is acceptable but PLC limit reached.
Buffered data and target data differ by more than 100 ms.
3.4 Time‑Scale‑Modification (WSOLA) Algorithm
WSOLA (Waveform Similarity Overlap‑Add) is a time‑domain technique that preserves pitch while changing duration. It divides the signal into frames of length N spaced by L, then recombines them with a scaling factor a ( a>1 compresses, a<1 stretches). A Hann window is applied, and the best‑matching overlap position is found to avoid spectral discontinuities.
3.5 Normal Processing
When a packet meets playback requirements, it is decoded and placed into the voice buffer for playback, considering the previous DSP processing mode.
3.6 Packet Loss Concealment (Expand)
NetEQ uses iLBC‑based PLC. It reconstructs LPC coefficients from the last sub‑frame, rebuilds the residual signal using pitch‑synchronous synthesis and noise generation, and applies energy decay across consecutive lost frames.
3.7 Merge Processing
Merge combines PLC‑generated data with decoded data to ensure smooth transitions. It first obtains a 202‑sample expanded sequence, finds the optimal overlap point P, and applies a gradual smoothing coefficient (0.032 per ms) to blend the two streams.
3.8 Acceleration (Accelerate)
Accelerate reduces latency by compressing speech using WSOLA. It extracts a 20 ms frame (960 samples at 48 kHz), estimates the pitch period via short‑term autocorrelation, computes correlation bestCorr, and if > 0.9, cross‑mixes two pitch periods to shorten the frame.
Resulting audio is stored in the Algorithm Buffer.
3.9 Deceleration (PreemptiveExpand)
Deceleration stretches speech using WSOLA when network conditions cause insufficient data arrival, inserting an extra pitch period to lengthen the audio.
4. Douyu’s Custom NetEQ Design
4.1 Thread Model Planning
The external API provides two operations, insert_packet and get_audio. Because NetEQ’s internal implementation is lock‑free, both operations must be invoked from the same thread, and this requirement should be clearly documented.
4.2 Complete Flow and Module Interaction
4.3 Integration with AudioMixer
Currently the AudioMixer buffer consists of rtppacketqueue + AudioSynchronization. In development, replace AudioSynchronization with NetEQ and modify rtppacketqueue to simulate packet loss and jitter. After NetEQ stabilizes, replace both rtppacketqueue and AudioSynchronization (video continues to use rtppacketqueue).
4.4 Water‑Level Calculation Logic
WebRTC’s NetEQ targets ultra‑low latency for end‑to‑end scenarios. Douyu’s server‑side NetEQ adapts the water‑level control logic to sacrifice a small amount of latency for more stable target level calculation, dramatically reducing stretch rates and improving overall listening experience.
n' = level_factor_ * n + (1 - |level_factor_|) * |buffer_size_packets| - time_stretched_packets
level_factor_的取值:
void BufferLevelFilter::SetTargetBufferLevel(int target_buffer_level) {
if (target_buffer_level <= 1) {
level_factor_ = 251;
} else if (target_buffer_level <= 3) {
level_factor_ = 252;
} else if (target_buffer_level <= 7) {
level_factor_ = 253;
} else {
level_factor_ = 254;
}
}
// buffer_size_packets = number of samples in packet_buffer + remaining samples in sync_buffer
// time_stretched_packets = number of packets after speed‑change‑without‑pitch‑shiftSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Douyu Streaming
Official account of Douyu Streaming Development Department, sharing audio and video technology best practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
