How Volcano Engine and CAS Acoustic Institute Won Top Spots at the First Low‑Resource Audio Codec Challenge

Volcano Engine's audio team, together with the Chinese Academy of Sciences Acoustic Institute, secured first‑place, runner‑up, and third‑place finishes in the 2025 Low‑Resource Audio Codec Challenge at ICASSP 2026 by delivering AI‑driven codecs that balance ultra‑low bitrate, low complexity, and high audio quality for real‑time communication and streaming scenarios.

ByteDance SE Lab
ByteDance SE Lab
ByteDance SE Lab
How Volcano Engine and CAS Acoustic Institute Won Top Spots at the First Low‑Resource Audio Codec Challenge

The 2025 Low‑Resource Audio Codec (LRAC) Challenge, the world’s first competition focused on ultra‑low‑resource audio coding, was hosted as part of the ICASSP 2026 workshop and attracted participants such as Anker Innovation, Horizon Robotics, and Nanjing University. Volcano Engine’s multimedia lab, in partnership with the Chinese Academy of Sciences Acoustic Institute, entered two teams—teamwzqaq and nano‑codec—winning first and second place in Track 1 and third place in Track 2.

Audio codecs compress and decompress sound, enabling storage and transmission across devices. Traditional algorithms (Opus, xHE‑AAC, EVS) perform well at 12‑64 kbps but degrade at 1‑6 kbps, producing quantization noise and spectral distortion. Recent AI‑driven codecs such as Google’s SoundStream and Meta’s EnCodec approach the quality of 12‑16 kbps at 1‑6 kbps but require high computational resources (≈GMACS). The LRAC challenge therefore emphasized achieving high reconstruction quality under strict latency (≤30 ms for Track 1, ≤50 ms for Track 2) and complexity limits (≤700 MFLOPS encoder, ≤300 MFLOPS decoder for Track 1; ≤2600 MFLOPS encoder, ≤600 MFLOPS decoder for Track 2).

Track 1 Solution: IRIS Codec

IRIS (Internet Real‑time Intelligent Streaming Codec) adopts an end‑to‑end AI codec architecture. The encoder consists of stacked residual blocks, followed by a 1‑D convolution and a GRU layer. Quantization uses Residual Vector Quantization (RVQ) with 12 codebooks (each 1024 entries); the 1 kbps mode uses 2 layers, while the 6 kbps mode uses all 12. The decoder projects quantized features with a 1‑D convolution, processes them through multiple Conv2FormerBlock units (which outperform Conv2NeXt in synthesis quality), and finally applies inverse STFT to reconstruct audio.

Track 2 Solution: Enhance‑Nanocodec

Enhance‑Nanocodec extends the codec with denoising and dereverberation. The pipeline operates entirely in the time‑frequency domain: the input waveform is transformed to a spectrogram via STFT, the encoder retains only magnitude information, and an Energy Content Decoupling (ECD) layer separates spectral energy from content before decoding. Both encoder and decoder share a Large Kernel Convolution‑Style Attention Block (LKCAB). The decoder predicts magnitude and phase, which are combined via inverse Fourier transform to produce the enhanced audio.

Training Strategies for Track 1

Data Augmentation: Noise, reverberation, multi‑speaker mixing, and pitch shifting were applied to expand the limited dataset, improving generalization across diverse speech conditions.

Multi‑Discriminator Optimization: Three discriminators—time‑domain multi‑period, frequency‑domain multi‑scale STFT, and multi‑scale mel‑spectrogram—were used jointly to reduce temporal distortion and preserve high‑frequency details.

Multi‑Loss Optimization: Combined STFT spectral loss, mel‑spectral loss, discriminator loss, RVQ codebook loss, and PESQ loss to address both objective metrics and human perception.

Gradient Direct Pass: To mitigate gradient approximation errors from discrete quantization, 50 % of training steps omitted codebook updates, allowing direct gradient flow from decoder to encoder.

Two‑Stage Fine‑Tuning: Initial training used standard loss weighting; the second stage reduced mel‑spectral loss weight and increased STFT loss emphasis, restoring mid‑high frequency harmonics.

Training Strategies for Track 2

Stage 1: Train encoder, codebook, and decoder on clean speech only, ensuring the codebook stores pure speech information; a high‑complexity teacher decoder guides the encoder.

Stage 2: Freeze quantizer and decoder, train a student encoder on noisy/reverberant speech to learn denoising without increasing model complexity.

Stage 3: Freeze encoder and quantizer, train a student decoder within the complexity budget to reconstruct clean waveforms.

Experimental results confirmed that these optimizations satisfied the strict latency and complexity constraints while delivering superior audio quality, demonstrating a viable path for real‑time communication and edge‑device audio transmission.

Future work includes further quality improvements, reduced computational load, packet‑loss compensation, multi‑sample‑rate support, and unified speech/music coding to broaden applicability to live streaming, IoT, and other real‑time scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

real-timeICASSPaudio codeclow-resourceVolcano EngineAI codec
ByteDance SE Lab
Written by

ByteDance SE Lab

Official account of ByteDance SE Lab, sharing research and practical experience in software engineering. Our lab unites researchers and engineers from various domains to accelerate the fusion of software engineering and AI, driving technological progress in every phase of software development.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.