How Volcano Engine and CAS Acoustic Institute Won Top Spots at the First Low‑Resource Audio Codec Challenge
Volcano Engine's audio team, together with the Chinese Academy of Sciences Acoustic Institute, secured first‑place, runner‑up, and third‑place finishes in the 2025 Low‑Resource Audio Codec Challenge at ICASSP 2026 by delivering AI‑driven codecs that balance ultra‑low bitrate, low complexity, and high audio quality for real‑time communication and streaming scenarios.
The 2025 Low‑Resource Audio Codec (LRAC) Challenge, the world’s first competition focused on ultra‑low‑resource audio coding, was hosted as part of the ICASSP 2026 workshop and attracted participants such as Anker Innovation, Horizon Robotics, and Nanjing University. Volcano Engine’s multimedia lab, in partnership with the Chinese Academy of Sciences Acoustic Institute, entered two teams—teamwzqaq and nano‑codec—winning first and second place in Track 1 and third place in Track 2.
Audio codecs compress and decompress sound, enabling storage and transmission across devices. Traditional algorithms (Opus, xHE‑AAC, EVS) perform well at 12‑64 kbps but degrade at 1‑6 kbps, producing quantization noise and spectral distortion. Recent AI‑driven codecs such as Google’s SoundStream and Meta’s EnCodec approach the quality of 12‑16 kbps at 1‑6 kbps but require high computational resources (≈GMACS). The LRAC challenge therefore emphasized achieving high reconstruction quality under strict latency (≤30 ms for Track 1, ≤50 ms for Track 2) and complexity limits (≤700 MFLOPS encoder, ≤300 MFLOPS decoder for Track 1; ≤2600 MFLOPS encoder, ≤600 MFLOPS decoder for Track 2).
Track 1 Solution: IRIS Codec
IRIS (Internet Real‑time Intelligent Streaming Codec) adopts an end‑to‑end AI codec architecture. The encoder consists of stacked residual blocks, followed by a 1‑D convolution and a GRU layer. Quantization uses Residual Vector Quantization (RVQ) with 12 codebooks (each 1024 entries); the 1 kbps mode uses 2 layers, while the 6 kbps mode uses all 12. The decoder projects quantized features with a 1‑D convolution, processes them through multiple Conv2FormerBlock units (which outperform Conv2NeXt in synthesis quality), and finally applies inverse STFT to reconstruct audio.
Track 2 Solution: Enhance‑Nanocodec
Enhance‑Nanocodec extends the codec with denoising and dereverberation. The pipeline operates entirely in the time‑frequency domain: the input waveform is transformed to a spectrogram via STFT, the encoder retains only magnitude information, and an Energy Content Decoupling (ECD) layer separates spectral energy from content before decoding. Both encoder and decoder share a Large Kernel Convolution‑Style Attention Block (LKCAB). The decoder predicts magnitude and phase, which are combined via inverse Fourier transform to produce the enhanced audio.
Training Strategies for Track 1
Data Augmentation: Noise, reverberation, multi‑speaker mixing, and pitch shifting were applied to expand the limited dataset, improving generalization across diverse speech conditions.
Multi‑Discriminator Optimization: Three discriminators—time‑domain multi‑period, frequency‑domain multi‑scale STFT, and multi‑scale mel‑spectrogram—were used jointly to reduce temporal distortion and preserve high‑frequency details.
Multi‑Loss Optimization: Combined STFT spectral loss, mel‑spectral loss, discriminator loss, RVQ codebook loss, and PESQ loss to address both objective metrics and human perception.
Gradient Direct Pass: To mitigate gradient approximation errors from discrete quantization, 50 % of training steps omitted codebook updates, allowing direct gradient flow from decoder to encoder.
Two‑Stage Fine‑Tuning: Initial training used standard loss weighting; the second stage reduced mel‑spectral loss weight and increased STFT loss emphasis, restoring mid‑high frequency harmonics.
Training Strategies for Track 2
Stage 1: Train encoder, codebook, and decoder on clean speech only, ensuring the codebook stores pure speech information; a high‑complexity teacher decoder guides the encoder.
Stage 2: Freeze quantizer and decoder, train a student encoder on noisy/reverberant speech to learn denoising without increasing model complexity.
Stage 3: Freeze encoder and quantizer, train a student decoder within the complexity budget to reconstruct clean waveforms.
Experimental results confirmed that these optimizations satisfied the strict latency and complexity constraints while delivering superior audio quality, demonstrating a viable path for real‑time communication and edge‑device audio transmission.
Future work includes further quality improvements, reduced computational load, packet‑loss compensation, multi‑sample‑rate support, and unified speech/music coding to broaden applicability to live streaming, IoT, and other real‑time scenarios.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ByteDance SE Lab
Official account of ByteDance SE Lab, sharing research and practical experience in software engineering. Our lab unites researchers and engineers from various domains to accelerate the fusion of software engineering and AI, driving technological progress in every phase of software development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
