How JD AI’s Four Interspeech 2020 Papers Advance Speech Processing

JD AI Research Institute presented four accepted Interspeech 2020 papers—covering sound event localization, speech dereverberation, speaker verification, and an efficient WaveGlow vocoder—demonstrating significant advances in audio AI despite the conference’s shift to an online format due to COVID‑19.

JD Cloud Developers
JD Cloud Developers
JD Cloud Developers
How JD AI’s Four Interspeech 2020 Papers Advance Speech Processing

Due to the COVID‑19 pandemic, the Interspeech 2020 conference originally scheduled for Shanghai was moved online. JD AI Research Institute had four papers accepted, covering sound event localization and detection, speech dereverberation, speaker verification, and an efficient WaveGlow vocoder.

Sound Event Localization and Detection Based on Multiple DOA Beamforming and Multi‑task Learning

Sound event detection and localization are crucial for smart home and security applications but face challenges from noise, reverberation, and overlapping sources. This paper proposes a method that combines traditional acoustic beamforming with multi‑task deep learning, extracting directional source signals via fixed beams and providing rich spatial representations without prior source localization.

The approach computes direction‑of‑arrival vectors from inter‑power spectra, removing dependence on microphone array geometry, and designs separate networks for source localization and event detection. Evaluated on the DCASE2019 dataset, it achieved the best overall performance.

Skip Convolutional Neural Network for Speech Dereverberation using Optimally Smoothed Spectral Mapping

This work, a collaboration between JD AI Research Institute and the University of Texas at Dallas, extends the UNet‑style fully convolutional network for speech dereverberation. It replaces each skip connection with a dedicated convolutional block (SkipConvNet) and introduces an optimal‑smoothed power‑spectral preprocessing step.

Experiments on the REVERB Challenge corpus show significant improvements in objective quality metrics and notable gains in speech recognition and speaker identification under reverberant conditions.

The JD AI Speaker Verification System for the FFSVC 2020 Challenge

Far‑field speaker verification suffers from complex acoustic environments. Leveraging the FFSVC2020 competition data (≈1100 h, 120 speakers), the system explores data augmentation, model architectures (TDNN, TDNN‑F, ResNet, Transformer), and scoring strategies.

1) Apply beamforming, channel switching, and dereverberation to convert far‑field recordings to near‑field. 2) Simulate room impulse responses to convolve with near‑field data, adding realistic reverberation. 3) Inject recorded environmental noises for additive noise augmentation. 4) Use data augmentation in training and testing to increase diversity, yielding large performance gains.

Incorporating TDNN, ResNet, and Transformer back‑ends with score normalization and a two‑stage scoring pipeline reduced minDCF by 0.2393 and EER by 3.16 % relative to the baseline.

Efficient WaveGlow: An Improved WaveGlow Vocoder with Enhanced Speed

WaveGlow‑style neural vocoders are essential for high‑quality speech synthesis. The proposed Efficient WaveGlow replaces the WaveNet‑based affine coupling layers with FFTNet‑based transformations, employs group convolutions, and shares local conditioning across layers to reduce parameters.

Compared with the original WaveGlow, Efficient WaveGlow cuts computational cost and model size by more than 12× while maintaining audio quality, achieving a 6× speedup on CPU and 5× on a P40 GPU.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

speech processingAudio AIneural vocodersound event detectionspeech dereverberation
JD Cloud Developers
Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.