Interspeech 2019 Highlights: End‑to‑End Speech AI Technologies and Key Paper Summaries
The article reviews Interspeech 2019, summarizing major trends and representative papers in end‑to‑end speech recognition, synthesis, natural language understanding, speaker recognition, and speech translation, while also highlighting best student papers and providing resources for further study.
DataFun invited Dr. Li Xianggang, head of Didi Voice, to share insights from the 2019 Interspeech conference, focusing on the evolution of speech technologies from algorithmic research to real‑world applications.
Conference Highlights : The first day featured Survey Talks on end‑to‑end ASR modeling (beyond HMMs) and attention mechanisms for speaker state recognition. Attention‑based models dominated the program, appearing in most ASR and speaker‑recognition papers.
Speech Recognition : Recent work emphasizes deep self‑attention networks, hybrid vs. attention models, and raw‑waveform acoustic modeling. Notable papers include “Very Deep Self‑Attention Networks for End‑to‑End Speech Recognition” (CMU/KIT) and “RWTH ASR Systems for LibriSpeech: Hybrid vs Attention”. Data augmentation (SpecAugment) and unsupervised pre‑training (wav2vec) were also highlighted.
Speech Synthesis : High‑quality, lightweight TTS systems such as LPCNet and GAN‑based voice conversion were discussed. StarGAN‑VC2 introduced a new conditional loss for voice conversion, and forward‑backward decoding was proposed to regularize end‑to‑end TTS.
Natural Language Understanding : End‑to‑end spoken language understanding increasingly relies on pre‑training (e.g., BERT‑style models) and multimodal inputs. Techniques such as multi‑task learning, frame‑level modeling, and GAN‑based adversarial training were presented.
Speaker Recognition : Sessions covered diarization, x‑vector clustering with Bayesian HMMs, LSTM‑based similarity scoring, and privacy‑preserving speaker verification. Advances in model architectures (TDNN, ResNet, ArcFace) and data‑augmentation (VAE) were noted, as well as the growing concern of spoofing attacks.
Speech Translation : Both cascade and end‑to‑end approaches were surveyed. Knowledge‑distillation from text‑translation models and sequence‑to‑sequence speech‑to‑speech translation were highlighted as promising directions.
Best Student Papers : The awards recognized (1) an adversarially trained Korean singing‑voice synthesis system, (2) evaluation of near‑end listening enhancement algorithms in realistic environments, and (3) “Language Modeling with Deep Transformers”, which was already discussed earlier.
Community Invitation : Readers are encouraged to join the DataFunTalk voice technology group for direct interaction with peers and to follow the DataFunTalk public account for more AI and big‑data content.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.