Semantic Data Augmentation and GigaSpeech: Highlights of Two INTERSPEECH 2021 Papers from the Beike Voice Team
The article summarizes two INTERSPEECH 2021 papers from Beike's voice technology team, detailing a grammar‑based semantic data augmentation method that improves end‑to‑end Chinese speech recognition and introducing GigaSpeech, a massive 10,000‑hour multilingual English speech dataset for robust ASR research.
INTERSPEECH, the premier international conference organized by the International Speech Communication Association (ISCA), recently announced its 2021 paper selections, with two contributions from Beike's voice technology team making their debut on the global stage.
The first paper proposes a grammar‑based semantic data augmentation technique to address data scarcity in end‑to‑end speech recognition training. By applying Chinese grammatical transformations (e.g., swapping subject and object) to both text and audio, the method expands the semantic diversity of the training set. The augmentation pipeline consists of four steps: (1) word‑level POS tagging and segmentation, (2) MFCC extraction and alignment of audio‑text pairs, (3) conversion of segmented text to phone IDs and alignment with frame‑level labels, and (4) recombination of segments according to seven rule‑based sentence transformations.
Experiments on HKUST and AISHELL‑1 datasets using Conformer and Transformer models demonstrate that combining audio‑side and text‑side semantic augmentation yields the greatest performance gains, while pure text augmentation alone offers no benefit. The study also reveals that the proportion and variety of augmented sentences affect results, with excessive augmentation leading to diminishing returns.
The second paper introduces GigaSpeech, a 10,000‑hour English speech corpus covering 24 domains and multiple speaking styles (reading and free conversation). Compared with existing corpora such as WSJ, Switchboard, TED‑LIUM 3, Fisher, and LibriSpeech, GigaSpeech provides an order‑of‑magnitude increase in scale, enabling more reliable evaluation of ASR algorithms on large‑scale, multi‑domain data. Baseline experiments across four major speech toolkits (Athena, ESPnet, Kaldi, Pika) and various model architectures are reported.
Both papers underscore the importance of large, diverse training data and innovative augmentation strategies for advancing industrial‑grade speech recognition systems.
Beike Product & Technology
As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.