Artificial Intelligence 7 min read

Semantic Data Augmentation and GigaSpeech: Highlights of Two INTERSPEECH 2021 Papers from the Beike Voice Team

The article summarizes two INTERSPEECH 2021 papers from Beike's voice technology team, detailing a grammar‑based semantic data augmentation method that improves end‑to‑end Chinese speech recognition and introducing GigaSpeech, a massive 10,000‑hour multilingual English speech dataset for robust ASR research.

Beike Product & Technology
Beike Product & Technology
Beike Product & Technology
Semantic Data Augmentation and GigaSpeech: Highlights of Two INTERSPEECH 2021 Papers from the Beike Voice Team

INTERSPEECH, the premier international conference organized by the International Speech Communication Association (ISCA), recently announced its 2021 paper selections, with two contributions from Beike's voice technology team making their debut on the global stage.

The first paper proposes a grammar‑based semantic data augmentation technique to address data scarcity in end‑to‑end speech recognition training. By applying Chinese grammatical transformations (e.g., swapping subject and object) to both text and audio, the method expands the semantic diversity of the training set. The augmentation pipeline consists of four steps: (1) word‑level POS tagging and segmentation, (2) MFCC extraction and alignment of audio‑text pairs, (3) conversion of segmented text to phone IDs and alignment with frame‑level labels, and (4) recombination of segments according to seven rule‑based sentence transformations.

Experiments on HKUST and AISHELL‑1 datasets using Conformer and Transformer models demonstrate that combining audio‑side and text‑side semantic augmentation yields the greatest performance gains, while pure text augmentation alone offers no benefit. The study also reveals that the proportion and variety of augmented sentences affect results, with excessive augmentation leading to diminishing returns.

The second paper introduces GigaSpeech, a 10,000‑hour English speech corpus covering 24 domains and multiple speaking styles (reading and free conversation). Compared with existing corpora such as WSJ, Switchboard, TED‑LIUM 3, Fisher, and LibriSpeech, GigaSpeech provides an order‑of‑magnitude increase in scale, enabling more reliable evaluation of ASR algorithms on large‑scale, multi‑domain data. Baseline experiments across four major speech toolkits (Athena, ESPnet, Kaldi, Pika) and various model architectures are reported.

Both papers underscore the importance of large, diverse training data and innovative augmentation strategies for advancing industrial‑grade speech recognition systems.

data augmentationSpeech Recognitionchineseend-to-end modelsGigaSpeechInterspeechsemantic augmentation
Beike Product & Technology
Written by

Beike Product & Technology

As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.