How Deep‑FSMN and Low Frame Rate Accelerate Speech Recognition
This article introduces the Deep‑FSMN (DFSMN) architecture and its integration with low‑frame‑rate (LFR) processing, showing how the combined LFR‑DFSMN acoustic model achieves higher accuracy, smaller model size, faster training, and lower latency than traditional BLSTM‑based speech recognition systems on both English and Chinese large‑vocabulary tasks.
Research Background
In recent years, deep neural networks have become the mainstream acoustic models for large‑vocabulary continuous speech recognition. Because speech signals exhibit strong long‑term dependencies, recurrent neural networks (RNN) such as LSTM are widely used, but their training with BPTT is slow and suffers from gradient vanishing. Previously we proposed feedforward sequential memory networks (FSMN), a non‑recurrent architecture that efficiently models long‑term context.
FSMN Review
The original FSMN adds memory blocks to a feed‑forward fully connected network, analogous to high‑order FIR filters, allowing effective long‑term modeling with stable training. Variants include scalar FSMN (sFSMN) and vector FSMN (vFSMN), and both unidirectional and bidirectional extensions.
Compact FSMN (cFSMN) reduces parameters by inserting low‑rank linear projection layers after each hidden layer and modifying the memory block formulation.
DFSMN Introduction
Building on FSMN, we propose Deep‑FSMN (DFSMN) by inserting skip connections between adjacent memory modules, enabling gradients to flow directly from higher to lower layers and allowing very deep networks without vanishing gradients. We also incorporate dilation‑like stride factors in the memory blocks to enlarge the effective receptive field while controlling latency.
LFR‑DFSMN Acoustic Model
We combine DFSMN with a low‑frame‑rate (LFR) scheme that concatenates several consecutive frames as a single input, reducing the frame rate to one‑third and greatly accelerating training and decoding. The final acoustic model consists of 10 DFSMN layers plus 2 DNN layers with LFR applied, achieving a three‑fold speedup.
Experimental Results
English recognition : On a 2000‑hour English FSH task, deeper DFSMN models (6, 8, 10, 12 layers) show consistent WER improvements; the 10‑layer DFSMN outperforms the state‑of‑the‑art BLSTM by 1.5% absolute with fewer parameters.
Chinese recognition : On a 5000‑hour Chinese task, LFR‑DFSMN surpasses LFR‑LCBLSTM by over 20% relative WER reduction and achieves up to three‑times faster training and one‑third lower real‑time factor. With only 5‑frame latency, DFSMN still outperforms LFR‑LCBLSTM.
Overall, DFSMN combined with LFR delivers higher accuracy, smaller model size, faster training, and lower decoding latency compared with BLSTM‑based systems.
Authors: Zhang Shiliang, Lei Ming, Yan Zhijie, Dai Lirong. Published in ICASSP‑2018.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
