Artificial Intelligence 12 min read

How Deep‑FSMN and Low Frame Rate Accelerate Speech Recognition

This article introduces the Deep‑FSMN (DFSMN) architecture and its integration with low‑frame‑rate (LFR) processing, showing how the combined LFR‑DFSMN acoustic model achieves higher accuracy, smaller model size, faster training, and lower latency than traditional BLSTM‑based speech recognition systems on both English and Chinese large‑vocabulary tasks.

Alibaba Cloud Developer

Oct 31, 2018

How Deep‑FSMN and Low Frame Rate Accelerate Speech Recognition

Research Background

In recent years, deep neural networks have become the mainstream acoustic models for large‑vocabulary continuous speech recognition. Because speech signals exhibit strong long‑term dependencies, recurrent neural networks (RNN) such as LSTM are widely used, but their training with BPTT is slow and suffers from gradient vanishing. Previously we proposed feedforward sequential memory networks (FSMN), a non‑recurrent architecture that efficiently models long‑term context.

FSMN Review

The original FSMN adds memory blocks to a feed‑forward fully connected network, analogous to high‑order FIR filters, allowing effective long‑term modeling with stable training. Variants include scalar FSMN (sFSMN) and vector FSMN (vFSMN), and both unidirectional and bidirectional extensions.

Compact FSMN (cFSMN) reduces parameters by inserting low‑rank linear projection layers after each hidden layer and modifying the memory block formulation.

DFSMN Introduction

Building on FSMN, we propose Deep‑FSMN (DFSMN) by inserting skip connections between adjacent memory modules, enabling gradients to flow directly from higher to lower layers and allowing very deep networks without vanishing gradients. We also incorporate dilation‑like stride factors in the memory blocks to enlarge the effective receptive field while controlling latency.

LFR‑DFSMN Acoustic Model

We combine DFSMN with a low‑frame‑rate (LFR) scheme that concatenates several consecutive frames as a single input, reducing the frame rate to one‑third and greatly accelerating training and decoding. The final acoustic model consists of 10 DFSMN layers plus 2 DNN layers with LFR applied, achieving a three‑fold speedup.

Experimental Results

English recognition : On a 2000‑hour English FSH task, deeper DFSMN models (6, 8, 10, 12 layers) show consistent WER improvements; the 10‑layer DFSMN outperforms the state‑of‑the‑art BLSTM by 1.5% absolute with fewer parameters.

Chinese recognition : On a 5000‑hour Chinese task, LFR‑DFSMN surpasses LFR‑LCBLSTM by over 20% relative WER reduction and achieves up to three‑times faster training and one‑third lower real‑time factor. With only 5‑frame latency, DFSMN still outperforms LFR‑LCBLSTM.

Overall, DFSMN combined with LFR delivers higher accuracy, smaller model size, faster training, and lower decoding latency compared with BLSTM‑based systems.

Authors: Zhang Shiliang, Lei Ming, Yan Zhijie, Dai Lirong. Published in ICASSP‑2018.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI deep neural networks acoustic modeling DFSMN low frame rate

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.