Artificial Intelligence 17 min read

How DFSMN Sets a New Record in Speech Recognition Accuracy and Speed

Alibaba's DAMO Academy has open‑sourced the Deep‑Feedforward Sequential Memory Network (DFSMN), a next‑generation speech‑recognition model that achieves a world‑record 96.04% accuracy on LibriSpeech, trains three times faster than LSTM, halves model size, and dramatically speeds up real‑time decoding.

Alibaba Cloud Developer

Jun 8, 2018

How DFSMN Sets a New Record in Speech Recognition Accuracy and Speed

Alibaba Open‑Sources DFSMN Speech‑Recognition Model

Alibaba's DAMO Academy recently released the Deep‑Feedforward Sequential Memory Network (DFSMN), a new generation acoustic model that pushes the global speech‑recognition accuracy record to 96.04% on the LibriSpeech benchmark.

Why DFSMN Improves Over Traditional Models

Compared with the widely used LSTM acoustic model, DFSMN trains faster and yields higher recognition accuracy. In smart speakers and home‑automation devices, DFSMN speeds up deep‑learning training by three times and doubles inference speed.

DFSMN Architecture

DFSMN builds on the Feedforward Sequential Memory Network (FSMN) by adding deeper layers, low‑frame‑rate (LFR) processing, and skip connections that alleviate gradient vanishing. The memory blocks act like high‑order FIR filters, enabling effective long‑term dependency modeling with fewer parameters than recurrent networks.

From FSMN to Compact FSMN (cFSMN) and Deep‑FSMN (DFSMN)

Standard FSMN struggles to train very deep structures due to gradient issues. Compact FSMN (cFSMN) introduces low‑rank matrix factorization to reduce parameters. DFSMN further adds skip connections and dilation‑style stride factors, allowing networks with ten cFSMN layers plus two DNN layers to be trained stably.

Performance Comparison

On a 2000‑hour English task, DFSMN reduces word error rate (WER) to 9.4%, a relative 14% improvement over BLSTM.

BLSTM: 10.9% WER

cFSMN: 10.8% WER

DFSMN: 9.4% WER

When combined with LFR, DFSMN achieves a 20% relative character error rate (CER) reduction compared with LFR‑LCBLSTM across two product lines.

LFR‑LCBLSTM: 18.92% (Product A), 10.21% (Product B)

LFR‑DFSMN: 15.00% (+20.72%) (Product A), 8.04% (+21.25%) (Product B)

Training Efficiency

Using Alibaba's Max‑Compute platform with 8 machines and 16 GPUs, LFR‑DFSMN processes an epoch in 3.4 hours, three times faster than LFR‑LCBLSTM (10.8 hours). For 20 000 hours of data, convergence typically requires only 3–4 epochs, allowing a full training run in about two days.

LFR‑LCBLSTM epoch time: 10.8 h

LFR‑DFSMN epoch time: 3.4 h

Decoding Latency, Speed, and Model Size

On a test set, LFR‑DFSMN decodes in 142 seconds, compared with 956 seconds for LCBLSTM, 377 seconds for DFSMN, and 339 seconds for LFR‑LCBLSTM. The model size is roughly half that of LFR‑LCBLSTM while delivering three‑fold faster inference.

LCBLSTM: 956 s

DFSMN: 377 s

LFR‑LCBLSTM: 339 s

LFR‑DFSMN: 142 s

Latency can be tuned by adjusting the look‑ahead order of the memory filters; setting the delay to 5–10 frames incurs only about a 3% relative performance loss.

References

1. Yu Zhang et al., “Long Short‑Term Memory RNNs for Distant Speech Recognition,” ICASSP 2016. 2. X. S. et al., “Improving Latency‑Controlled BLSTM Acoustic Models for Online Speech Recognition,” ICASSP 2016. 3. S. Zhang et al., “Feedforward Sequential Memory Networks,” arXiv 1512.08301, 2015. 4. S. Zhang et al., “Compact Feedforward Sequential Memory Networks for Large Vocabulary Continuous Speech Recognition,” INTERSPEECH 2016. 5. S. Zhang et al., “Non‑recurrent Neural Structure for Long‑Term Dependency,” IEEE/ACM TASLP, 2017. 6. A. Oord et al., “WaveNet: A Generative Model for Raw Audio,” arXiv 1609.03499, 2016. 7. G. Pundak & T. N. Sainath, “Lower Frame Rate Neural Network Acoustic Models,” INTERSPEECH 2016.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deep Learning acoustic modeling DFSMN low frame rate

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.