Artificial Intelligence 10 min read

How DFSMN Cuts Speech Synthesis Model Size by 75% While Quadrupling Speed

This paper introduces a Deep Feedforward Sequential Memory Network (DFSMN) for statistical parametric speech synthesis that matches BLSTM quality with only a quarter of the model size and four times faster inference, making it ideal for memory‑constrained, real‑time IoT devices.

Alibaba Cloud Developer

Oct 23, 2018

How DFSMN Cuts Speech Synthesis Model Size by 75% While Quadrupling Speed

Research Background

Statistical parametric speech synthesis has advanced significantly with neural networks, but deploying such models on IoT devices (e.g., smart speakers, smart TVs) faces strict memory and real‑time constraints. To address this, we propose a Deep Feedforward Sequential Memory Network (DFSMN) that maintains synthesis quality while drastically reducing computational load.

Deep Feedforward Sequential Memory Network

The compact Feedforward Sequential Memory Network (cFSMN) improves the standard FSMN by introducing low‑rank matrix factorization, reducing parameters and accelerating training and inference. Each cFSMN layer performs a linear projection, a memory module that aggregates weighted sums of past and future frames, and a non‑linear affine transformation.

Building on cFSMN, DFSMN adds skip‑connections between adjacent memory modules, allowing gradients to bypass non‑linearities during back‑propagation. This enables deeper networks to converge quickly and capture longer‑range temporal context, which is crucial for high‑quality speech synthesis.

DFSMN’s memory module includes two hyper‑parameters: order (the number of past/future frames considered) and stride (the frame step size). Larger orders and strides allow the model to exploit longer context when needed, while smaller settings keep latency low for short utterances.

Experiments

We evaluated DFSMN on a Mandarin audiobook corpus read by a male speaker (≈83 h training, 3 h validation). Audio was sampled at 16 kHz; acoustic features (60‑dimensional MFCCs, log‑F0, 11‑dimensional BAP, voicing flag) were extracted with the WORLD vocoder. Linguistic features (754 dim) served as network inputs.

The baseline was a strong BLSTM system (1 fully‑connected layer + 3 BLSTM layers, each with 2048 units) trained with BPTT. DFSMN models used standard back‑propagation and were trained on two GPUs with BMUF. All models employed multi‑target frame‑level MSE loss.

We constructed a series of DFSMN configurations (models A–I) varying depth, order, and stride. Objective metrics (e.g., MSE) consistently improved as depth and order increased, with model H surpassing the BLSTM baseline. Subjective MOS tests with 40 native Chinese listeners showed that models A–E gradually approached BLSTM naturalness, with model E matching the baseline, while later models did not yield further perceptual gains despite better objective scores.

Conclusion

Our experiments indicate that capturing approximately 120 past and 120 future frames (≈600 ms) provides sufficient context for acoustic modeling in speech synthesis; longer context yields diminishing returns. Compared with the BLSTM baseline, DFSMN achieves comparable subjective quality with only 25 % of the parameters and four‑fold faster inference, making it highly suitable for memory‑constrained, real‑time edge devices.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning model compression Speech synthesis Real-time inference DFSMN IoT devices

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.