How DFSMN Cuts Speech Synthesis Model Size by 75% and Quadruples Speed
Researchers propose a Deep Feedforward Sequential Memory Network (DFSMN) for speech synthesis that matches BLSTM quality while using only a quarter of the model size and achieving four times faster inference, making it ideal for memory‑constrained, real‑time edge devices.
Research Background
Speech synthesis systems are divided into concatenative and parametric approaches. After neural networks were introduced, parametric systems achieved significant quality improvements, but the proliferation of IoT devices (e.g., smart speakers, smart TVs) imposes strict memory and real‑time constraints. The proposed Deep Feedforward Sequential Memory Network (DFSMN) maintains synthesis quality while dramatically reducing computation and increasing speed.
Deep Feedforward Sequential Memory Network
The compact Feedforward Sequential Memory Network (cFSMN) improves the standard FSMN by incorporating low‑rank matrix factorization, reducing parameters and accelerating training and inference. Each cFSMN layer performs a linear projection, a memory module that aggregates weighted sums of past and future frames, and a final affine‑nonlinear transformation.
DFSMN extends cFSMN with skip‑connections between adjacent memory modules, allowing gradients to bypass nonlinearities during back‑propagation. This enables deeper networks to converge quickly and capture longer‑range context. The memory module also introduces a stride hyper‑parameter, controlling how many frames are skipped when aggregating past or future information—particularly useful for speech synthesis where adjacent frames overlap heavily.
Experiments
The experiments used a Chinese novel corpus read by a male speaker (≈83 h training, ≈3 h validation). Audio was sampled at 16 kHz, with 25 ms frames and 5 ms shift. WORLD extracted 60‑dimensional mel‑cepstral coefficients, 3‑dimensional log‑F0, 11‑dimensional BAP, and a voicing flag. The front‑end produced 754‑dimensional linguistic features as network inputs.
The baseline was a strong BLSTM model (1 fully‑connected layer + 3 BLSTM layers, each with 2048 units) trained with BPTT. DFSMN models were trained with standard back‑propagation on two GPUs using BMUF. All models were optimized with multi‑target frame‑level MSE.
Each DFSMN model consisted of several DFSMN layers followed by two fully‑connected layers (2048 nodes each). DFSMN layers had 2048 nodes and 512 projection nodes. Experiments varied the number of DFSMN layers, memory‑module order, and stride. Model A started with few layers and small order; models B‑I progressively increased depth, order, and stride. Objective metrics consistently improved with deeper, higher‑order DFSMN models, and system H surpassed the BLSTM baseline.
Subjective MOS tests (40 native Chinese listeners, 20 utterances per system, each rated by 10 listeners) showed that naturalness improved from system A to E, with system E matching the BLSTM baseline. Later systems did not yield further MOS gains despite better objective scores.
Conclusion
Both objective and subjective results indicate that capturing 120 frames (≈600 ms) of past and future context is sufficient for acoustic modeling in speech synthesis; additional context offers no benefit. Compared with the BLSTM baseline, the DFSMN system achieves comparable subjective quality with only one‑quarter the model size and four‑times faster inference, making it well‑suited for memory‑constrained, real‑time edge devices.
English paper: https://arxiv.org/abs/1802.09194
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
