Artificial Intelligence 9 min read

How Linear Networks Enable Speaker‑Adaptive Speech Synthesis with Minimal Data

This article presents a linear‑network‑based speaker‑adaptation method for text‑to‑speech that achieves synthesis quality comparable to large‑scale training using only a few hundred target‑speaker utterances, and introduces a low‑rank‑plus‑diagonal compression to improve stability with scarce data.

Alibaba Cloud Developer

Nov 27, 2018

How Linear Networks Enable Speaker‑Adaptive Speech Synthesis with Minimal Data

Abstract

Speaker adaptation algorithms use a small amount of speaker data to build a speaker‑adaptive TTS system that can synthesize satisfactory speech. This paper proposes a linear‑network‑based speaker‑adaptation algorithm. By learning a specific linear network for each target speaker, the acoustic model becomes speaker‑dependent. Using only 200 adaptation sentences achieves performance comparable to training with 1000 sentences.

Research Background

When abundant data are available for a target speaker, a speaker‑dependent acoustic model can be trained, yielding speech that closely resembles the target voice. However, most speakers lack sufficient data, leading to poor synthesis quality. Speaker‑adaptation algorithms aim to obtain good synthesis quality from limited data, reducing recording and transcription effort.

Algorithm Description

The source acoustic model is a multi‑task DNN‑BLSTM network. Linear networks (LN) are inserted between layers of the source model. Depending on insertion position, LNs are classified as Linear Input Network (LIN), Linear Hidden Network (LHN), or Linear Output Network (LON). Each LN learns a speaker‑specific linear transformation (matrix and bias). During adaptation, the LN parameters are initialized (identity matrix, zero bias) and updated using the target speaker’s data while keeping other model parameters fixed.

To reduce parameter count, a low‑rank plus diagonal (LRPD) decomposition is applied to the LN, yielding LRPD‑LN. LRPD‑LN dramatically reduces parameters (≈18% of Full‑LN) with negligible performance loss, improving stability when adaptation data are scarce.

Experiments

Experiments were conducted on a Chinese dataset with three speakers (5000 utterances each, ~5 h). Adaptation data varied from 50 to 1000 sentences; 200 sentences were used as a development set and 20 as a test set. Both objective metrics (MCD, F0 RMSE, U/V error, MSE) and subjective MOS were evaluated across gender pairings (female‑female, male‑female, female‑male).

Results show that the speaker‑dependent system (SD) outperforms output‑layer‑only adaptation (OL). OL+LRPD‑LN consistently yields more stable and higher‑quality speech than OL+Full‑LN, especially with limited adaptation data, where Full‑LN suffers from over‑fitting.

Conclusion

The linear‑network‑based speaker‑adaptation algorithm, combined with LRPD model compression, provides stable and high‑quality speech synthesis. With only 200 adaptation sentences, the system achieves performance comparable to using 1000 sentences, and LRPD further improves stability when data are scarce.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

artificial-intelligence Speech synthesis acoustic modeling linear network low-rank decomposition speaker adaptation

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.