Gaode Map Custom Voice Pack: End‑to‑End TTS Model Architecture and Deployment
This article explains how Gaode Map leverages lightweight edge TTS models, dual‑autoregressive large‑model data augmentation, and a configurable audio‑processing DAG to enable users to create highly realistic personalized voice packs from just three recorded sentences.
Gaode Map introduces a personalized voice‑pack feature that replaces generic navigation prompts with custom voices mimicking family members or loved ones, turning navigation into an emotionally extended experience.
The technical sharing outlines the evolution of Text‑to‑Speech (TTS) technology from mechanical devices to modern deep‑learning models, highlighting the shift from acoustic‑feature generation to token‑plus‑LLM based large models that enable natural, personalized, and multimodal speech synthesis.
To support billions of devices, Gaode adopts an ultra‑lightweight edge acoustic model combined with data‑enhanced TTS large models. The edge model is trained in three stages: pre‑training on thousands of high‑quality recordings with speaker‑semantic disentanglement, Teacher‑Force distillation using a Hubert‑Feature Map loss to preserve semantic information, and fine‑tuning that updates only speaker‑related parameters for rapid adaptation.
The resulting model, together with a source‑filter vocoder, totals less than 5 MB, making it one of the smallest full‑neural end‑side TTS solutions while delivering superior audio quality.
To mitigate data scarcity during fine‑tuning, a dual‑autoregressive TTS large model is employed for zero‑shot voice cloning and augmentation, capturing timbre and prosody details more effectively than single‑codebook methods. The audio codec uses DAC‑based RVQ with an 8‑codebook 25 Hz table distilled by CN‑Hubert, balancing semantic and acoustic fidelity.
The voice‑pack production pipeline is a fully‑stacked audio processing DAG that orchestrates tasks such as SNR detection, MOS scoring, WER scoring, noise reduction, volume normalization, and data augmentation. This configuration‑driven approach enables zero‑code deployment, rapid validation, and seamless iterative upgrades of individual audio nodes.
Key advantages of the pipeline are agility (quickly verify standards), uniformity (consistent results across implementations), and iterability (non‑intrusive upgrades of processing modules).
Users can experience the feature by recording three sentences, after which a personalized voice pack is generated within ten minutes, supporting diverse scenarios like navigation prompts, eye‑alert messages, and AI‑driven digital avatars.
Future plans focus on faster recording workflows, richer voice expressiveness, and broader integration of personalized speech services across map scenarios to create a truly living map experience.
Amap Tech
Official Amap technology account showcasing all of Amap's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.