Artificial Intelligence 8 min read

Gaode Map Custom Voice Pack: End‑to‑End TTS Model Architecture and Deployment

This article explains how Gaode Map leverages lightweight edge TTS models, dual‑autoregressive large‑model data augmentation, and a configurable audio‑processing DAG to enable users to create highly realistic personalized voice packs from just three recorded sentences.

Amap Tech

May 27, 2025

Gaode Map Custom Voice Pack: End‑to‑End TTS Model Architecture and Deployment

Gaode Map introduces a personalized voice‑pack feature that replaces generic navigation prompts with custom voices mimicking family members or loved ones, turning navigation into an emotionally extended experience.

The technical sharing outlines the evolution of Text‑to‑Speech (TTS) technology from mechanical devices to modern deep‑learning models, highlighting the shift from acoustic‑feature generation to token‑plus‑LLM based large models that enable natural, personalized, and multimodal speech synthesis.

To support billions of devices, Gaode adopts an ultra‑lightweight edge acoustic model combined with data‑enhanced TTS large models. The edge model is trained in three stages: pre‑training on thousands of high‑quality recordings with speaker‑semantic disentanglement, Teacher‑Force distillation using a Hubert‑Feature Map loss to preserve semantic information, and fine‑tuning that updates only speaker‑related parameters for rapid adaptation.

The resulting model, together with a source‑filter vocoder, totals less than 5 MB, making it one of the smallest full‑neural end‑side TTS solutions while delivering superior audio quality.

To mitigate data scarcity during fine‑tuning, a dual‑autoregressive TTS large model is employed for zero‑shot voice cloning and augmentation, capturing timbre and prosody details more effectively than single‑codebook methods. The audio codec uses DAC‑based RVQ with an 8‑codebook 25 Hz table distilled by CN‑Hubert, balancing semantic and acoustic fidelity.

The voice‑pack production pipeline is a fully‑stacked audio processing DAG that orchestrates tasks such as SNR detection, MOS scoring, WER scoring, noise reduction, volume normalization, and data augmentation. This configuration‑driven approach enables zero‑code deployment, rapid validation, and seamless iterative upgrades of individual audio nodes.

Key advantages of the pipeline are agility (quickly verify standards), uniformity (consistent results across implementations), and iterability (non‑intrusive upgrades of processing modules).

Users can experience the feature by recording three sentences, after which a personalized voice pack is generated within ten minutes, supporting diverse scenarios like navigation prompts, eye‑alert messages, and AI‑driven digital avatars.

Future plans focus on faster recording workflows, richer voice expressiveness, and broader integration of personalized speech services across map scenarios to create a truly living map experience.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data augmentation model compression edge AI TTS voice synthesis Gaode Maps

Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.