Can End-to-End Diffusion TTS Beat Traditional Pipelines? Inside LongCat-AudioDiT
LongCat-AudioDiT introduces a wave‑VAE plus diffusion Transformer architecture that eliminates intermediate spectrograms, solves training‑inference mismatch with dual constraints, replaces classifier‑free guidance with adaptive projection guidance, and achieves state‑of‑the‑art zero‑shot voice cloning performance on multiple benchmarks.
Audio generation is moving from a cascade pipeline—predicting intermediate representations such as mel‑spectrograms and then using a neural vocoder—to an end‑to‑end paradigm that directly models waveforms. The cascade approach causes information loss and error accumulation, degrading fine‑grained timbre and speaker identity, which are critical for zero‑shot voice cloning.
LongCat‑AudioDiT Architecture
The model eliminates intermediate representations and operates entirely in a waveform latent space using two components:
Wav‑VAE : a fully convolutional variational auto‑encoder that compresses 24 kHz raw audio to a latent sequence at ~11.7 Hz, achieving >2000× compression.
Diffusion Transformer (DiT) : a diffusion‑based generative model that learns conditional flow matching (CFM) in the latent space.
Wav‑VAE Design
Efficient down‑sampling & multi‑scale modeling : hierarchical Oobleck blocks with dilated residual units reduce temporal resolution while preserving long‑range dependencies, compressing 24 kHz audio to ~11.7 Hz.
Non‑parametric shortcut paths : "space‑to‑channel" and "channel‑to‑space" shortcuts provide direct linear gradient routes, stabilizing training under aggressive down‑sampling.
Adversarial multi‑objective training : combines multi‑resolution STFT loss, multi‑scale mel loss, time‑domain L1 loss, KL regularization, and an adversarial STFT discriminator with feature‑matching loss to ensure high‑fidelity reconstruction.
Diffusion Transformer (DiT) Enhancements
Text encoder : UMT5 supporting 107 languages; first‑layer token embeddings are added to the final hidden state and LayerNorm‑ed to improve intelligibility.
ConvNeXt V2 sequence module : refines text representations and speeds up text‑to‑audio alignment.
Global AdaLN : injects timestep information with a shared adaptive layer‑norm, reducing parameter count.
QK‑Norm + RoPE : stabilizes attention with rotary positional encoding.
Long skip connections : add input directly to output, consistently improving quality.
Representation alignment (REPA) : uses mHuBERT self‑supervised features to guide intermediate DiT layers, accelerating convergence.
Resolving Training‑Inference Mismatch
Standard CFM training penalizes only masked regions, leaving audio prompts unoptimized. During inference this causes speaker‑style drift. LongCat‑AudioDiT introduces a dual‑constraint mechanism:
Prompt latent reset : at each inference step the latent variables of the prompt region are forced to their ground‑truth values, aligning inference trajectories with the training distribution.
Unconditional prediction purification : removes prompt latents from the unconditional velocity field to prevent information leakage.
Adaptive Projection Guidance (APG)
Classifier‑free guidance (CFG) amplifies the difference between conditional and unconditional predictions, which can over‑saturate the spectrum and degrade naturalness. APG decomposes the guidance signal into orthogonal components, preserving the beneficial part while suppressing the harmful part, thus improving naturalness without sacrificing speaker similarity.
Empirical Findings
Higher VAE reconstruction quality does not directly translate to better TTS generation; overly accurate reconstruction inflates latent dimensionality, making diffusion modeling harder. Systematic ablations identified the optimal latent configuration as 64 dimensions with an 11.7 Hz frame rate, balancing fidelity and learnability.
Benchmark Results
On the Seed benchmark, LongCat‑AudioDiT‑3.5B achieves speaker similarity (SIM) scores of 0.818 on Seed‑ZH and 0.797 on Seed‑Hard, surpassing Seed‑DiT, CosyVoice 3.5, and MiniMax‑Speech. Intelligibility scores are competitive:
Chinese CER: 1.09 % (3.5B), 1.18 % (1B)
English WER: 1.50 % (3.5B), 1.78 % (1B)
Chinese hard‑sentence CER: 6.04 % (3.5B) vs. 8.67 % for F5‑TTS
These results were obtained using only ASR‑generated transcripts for pre‑training, without high‑quality human annotations or multi‑stage pipelines.
Open‑Source Release
Paper: https://arxiv.org/abs/2603.29339v1
GitHub: https://github.com/meituan-longcat/LongCat-AudioDiT
HuggingFace: https://huggingface.co/meituan-longcat/LongCat-AudioDiT
LongCat‑AudioDiT demonstrates that a pure waveform‑latent diffusion approach can match or exceed the performance of complex multi‑stage pipelines, offering a new direction for high‑fidelity speech synthesis and multimodal audio generation.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
