Can End-to-End Diffusion TTS Beat Traditional Pipelines? Inside LongCat-AudioDiT

LongCat-AudioDiT introduces a wave‑VAE plus diffusion Transformer architecture that eliminates intermediate spectrograms, solves training‑inference mismatch with dual constraints, replaces classifier‑free guidance with adaptive projection guidance, and achieves state‑of‑the‑art zero‑shot voice cloning performance on multiple benchmarks.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
Can End-to-End Diffusion TTS Beat Traditional Pipelines? Inside LongCat-AudioDiT

Audio generation is moving from a cascade pipeline—predicting intermediate representations such as mel‑spectrograms and then using a neural vocoder—to an end‑to‑end paradigm that directly models waveforms. The cascade approach causes information loss and error accumulation, degrading fine‑grained timbre and speaker identity, which are critical for zero‑shot voice cloning.

LongCat‑AudioDiT Architecture

The model eliminates intermediate representations and operates entirely in a waveform latent space using two components:

Wav‑VAE : a fully convolutional variational auto‑encoder that compresses 24 kHz raw audio to a latent sequence at ~11.7 Hz, achieving >2000× compression.

Diffusion Transformer (DiT) : a diffusion‑based generative model that learns conditional flow matching (CFM) in the latent space.

LongCat‑AudioDiT architecture overview
LongCat‑AudioDiT architecture overview

Wav‑VAE Design

Efficient down‑sampling & multi‑scale modeling : hierarchical Oobleck blocks with dilated residual units reduce temporal resolution while preserving long‑range dependencies, compressing 24 kHz audio to ~11.7 Hz.

Non‑parametric shortcut paths : "space‑to‑channel" and "channel‑to‑space" shortcuts provide direct linear gradient routes, stabilizing training under aggressive down‑sampling.

Adversarial multi‑objective training : combines multi‑resolution STFT loss, multi‑scale mel loss, time‑domain L1 loss, KL regularization, and an adversarial STFT discriminator with feature‑matching loss to ensure high‑fidelity reconstruction.

Diffusion Transformer (DiT) Enhancements

Text encoder : UMT5 supporting 107 languages; first‑layer token embeddings are added to the final hidden state and LayerNorm‑ed to improve intelligibility.

ConvNeXt V2 sequence module : refines text representations and speeds up text‑to‑audio alignment.

Global AdaLN : injects timestep information with a shared adaptive layer‑norm, reducing parameter count.

QK‑Norm + RoPE : stabilizes attention with rotary positional encoding.

Long skip connections : add input directly to output, consistently improving quality.

Representation alignment (REPA) : uses mHuBERT self‑supervised features to guide intermediate DiT layers, accelerating convergence.

Resolving Training‑Inference Mismatch

Standard CFM training penalizes only masked regions, leaving audio prompts unoptimized. During inference this causes speaker‑style drift. LongCat‑AudioDiT introduces a dual‑constraint mechanism:

Prompt latent reset : at each inference step the latent variables of the prompt region are forced to their ground‑truth values, aligning inference trajectories with the training distribution.

Unconditional prediction purification : removes prompt latents from the unconditional velocity field to prevent information leakage.

Adaptive Projection Guidance (APG)

Classifier‑free guidance (CFG) amplifies the difference between conditional and unconditional predictions, which can over‑saturate the spectrum and degrade naturalness. APG decomposes the guidance signal into orthogonal components, preserving the beneficial part while suppressing the harmful part, thus improving naturalness without sacrificing speaker similarity.

Empirical Findings

Higher VAE reconstruction quality does not directly translate to better TTS generation; overly accurate reconstruction inflates latent dimensionality, making diffusion modeling harder. Systematic ablations identified the optimal latent configuration as 64 dimensions with an 11.7 Hz frame rate, balancing fidelity and learnability.

Objective evaluation of VAE latent dimensions
Objective evaluation of VAE latent dimensions

Benchmark Results

On the Seed benchmark, LongCat‑AudioDiT‑3.5B achieves speaker similarity (SIM) scores of 0.818 on Seed‑ZH and 0.797 on Seed‑Hard, surpassing Seed‑DiT, CosyVoice 3.5, and MiniMax‑Speech. Intelligibility scores are competitive:

Chinese CER: 1.09 % (3.5B), 1.18 % (1B)

English WER: 1.50 % (3.5B), 1.78 % (1B)

Chinese hard‑sentence CER: 6.04 % (3.5B) vs. 8.67 % for F5‑TTS

These results were obtained using only ASR‑generated transcripts for pre‑training, without high‑quality human annotations or multi‑stage pipelines.

Open‑Source Release

Paper: https://arxiv.org/abs/2603.29339v1

GitHub: https://github.com/meituan-longcat/LongCat-AudioDiT

HuggingFace: https://huggingface.co/meituan-longcat/LongCat-AudioDiT

LongCat‑AudioDiT demonstrates that a pure waveform‑latent diffusion approach can match or exceed the performance of complex multi‑stage pipelines, offering a new direction for high‑fidelity speech synthesis and multimodal audio generation.

diffusion modelopen-sourceAI researchtext-to-speechAudio Generationwaveform VAE
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.