ControlAudio: Script‑Driven, Time‑Precise Text‑to‑Audio Generation Presented at ACL 2026
ControlAudio, a progressive diffusion framework introduced by Tsinghua researchers, unifies text, timing, and phoneme modeling to enable precise control over when sounds occur and what is spoken, achieving superior alignment and intelligibility while preserving high‑fidelity audio generation.
Research background – Text‑to‑Audio (TTA) has advanced from simple sound‑effect synthesis to high‑fidelity diffusion‑based generation, yet existing systems lack fine‑grained control over event timing and speech clarity.
Problem – Current TTA models cannot accurately schedule sound events nor guarantee understandable speech content.
Core method (ControlAudio) – The authors propose a three‑part progressive diffusion approach:
Data construction & representation : combine manually annotated and simulated data, creating multi‑level samples and designing structured prompts that let a pretrained text encoder encode text, timing, and phoneme information jointly.
Model training : pre‑train a diffusion model on large‑scale text‑audio data, then fine‑tune with timing annotations, and finally incorporate phoneme cues to achieve hierarchical control.
Guided sampling : during inference, first generate the overall temporal structure using text + timing conditions, then progressively inject phoneme information with stronger guidance to refine speech content.
Progressive diffusion modeling – Training proceeds in three stages (Text → Text + Timing → Text + Timing + Phoneme), each adding a finer control signal. In inference, a coarse‑to‑fine sampling schedule mirrors the diffusion process, improving time alignment and speech intelligibility.
Dataset construction – Real data are derived from AudioSet‑SL with timestamps and transcriptions, expanding <text, audio> to <text, timing, phoneme, audio>. Synthetic data are generated by statistically modeling speech activity, mixing single‑ or multi‑speaker segments with background audio, yielding over 170 000 training samples. Structured prompts are automatically generated via a Chain‑of‑Thought (CoT) pipeline that parses natural language into "event — time — speech" triples.
Experimental results – On the AudioCondition benchmark for time‑controllable audio, ControlAudio significantly improves event‑time alignment while maintaining or surpassing baseline scores on FAD and CLAP metrics. In speech‑focused evaluations, it delivers clearer, more intelligible speech and comparable overall audio quality, demonstrating the ability to model time structure and content within a single framework.
Conclusion and outlook – By addressing data, training, and sampling, ControlAudio solves the fine‑grained control problem in TTA and shows stronger generality and extensibility than prior single‑dimension approaches. The authors anticipate that the "multi‑granularity conditional modeling + progressive generation" paradigm will guide future unified audio, speech, and music generation systems.
Sample prompts
Text Prompt: Music plays, followed by mechanisms, typing, beeps, and an alarm.
Timing Prompt: Music: 0.00‑10.00 s; Beeps: 1.00‑1.20 s, 3.00‑3.20 s, 4.90‑5.10 s, 6.90‑7.10 s; Typing: 1.20‑7.80 s; Alarm: 7.85‑8.50 s.
Structured Prompt: Music plays, followed by mechanisms, typing, beeps, and an alarm. @{Music.&<0.00,10.00>}@{Beeps.&<1.00,1.20><3.00,3.20><4.90,5.10><6.90,7.10>}@{Typing.&<1.20,7.80>}@{Alarm.&<7.85,8.50>}
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
