Artificial Intelligence 7 min read

How Make-An-Audio Turns Text Into Realistic Sound Effects

Make-An-Audio, a collaborative text‑to‑audio model from Zhejiang University, Peking University and Volcano Speech, uses a Distill‑then‑Reprogram strategy to generate high‑quality, controllable sound effects from any modality, showcasing impressive demos and promising future AIGC applications.

Volcano Engine Developer Services

Feb 14, 2023

How Make-An-Audio Turns Text Into Realistic Sound Effects

Make-An-Audio is a new text‑to‑audio generation model jointly developed by Zhejiang University, Peking University and Volcano Speech, capable of producing realistic sound effects from natural language descriptions and supporting any modality (text, audio, image, video) as input.

Recent AIGC trends have highlighted breakthroughs such as generating images, video and 3D models from text; however, audio synthesis has lagged due to scarce paired data and long‑duration waveform modeling. Make‑An‑Audio addresses these challenges with a “Distill‑then‑Reprogram” text‑augmentation strategy that first distills natural language descriptions from a teacher model and then reprograms them with randomly recombined event data to create dynamic training samples.

Key technical components include a self‑supervised audio‑to‑text and audio‑text retrieval system for the Distill stage, a latent diffusion model for predicting spectrogram representations, and CLAP‑based contrastive language‑audio pretraining together with large language models (T5, BERT) for robust text conditioning. The model also introduces a CLAP Score for evaluating audio‑text consistency and demonstrates strong zero‑shot generalization on benchmark datasets.

Demonstrations show the model generating audio for prompts such as “a speedboat running as wind blows into a microphone” and “fireworks pop and explode”, as well as repairing damaged audio and converting images or video frames into corresponding sound effects. Sample videos have attracted tens of thousands of views on social media.

Make‑An‑Audio embodies the “No Modality Left Behind” principle, enabling high‑quality, controllable audio synthesis from any input modality, and is expected to impact film dubbing, short‑video creation and broader AIGC applications, though occasional mismatches between text and generated audio remain.

Paper link: https://arxiv.org/abs/2301.12661 Project link: https://text-to-audio.github.io

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning AIGC speech synthesis Text-to-Audio Audio Generation

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.