AI RAP: End-to-End Speech Synthesis for Rap Generation Using Location‑Sensitive Attention and Inference Mask

AI RAP is an end‑to‑end AI service that lets users generate personalized rap with a single click by combining location‑sensitive attention and an inference mask to achieve perfect alignment, beat‑synchronous timing, multi‑character voice timbres, sub‑second synthesis, and a scalable architecture supporting millions of daily users.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
AI RAP: End-to-End Speech Synthesis for Rap Generation Using Location‑Sensitive Attention and Inference Mask

Background: The popular Chinese rap show "China New Rap" attracted massive attention, generating a strong desire among users to create freestyle rap but lacking an easy creation environment.

To address this, the team released an AI‑powered service called "AI RAP" that enables ordinary users to generate personalized rap songs with a single click, fulfilling the vision of making entertainment simpler and more fun.

Technical challenges: Compared with normal speech, rap features fast tempo and blurred phoneme boundaries. Conventional end‑to‑end TTS models can produce continuous audio but struggle to obtain precise character boundaries and suffer from alignment accuracy and real‑time performance issues. AI RAP implements several optimizations to overcome these problems.

AI RAP integrates multiple AI techniques. It is the first system to combine location‑sensitive attention with an inference mask, significantly improving alignment correctness. Based on the alignment results, the synthesized speech is automatically matched to the beats of background music, and an optimal stretching strategy is applied to align each word with the corresponding rap tempo, producing a professional‑level flow.

In the speech synthesis component, while Google’s Tacotron model restores timbre well, it often yields unstable monotonic alignments, leading to missing or repeated characters. AI RAP further refines the Tacotron architecture, markedly enhancing alignment accuracy and synthesis quality, and supports multiple voice timbres.

The system adopts the location‑sensitive attention module from Tacotron‑2 and introduces an inference mask during inference. This mask limits attention to maintain monotonicity, raising alignment accuracy on an 8,000‑plus test set from 80 % to 100 %.

Attention optimization results (before vs. after) demonstrate a clear improvement in alignment stability and naturalness.

Voice timbre: By training on a large dataset, AI RAP can generate IP‑character voices such as "Crayon Shin‑chan", "Chibi Maruko‑chan", and popular comedy personalities, satisfying diverse age groups and enabling users to have celebrities “sing” their lyrics.

Synthesis speed: The system replaces the iterative Griffin‑Lim algorithm with a proprietary waveform reconstruction algorithm, achieving synthesis of up to 40 characters in about one second on a CPU.

Scalability: With the speed optimizations and a distributed deployment architecture, AI RAP can support millions of daily users, allowing widespread rap creation and sharing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIAttention MechanismAudio ProcessingSpeech synthesisrap generationTacotron
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.