Tencent GME Audio Technology Solution for QQ Dance Mobile Game Real-time KTV
Tencent’s Game Multimedia Engine (GME) powers QQ Dance’s mobile real‑time KTV by delivering frame‑level synchronization of vocals, accompaniment and lyrics, ultra‑low latency ear return, robust echo cancellation, high MOS scores even under severe packet loss, low CPU and data usage, and advanced 3‑D spatial voice rendering for immersive multiplayer gameplay.
On March 14th, Tencent's popular mobile game "QQ Dance" was officially launched on major app stores and quickly rose to #1 on the App Store free games chart. Prior to release, the game had already accumulated over 20 million pre-registrations across all channels.
The mobile version was developed by the original PC game team, maintaining the core experience of the desktop version while introducing innovative gameplay modes such as pinball mode and real-time in-game KTV rooms. However, these innovations brought significant audio technology challenges: aligning vocals, accompaniment, and lyrics during KTV sessions, and ensuring clear voice transmission without echo during multiplayer voice communication.
The real-time KTV gameplay in QQ Dance far exceeds normal complexity in audio processing. During normal KTV sessions, players hear the accompaniment first before singing, with their voice captured by the microphone and then output. However, due to system playback module delays in gaming scenarios, particularly with Android devices having various models and significant playback/capture delays, simply following the normal mode would result in obvious misalignment between vocals, accompaniment, and lyrics.
Tencent's Audio and Video Lab provided the Game Multimedia Engine (GME) as the solution for in-game KTV gameplay and multiplayer real-time voice communication. The results exceeded expectations: vocals, accompaniment, and lyrics are perfectly synchronized during gameplay, with reverb effects between each element. When players speak while others are singing, the voices remain "isolated" from each other, allowing each sound source to be clearly presented without mixing noise.
Technical experts explained: "We fully considered the impact of network jitter on sound transmission, and through processing the resulting waveform changes, we achieved frame-level precise synchronization between vocals, lyrics, and accompaniment."
Based on self-developed, high-quality echo cancellation technology, GME ensures players can hear everyone clearly in multiplayer real-time voice scenarios without voice clipping. If a player speaks while others are singing, it can switch to a radio "ducking" effect to enhance the experience for both the performer and audience.
GME also provides 30ms-level ultra-low latency ear return. Additionally, since high-quality audio in KTV scenarios requires higher network transmission standards with more demanding requirements for weak network optimization and jitter resistance, GME meets these needs effectively.
Beyond real-time KTV scenarios, GME also provides technical support for QQ Dance's real-time voice functionality. GME has mature audio processing experience and has provided audio technical support for over 400 products. GME supports different scenarios such as gaming voice chat, casual games, and audio streaming, providing different audio quality experiences and network damage resistance technologies for different scenarios.
In lossless network scenarios, real-time voice audio quality achieves an average MOS score of 4.38 (out of 5), with average latency below 200ms. Through advanced packet loss recovery technology, packet loss compensation algorithms, and excellent network resistance, smooth communication and good audio quality can be maintained even with over 50% packet loss and 1000ms network jitter.
GME has also made corresponding optimizations for data consumption and CPU usage concerns. For example, in MOBA games, under normal voice communication and good performance premises, mobile network mode consumes less than 500KB per minute with average CPU usage below 10%.
Notably, GME's self-developed 3D real-time voice technology uses HRTF algorithms to establish a human ear sound model based on time and spectral differences between both ears, processing non-directional sound into sound with source directional information, thereby virtualizing the sensation of sound sources at any position in space. This technology can be applied to battle royale and card/board game scenarios, providing players with better gaming experiences through sound localization capabilities.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.