Tagged articles
47 articles
Page 1 of 1
Machine Heart
Machine Heart
Apr 23, 2026 · Artificial Intelligence

UniLS: End-to-End Audio-Driven Framework Eliminates the ‘Poker Face’ in Digital Human Dialogue

UniLS, the first end‑to‑end audio‑driven framework that jointly generates speaking and listening facial motions for digital humans, achieves state‑of‑the‑art speaking accuracy, improves listening naturalness by 44.1 %, and runs at over 500 FPS, as demonstrated on the CVPR 2026‑accepted paper with extensive quantitative and user studies.

CVPR 2026Speech synthesisaudio-driven animation
0 likes · 9 min read
UniLS: End-to-End Audio-Driven Framework Eliminates the ‘Poker Face’ in Digital Human Dialogue
AI Open-Source Efficiency Guide
AI Open-Source Efficiency Guide
Apr 6, 2026 · Artificial Intelligence

VibeVoice vs PersonaPlex vs OmniVoice: A Comprehensive Open‑Source AI Voice Comparison

This article provides a detailed side‑by‑side analysis of three open‑source speech AI projects—Microsoft's VibeVoice, NVIDIA's PersonaPlex, and Xiaomi's OmniVoice—covering their positioning, core models, technical highlights, multilingual support, performance metrics, licensing, and recommended use cases.

AISpeech synthesisautomatic speech recognition
0 likes · 15 min read
VibeVoice vs PersonaPlex vs OmniVoice: A Comprehensive Open‑Source AI Voice Comparison
Weekly Large Model Application
Weekly Large Model Application
Mar 23, 2026 · Artificial Intelligence

Inside Step‑Audio2: End‑to‑End Multimodal Audio LLM Architecture and Deployment

This article dissects Step‑Audio2, an industrial‑grade multimodal large language model that unifies speech understanding, translation, dialogue and audio generation in a single causal LM, detailing its inference pipeline, key implementation tricks, deployment modes, strengths, limitations, and suitable application scenarios.

PythonSpeech synthesisStep-Audio2
0 likes · 10 min read
Inside Step‑Audio2: End‑to‑End Multimodal Audio LLM Architecture and Deployment
AI Explorer
AI Explorer
Mar 19, 2026 · Artificial Intelligence

Unveiling Hunter Alpha: Xiaomi’s MiMo‑V2‑Pro and Two New Models Revealed

After a week of anonymous dominance on OpenRouter, Xiaomi revealed that the top‑ranking Hunter Alpha and Healer Alpha models are its MiMo‑V2‑Pro and MiMo‑V2‑Omni, respectively, and introduced the MiMo‑V2‑TTS voice model, detailing their massive parameters, benchmark scores, pricing, multimodal capabilities, and a clever blind‑test launch strategy.

AI AgentBenchmarkMiMo-V2
0 likes · 11 min read
Unveiling Hunter Alpha: Xiaomi’s MiMo‑V2‑Pro and Two New Models Revealed
Weekly Large Model Application
Weekly Large Model Application
Feb 22, 2026 · Artificial Intelligence

2026 Guide: Pure‑CPU Open‑Source Chinese TTS Models Optimized for Performance

This article reviews the most capable open‑source Chinese text‑to‑speech models that run entirely on CPU in 2026, compares their quantization and speed features, recommends acceleration engines, outlines five hard‑won optimization rules, and provides a concise selection guide for various deployment scenarios.

CPU inferenceChinese TTSONNX Runtime
0 likes · 6 min read
2026 Guide: Pure‑CPU Open‑Source Chinese TTS Models Optimized for Performance
Old Zhang's AI Learning
Old Zhang's AI Learning
Feb 7, 2026 · Artificial Intelligence

Zero‑Shot Voice Cloning with Emotion and Duration Control: IndexTTS‑2 Runs Locally

IndexTTS‑2, an open‑source zero‑shot TTS system from B‑Station, enables precise duration control, emotion‑tone separation, and bilingual synthesis, offering a modern uv‑based installation, GPU‑accelerated inference, and benchmark‑leading WER and emotional similarity scores compared to contemporary models.

AIIndexTTS-2Speech synthesis
0 likes · 10 min read
Zero‑Shot Voice Cloning with Emotion and Duration Control: IndexTTS‑2 Runs Locally
DataFunSummit
DataFunSummit
Sep 7, 2025 · Artificial Intelligence

How NIO Cut Radio Production Costs by 80% with AI Voice Cloning

This article details NIO's AI‑driven voice‑cloning solution for its in‑car NIO Radio, explaining the business background, pain points of traditional production, the TTS‑VC framework and modular workflow, evaluation metrics, and the resulting cost savings, efficiency gains, and scalability across dozens of cities.

AICost reductionSpeech synthesis
0 likes · 10 min read
How NIO Cut Radio Production Costs by 80% with AI Voice Cloning
Bilibili Tech
Bilibili Tech
Aug 12, 2025 · Artificial Intelligence

How AI Recreates Original Voices in Multilingual Video Dubbing

This article explains the technical challenges and innovative AI solutions behind preserving speaker identity, emotion, and timing while translating video content into multiple languages, covering speech generation modeling, speaker segmentation, adversarial reinforcement learning, proper‑noun adaptation, and audio‑visual alignment techniques.

AI voice cloningDeep LearningSpeech synthesis
0 likes · 22 min read
How AI Recreates Original Voices in Multilingual Video Dubbing
Bilibili Tech
Bilibili Tech
Aug 5, 2025 · Artificial Intelligence

How Bilibili’s IndexTTS2 Achieves Real‑Time, Emotion‑Rich Voice Translation

IndexTTS2 introduces a cross‑modal, multi‑language voice translation system that preserves speaker identity, acoustic space, and multi‑source timbre, while tackling challenges like voice personality loss, subtitle cognitive load, localization costs, multi‑speaker diarization, and cultural adaptation through novel time‑coding, adversarial RL, and diffusion‑based lip‑sync techniques.

Multimodal AISpeech synthesisadversarial reinforcement learning
0 likes · 20 min read
How Bilibili’s IndexTTS2 Achieves Real‑Time, Emotion‑Rich Voice Translation
Cognitive Technology Team
Cognitive Technology Team
Jul 1, 2025 · Artificial Intelligence

How We Built a Live‑Streaming TTS Engine: From Data Pipelines to AI Voice Generation

This article presents a comprehensive practice summary of building an intelligent digital‑human system, covering six core modules—LLM content generation, LLM interaction, TTS synthesis, visual driving, audio‑video engineering, and backend services—while detailing data collection, signal processing, ASR annotation, speaker clustering, model optimization (V1‑V4), evaluation metrics, and future research directions.

AI voiceAudio ProcessingDigital Human
0 likes · 23 min read
How We Built a Live‑Streaming TTS Engine: From Data Pipelines to AI Voice Generation
DaTaobao Tech
DaTaobao Tech
Jun 27, 2025 · Artificial Intelligence

Building a High‑Quality Live‑Streaming Digital Human: TTS Pipeline, Data Processing, and Model Optimizations

This article details the end‑to‑end workflow for creating intelligent digital humans for live streaming, covering large‑language‑model‑driven content generation, multi‑stage TTS architecture, extensive audio‑signal processing, speaker clustering, front‑end text normalization, back‑end acoustic modeling, and quantitative evaluation of model improvements.

AIDigital HumanSpeech synthesis
0 likes · 22 min read
Building a High‑Quality Live‑Streaming Digital Human: TTS Pipeline, Data Processing, and Model Optimizations
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Jul 29, 2024 · Artificial Intelligence

How to Run Real‑Time Voice Cloning with Python: A Step‑by‑Step Guide

This guide introduces the open‑source Realtime Voice Cloning project, explains its key features, and provides detailed installation and usage instructions—including environment setup, dependency installation, cloning the repository, and running the demo tool—to enable real‑time voice transformation with Python.

AIPythonReal-time Audio
0 likes · 5 min read
How to Run Real‑Time Voice Cloning with Python: A Step‑by‑Step Guide
Tencent Cloud Developer
Tencent Cloud Developer
Jun 14, 2024 · Artificial Intelligence

GPT-4o Speech Multimodal Technology: Speech Tokenization, LLM Integration, and Zero-shot TTS

GPT‑4o’s speech multimodal system discretizes audio into semantic and acoustic tokens, integrates these tokens with large language models through multi‑stage instruction tuning, and employs hierarchical zero‑shot text‑to‑speech decoding, enabling low‑latency, streaming, and prompt‑driven voice synthesis for applications like gaming.

AudioLMGPT-4oLLM integration
0 likes · 33 min read
GPT-4o Speech Multimodal Technology: Speech Tokenization, LLM Integration, and Zero-shot TTS
php Courses
php Courses
Sep 1, 2023 · Artificial Intelligence

Integrating Baidu Text-to-Speech API with PHP

This tutorial demonstrates how to obtain Baidu TTS credentials, construct the required signature, send an HTTP request using PHP's cURL library, and save the returned audio data as an MP3 file, providing a complete code example for developers.

Baidu TTSPHPSpeech synthesis
0 likes · 5 min read
Integrating Baidu Text-to-Speech API with PHP
58 Tech
58 Tech
Aug 25, 2023 · Artificial Intelligence

Voice Cloning Technology in AI Sales Assistant

This article introduces the AI sales assistant from 58.com, detailing its background, a few‑shot voice cloning approach using real dialogue data, multi‑accent naturalness optimization, deployment architecture, and future plans, while evaluating performance metrics and discussing challenges in speech synthesis quality and stability.

AI sales assistantFew‑Shot LearningSpeech synthesis
0 likes · 19 min read
Voice Cloning Technology in AI Sales Assistant
DataFunSummit
DataFunSummit
Aug 15, 2023 · Artificial Intelligence

AI Sales Assistant: Few‑Shot Voice Cloning and Multi‑Accent Naturalness Optimization

The article presents 58 Tongcheng AI Lab's AI sales assistant, detailing its background, a few‑shot voice‑cloning pipeline built on real dialogue data, data preprocessing, FastSpeech2‑based acoustic modeling, multi‑accent style transfer, deployment architecture, controllable synthesis parameters, and future research directions.

AI sales assistantFastspeech2Speech synthesis
0 likes · 20 min read
AI Sales Assistant: Few‑Shot Voice Cloning and Multi‑Accent Naturalness Optimization
Programmer DD
Programmer DD
Jun 20, 2023 · Artificial Intelligence

Yann LeCun: Today's AI Still Below Dog Level – Inside Meta’s Voicebox, MusicGen & I‑JEPA

Meta’s chief AI scientist Yann LeCun warned that current large language models still fall short of human and even dog intelligence, citing their lack of real‑world understanding, while Meta unveiled three new generative AI models—Voicebox for speech, MusicGen for music, and I‑JEPA for image reasoning—showcasing both progress and remaining limitations.

Computer VisionMusic generationSpeech synthesis
0 likes · 7 min read
Yann LeCun: Today's AI Still Below Dog Level – Inside Meta’s Voicebox, MusicGen & I‑JEPA
Tencent Cloud Developer
Tencent Cloud Developer
Apr 4, 2023 · Artificial Intelligence

Step-by-Step Guide to Building Your Own Realistic AI Image Generation Website with Stable Diffusion

This step‑by‑step tutorial shows how to set up a Stable Diffusion web UI, install the required Python environment and GPU‑enabled PyTorch, add Chinese localization and optional LoRA or Deforum extensions, generate realistic human images, create animated videos, and add speech with D‑ID, all ready for deployment on your own AI website.

DeforumGitPython
0 likes · 9 min read
Step-by-Step Guide to Building Your Own Realistic AI Image Generation Website with Stable Diffusion
Volcano Engine Developer Services
Volcano Engine Developer Services
Feb 14, 2023 · Artificial Intelligence

How Make-An-Audio Turns Text Into Realistic Sound Effects

Make-An-Audio, a collaborative text‑to‑audio model from Zhejiang University, Peking University and Volcano Speech, uses a Distill‑then‑Reprogram strategy to generate high‑quality, controllable sound effects from any modality, showcasing impressive demos and promising future AIGC applications.

AIGCDeep LearningSpeech synthesis
0 likes · 7 min read
How Make-An-Audio Turns Text Into Realistic Sound Effects
DataFunSummit
DataFunSummit
Dec 9, 2022 · Artificial Intelligence

Volcano Engine Virtual Digital Human Technology Overview

This article provides a comprehensive overview of Volcano Engine's virtual digital human platform, detailing its definition, AI‑driven and human‑driven classifications, 2D and 3D technical architectures, multi‑modal perception, interaction capabilities, application scenarios, and future development directions.

2D avatar3D AvatarComputer Vision
0 likes · 15 min read
Volcano Engine Virtual Digital Human Technology Overview
iQIYI Technical Product Team
iQIYI Technical Product Team
Aug 26, 2022 · Artificial Intelligence

IQDubbing: AI-Powered Multi-Language, Multi-Voice Dubbing System for Film and TV

iQIYI’s IQDubbing system leverages AI‑driven voice conversion to automatically generate high‑quality, expressive dubbing in dozens of languages and over 50 character voice styles, streamlining multilingual film and TV localization, reducing reliance on scarce actors, and earning positive audience feedback, patents and industry awards.

AI dubbingFilm ProductionSpeech synthesis
0 likes · 13 min read
IQDubbing: AI-Powered Multi-Language, Multi-Voice Dubbing System for Film and TV
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Aug 10, 2022 · Artificial Intelligence

Multi-Stage Multi-Codebook VQ-VAE for High-Performance Neural Text-to-Speech (MSMC‑TTS)

The MSMC‑TTS system, a multi‑stage multi‑codebook VQ‑VAE based neural text‑to‑speech solution, delivers near‑human audio quality (MOS 4.41) with a compact 3.12 MB acoustic model, substantially surpassing Mel‑Spectrogram FastSpeech baselines in naturalness and efficiency.

Compact RepresentationMulti-Stage ModelingSpeech synthesis
0 likes · 10 min read
Multi-Stage Multi-Codebook VQ-VAE for High-Performance Neural Text-to-Speech (MSMC‑TTS)
NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
Jun 14, 2022 · Artificial Intelligence

How Outbound Call Robots Work: Challenges and Optimizations in Voice Dialogue Systems

This article explains the architecture of outbound call robots, classifies dialogue system types, details pipeline and end‑to‑end task‑oriented designs, highlights technical challenges such as dialects and transcription errors, and presents optimization techniques like ASR correction and script improvement.

AI OptimizationASR correctionNLU
0 likes · 12 min read
How Outbound Call Robots Work: Challenges and Optimizations in Voice Dialogue Systems
Zuoyebang Tech Team
Zuoyebang Tech Team
May 19, 2022 · Artificial Intelligence

How to Achieve High‑Quality TTS with Only Minutes of Data

This article reviews neural speech synthesis, explains why large high‑quality paired data are essential, and presents a range of low‑resource solutions—including semi‑supervised pre‑training, cross‑language transfer, speaker embedding, and Conformer‑based model upgrades—demonstrating how the Zuoyebang team built a robust TTS system with as little as 7‑minute speaker recordings.

ConformerFastspeech2Speech synthesis
0 likes · 15 min read
How to Achieve High‑Quality TTS with Only Minutes of Data
DataFunSummit
DataFunSummit
Apr 14, 2022 · Artificial Intelligence

Advances in Alibaba's Digital Human Technology: Construction, Performance, Interaction, and the MMTK Multimodal Algorithm Library

This article reviews Alibaba's digital‑human (virtual avatar) research over the past few years, covering the product’s evolution, a six‑stage pipeline for building digital humans, solutions to key challenges in realism, multimodal interaction, and the open‑source MMTK algorithm library.

Digital HumanEmotion ModelingMultimodal AI
0 likes · 12 min read
Advances in Alibaba's Digital Human Technology: Construction, Performance, Interaction, and the MMTK Multimodal Algorithm Library
Test Development Learning Exchange
Test Development Learning Exchange
Oct 17, 2021 · Artificial Intelligence

Using pyttsx3 for Text-to-Speech in Python

This article provides a hands‑on guide to using the pyttsx3 library for offline text‑to‑speech conversion in Python, covering installation, basic playback, voice property adjustments, multilingual support, and conditional speech examples with counters.

PythonSpeech synthesisconditional speech
0 likes · 7 min read
Using pyttsx3 for Text-to-Speech in Python
Volcano Engine Developer Services
Volcano Engine Developer Services
Sep 14, 2021 · Artificial Intelligence

How ByteDance’s AI Lab is Revolutionizing Intelligent Speech for Content Creation

ByteDance’s AI‑Lab leader Dr Yin Xiang discusses how the company’s intelligent speech technologies—spanning voice synthesis, recognition, and multimodal interaction—have been integrated across its global content platforms since 2017, boosting productivity in short videos, audiobooks, education, and more.

AIByteDanceSpeech synthesis
0 likes · 13 min read
How ByteDance’s AI Lab is Revolutionizing Intelligent Speech for Content Creation
Kuaishou Tech
Kuaishou Tech
May 29, 2021 · Artificial Intelligence

Speaker-Aware Module for Single-Sample Voice Conversion (SAVC)

The paper presents a speaker‑aware module (SAM) that enables high‑quality voice conversion using only a single utterance of the target speaker, addressing the small‑data challenge in speech timbre transfer and achieving state‑of‑the‑art performance on the Aishell‑1 benchmark.

Deep LearningLPCNetSpeech synthesis
0 likes · 12 min read
Speaker-Aware Module for Single-Sample Voice Conversion (SAVC)
iQIYI Technical Product Team
iQIYI Technical Product Team
Nov 20, 2020 · Artificial Intelligence

iQIYI M2VoC Multi‑Speaker Multi‑Style Voice Cloning Challenge (ICASSP 2021) Overview

The iQIYI M2VoC Challenge at ICASSP 2021 invites researchers to tackle low‑resource multi‑speaker, multi‑style voice cloning by providing Mandarin datasets, few‑shot and extremely few‑shot tracks with strict data rules, MOS‑based subjective evaluation, and a $9,600 prize pool for top submissions.

AIChallengeICASSP
0 likes · 10 min read
iQIYI M2VoC Multi‑Speaker Multi‑Style Voice Cloning Challenge (ICASSP 2021) Overview
The Dominant Programmer
The Dominant Programmer
Nov 17, 2020 · Mobile Development

Offline Android Text‑to‑Speech without Third‑Party SDKs

This guide shows how to create an offline Android app that converts any text to speech using the platform‑provided TextToSpeech class, covering UI layout with EditText and Button, a singleton SpeechUtils helper, language, pitch and rate configuration, and full code snippets for a working demo.

AndroidJavaMobile Development
0 likes · 5 min read
Offline Android Text‑to‑Speech without Third‑Party SDKs
DataFunTalk
DataFunTalk
Jan 16, 2020 · Artificial Intelligence

Voice Conversion: Fundamentals, Methods, and iQIYI Applications

This article provides a comprehensive overview of voice conversion technology, covering its definition, parallel and non‑parallel data approaches, classic and deep‑learning methods such as DTW, GMM, seq2seq, PPG, VAE, Flow, GAN, and practical applications and challenges in iQIYI’s products.

ASRDeep LearningGAN
0 likes · 8 min read
Voice Conversion: Fundamentals, Methods, and iQIYI Applications
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 20, 2019 · Artificial Intelligence

Unlock Cutting-Edge Voice AI: Highlights from Alibaba’s Speech & Signal Processing eBook

This article introduces Alibaba's new e‑book collection of five ICASSP‑accepted papers that showcase advances in speech recognition, synthesis, and emotion detection, detailing novel models like DFSMN, A‑LSTM, and speaker‑adaptation techniques that dramatically improve speed, size, and accuracy.

AI voiceDeep LearningEmotion Recognition
0 likes · 6 min read
Unlock Cutting-Edge Voice AI: Highlights from Alibaba’s Speech & Signal Processing eBook
Tencent Cloud Developer
Tencent Cloud Developer
Feb 26, 2019 · Artificial Intelligence

Tencent Cloud Intelligent Speech Technology: Development, Challenges and Practical Applications

Tencent Cloud's intelligent speech platform combines high‑accuracy ASR, advanced WaveNet‑based TTS, and solutions for noise, far‑field, and dialect challenges, enabling voice input, transcription, and customer‑service bots, with real‑world deployments in finance, museums, hotels, and other industry scenarios.

ASRHuman-Computer InteractionSpeech synthesis
0 likes · 8 min read
Tencent Cloud Intelligent Speech Technology: Development, Challenges and Practical Applications
Ctrip Technology
Ctrip Technology
Feb 21, 2019 · Artificial Intelligence

Speech Recognition and Synthesis: Principles, Challenges, Optimizations, and Tencent Cloud Use Cases

This article reviews the development roadmap, current industry status, challenges, typical deployment scenarios, and optimization methods for speech recognition (ASR) and speech synthesis (TTS), and shares several Tencent Cloud intelligent voice case studies to illustrate practical applications.

AISpeech synthesiscloud computing
0 likes · 9 min read
Speech Recognition and Synthesis: Principles, Challenges, Optimizations, and Tencent Cloud Use Cases
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 27, 2018 · Artificial Intelligence

How Linear Networks Enable Speaker‑Adaptive Speech Synthesis with Minimal Data

This article presents a linear‑network‑based speaker‑adaptation method for text‑to‑speech that achieves synthesis quality comparable to large‑scale training using only a few hundred target‑speaker utterances, and introduces a low‑rank‑plus‑diagonal compression to improve stability with scarce data.

Speech synthesisacoustic modelingartificial intelligence
0 likes · 9 min read
How Linear Networks Enable Speaker‑Adaptive Speech Synthesis with Minimal Data
Tencent Cloud Developer
Tencent Cloud Developer
Oct 10, 2018 · Artificial Intelligence

What Are the Real Challenges and Future Trends in Intelligent Voice Technology?

This article examines the current landscape of intelligent voice technology—including speech recognition, synthesis, voiceprint identification, and acoustic event detection—highlighting technical hurdles, evaluation metrics, recent advances such as WaveNet, and a wide range of practical applications from mobile devices to smart hardware and enterprise solutions.

Audio ProcessingSpeech synthesisTencent Cloud
0 likes · 16 min read
What Are the Real Challenges and Future Trends in Intelligent Voice Technology?
iQIYI Technical Product Team
iQIYI Technical Product Team
Sep 14, 2018 · Artificial Intelligence

AI RAP: End-to-End Speech Synthesis for Rap Generation Using Location‑Sensitive Attention and Inference Mask

AI RAP is an end‑to‑end AI service that lets users generate personalized rap with a single click by combining location‑sensitive attention and an inference mask to achieve perfect alignment, beat‑synchronous timing, multi‑character voice timbres, sub‑second synthesis, and a scalable architecture supporting millions of daily users.

AIAttention MechanismAudio Processing
0 likes · 5 min read
AI RAP: End-to-End Speech Synthesis for Rap Generation Using Location‑Sensitive Attention and Inference Mask
Liulishuo Tech Team
Liulishuo Tech Team
Sep 3, 2017 · Artificial Intelligence

Report on Interspeech 2017 and SLaTE 2017: Highlights in Speech Recognition, Synthesis, and Speaker Technologies

The article reports on Liulishuo’s participation in Interspeech 2017 and the SLaTE 2017 workshop, summarizing key research papers on noise‑robust ASR, attention‑based models, TensorFlow training, modern TTS systems, speaker identification datasets, and includes a hiring announcement for AI engineers.

AIInterspeechSpeech synthesis
0 likes · 7 min read
Report on Interspeech 2017 and SLaTE 2017: Highlights in Speech Recognition, Synthesis, and Speaker Technologies
Baidu Tech Salon
Baidu Tech Salon
Jul 29, 2014 · Artificial Intelligence

Baidu Speech Synthesis: Balancing Trade‑offs and Opening the Platform to Developers

Baidu’s speech synthesis system, developed since 2013 to give machines natural Chinese voices, tackles millions of tonal variations through phonetic compression and optimized acoustic models, balances trade‑offs in data and scalability, and offers a free open platform that lets developers integrate high‑quality text‑to‑speech into apps, advancing equal access to information.

BaiduDeveloper PlatformHMM
0 likes · 6 min read
Baidu Speech Synthesis: Balancing Trade‑offs and Opening the Platform to Developers