Tagged articles

Speech synthesis

51 articles · Page 1 of 1

Jun 18, 2026 · Artificial Intelligence

8 Must‑Watch Open‑Source TTS Projects for 2026

This article reviews eight open‑source text‑to‑speech systems—from lightweight, CPU‑only models to multilingual, podcast‑focused engines—detailing their architectures, language coverage, benchmark scores, licensing, and practical use‑case recommendations.

AISpeech synthesisText‑to‑Speech

0 likes · 15 min read

8 Must‑Watch Open‑Source TTS Projects for 2026

Weekly Large Model Application

Jun 16, 2026 · Artificial Intelligence

Building an Open‑Source TTS Evaluation Framework with ZipVoice, OmniVoice & Latest Benchmarks

This guide explains why TTS evaluation requires a three‑metric “iron triangle” (WER/CER, speaker similarity, and naturalness), introduces community benchmarks such as Seed‑TTS‑eval, TTSDS2, TTS Arena and TTSD‑eval, and provides a concrete six‑stage pipeline and best‑practice checklist for reproducible, production‑ready assessment.

CI PipelineOpen-source benchmarksSeed-TTS-eval

0 likes · 11 min read

Building an Open‑Source TTS Evaluation Framework with ZipVoice, OmniVoice & Latest Benchmarks

Sohu Tech Products

May 13, 2026 · Artificial Intelligence

Three Simple Steps to Make AI‑Cloned Voices Sound Truly Like You

The article reveals that 80% of AI voice‑cloning failures stem from poor recording quality, analyzes three fatal sample defects—noise pollution, high‑frequency loss, and invalid segments—and proposes a three‑step “Extract → Enhance → Select” pipeline using BS‑RoFormer, DeepFilterNet3 and NISQA, boosting similarity from 68% to 89%.

AISpeech synthesisVoice Cloning

0 likes · 16 min read

Three Simple Steps to Make AI‑Cloned Voices Sound Truly Like You

Machine Heart

Apr 23, 2026 · Artificial Intelligence

UniLS: End-to-End Audio-Driven Framework Eliminates the ‘Poker Face’ in Digital Human Dialogue

UniLS, the first end‑to‑end audio‑driven framework that jointly generates speaking and listening facial motions for digital humans, achieves state‑of‑the‑art speaking accuracy, improves listening naturalness by 44.1 %, and runs at over 500 FPS, as demonstrated on the CVPR 2026‑accepted paper with extensive quantitative and user studies.

CVPR 2026Real-time AISpeech synthesis

0 likes · 9 min read

UniLS: End-to-End Audio-Driven Framework Eliminates the ‘Poker Face’ in Digital Human Dialogue

AI Open-Source Efficiency Guide

Apr 6, 2026 · Artificial Intelligence

VibeVoice vs PersonaPlex vs OmniVoice: A Comprehensive Open‑Source AI Voice Comparison

This article provides a detailed side‑by‑side analysis of three open‑source speech AI projects—Microsoft's VibeVoice, NVIDIA's PersonaPlex, and Xiaomi's OmniVoice—covering their positioning, core models, technical highlights, multilingual support, performance metrics, licensing, and recommended use cases.

AIAutomatic Speech RecognitionSpeech synthesis

0 likes · 15 min read

VibeVoice vs PersonaPlex vs OmniVoice: A Comprehensive Open‑Source AI Voice Comparison

Weekly Large Model Application

Mar 23, 2026 · Artificial Intelligence

Inside Step‑Audio2: End‑to‑End Multimodal Audio LLM Architecture and Deployment

This article dissects Step‑Audio2, an industrial‑grade multimodal large language model that unifies speech understanding, translation, dialogue and audio generation in a single causal LM, detailing its inference pipeline, key implementation tricks, deployment modes, strengths, limitations, and suitable application scenarios.

PythonSpeech synthesisStep-Audio2

0 likes · 10 min read

Inside Step‑Audio2: End‑to‑End Multimodal Audio LLM Architecture and Deployment

AI Explorer

Mar 19, 2026 · Artificial Intelligence

Unveiling Hunter Alpha: Xiaomi’s MiMo‑V2‑Pro and Two New Models Revealed

After a week of anonymous dominance on OpenRouter, Xiaomi revealed that the top‑ranking Hunter Alpha and Healer Alpha models are its MiMo‑V2‑Pro and MiMo‑V2‑Omni, respectively, and introduced the MiMo‑V2‑TTS voice model, detailing their massive parameters, benchmark scores, pricing, multimodal capabilities, and a clever blind‑test launch strategy.

AI AgentMiMo-V2Multimodal

0 likes · 11 min read

Unveiling Hunter Alpha: Xiaomi’s MiMo‑V2‑Pro and Two New Models Revealed

Xiaomi Tech

Mar 18, 2026 · Artificial Intelligence

Xiaomi Unveils MiMo-V2-TTS: Giving Agents a Voice with Soul

Xiaomi introduces MiMo-V2-TTS, a self‑developed speech‑synthesis large model that combines a custom audio tokenizer, multi‑codebook architecture, massive pre‑training on over a hundred million hours of data and multi‑dimensional reinforcement learning to deliver fine‑grained style control, dialect support, role‑play and high‑quality singing, aiming to give AI agents expressive, human‑like voices.

Speech synthesisaudio tokenizerlarge model

0 likes · 6 min read

Xiaomi Unveils MiMo-V2-TTS: Giving Agents a Voice with Soul

Weekly Large Model Application

Feb 22, 2026 · Artificial Intelligence

2026 Guide: Pure‑CPU Open‑Source Chinese TTS Models Optimized for Performance

This article reviews the most capable open‑source Chinese text‑to‑speech models that run entirely on CPU in 2026, compares their quantization and speed features, recommends acceleration engines, outlines five hard‑won optimization rules, and provides a concise selection guide for various deployment scenarios.

CPU inferenceChinese TTSONNX Runtime

0 likes · 6 min read

2026 Guide: Pure‑CPU Open‑Source Chinese TTS Models Optimized for Performance

Old Zhang's AI Learning

Feb 7, 2026 · Artificial Intelligence

Zero‑Shot Voice Cloning with Emotion and Duration Control: IndexTTS‑2 Runs Locally

IndexTTS‑2, an open‑source zero‑shot TTS system from B‑Station, enables precise duration control, emotion‑tone separation, and bilingual synthesis, offering a modern uv‑based installation, GPU‑accelerated inference, and benchmark‑leading WER and emotional similarity scores compared to contemporary models.

AIIndexTTS-2Speech synthesis

0 likes · 10 min read

Zero‑Shot Voice Cloning with Emotion and Duration Control: IndexTTS‑2 Runs Locally

DataFunSummit

Sep 7, 2025 · Artificial Intelligence

How NIO Cut Radio Production Costs by 80% with AI Voice Cloning

This article details NIO's AI‑driven voice‑cloning solution for its in‑car NIO Radio, explaining the business background, pain points of traditional production, the TTS‑VC framework and modular workflow, evaluation metrics, and the resulting cost savings, efficiency gains, and scalability across dozens of cities.

AIAutomotiveSpeech synthesis

0 likes · 10 min read

How NIO Cut Radio Production Costs by 80% with AI Voice Cloning

Bilibili Tech

Aug 12, 2025 · Artificial Intelligence

How AI Recreates Original Voices in Multilingual Video Dubbing

This article explains the technical challenges and innovative AI solutions behind preserving speaker identity, emotion, and timing while translating video content into multiple languages, covering speech generation modeling, speaker segmentation, adversarial reinforcement learning, proper‑noun adaptation, and audio‑visual alignment techniques.

AI voice cloningSpeech synthesisaudio-visual alignment

0 likes · 22 min read

How AI Recreates Original Voices in Multilingual Video Dubbing

Bilibili Tech

Aug 5, 2025 · Artificial Intelligence

How Bilibili’s IndexTTS2 Achieves Real‑Time, Emotion‑Rich Voice Translation

IndexTTS2 introduces a cross‑modal, multi‑language voice translation system that preserves speaker identity, acoustic space, and multi‑source timbre, while tackling challenges like voice personality loss, subtitle cognitive load, localization costs, multi‑speaker diarization, and cultural adaptation through novel time‑coding, adversarial RL, and diffusion‑based lip‑sync techniques.

Multimodal AISpeech synthesisadversarial reinforcement learning

0 likes · 20 min read

How Bilibili’s IndexTTS2 Achieves Real‑Time, Emotion‑Rich Voice Translation

Cognitive Technology Team

Jul 1, 2025 · Artificial Intelligence

How We Built a Live‑Streaming TTS Engine: From Data Pipelines to AI Voice Generation

This article presents a comprehensive practice summary of building an intelligent digital‑human system, covering six core modules—LLM content generation, LLM interaction, TTS synthesis, visual driving, audio‑video engineering, and backend services—while detailing data collection, signal processing, ASR annotation, speaker clustering, model optimization (V1‑V4), evaluation metrics, and future research directions.

AI voiceAudio ProcessingLLM

0 likes · 23 min read

How We Built a Live‑Streaming TTS Engine: From Data Pipelines to AI Voice Generation

DaTaobao Tech

Jun 27, 2025 · Artificial Intelligence

Building a High‑Quality Live‑Streaming Digital Human: TTS Pipeline, Data Processing, and Model Optimizations

This article details the end‑to‑end workflow for creating intelligent digital humans for live streaming, covering large‑language‑model‑driven content generation, multi‑stage TTS architecture, extensive audio‑signal processing, speaker clustering, front‑end text normalization, back‑end acoustic modeling, and quantitative evaluation of model improvements.

AILive StreamingSpeech synthesis

0 likes · 22 min read

Building a High‑Quality Live‑Streaming Digital Human: TTS Pipeline, Data Processing, and Model Optimizations

Python Programming Learning Circle

Mar 20, 2025 · Artificial Intelligence

Building a Python Voice Synthesis System Using Xunfei WebAPI

This tutorial explains how to create a Python-based speech synthesis tool by installing required packages, configuring Xunfei Open Platform credentials, implementing a Tkinter GUI, and using WebSocket communication to convert text into audio with selectable voice profiles.

GUISpeech synthesisText‑to‑Speech

0 likes · 8 min read

Building a Python Voice Synthesis System Using Xunfei WebAPI

Full-Stack DevOps & Kubernetes

Jul 29, 2024 · Artificial Intelligence

How to Run Real‑Time Voice Cloning with Python: A Step‑by‑Step Guide

This guide introduces the open‑source Realtime Voice Cloning project, explains its key features, and provides detailed installation and usage instructions—including environment setup, dependency installation, cloning the repository, and running the demo tool—to enable real‑time voice transformation with Python.

AIPythonReal-time Audio

0 likes · 5 min read

How to Run Real‑Time Voice Cloning with Python: A Step‑by‑Step Guide

Tencent Cloud Developer

Jun 14, 2024 · Artificial Intelligence

GPT-4o Speech Multimodal Technology: Speech Tokenization, LLM Integration, and Zero-shot TTS

GPT‑4o’s speech multimodal system discretizes audio into semantic and acoustic tokens, integrates these tokens with large language models through multi‑stage instruction tuning, and employs hierarchical zero‑shot text‑to‑speech decoding, enabling low‑latency, streaming, and prompt‑driven voice synthesis for applications like gaming.

AudioLMGPT-4oLLM integration

0 likes · 33 min read

GPT-4o Speech Multimodal Technology: Speech Tokenization, LLM Integration, and Zero-shot TTS

Spring Full-Stack Practical Cases

Apr 29, 2024 · Artificial Intelligence

Build AI-Powered Spring Boot Apps with Alibaba Tongyi: A Hands‑On Guide

This tutorial walks through setting up Spring AI 0.8.1 with Spring Boot 3.1.1, configuring Alibaba Tongyi model access, and implementing chat, streaming, image, and audio generation endpoints using Java code and vector database integrations.

Alibaba AIChatJava

0 likes · 9 min read

Build AI-Powered Spring Boot Apps with Alibaba Tongyi: A Hands‑On Guide

php Courses

Sep 1, 2023 · Artificial Intelligence

Integrating Baidu Text-to-Speech API with PHP

This tutorial demonstrates how to obtain Baidu TTS credentials, construct the required signature, send an HTTP request using PHP's cURL library, and save the returned audio data as an MP3 file, providing a complete code example for developers.

API integrationBaidu TTSPHP

0 likes · 5 min read

Integrating Baidu Text-to-Speech API with PHP

58 Tech

Aug 25, 2023 · Artificial Intelligence

Voice Cloning Technology in AI Sales Assistant

This article introduces the AI sales assistant from 58.com, detailing its background, a few‑shot voice cloning approach using real dialogue data, multi‑accent naturalness optimization, deployment architecture, and future plans, while evaluating performance metrics and discussing challenges in speech synthesis quality and stability.

AI sales assistantSpeech synthesisText‑to‑Speech

0 likes · 19 min read

Voice Cloning Technology in AI Sales Assistant

DataFunSummit

Aug 15, 2023 · Artificial Intelligence

AI Sales Assistant: Few‑Shot Voice Cloning and Multi‑Accent Naturalness Optimization

The article presents 58 Tongcheng AI Lab's AI sales assistant, detailing its background, a few‑shot voice‑cloning pipeline built on real dialogue data, data preprocessing, FastSpeech2‑based acoustic modeling, multi‑accent style transfer, deployment architecture, controllable synthesis parameters, and future research directions.

AI sales assistantFastspeech2Speech synthesis

0 likes · 20 min read

AI Sales Assistant: Few‑Shot Voice Cloning and Multi‑Accent Naturalness Optimization

Programmer DD

Jun 20, 2023 · Artificial Intelligence

Yann LeCun: Today's AI Still Below Dog Level – Inside Meta’s Voicebox, MusicGen & I‑JEPA

Meta’s chief AI scientist Yann LeCun warned that current large language models still fall short of human and even dog intelligence, citing their lack of real‑world understanding, while Meta unveiled three new generative AI models—Voicebox for speech, MusicGen for music, and I‑JEPA for image reasoning—showcasing both progress and remaining limitations.

Generative AISpeech synthesisartificial-intelligence

0 likes · 7 min read

Yann LeCun: Today's AI Still Below Dog Level – Inside Meta’s Voicebox, MusicGen & I‑JEPA

Tencent Cloud Developer

Apr 4, 2023 · Artificial Intelligence

Step-by-Step Guide to Building Your Own Realistic AI Image Generation Website with Stable Diffusion

This step‑by‑step tutorial shows how to set up a Stable Diffusion web UI, install the required Python environment and GPU‑enabled PyTorch, add Chinese localization and optional LoRA or Deforum extensions, generate realistic human images, create animated videos, and add speech with D‑ID, all ready for deployment on your own AI website.

DeforumGitPython

0 likes · 9 min read

Step-by-Step Guide to Building Your Own Realistic AI Image Generation Website with Stable Diffusion

Volcano Engine Developer Services

Feb 14, 2023 · Artificial Intelligence

How Make-An-Audio Turns Text Into Realistic Sound Effects

Make-An-Audio, a collaborative text‑to‑audio model from Zhejiang University, Peking University and Volcano Speech, uses a Distill‑then‑Reprogram strategy to generate high‑quality, controllable sound effects from any modality, showcasing impressive demos and promising future AIGC applications.

AIGCSpeech synthesisText-to-Audio

0 likes · 7 min read

How Make-An-Audio Turns Text Into Realistic Sound Effects

DataFunSummit

Dec 9, 2022 · Artificial Intelligence

Volcano Engine Virtual Digital Human Technology Overview

This article provides a comprehensive overview of Volcano Engine's virtual digital human platform, detailing its definition, AI‑driven and human‑driven classifications, 2D and 3D technical architectures, multi‑modal perception, interaction capabilities, application scenarios, and future development directions.

2D avatar3D avatarMultimodal AI

0 likes · 15 min read

Volcano Engine Virtual Digital Human Technology Overview

iQIYI Technical Product Team

Aug 26, 2022 · Artificial Intelligence

IQDubbing: AI-Powered Multi-Language, Multi-Voice Dubbing System for Film and TV

iQIYI’s IQDubbing system leverages AI‑driven voice conversion to automatically generate high‑quality, expressive dubbing in dozens of languages and over 50 character voice styles, streamlining multilingual film and TV localization, reducing reliance on scarce actors, and earning positive audience feedback, patents and industry awards.

AI dubbingFilm ProductionSpeech synthesis

0 likes · 13 min read

IQDubbing: AI-Powered Multi-Language, Multi-Voice Dubbing System for Film and TV

Xiaohongshu Tech REDtech

Aug 10, 2022 · Artificial Intelligence

Multi-Stage Multi-Codebook VQ-VAE for High-Performance Neural Text-to-Speech (MSMC‑TTS)

The MSMC‑TTS system, a multi‑stage multi‑codebook VQ‑VAE based neural text‑to‑speech solution, delivers near‑human audio quality (MOS 4.41) with a compact 3.12 MB acoustic model, substantially surpassing Mel‑Spectrogram FastSpeech baselines in naturalness and efficiency.

Compact RepresentationMulti-Stage ModelingSpeech synthesis

0 likes · 10 min read

Multi-Stage Multi-Codebook VQ-VAE for High-Performance Neural Text-to-Speech (MSMC‑TTS)

NetEase Smart Enterprise Tech+

Jun 14, 2022 · Artificial Intelligence

How Outbound Call Robots Work: Challenges and Optimizations in Voice Dialogue Systems

This article explains the architecture of outbound call robots, classifies dialogue system types, details pipeline and end‑to‑end task‑oriented designs, highlights technical challenges such as dialects and transcription errors, and presents optimization techniques like ASR correction and script improvement.

ASR correctionNLUSpeech synthesis

0 likes · 12 min read

How Outbound Call Robots Work: Challenges and Optimizations in Voice Dialogue Systems

Zuoyebang Tech Team

May 19, 2022 · Artificial Intelligence

How to Achieve High‑Quality TTS with Only Minutes of Data

This article reviews neural speech synthesis, explains why large high‑quality paired data are essential, and presents a range of low‑resource solutions—including semi‑supervised pre‑training, cross‑language transfer, speaker embedding, and Conformer‑based model upgrades—demonstrating how the Zuoyebang team built a robust TTS system with as little as 7‑minute speaker recordings.

ConformerFastspeech2Speech synthesis

0 likes · 15 min read

How to Achieve High‑Quality TTS with Only Minutes of Data

DataFunSummit

Apr 14, 2022 · Artificial Intelligence

Advances in Alibaba's Digital Human Technology: Construction, Performance, Interaction, and the MMTK Multimodal Algorithm Library

This article reviews Alibaba's digital‑human (virtual avatar) research over the past few years, covering the product’s evolution, a six‑stage pipeline for building digital humans, solutions to key challenges in realism, multimodal interaction, and the open‑source MMTK algorithm library.

Emotion ModelingMultimodal AISpeech synthesis

0 likes · 12 min read

Advances in Alibaba's Digital Human Technology: Construction, Performance, Interaction, and the MMTK Multimodal Algorithm Library

Python Programming Learning Circle

Apr 4, 2022 · Artificial Intelligence

Building a Simple Speech Synthesis System with iFlytek WebAPI in Python

This tutorial explains how to create a lightweight speech synthesis tool using iFlytek's WebAPI, covering required environment setup, API credential acquisition, GUI design with Tkinter, and detailed Python code for WebSocket communication, audio handling, and WAV file generation.

Audio ProcessingPythonSpeech synthesis

0 likes · 8 min read

Building a Simple Speech Synthesis System with iFlytek WebAPI in Python

Test Development Learning Exchange

Oct 17, 2021 · Artificial Intelligence

Using pyttsx3 for Text-to-Speech in Python

This article provides a hands‑on guide to using the pyttsx3 library for offline text‑to‑speech conversion in Python, covering installation, basic playback, voice property adjustments, multilingual support, and conditional speech examples with counters.

PythonSpeech synthesisText‑to‑Speech

0 likes · 7 min read

Using pyttsx3 for Text-to-Speech in Python

Volcano Engine Developer Services

Sep 14, 2021 · Artificial Intelligence

How ByteDance’s AI Lab is Revolutionizing Intelligent Speech for Content Creation

ByteDance’s AI‑Lab leader Dr Yin Xiang discusses how the company’s intelligent speech technologies—spanning voice synthesis, recognition, and multimodal interaction—have been integrated across its global content platforms since 2017, boosting productivity in short videos, audiobooks, education, and more.

AIByteDanceSpeech synthesis

0 likes · 13 min read

How ByteDance’s AI Lab is Revolutionizing Intelligent Speech for Content Creation

Kuaishou Tech

May 29, 2021 · Artificial Intelligence

Speaker-Aware Module for Single-Sample Voice Conversion (SAVC)

The paper presents a speaker‑aware module (SAM) that enables high‑quality voice conversion using only a single utterance of the target speaker, addressing the small‑data challenge in speech timbre transfer and achieving state‑of‑the‑art performance on the Aishell‑1 benchmark.

LPCNetSpeech synthesisdeep learning

0 likes · 12 min read

Speaker-Aware Module for Single-Sample Voice Conversion (SAVC)

Python Crawling & Data Mining

Nov 28, 2020 · Artificial Intelligence

How to Convert Text to Speech in Python with 5 Powerful TTS Libraries

This guide walks you through installing, configuring, and using five Python text‑to‑speech libraries—gTTS, Baidu AIP, pyttsx3, pywin32, and speech—to generate personalized audio files, adjust voice properties, and automate playback.

PythonSpeech synthesisText‑to‑Speech

0 likes · 5 min read

How to Convert Text to Speech in Python with 5 Powerful TTS Libraries

The Dominant Programmer

Nov 27, 2020 · Backend Development

Using Jacob in Java for Windows Speech Synthesis and Audio File Generation

This guide walks through downloading Jacob's DLL and JAR, configuring the Java environment, setting up an Eclipse project, and writing Java code that leverages the SAPI COM interfaces to synthesize Chinese text into a WAV file on Windows, complete with step‑by‑step screenshots and a full source example.

COMJacobJava

0 likes · 5 min read

Using Jacob in Java for Windows Speech Synthesis and Audio File Generation

iQIYI Technical Product Team

Nov 20, 2020 · Artificial Intelligence

iQIYI M2VoC Multi‑Speaker Multi‑Style Voice Cloning Challenge (ICASSP 2021) Overview

The iQIYI M2VoC Challenge at ICASSP 2021 invites researchers to tackle low‑resource multi‑speaker, multi‑style voice cloning by providing Mandarin datasets, few‑shot and extremely few‑shot tracks with strict data rules, MOS‑based subjective evaluation, and a $9,600 prize pool for top submissions.

AIICASSPSpeech synthesis

0 likes · 10 min read

iQIYI M2VoC Multi‑Speaker Multi‑Style Voice Cloning Challenge (ICASSP 2021) Overview

The Dominant Programmer

Nov 17, 2020 · Mobile Development

Offline Android Text‑to‑Speech without Third‑Party SDKs

This guide shows how to create an offline Android app that converts any text to speech using the platform‑provided TextToSpeech class, covering UI layout with EditText and Button, a singleton SpeechUtils helper, language, pitch and rate configuration, and full code snippets for a working demo.

AndroidJavaMobile Development

0 likes · 5 min read

Offline Android Text‑to‑Speech without Third‑Party SDKs

DataFunTalk

Mar 10, 2020 · Artificial Intelligence

Interspeech 2019 Highlights: End‑to‑End Speech AI Technologies and Key Paper Summaries

The article reviews Interspeech 2019, summarizing major trends and representative papers in end‑to‑end speech recognition, synthesis, natural language understanding, speaker recognition, and speech translation, while also highlighting best student papers and providing resources for further study.

AIInterspeech 2019Natural Language Understanding

0 likes · 24 min read

Interspeech 2019 Highlights: End‑to‑End Speech AI Technologies and Key Paper Summaries

DataFunTalk

Jan 16, 2020 · Artificial Intelligence

Voice Conversion: Fundamentals, Methods, and iQIYI Applications

This article provides a comprehensive overview of voice conversion technology, covering its definition, parallel and non‑parallel data approaches, classic and deep‑learning methods such as DTW, GMM, seq2seq, PPG, VAE, Flow, GAN, and practical applications and challenges in iQIYI’s products.

ASRGaNSpeech synthesis

0 likes · 8 min read

Voice Conversion: Fundamentals, Methods, and iQIYI Applications

Alibaba Cloud Developer

Jun 20, 2019 · Artificial Intelligence

Unlock Cutting-Edge Voice AI: Highlights from Alibaba’s Speech & Signal Processing eBook

This article introduces Alibaba's new e‑book collection of five ICASSP‑accepted papers that showcase advances in speech recognition, synthesis, and emotion detection, detailing novel models like DFSMN, A‑LSTM, and speaker‑adaptation techniques that dramatically improve speed, size, and accuracy.

AI voiceEmotion RecognitionICASSP

0 likes · 6 min read

Unlock Cutting-Edge Voice AI: Highlights from Alibaba’s Speech & Signal Processing eBook

Tencent Cloud Developer

Feb 26, 2019 · Artificial Intelligence

Tencent Cloud Intelligent Speech Technology: Development, Challenges and Practical Applications

Tencent Cloud's intelligent speech platform combines high‑accuracy ASR, advanced WaveNet‑based TTS, and solutions for noise, far‑field, and dialect challenges, enabling voice input, transcription, and customer‑service bots, with real‑world deployments in finance, museums, hotels, and other industry scenarios.

ASRHuman-Computer InteractionSpeech synthesis

0 likes · 8 min read

Tencent Cloud Intelligent Speech Technology: Development, Challenges and Practical Applications

Ctrip Technology

Feb 21, 2019 · Artificial Intelligence

Speech Recognition and Synthesis: Principles, Challenges, Optimizations, and Tencent Cloud Use Cases

This article reviews the development roadmap, current industry status, challenges, typical deployment scenarios, and optimization methods for speech recognition (ASR) and speech synthesis (TTS), and shares several Tencent Cloud intelligent voice case studies to illustrate practical applications.

AICloud ComputingSpeech synthesis

0 likes · 9 min read

Speech Recognition and Synthesis: Principles, Challenges, Optimizations, and Tencent Cloud Use Cases

Alibaba Cloud Developer

Nov 27, 2018 · Artificial Intelligence

How Linear Networks Enable Speaker‑Adaptive Speech Synthesis with Minimal Data

This article presents a linear‑network‑based speaker‑adaptation method for text‑to‑speech that achieves synthesis quality comparable to large‑scale training using only a few hundred target‑speaker utterances, and introduces a low‑rank‑plus‑diagonal compression to improve stability with scarce data.

Speech synthesisacoustic modelingartificial-intelligence

0 likes · 9 min read

How Linear Networks Enable Speaker‑Adaptive Speech Synthesis with Minimal Data

Alibaba Cloud Developer

Nov 1, 2018 · Artificial Intelligence

How DFSMN Cuts Speech Synthesis Model Size by 75% and Quadruples Speed

Researchers propose a Deep Feedforward Sequential Memory Network (DFSMN) for speech synthesis that matches BLSTM quality while using only a quarter of the model size and achieving four times faster inference, making it ideal for memory‑constrained, real‑time edge devices.

DFSMNSpeech synthesisdeep learning

0 likes · 10 min read

How DFSMN Cuts Speech Synthesis Model Size by 75% and Quadruples Speed

Alibaba Cloud Developer

Oct 23, 2018 · Artificial Intelligence

How DFSMN Cuts Speech Synthesis Model Size by 75% While Quadrupling Speed

This paper introduces a Deep Feedforward Sequential Memory Network (DFSMN) for statistical parametric speech synthesis that matches BLSTM quality with only a quarter of the model size and four times faster inference, making it ideal for memory‑constrained, real‑time IoT devices.

DFSMNIoT devicesReal-time inference

0 likes · 10 min read

How DFSMN Cuts Speech Synthesis Model Size by 75% While Quadrupling Speed

Tencent Cloud Developer

Oct 10, 2018 · Artificial Intelligence

What Are the Real Challenges and Future Trends in Intelligent Voice Technology?

This article examines the current landscape of intelligent voice technology—including speech recognition, synthesis, voiceprint identification, and acoustic event detection—highlighting technical hurdles, evaluation metrics, recent advances such as WaveNet, and a wide range of practical applications from mobile devices to smart hardware and enterprise solutions.

Audio ProcessingSpeech synthesisTencent Cloud

0 likes · 16 min read

What Are the Real Challenges and Future Trends in Intelligent Voice Technology?

iQIYI Technical Product Team

Sep 14, 2018 · Artificial Intelligence

AI RAP: End-to-End Speech Synthesis for Rap Generation Using Location‑Sensitive Attention and Inference Mask

AI RAP is an end‑to‑end AI service that lets users generate personalized rap with a single click by combining location‑sensitive attention and an inference mask to achieve perfect alignment, beat‑synchronous timing, multi‑character voice timbres, sub‑second synthesis, and a scalable architecture supporting millions of daily users.

AIAttention MechanismAudio Processing

0 likes · 5 min read

AI RAP: End-to-End Speech Synthesis for Rap Generation Using Location‑Sensitive Attention and Inference Mask

Liulishuo Tech Team

Sep 3, 2017 · Artificial Intelligence

Report on Interspeech 2017 and SLaTE 2017: Highlights in Speech Recognition, Synthesis, and Speaker Technologies

The article reports on Liulishuo’s participation in Interspeech 2017 and the SLaTE 2017 workshop, summarizing key research papers on noise‑robust ASR, attention‑based models, TensorFlow training, modern TTS systems, speaker identification datasets, and includes a hiring announcement for AI engineers.

AIInterspeechSpeech synthesis

0 likes · 7 min read

Report on Interspeech 2017 and SLaTE 2017: Highlights in Speech Recognition, Synthesis, and Speaker Technologies

Baidu Tech Salon

Jul 29, 2014 · Artificial Intelligence

Baidu Speech Synthesis: Balancing Trade‑offs and Opening the Platform to Developers

Baidu’s speech synthesis system, developed since 2013 to give machines natural Chinese voices, tackles millions of tonal variations through phonetic compression and optimized acoustic models, balances trade‑offs in data and scalability, and offers a free open platform that lets developers integrate high‑quality text‑to‑speech into apps, advancing equal access to information.

BaiduDeveloper PlatformHMM

0 likes · 6 min read

Baidu Speech Synthesis: Balancing Trade‑offs and Opening the Platform to Developers