Tagged articles

TTS

39 articles · Page 1 of 1

Jun 23, 2026 · Artificial Intelligence

Inside Artificial Analysis: Independent AI Voice Benchmarks for ASR, TTS, and Speech‑to‑Speech

Artificial Analysis provides an independent, reproducible benchmarking platform for voice AI, offering objective WER scores for ASR, Elo‑based blind‑listening scores for TTS, and three‑dimensional metrics for end‑to‑end speech dialogue, together with detailed methodology, top‑model rankings, and practical guidance for developers to choose the most suitable model and provider for their scenarios.

AI voice evaluationASRArtificial Analysis

0 likes · 14 min read

Inside Artificial Analysis: Independent AI Voice Benchmarks for ASR, TTS, and Speech‑to‑Speech

AI Architecture Path

Jun 21, 2026 · Artificial Intelligence

How Abogen Generates 3,000‑Character Audio in 11 seconds Offline – 4.8k‑Star GitHub TTS Tool

Abogen is an open‑source, fully offline TTS solution that eliminates cloud‑based costs and privacy risks, converts 3,000 characters to a 3‑minute‑28‑second audio file in just 11 seconds, and automatically produces word‑ or sentence‑level synchronized subtitles for e‑books and short‑video scripts.

Audiobook GenerationCross-PlatformKokoro Model

0 likes · 13 min read

How Abogen Generates 3,000‑Character Audio in 11 seconds Offline – 4.8k‑Star GitHub TTS Tool

JavaGuide

May 11, 2026 · Artificial Intelligence

Running Code Review and Voice Agents with Step Plan and Claude Code

The article walks through using Step Plan’s unified API to integrate Claude Code for automated code review and to build a voice‑agent pipeline that transcribes meeting recordings, generates structured summaries, and produces audio briefs, while discussing setup, costs, model selection, practical demos, and observed limitations.

AI AgentASRClaude Code

0 likes · 24 min read

Running Code Review and Voice Agents with Step Plan and Claude Code

Xiaomi Tech

May 7, 2026 · Artificial Intelligence

OmniVoice: Open‑Source TTS Model Clones Voices in 600+ Languages with a Single Architecture

OmniVoice, an open‑source TTS system from Xiaomi AI Lab, uses a minimalist bidirectional Transformer and LLM‑enhanced pre‑training to synthesize high‑quality speech in over 600 languages, outperforming commercial systems while offering fine‑grained control and fully public code and models.

Multilingual speech synthesisOmniVoiceTTS

0 likes · 8 min read

OmniVoice: Open‑Source TTS Model Clones Voices in 600+ Languages with a Single Architecture

AI Explorer

Apr 12, 2026 · Backend Development

Generate Viral Reddit Videos with a Single Command Using RedditVideoMakerBot

This article introduces RedditVideoMakerBot, an open‑source Python tool that automates fetching hot Reddit posts, creating TTS narration, adding background media, and producing a final video file without manual editing, and provides setup instructions and future feature ideas.

GitHubPythonReddit

0 likes · 4 min read

Generate Viral Reddit Videos with a Single Command Using RedditVideoMakerBot

AI Explorer

Apr 11, 2026 · Artificial Intelligence

VoxCPM2: Tokenizer‑Free Multilingual TTS that Creates New Voices from Text

VoxCPM2, an open‑source 2‑billion‑parameter TTS model from OpenBMB, eliminates tokenizers and uses a diffusion‑autoregressive architecture to generate high‑fidelity, controllable speech in 30 languages, supporting voice design from natural‑language prompts and high‑quality voice cloning with just a short reference clip.

AudioVAETTSVoice Cloning

0 likes · 8 min read

VoxCPM2: Tokenizer‑Free Multilingual TTS that Creates New Voices from Text

Weekly Large Model Application

Mar 30, 2026 · Artificial Intelligence

Inside Kimi-Audio: A Unified Large Audio Model Covering ASR, AQA, TTS and More

Kimi-Audio, a general‑purpose audio foundation model from Moonshot AI, integrates ASR, audio QA, automatic audio captioning, emotion classification and end‑to‑end speech dialogue within a single framework, detailing its mixed‑audio input, MiMo‑Transformer core, efficient synthesis pipeline, architectural strengths, limitations, and suitable application scenarios.

ASRAudio LLMBigVGAN

0 likes · 9 min read

Inside Kimi-Audio: A Unified Large Audio Model Covering ASR, AQA, TTS and More

Weekly Large Model Application

Mar 17, 2026 · Artificial Intelligence

Essential Features Every Voice Interaction System Must Support

The article provides a comprehensive analysis of core voice interaction system capabilities—including barge‑in, turn‑taking, multi‑turn dialogue, intent recognition, speaker identification, streaming latency, noise robustness, multilingual support, emotion handling, personalization, security, and deployment considerations—highlighting typical scenarios such as smart speakers, in‑car assistants, call centers, and meeting transcription.

ASRLatencyMultimodal

0 likes · 11 min read

Essential Features Every Voice Interaction System Must Support

Weekly Large Model Application

Mar 13, 2026 · Artificial Intelligence

Speech Large Models: Why End-to-End Architecture Beats Traditional ASR‑LLM‑TTS Pipelines

The article defines true speech large models as native end‑to‑end systems that directly map audio to audio, compares them with traditional cascade ASR‑LLM‑TTS pipelines across architecture, error control, latency, paralinguistic perception, long‑context handling and deployment, and surveys the leading open‑source and commercial speech LLMs released in March 2026 with a quick selection guide.

AIASREnd-to-End

0 likes · 11 min read

Speech Large Models: Why End-to-End Architecture Beats Traditional ASR‑LLM‑TTS Pipelines

Weekly Large Model Application

Feb 20, 2026 · Artificial Intelligence

Intelligent Speech vs. Voice Agent: Key Differences and How They Relate

This article explains the technical distinction between intelligent speech— a toolbox of ASR, TTS, NLU, and NLG technologies— and Voice Agent, an end‑to‑end conversational system built on those tools and large‑model reasoning, illustrating their layered relationship, functional gaps, and typical use cases.

ASRDialogue SystemsLarge Language Model

0 likes · 7 min read

Intelligent Speech vs. Voice Agent: Key Differences and How They Relate

AI Large Model Application Practice

Nov 24, 2025 · Artificial Intelligence

How to Turn Text into an AI‑Powered PPT Video: A Step‑by‑Step Guide

This article breaks down the end‑to‑end engineering pipeline that converts a knowledge source such as a URL or PDF into a narrated PPT‑style video, detailing six core stages—from knowledge extraction and script generation to image creation, voice synthesis, and final video stitching—while highlighting practical model choices, prompt design, and stability tricks.

LLMMultimodalPPT

0 likes · 16 min read

How to Turn Text into an AI‑Powered PPT Video: A Step‑by‑Step Guide

HyperAI Super Neural

Nov 4, 2025 · Artificial Intelligence

On‑Device TTS Breakthrough: NeuTTS‑Air Achieves 3‑Second Audio Cloning with a 0.5B Model

NeuTTS‑Air, an open‑source on‑device text‑to‑speech model built on a 0.5B Qwen LLM and NeuCodec, reaches SOTA among open models, runs entirely on CPU, supports 3‑second voice cloning, and comes with a step‑by‑step tutorial for deployment on edge devices.

NeuCodecNeuTTS-AirQwen

0 likes · 5 min read

On‑Device TTS Breakthrough: NeuTTS‑Air Achieves 3‑Second Audio Cloning with a 0.5B Model

Programmer DD

Oct 24, 2025 · Backend Development

How to Seamlessly Integrate MiniMax & CosyVoice TTS into Spring Boot with UnifiedTTS

This guide walks you through building a Spring Boot application, registering a UnifiedTTS API key, configuring MiniMax or CosyVoice models, implementing the service layer, running unit tests, and handling production concerns to achieve high‑quality text‑to‑speech synthesis without changing client code.

CosyVoiceJavaMiniMax

0 likes · 11 min read

How to Seamlessly Integrate MiniMax & CosyVoice TTS into Spring Boot with UnifiedTTS

360 Zhihui Cloud Developer

Sep 16, 2025 · Artificial Intelligence

How AI Transforms Video Conferencing: From ASR to LLM-Powered Smart Meetings

This article explores how integrating ASR, TTS, and large language models into video conferencing creates an intelligent collaboration hub that boosts efficiency, enhances user experience, expands multilingual scenarios, and provides practical architecture and Python code examples for real‑time smart meetings.

AIASRLLM

0 likes · 11 min read

How AI Transforms Video Conferencing: From ASR to LLM-Powered Smart Meetings

Huolala Tech

Sep 10, 2025 · Artificial Intelligence

How AI Voice Humanization Cuts Call‑Center Costs: ASR, Smart Interrupt & TTS Deep Dive

This article examines how AI‑driven voice humanization—covering advanced ASR, intelligent interruption, and expressive TTS—addresses high labor costs, efficiency bottlenecks, and inconsistent service quality in inbound and outbound call‑center operations, presenting technical evaluations, optimization strategies, and future research directions.

AI voiceASRHumanization

0 likes · 13 min read

How AI Voice Humanization Cuts Call‑Center Costs: ASR, Smart Interrupt & TTS Deep Dive

ShiZhen AI

Aug 14, 2025 · Artificial Intelligence

How to Auto‑Dubbing Multi‑Character Novels in Just 5 Minutes

This guide walks you through using the AI 易配音 long‑form audio tool to split a novel into chapters, assign distinct voice tones to each character, fine‑tune volume, speed and pitch, generate and batch‑process audio segments, and finally export the finished audio files.

AI voice synthesisAutomationTTS

0 likes · 15 min read

How to Auto‑Dubbing Multi‑Character Novels in Just 5 Minutes

DaTaobao Tech

Jul 4, 2025 · Artificial Intelligence

How Taobao Live’s AI Digital Humans Transform E‑Commerce: Architecture, Algorithms, and Engineering Insights

This article details the end‑to‑end design of Taobao Live's AI digital human system, covering six core components such as LLM‑driven content creation, interactive dialogue, TTS voice synthesis, visual synchronization, audio‑video engineering, and a scalable backend, while also discussing product evolution, automation challenges, and future roadmap.

AIAutomationLLM

0 likes · 19 min read

How Taobao Live’s AI Digital Humans Transform E‑Commerce: Architecture, Algorithms, and Engineering Insights

DaTaobao Tech

Jul 2, 2025 · Artificial Intelligence

How AI Powers 24/7 Digital Human Live Streams: Architecture, Challenges, and Innovations

This article presents a comprehensive overview of the AI‑driven digital‑human live‑streaming solution used by Taobao, detailing six core components—including LLM‑based content generation and interaction, TTS, visual driving, audio‑video engineering, and backend services—while sharing architectural diagrams, cost‑reduction strategies, productization insights, and future directions.

AILLMLive Streaming

0 likes · 8 min read

How AI Powers 24/7 Digital Human Live Streams: Architecture, Challenges, and Innovations

DaTaobao Tech

Jun 27, 2025 · Artificial Intelligence

Building a High‑Quality Live‑Streaming Digital Human: TTS Pipeline, Data Processing, and Model Optimizations

This article details the end‑to‑end workflow for creating intelligent digital humans for live streaming, covering large‑language‑model‑driven content generation, multi‑stage TTS architecture, extensive audio‑signal processing, speaker clustering, front‑end text normalization, back‑end acoustic modeling, and quantitative evaluation of model improvements.

AILive StreamingSpeech synthesis

0 likes · 22 min read

Building a High‑Quality Live‑Streaming Digital Human: TTS Pipeline, Data Processing, and Model Optimizations

Amap Tech

May 27, 2025 · Artificial Intelligence

Gaode Map Custom Voice Pack: End‑to‑End TTS Model Architecture and Deployment

This article explains how Gaode Map leverages lightweight edge TTS models, dual‑autoregressive large‑model data augmentation, and a configurable audio‑processing DAG to enable users to create highly realistic personalized voice packs from just three recorded sentences.

Data AugmentationGaode MapsTTS

0 likes · 8 min read

Gaode Map Custom Voice Pack: End‑to‑End TTS Model Architecture and Deployment

ShiZhen AI

May 13, 2025 · Artificial Intelligence

Top Free Text‑to‑Speech Tools for Content Creators

This article reviews five free text‑to‑speech solutions—AI易视频, Google TTS, Natural Reader, Balabolka, and Speech2Go—detailing their features, language support, installation needs, and unique capabilities to help creators choose the right tool for narration, translation, or multi‑character audio production.

AITTSText‑to‑Speech

0 likes · 7 min read

Top Free Text‑to‑Speech Tools for Content Creators

DaTaobao Tech

Mar 31, 2025 · Artificial Intelligence

AI Audio Generation and Voice Synthesis Practices at Taobao

The article surveys Taobao’s AI‑generated audio pipeline, detailing eight technical papers on image‑to‑video, OpenAI o1, multimodal video, and large‑model voice synthesis, while highlighting advances like VALL‑E, CosyVoice, F5‑TTS, data‑cleaning methods, and e‑commerce applications such as voice‑cloned live streams, multilingual TTS, AI video‑audio integration, and audiobook production.

AI audioLarge Language ModelTTS

0 likes · 11 min read

AI Audio Generation and Voice Synthesis Practices at Taobao

DataFunTalk

Jun 3, 2024 · Artificial Intelligence

Deploying Speech AI Services Quickly with NVIDIA Riva

This article explains how to use NVIDIA Riva to rapidly deploy speech AI services, covering Riva's overview, Chinese ASR model updates, TTS capabilities, customization options, the Quickstart tool, and a Q&A session that clarifies deployment, model fine‑tuning, and integration with NeMo and Triton.

ASRGPU AccelerationNVIDIA Riva

0 likes · 13 min read

Deploying Speech AI Services Quickly with NVIDIA Riva

AI Large Model Application Practice

Mar 22, 2024 · Artificial Intelligence

How to Build a Real‑Time AI‑Powered 3D Digital Human with Unreal Engine

This guide explains the architecture of an interactive digital‑human system, walks through 3D avatar creation with Unreal Engine, details the AI controller that combines ASR, LLM and TTS, and provides step‑by‑step instructions for deploying the open‑source Fay project.

AI AvatarASRFay

0 likes · 14 min read

How to Build a Real‑Time AI‑Powered 3D Digital Human with Unreal Engine

DataFunTalk

Feb 13, 2024 · Artificial Intelligence

An Overview of NVIDIA NeMo: Open‑Source Framework for Speech AI, ASR, TTS, NLP and Large Language Model Training

This article introduces NVIDIA’s open‑source NeMo framework, detailing its PyTorch‑based architecture for Speech AI, ASR and TTS training, NLP and LLM support, GPU‑optimized parallelism, pre‑trained model resources, fine‑tuning techniques, and the accompanying NeMo Aligner and Framework tools.

ASRNVIDIA NeMoPyTorch

0 likes · 18 min read

An Overview of NVIDIA NeMo: Open‑Source Framework for Speech AI, ASR, TTS, NLP and Large Language Model Training

DataFunTalk

Jan 26, 2024 · Artificial Intelligence

Efficient Deployment of Speech AI Models on GPUs

This article explains how to efficiently deploy speech AI models—including ASR and TTS—on GPUs using NVIDIA's Triton Inference Server and TensorRT, covering background challenges, GPU‑based solutions, decoding optimizations, Whisper acceleration with TensorRT‑LLM, streaming TTS improvements, voice‑cloning pipelines, future plans, and a Q&A session.

ASRGPUTTS

0 likes · 20 min read

Efficient Deployment of Speech AI Models on GPUs

HomeTech

Dec 6, 2023 · Artificial Intelligence

Metaverse-Based Virtual Humans: Technologies and Applications in Intelligent Q&A

This article explores the concept of the metaverse and virtual humans, detailing 3D modeling techniques, NLP-driven language understanding, streaming TTS, VR/AR interaction, AIGC content generation, and the deployment of a large‑model intelligent Q&A system with real‑time facial expression synthesis for virtual anchors.

3D modelingAIGCMetaverse

0 likes · 8 min read

Metaverse-Based Virtual Humans: Technologies and Applications in Intelligent Q&A

DataFunSummit

Aug 15, 2023 · Artificial Intelligence

AI Sales Assistant: Few‑Shot Voice Cloning and Multi‑Accent Naturalness Optimization

The article presents 58 Tongcheng AI Lab's AI sales assistant, detailing its background, a few‑shot voice‑cloning pipeline built on real dialogue data, data preprocessing, FastSpeech2‑based acoustic modeling, multi‑accent style transfer, deployment architecture, controllable synthesis parameters, and future research directions.

AI sales assistantFastspeech2Speech synthesis

0 likes · 20 min read

AI Sales Assistant: Few‑Shot Voice Cloning and Multi‑Accent Naturalness Optimization

DataFunSummit

Apr 18, 2023 · Artificial Intelligence

Best Practices for Deploying Speech AI on GPUs with Triton and TensorRT

This article presents comprehensive best‑practice guidelines for deploying conversational speech AI—including ASR and TTS pipelines—on GPU servers using NVIDIA Triton Inference Server and TensorRT, covering workflow overview, performance optimizations, streaming inference, and real‑world deployment tips.

ASRConversational AIGPU deployment

0 likes · 14 min read

Best Practices for Deploying Speech AI on GPUs with Triton and TensorRT

Meituan Technology Team

Mar 9, 2023 · Artificial Intelligence

Implementation and Practice of MRCP in Meituan Voice Interaction

This article details Meituan’s adoption of the Media Resource Control Protocol (MRCP) to standardize ASR and TTS integration, describing its architecture, key components, high‑availability deployment, and measured performance gains such as up to 55% latency reduction and a 15% increase in outbound call success rates.

ASRMRCPMeituan

0 likes · 24 min read

Implementation and Practice of MRCP in Meituan Voice Interaction

DataFunTalk

Jul 7, 2022 · Artificial Intelligence

Huawei Translation’s Achievements and Technical Solutions in IWSLT 2022 Speech Translation Tasks

This article reviews Huawei Translation’s top-ranking results in the IWSLT 2022 speech translation competition across speech‑to‑speech, offline speech‑to‑text, and length‑controlled translation tasks, and details their cascade and end‑to‑end technical approaches, including domain‑controlled ASR, context‑aware MT re‑ranking, and VITS‑based TTS.

ASREnd-to-EndHuawei

0 likes · 13 min read

Huawei Translation’s Achievements and Technical Solutions in IWSLT 2022 Speech Translation Tasks

Yuewen Technology

Oct 15, 2021 · Artificial Intelligence

How Yuedu's TTS Platform Automates High‑Quality Audiobook Production

This article explains how Yuedu's TTS synthesis platform tackles the booming audiobook market by using AI‑driven text preprocessing, role graph construction, content structuring, emotion and effect recognition, and a streamlined post‑processing workflow to efficiently generate multi‑character, emotionally rich audio books at scale.

Emotion RecognitionNLPTTS

0 likes · 13 min read

How Yuedu's TTS Platform Automates High‑Quality Audiobook Production

58 Tech

Dec 28, 2020 · Backend Development

Implementation of SIP‑Based DTMF Signal Capture for Intelligent Voice Robots

This article explains how an intelligent voice robot leverages TTS and SIP to convert server alerts into spoken notifications, detailing the end‑to‑end workflow, DTMF transmission methods, SIP detection techniques, SDP media negotiation, and RTP‑based DTMF parsing to enable reliable key‑press handling.

DTMFRTPSIP

0 likes · 8 min read

Implementation of SIP‑Based DTMF Signal Capture for Intelligent Voice Robots

JD Cloud Developers

Dec 16, 2020 · Artificial Intelligence

How NeuHub Scaled TTS to Billions of Calls: Gateway Architecture Evolution

This article recounts the JD.com 11.11 tech salon presentation where architect Shi Weihang explains how the NeuHub gateway was redesigned to handle over two billion TTS service calls, detailing the challenges, scaling techniques, and architectural decisions that enabled such massive traffic.

AIJD.comTTS

0 likes · 1 min read

How NeuHub Scaled TTS to Billions of Calls: Gateway Architecture Evolution

Alibaba Cloud Developer

Apr 7, 2020 · Artificial Intelligence

How Does Alibaba’s Tmall Genie Achieve Full‑Duplex Natural Dialogue?

This article explains the concept of full‑duplex natural dialogue for Alibaba’s Tmall Genie, illustrates interaction scenarios, and details the technical solution covering device‑side management, speech recognition, language understanding, synthesis, dialogue control, duration handling, and conversation flow.

ASRHuman-Computer InteractionNLU

0 likes · 8 min read

How Does Alibaba’s Tmall Genie Achieve Full‑Duplex Natural Dialogue?

360 Quality & Efficiency

May 10, 2019 · Artificial Intelligence

Smart Speaker Voice Interaction Platform: Concepts, Processes, and Testing Metrics

This article introduces the architecture of smart speaker voice interaction systems, covering wake‑word activation, automatic speech recognition (ASR), natural language understanding (NLU), skill processing, text‑to‑speech synthesis (TTS), and the key performance and testing metrics for each component.

ASRNLUTTS

0 likes · 11 min read

Smart Speaker Voice Interaction Platform: Concepts, Processes, and Testing Metrics

Tencent Cloud Developer

Feb 26, 2019 · Artificial Intelligence

Tencent Cloud Intelligent Speech Technology: Development, Challenges and Practical Applications

Tencent Cloud's intelligent speech platform combines high‑accuracy ASR, advanced WaveNet‑based TTS, and solutions for noise, far‑field, and dialect challenges, enabling voice input, transcription, and customer‑service bots, with real‑world deployments in finance, museums, hotels, and other industry scenarios.

ASRHuman-Computer InteractionSpeech synthesis

0 likes · 8 min read

Tencent Cloud Intelligent Speech Technology: Development, Challenges and Practical Applications

Tencent Cloud Developer

Sep 30, 2018 · Artificial Intelligence

Smart Speaker Voice Interaction Technology: Recent Advances and Tencent's Research Progress

The article surveys Tencent’s recent advances in smart‑speaker voice interaction, detailing a full technology chain—from front‑end capture, wake‑up and enhancement, through speaker verification and short‑speech voiceprint, to TDNN/LSTM speech recognition, target speaker extraction, and end‑to‑end attention modeling for robust, personalized performance.

Attention MechanismTTSmicrophone array

0 likes · 18 min read

Smart Speaker Voice Interaction Technology: Recent Advances and Tencent's Research Progress

Tencent TDS Service

Sep 7, 2017 · Mobile Development

How to Implement iOS Voice Payment Alerts: Push Wake‑up, TTS, and Mute Detection

This article explains how to enable voice reminders for payment receipt on iOS by using VoIP push notifications to wake a suspended app, integrating online/offline TTS synthesis, handling audio playback in background, detecting the mute switch, and adjusting system volume thresholds.

Audio SessionTTSiOS

0 likes · 10 min read

How to Implement iOS Voice Payment Alerts: Push Wake‑up, TTS, and Mute Detection