Tagged articles
36 articles
Page 1 of 1
JavaGuide
JavaGuide
May 11, 2026 · Artificial Intelligence

Running Code Review and Voice Agents with Step Plan and Claude Code

The article walks through using Step Plan’s unified API to integrate Claude Code for automated code review and to build a voice‑agent pipeline that transcribes meeting recordings, generates structured summaries, and produces audio briefs, while discussing setup, costs, model selection, practical demos, and observed limitations.

AI AgentASRClaude Code
0 likes · 24 min read
Running Code Review and Voice Agents with Step Plan and Claude Code
AI Explorer
AI Explorer
Apr 12, 2026 · Backend Development

Generate Viral Reddit Videos with a Single Command Using RedditVideoMakerBot

This article introduces RedditVideoMakerBot, an open‑source Python tool that automates fetching hot Reddit posts, creating TTS narration, adding background media, and producing a final video file without manual editing, and provides setup instructions and future feature ideas.

GitHubPythonReddit
0 likes · 4 min read
Generate Viral Reddit Videos with a Single Command Using RedditVideoMakerBot
AI Explorer
AI Explorer
Apr 11, 2026 · Artificial Intelligence

VoxCPM2: Tokenizer‑Free Multilingual TTS that Creates New Voices from Text

VoxCPM2, an open‑source 2‑billion‑parameter TTS model from OpenBMB, eliminates tokenizers and uses a diffusion‑autoregressive architecture to generate high‑fidelity, controllable speech in 30 languages, supporting voice design from natural‑language prompts and high‑quality voice cloning with just a short reference clip.

AudioVAETTSVoxCPM2
0 likes · 8 min read
VoxCPM2: Tokenizer‑Free Multilingual TTS that Creates New Voices from Text
Weekly Large Model Application
Weekly Large Model Application
Mar 30, 2026 · Artificial Intelligence

Inside Kimi-Audio: A Unified Large Audio Model Covering ASR, AQA, TTS and More

Kimi-Audio, a general‑purpose audio foundation model from Moonshot AI, integrates ASR, audio QA, automatic audio captioning, emotion classification and end‑to‑end speech dialogue within a single framework, detailing its mixed‑audio input, MiMo‑Transformer core, efficient synthesis pipeline, architectural strengths, limitations, and suitable application scenarios.

ASRAudio LLMBigVGAN
0 likes · 9 min read
Inside Kimi-Audio: A Unified Large Audio Model Covering ASR, AQA, TTS and More
Weekly Large Model Application
Weekly Large Model Application
Mar 17, 2026 · Artificial Intelligence

Essential Features Every Voice Interaction System Must Support

The article provides a comprehensive analysis of core voice interaction system capabilities—including barge‑in, turn‑taking, multi‑turn dialogue, intent recognition, speaker identification, streaming latency, noise robustness, multilingual support, emotion handling, personalization, security, and deployment considerations—highlighting typical scenarios such as smart speakers, in‑car assistants, call centers, and meeting transcription.

ASRLatencyTTS
0 likes · 11 min read
Essential Features Every Voice Interaction System Must Support
Weekly Large Model Application
Weekly Large Model Application
Mar 13, 2026 · Artificial Intelligence

Speech Large Models: Why End-to-End Architecture Beats Traditional ASR‑LLM‑TTS Pipelines

The article defines true speech large models as native end‑to‑end systems that directly map audio to audio, compares them with traditional cascade ASR‑LLM‑TTS pipelines across architecture, error control, latency, paralinguistic perception, long‑context handling and deployment, and surveys the leading open‑source and commercial speech LLMs released in March 2026 with a quick selection guide.

AIASREnd-to-End
0 likes · 11 min read
Speech Large Models: Why End-to-End Architecture Beats Traditional ASR‑LLM‑TTS Pipelines
AI Large Model Application Practice
AI Large Model Application Practice
Nov 24, 2025 · Artificial Intelligence

How to Turn Text into an AI‑Powered PPT Video: A Step‑by‑Step Guide

This article breaks down the end‑to‑end engineering pipeline that converts a knowledge source such as a URL or PDF into a narrated PPT‑style video, detailing six core stages—from knowledge extraction and script generation to image creation, voice synthesis, and final video stitching—while highlighting practical model choices, prompt design, and stability tricks.

LLMPPTTTS
0 likes · 16 min read
How to Turn Text into an AI‑Powered PPT Video: A Step‑by‑Step Guide
Huolala Tech
Huolala Tech
Sep 10, 2025 · Artificial Intelligence

How AI Voice Humanization Cuts Call‑Center Costs: ASR, Smart Interrupt & TTS Deep Dive

This article examines how AI‑driven voice humanization—covering advanced ASR, intelligent interruption, and expressive TTS—addresses high labor costs, efficiency bottlenecks, and inconsistent service quality in inbound and outbound call‑center operations, presenting technical evaluations, optimization strategies, and future research directions.

AI voiceASRHumanization
0 likes · 13 min read
How AI Voice Humanization Cuts Call‑Center Costs: ASR, Smart Interrupt & TTS Deep Dive
ShiZhen AI
ShiZhen AI
Aug 14, 2025 · Artificial Intelligence

How to Auto‑Dubbing Multi‑Character Novels in Just 5 Minutes

This guide walks you through using the AI 易配音 long‑form audio tool to split a novel into chapters, assign distinct voice tones to each character, fine‑tune volume, speed and pitch, generate and batch‑process audio segments, and finally export the finished audio files.

AI voice synthesisAutomationTTS
0 likes · 15 min read
How to Auto‑Dubbing Multi‑Character Novels in Just 5 Minutes
DaTaobao Tech
DaTaobao Tech
Jul 4, 2025 · Artificial Intelligence

How Taobao Live’s AI Digital Humans Transform E‑Commerce: Architecture, Algorithms, and Engineering Insights

This article details the end‑to‑end design of Taobao Live's AI digital human system, covering six core components such as LLM‑driven content creation, interactive dialogue, TTS voice synthesis, visual synchronization, audio‑video engineering, and a scalable backend, while also discussing product evolution, automation challenges, and future roadmap.

AIAutomationDigital Human
0 likes · 19 min read
How Taobao Live’s AI Digital Humans Transform E‑Commerce: Architecture, Algorithms, and Engineering Insights
DaTaobao Tech
DaTaobao Tech
Jul 2, 2025 · Artificial Intelligence

How AI Powers 24/7 Digital Human Live Streams: Architecture, Challenges, and Innovations

This article presents a comprehensive overview of the AI‑driven digital‑human live‑streaming solution used by Taobao, detailing six core components—including LLM‑based content generation and interaction, TTS, visual driving, audio‑video engineering, and backend services—while sharing architectural diagrams, cost‑reduction strategies, productization insights, and future directions.

AIDigital HumanLLM
0 likes · 8 min read
How AI Powers 24/7 Digital Human Live Streams: Architecture, Challenges, and Innovations
DaTaobao Tech
DaTaobao Tech
Jun 27, 2025 · Artificial Intelligence

Building a High‑Quality Live‑Streaming Digital Human: TTS Pipeline, Data Processing, and Model Optimizations

This article details the end‑to‑end workflow for creating intelligent digital humans for live streaming, covering large‑language‑model‑driven content generation, multi‑stage TTS architecture, extensive audio‑signal processing, speaker clustering, front‑end text normalization, back‑end acoustic modeling, and quantitative evaluation of model improvements.

AIDigital HumanSpeech synthesis
0 likes · 22 min read
Building a High‑Quality Live‑Streaming Digital Human: TTS Pipeline, Data Processing, and Model Optimizations
Amap Tech
Amap Tech
May 27, 2025 · Artificial Intelligence

Gaode Map Custom Voice Pack: End‑to‑End TTS Model Architecture and Deployment

This article explains how Gaode Map leverages lightweight edge TTS models, dual‑autoregressive large‑model data augmentation, and a configurable audio‑processing DAG to enable users to create highly realistic personalized voice packs from just three recorded sentences.

Gaode MapsTTSdata augmentation
0 likes · 8 min read
Gaode Map Custom Voice Pack: End‑to‑End TTS Model Architecture and Deployment
ShiZhen AI
ShiZhen AI
May 13, 2025 · Artificial Intelligence

Top Free Text‑to‑Speech Tools for Content Creators

This article reviews five free text‑to‑speech solutions—AI易视频, Google TTS, Natural Reader, Balabolka, and Speech2Go—detailing their features, language support, installation needs, and unique capabilities to help creators choose the right tool for narration, translation, or multi‑character audio production.

AITTSaudio generation
0 likes · 7 min read
Top Free Text‑to‑Speech Tools for Content Creators
DaTaobao Tech
DaTaobao Tech
Mar 31, 2025 · Artificial Intelligence

AI Audio Generation and Voice Synthesis Practices at Taobao

The article surveys Taobao’s AI‑generated audio pipeline, detailing eight technical papers on image‑to‑video, OpenAI o1, multimodal video, and large‑model voice synthesis, while highlighting advances like VALL‑E, CosyVoice, F5‑TTS, data‑cleaning methods, and e‑commerce applications such as voice‑cloned live streams, multilingual TTS, AI video‑audio integration, and audiobook production.

AI audioTTSdata cleaning
0 likes · 11 min read
AI Audio Generation and Voice Synthesis Practices at Taobao
DataFunTalk
DataFunTalk
Jun 3, 2024 · Artificial Intelligence

Deploying Speech AI Services Quickly with NVIDIA Riva

This article explains how to use NVIDIA Riva to rapidly deploy speech AI services, covering Riva's overview, Chinese ASR model updates, TTS capabilities, customization options, the Quickstart tool, and a Q&A session that clarifies deployment, model fine‑tuning, and integration with NeMo and Triton.

ASRGPU AccelerationNVIDIA Riva
0 likes · 13 min read
Deploying Speech AI Services Quickly with NVIDIA Riva
DataFunTalk
DataFunTalk
Feb 13, 2024 · Artificial Intelligence

An Overview of NVIDIA NeMo: Open‑Source Framework for Speech AI, ASR, TTS, NLP and Large Language Model Training

This article introduces NVIDIA’s open‑source NeMo framework, detailing its PyTorch‑based architecture for Speech AI, ASR and TTS training, NLP and LLM support, GPU‑optimized parallelism, pre‑trained model resources, fine‑tuning techniques, and the accompanying NeMo Aligner and Framework tools.

ASRNVIDIA NeMoPyTorch
0 likes · 18 min read
An Overview of NVIDIA NeMo: Open‑Source Framework for Speech AI, ASR, TTS, NLP and Large Language Model Training
DataFunTalk
DataFunTalk
Jan 26, 2024 · Artificial Intelligence

Efficient Deployment of Speech AI Models on GPUs

This article explains how to efficiently deploy speech AI models—including ASR and TTS—on GPUs using NVIDIA's Triton Inference Server and TensorRT, covering background challenges, GPU‑based solutions, decoding optimizations, Whisper acceleration with TensorRT‑LLM, streaming TTS improvements, voice‑cloning pipelines, future plans, and a Q&A session.

ASRGPUInference
0 likes · 20 min read
Efficient Deployment of Speech AI Models on GPUs
HomeTech
HomeTech
Dec 6, 2023 · Artificial Intelligence

Metaverse-Based Virtual Humans: Technologies and Applications in Intelligent Q&A

This article explores the concept of the metaverse and virtual humans, detailing 3D modeling techniques, NLP-driven language understanding, streaming TTS, VR/AR interaction, AIGC content generation, and the deployment of a large‑model intelligent Q&A system with real‑time facial expression synthesis for virtual anchors.

3D ModelingAIGCMetaverse
0 likes · 8 min read
Metaverse-Based Virtual Humans: Technologies and Applications in Intelligent Q&A
DataFunSummit
DataFunSummit
Aug 15, 2023 · Artificial Intelligence

AI Sales Assistant: Few‑Shot Voice Cloning and Multi‑Accent Naturalness Optimization

The article presents 58 Tongcheng AI Lab's AI sales assistant, detailing its background, a few‑shot voice‑cloning pipeline built on real dialogue data, data preprocessing, FastSpeech2‑based acoustic modeling, multi‑accent style transfer, deployment architecture, controllable synthesis parameters, and future research directions.

AI sales assistantFastspeech2Speech synthesis
0 likes · 20 min read
AI Sales Assistant: Few‑Shot Voice Cloning and Multi‑Accent Naturalness Optimization
DataFunSummit
DataFunSummit
Apr 18, 2023 · Artificial Intelligence

Best Practices for Deploying Speech AI on GPUs with Triton and TensorRT

This article presents comprehensive best‑practice guidelines for deploying conversational speech AI—including ASR and TTS pipelines—on GPU servers using NVIDIA Triton Inference Server and TensorRT, covering workflow overview, performance optimizations, streaming inference, and real‑world deployment tips.

ASRConversational AIGPU deployment
0 likes · 14 min read
Best Practices for Deploying Speech AI on GPUs with Triton and TensorRT
Meituan Technology Team
Meituan Technology Team
Mar 9, 2023 · Artificial Intelligence

Implementation and Practice of MRCP in Meituan Voice Interaction

This article details Meituan’s adoption of the Media Resource Control Protocol (MRCP) to standardize ASR and TTS integration, describing its architecture, key components, high‑availability deployment, and measured performance gains such as up to 55% latency reduction and a 15% increase in outbound call success rates.

ASRMRCPMeituan
0 likes · 24 min read
Implementation and Practice of MRCP in Meituan Voice Interaction
DataFunTalk
DataFunTalk
Jul 7, 2022 · Artificial Intelligence

Huawei Translation’s Achievements and Technical Solutions in IWSLT 2022 Speech Translation Tasks

This article reviews Huawei Translation’s top-ranking results in the IWSLT 2022 speech translation competition across speech‑to‑speech, offline speech‑to‑text, and length‑controlled translation tasks, and details their cascade and end‑to‑end technical approaches, including domain‑controlled ASR, context‑aware MT re‑ranking, and VITS‑based TTS.

ASREnd-to-EndHuawei
0 likes · 13 min read
Huawei Translation’s Achievements and Technical Solutions in IWSLT 2022 Speech Translation Tasks
Yuewen Technology
Yuewen Technology
Oct 15, 2021 · Artificial Intelligence

How Yuedu's TTS Platform Automates High‑Quality Audiobook Production

This article explains how Yuedu's TTS synthesis platform tackles the booming audiobook market by using AI‑driven text preprocessing, role graph construction, content structuring, emotion and effect recognition, and a streamlined post‑processing workflow to efficiently generate multi‑character, emotionally rich audio books at scale.

Audio SynthesisEmotion RecognitionNLP
0 likes · 13 min read
How Yuedu's TTS Platform Automates High‑Quality Audiobook Production
58 Tech
58 Tech
Dec 28, 2020 · Backend Development

Implementation of SIP‑Based DTMF Signal Capture for Intelligent Voice Robots

This article explains how an intelligent voice robot leverages TTS and SIP to convert server alerts into spoken notifications, detailing the end‑to‑end workflow, DTMF transmission methods, SIP detection techniques, SDP media negotiation, and RTP‑based DTMF parsing to enable reliable key‑press handling.

DTMFRTPSIP
0 likes · 8 min read
Implementation of SIP‑Based DTMF Signal Capture for Intelligent Voice Robots
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 7, 2020 · Artificial Intelligence

How Does Alibaba’s Tmall Genie Achieve Full‑Duplex Natural Dialogue?

This article explains the concept of full‑duplex natural dialogue for Alibaba’s Tmall Genie, illustrates interaction scenarios, and details the technical solution covering device‑side management, speech recognition, language understanding, synthesis, dialogue control, duration handling, and conversation flow.

ASRHuman-Computer InteractionNLU
0 likes · 8 min read
How Does Alibaba’s Tmall Genie Achieve Full‑Duplex Natural Dialogue?
Tencent Cloud Developer
Tencent Cloud Developer
Feb 26, 2019 · Artificial Intelligence

Tencent Cloud Intelligent Speech Technology: Development, Challenges and Practical Applications

Tencent Cloud's intelligent speech platform combines high‑accuracy ASR, advanced WaveNet‑based TTS, and solutions for noise, far‑field, and dialect challenges, enabling voice input, transcription, and customer‑service bots, with real‑world deployments in finance, museums, hotels, and other industry scenarios.

ASRHuman-Computer InteractionSpeech synthesis
0 likes · 8 min read
Tencent Cloud Intelligent Speech Technology: Development, Challenges and Practical Applications
Tencent Cloud Developer
Tencent Cloud Developer
Sep 30, 2018 · Artificial Intelligence

Smart Speaker Voice Interaction Technology: Recent Advances and Tencent's Research Progress

The article surveys Tencent’s recent advances in smart‑speaker voice interaction, detailing a full technology chain—from front‑end capture, wake‑up and enhancement, through speaker verification and short‑speech voiceprint, to TDNN/LSTM speech recognition, target speaker extraction, and end‑to‑end attention modeling for robust, personalized performance.

Attention MechanismTTSmicrophone array
0 likes · 18 min read
Smart Speaker Voice Interaction Technology: Recent Advances and Tencent's Research Progress