Tagged articles
51 articles
Page 1 of 1
Geek Labs
Geek Labs
May 3, 2026 · Artificial Intelligence

VibeVoice: Microsoft’s Open‑Source Cutting‑Edge Speech AI Models

The article introduces Microsoft’s open‑source VibeVoice project, detailing its long‑audio ASR‑7B and real‑time TTS‑0.5B models, the continuous speech tokenizer and next‑token diffusion techniques, and provides quick‑start instructions for online demos and local deployment via Hugging Face.

Hugging FaceMicrosoftVibeVoice
0 likes · 3 min read
VibeVoice: Microsoft’s Open‑Source Cutting‑Edge Speech AI Models
James' Growth Diary
James' Growth Diary
May 2, 2026 · Artificial Intelligence

How to Add Real‑Time Speech Recognition and Streaming TTS to Your AI Agent

This guide walks through choosing the right voice‑agent architecture, implementing streaming ASR with WebSocket, triggering sentence‑by‑sentence TTS, wiring the three layers together via async generators, optimizing latency to under a second, and avoiding common pitfalls such as missing VAD and checkpoint persistence.

LangChainWebSocketasync generators
0 likes · 19 min read
How to Add Real‑Time Speech Recognition and Streaming TTS to Your AI Agent
IT Services Circle
IT Services Circle
Apr 21, 2026 · Artificial Intelligence

Top 10 Open‑Source AI Projects Transforming Multi‑Agent Development, Coding and More

This article surveys ten notable open‑source AI projects—from a visual multi‑agent IDE and a teammate‑style agent framework to AI‑enhanced coding workflows, a lifelong‑memory layer for Claude Code, a massive Chinese textbook repository, a universal Markdown converter, and a high‑quality TTS model—detailing their motivations, core features, benchmarks, and real‑world usage scenarios.

AI toolsLLM workflowsMarkdown conversion
0 likes · 14 min read
Top 10 Open‑Source AI Projects Transforming Multi‑Agent Development, Coding and More
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 17, 2026 · Artificial Intelligence

Google Strikes Back: Gemini’s New Features Take on Claude Code

The article reviews Google Gemini’s three‑pronged rollout— a Mac desktop app with global shortcuts and window‑sharing, a Gemini CLI enhanced with Subagents that keep context clean and enable parallel expert tasks, and the Gemini 3.1 Flash TTS model with Audio Tags—comparing each to competitors and highlighting practical use cases and limitations.

AI CodingGemini CLIGoogle Gemini
0 likes · 12 min read
Google Strikes Back: Gemini’s New Features Take on Claude Code
Meituan Technology Team
Meituan Technology Team
Apr 16, 2026 · Artificial Intelligence

Can End-to-End Diffusion TTS Beat Traditional Pipelines? Inside LongCat-AudioDiT

LongCat-AudioDiT introduces a wave‑VAE plus diffusion Transformer architecture that eliminates intermediate spectrograms, solves training‑inference mismatch with dual constraints, replaces classifier‑free guidance with adaptive projection guidance, and achieves state‑of‑the‑art zero‑shot voice cloning performance on multiple benchmarks.

AI researchaudio generationdiffusion model
0 likes · 12 min read
Can End-to-End Diffusion TTS Beat Traditional Pipelines? Inside LongCat-AudioDiT
SuanNi
SuanNi
Apr 11, 2026 · Artificial Intelligence

Deploy Microsoft VibeVoice TTS for Real‑Time Multi‑Speaker Audio

This guide explains the features of Microsoft’s VibeVoice TTS models, including long‑context synthesis, low‑latency realtime streaming, multi‑speaker support, and provides step‑by‑step instructions for deploying the models on a GPU cloud platform using Python.

AI modelsDeploymentMulti-speaker
0 likes · 5 min read
Deploy Microsoft VibeVoice TTS for Real‑Time Multi‑Speaker Audio
AI Open-Source Efficiency Guide
AI Open-Source Efficiency Guide
Apr 6, 2026 · Artificial Intelligence

VibeVoice vs PersonaPlex vs OmniVoice: A Comprehensive Open‑Source AI Voice Comparison

This article provides a detailed side‑by‑side analysis of three open‑source speech AI projects—Microsoft's VibeVoice, NVIDIA's PersonaPlex, and Xiaomi's OmniVoice—covering their positioning, core models, technical highlights, multilingual support, performance metrics, licensing, and recommended use cases.

AISpeech synthesisautomatic speech recognition
0 likes · 15 min read
VibeVoice vs PersonaPlex vs OmniVoice: A Comprehensive Open‑Source AI Voice Comparison
PaperAgent
PaperAgent
Jan 25, 2026 · Industry Insights

Top 10 Chinese Large Models to Watch: Features, Benchmarks, and Download Links

This roundup highlights ten cutting‑edge Chinese AI models—including Qwen3‑TTS, LongCat‑Flash‑Thinking‑2601, GLM‑4.7‑Flash, STEP3‑VL‑10B, Baichuan‑M3, and Youtu‑LLM—detailing their multilingual capabilities, architecture innovations, performance claims, and providing direct repository links for researchers and developers.

AI researchChinese AIlarge language models
0 likes · 7 min read
Top 10 Chinese Large Models to Watch: Features, Benchmarks, and Download Links
Old Zhang's AI Learning
Old Zhang's AI Learning
Jan 24, 2026 · Artificial Intelligence

Open-Source Qwen3‑TTS: Sub‑100 ms Latency, Runs on 8 GB GPU, and ComfyUI Integration

Qwen3‑TTS, an open‑source text‑to‑speech model from Alibaba, offers sub‑100 ms first‑packet latency, supports voice cloning, natural‑language voice design, and ten languages, can be deployed locally on a GPU with as little as 8 GB VRAM, and integrates with ComfyUI for visual workflow building.

ComfyUILow latencyQwen3-TTS
0 likes · 15 min read
Open-Source Qwen3‑TTS: Sub‑100 ms Latency, Runs on 8 GB GPU, and ComfyUI Integration
Old Meng AI Explorer
Old Meng AI Explorer
Jan 8, 2026 · Artificial Intelligence

How Microsoft’s Open‑Source VibeVoice Gives AI Speech Real Emotion

Microsoft’s open‑source VibeVoice model transforms text‑to‑speech by adding fine‑grained emotional control, multi‑scene styles, and support for over 100 languages, offering free commercial use, low‑latency local deployment, and detailed parameter settings that let developers and creators generate expressive, context‑aware audio for videos, audiobooks, chatbots, and more.

AI voiceDeploymentVibeVoice
0 likes · 10 min read
How Microsoft’s Open‑Source VibeVoice Gives AI Speech Real Emotion
HyperAI Super Neural
HyperAI Super Neural
Jan 3, 2026 · Artificial Intelligence

Clone a Voice in 5 seconds with One‑Step Generation: Inside Chatterbox‑Turbo’s High‑Fidelity TTS

Resemble AI’s open‑source Chatterbox‑Turbo reduces TTS generation from ten steps to one, enabling high‑sample‑rate, lossless voice cloning from a 5‑10 second reference while supporting emotional control, side‑language tags, and embedded watermarking for real‑time applications across chatbots, games, podcasts, and education.

Chatterbox‑TurboReal-time inferenceknowledge distillation
0 likes · 7 min read
Clone a Voice in 5 seconds with One‑Step Generation: Inside Chatterbox‑Turbo’s High‑Fidelity TTS
HyperAI Super Neural
HyperAI Super Neural
Dec 12, 2025 · Artificial Intelligence

Weekly AI Paper Digest: Attention, Nvidia VLA, TTS, and Graph Neural Networks

This roundup presents five recent AI papers covering hierarchical sparse attention for ultra‑long context, Nvidia's Alpamayo‑R1 VLA model for autonomous driving, the non‑autoregressive F5‑TTS system, LatentMAS for latent‑space multi‑agent collaboration, and Deeper‑GXX that deepens arbitrary graph neural networks, highlighting each method's key innovations and reported performance gains.

Attention Mechanismautonomous drivinggraph neural networks
0 likes · 6 min read
Weekly AI Paper Digest: Attention, Nvidia VLA, TTS, and Graph Neural Networks
Programmer DD
Programmer DD
Oct 19, 2025 · Backend Development

How to Add Free Edge TTS to Your Spring Boot Application in Minutes

This tutorial shows how to integrate UnifiedTTS's free Edge TTS service into a Spring Boot project, covering project setup, API key registration, configuration, request/response models, service implementation, unit testing, and runtime verification with sample code and images.

Edge TTSJavaSpring Boot
0 likes · 9 min read
How to Add Free Edge TTS to Your Spring Boot Application in Minutes
HyperAI Super Neural
HyperAI Super Neural
Oct 8, 2025 · Artificial Intelligence

From WeChat’s AI Podcast Trial to Google, ByteDance and Xiaohongshu: Can AI Podcasts Capture the Emerging AIGC Blue Ocean?

The article examines how breakthroughs in large language models and high‑fidelity TTS are powering AI‑generated podcasts, analyzes the technical advances behind the "human‑like" sound, surveys major players such as Google, ByteDance, Xiaohongshu and startups, and evaluates the market potential of this rapidly expanding AIGC niche.

AI podcastAIGCByteDance
0 likes · 9 min read
From WeChat’s AI Podcast Trial to Google, ByteDance and Xiaohongshu: Can AI Podcasts Capture the Emerging AIGC Blue Ocean?
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Sep 19, 2025 · Artificial Intelligence

FireRedTTS-2: How the New Open-Source Model Achieves Human‑Like Multi‑Speaker Dialogue Synthesis

FireRedTTS-2, the latest open‑source dialogue TTS model from Xiaohongshu’s audio team, upgrades its speech tokenizer and text‑to‑speech architecture to enable low‑latency, per‑sentence generation, robust multi‑speaker switching, and natural prosody across multiple languages, outperforming rivals in both objective and subjective tests.

AI audiodialogue synthesismultilingual
0 likes · 10 min read
FireRedTTS-2: How the New Open-Source Model Achieves Human‑Like Multi‑Speaker Dialogue Synthesis
DataFunSummit
DataFunSummit
Sep 7, 2025 · Artificial Intelligence

How NIO Cut Radio Production Costs by 80% with AI Voice Cloning

This article details NIO's AI‑driven voice‑cloning solution for its in‑car NIO Radio, explaining the business background, pain points of traditional production, the TTS‑VC framework and modular workflow, evaluation metrics, and the resulting cost savings, efficiency gains, and scalability across dozens of cities.

AICost reductionSpeech synthesis
0 likes · 10 min read
How NIO Cut Radio Production Costs by 80% with AI Voice Cloning
Bilibili Tech
Bilibili Tech
Jul 11, 2025 · Artificial Intelligence

IndexTTS2: Emotionally Expressive, Duration-Controlled Zero-Shot TTS

IndexTTS2 introduces a novel auto-regressive zero-shot text-to-speech model that achieves precise duration control and fine-grained emotional expression through a universal time‑encoding mechanism, decoupled voice‑style and emotion modeling, and a GPT‑style latent feature, outperforming state‑of‑the‑art baselines across multiple benchmarks.

duration controlemotional synthesisspeech generation
0 likes · 23 min read
IndexTTS2: Emotionally Expressive, Duration-Controlled Zero-Shot TTS
Cognitive Technology Team
Cognitive Technology Team
Jul 1, 2025 · Artificial Intelligence

How We Built a Live‑Streaming TTS Engine: From Data Pipelines to AI Voice Generation

This article presents a comprehensive practice summary of building an intelligent digital‑human system, covering six core modules—LLM content generation, LLM interaction, TTS synthesis, visual driving, audio‑video engineering, and backend services—while detailing data collection, signal processing, ASR annotation, speaker clustering, model optimization (V1‑V4), evaluation metrics, and future research directions.

AI voiceAudio ProcessingDigital Human
0 likes · 23 min read
How We Built a Live‑Streaming TTS Engine: From Data Pipelines to AI Voice Generation
ShiZhen AI
ShiZhen AI
May 13, 2025 · Artificial Intelligence

Top Free Text‑to‑Speech Tools for Content Creators

This article reviews five free text‑to‑speech solutions—AI易视频, Google TTS, Natural Reader, Balabolka, and Speech2Go—detailing their features, language support, installation needs, and unique capabilities to help creators choose the right tool for narration, translation, or multi‑character audio production.

AITTSaudio generation
0 likes · 7 min read
Top Free Text‑to‑Speech Tools for Content Creators
DataFunTalk
DataFunTalk
Mar 21, 2025 · Artificial Intelligence

OpenAI Unveils New STT and TTS Models: gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-mini-tts – Performance, Pricing, and Demo

OpenAI announced three new speech models—two STT models (gpt-4o-transcribe and its lightweight gpt-4o-mini-transcribe) and one TTS model (gpt-4o-mini-tts)—showcasing strong accuracy on multilingual benchmarks, competitive pricing, and a quick‑start API demo for developers.

AI modelsGPT-4oOpenAI
0 likes · 8 min read
OpenAI Unveils New STT and TTS Models: gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-mini-tts – Performance, Pricing, and Demo
Huolala Tech
Huolala Tech
Dec 26, 2024 · Artificial Intelligence

How Huolala’s In‑House TTS Overcomes Latency, Naturalness, and Multilingual Limits

This article details Huolala’s self‑developed Text‑to‑Speech system, outlining its architecture, the challenges of latency, naturalness, and language support, and the innovative solutions—including streaming synthesis, emotion modeling, and transfer‑learning‑based multilingual capabilities—that deliver more flexible and realistic voice interactions.

Emotion ModelingStreaming TTSVoice Customization
0 likes · 10 min read
How Huolala’s In‑House TTS Overcomes Latency, Naturalness, and Multilingual Limits
System Architect Go
System Architect Go
Nov 28, 2024 · Artificial Intelligence

An Overview of Modern AI Audio Technologies: ASR, TTS, and Voice Cloning

This article explains how modern AI advances have transformed audio processing, covering digital audio fundamentals, automatic speech recognition (ASR), text‑to‑speech (TTS), voice cloning techniques, and provides practical Python code examples using OpenAI Whisper and HuggingFace TTS models.

AIAudio Processingspeech recognition
0 likes · 7 min read
An Overview of Modern AI Audio Technologies: ASR, TTS, and Voice Cloning
Huolala Tech
Huolala Tech
Oct 8, 2024 · Mobile Development

iOS 17 Text‑to‑Speech Crash: Root Cause and Effective Fixes

This article investigates a recurring text‑to‑speech crash on iOS 17 devices, detailing the EXC_BAD_ACCESS error, analyzing stack traces, exploring internal AVAudioEngine and AUAudioUnit_XPC structures, and presenting two remediation strategies—including a hook‑based approach that safely bypasses problematic dealloc and stop calls.

AVAudioEngineCrashHook
0 likes · 16 min read
iOS 17 Text‑to‑Speech Crash: Root Cause and Effective Fixes
ShiZhen AI
ShiZhen AI
Jun 5, 2024 · Artificial Intelligence

How the Homegrown Open‑Source ChatTTS Model Scored 20K Stars in One Week

The article introduces ChatTTS, a dialogue‑optimized open‑source text‑to‑speech model trained on over 100,000 hours of Chinese and English data, highlights its fine‑grained prosody control and multi‑speaker support, notes its superior naturalness compared to most open‑source TTS systems, and outlines its current limitations such as poor Arabic numeral handling and slow inference speed.

ChatTTSChinese AIdialogue TTS
0 likes · 2 min read
How the Homegrown Open‑Source ChatTTS Model Scored 20K Stars in One Week
php Courses
php Courses
Sep 1, 2023 · Artificial Intelligence

Integrating Baidu Text-to-Speech API with PHP

This tutorial demonstrates how to obtain Baidu TTS credentials, construct the required signature, send an HTTP request using PHP's cURL library, and save the returned audio data as an MP3 file, providing a complete code example for developers.

Baidu TTSPHPSpeech synthesis
0 likes · 5 min read
Integrating Baidu Text-to-Speech API with PHP
58 Tech
58 Tech
Aug 25, 2023 · Artificial Intelligence

Voice Cloning Technology in AI Sales Assistant

This article introduces the AI sales assistant from 58.com, detailing its background, a few‑shot voice cloning approach using real dialogue data, multi‑accent naturalness optimization, deployment architecture, and future plans, while evaluating performance metrics and discussing challenges in speech synthesis quality and stability.

AI sales assistantFew‑Shot LearningSpeech synthesis
0 likes · 19 min read
Voice Cloning Technology in AI Sales Assistant
Python Crawling & Data Mining
Python Crawling & Data Mining
Oct 18, 2022 · Fundamentals

How to Make Python Speak with a Male Voice Using pyttsx3 and Registry Tweaks

This article walks through troubleshooting the pyttsx3 Python text‑to‑speech library on Windows, explains why only female voices appear by default, shows how to add the missing male “Kangkang” voice via registry edits, and provides complete working code examples for both voice selection and speech synthesis.

Programming tutorialWindows Registrypyttsx3
0 likes · 5 min read
How to Make Python Speak with a Male Voice Using pyttsx3 and Registry Tweaks
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Aug 10, 2022 · Artificial Intelligence

Multi-Stage Multi-Codebook VQ-VAE for High-Performance Neural Text-to-Speech (MSMC‑TTS)

The MSMC‑TTS system, a multi‑stage multi‑codebook VQ‑VAE based neural text‑to‑speech solution, delivers near‑human audio quality (MOS 4.41) with a compact 3.12 MB acoustic model, substantially surpassing Mel‑Spectrogram FastSpeech baselines in naturalness and efficiency.

Compact RepresentationMulti-Stage ModelingSpeech synthesis
0 likes · 10 min read
Multi-Stage Multi-Codebook VQ-VAE for High-Performance Neural Text-to-Speech (MSMC‑TTS)
DataFunTalk
DataFunTalk
Mar 26, 2022 · Artificial Intelligence

Advances in Alibaba's Digital Human (XiaoMi) Technology: Development, Construction, and Interaction

This article reviews Alibaba's XiaoMi digital human technology, covering its evolution since 2019, a six‑stage pipeline for building avatars, methods to enhance emotional, textual, vocal, and motion expressiveness, and approaches for improving long‑term interactive capabilities such as controllable script generation, multimodal QA, sign‑language translation, and intelligent behavior decision, culminating in the release of the MMTK multimodal algorithm library.

Digital HumanEmotion ModelingMultimodal AI
0 likes · 11 min read
Advances in Alibaba's Digital Human (XiaoMi) Technology: Development, Construction, and Interaction
Test Development Learning Exchange
Test Development Learning Exchange
Oct 17, 2021 · Artificial Intelligence

Using pyttsx3 for Text-to-Speech in Python

This article provides a hands‑on guide to using the pyttsx3 library for offline text‑to‑speech conversion in Python, covering installation, basic playback, voice property adjustments, multilingual support, and conditional speech examples with counters.

PythonSpeech synthesisconditional speech
0 likes · 7 min read
Using pyttsx3 for Text-to-Speech in Python
MaGe Linux Operations
MaGe Linux Operations
Nov 26, 2019 · Artificial Intelligence

Create Cute Voiceovers with Baidu TTS and Python

This guide shows how to use Baidu's AI speech synthesis service with Python, covering SDK installation, app creation, obtaining credentials, and sample code to convert text—including daily quotes from an external API—into audio files, even customizing voice styles.

APIBaidu AIPython
0 likes · 5 min read
Create Cute Voiceovers with Baidu TTS and Python
DataFunTalk
DataFunTalk
Nov 5, 2019 · Artificial Intelligence

Low-Resource Text-to-Speech: FastSpeech, LightTTS, and LightBERT Overview

This article reviews recent advances in low‑resource text‑to‑speech synthesis, covering the background of TTS, challenges in data‑ and compute‑limited scenarios, and detailed descriptions of FastSpeech, LightTTS, LightBERT, and related lightweight vocoder techniques, along with experimental results and future research directions.

FastSpeechLightTTSLow-Resource
0 likes · 20 min read
Low-Resource Text-to-Speech: FastSpeech, LightTTS, and LightBERT Overview
Liangxu Linux
Liangxu Linux
Sep 3, 2019 · Artificial Intelligence

Clone Any Voice in Seconds with the Real-Time-Voice-Cloning Open‑Source TTS

This guide explains how the Real-Time-Voice-Cloning project uses deep‑learning text‑to‑speech techniques to generate a voice clone from a short audio sample, covering the underlying principle, required dataset, setup steps, demo usage, and ethical considerations.

Deep LearningReal-Time-Voice-Cloningtext-to-speech
0 likes · 5 min read
Clone Any Voice in Seconds with the Real-Time-Voice-Cloning Open‑Source TTS
21CTO
21CTO
Dec 9, 2015 · Artificial Intelligence

iFLY Mobile Speech Platform: Enabling Voice Recognition and Synthesis

iFLY’s Mobile Speech Platform (MSP) integrates cloud‑based speech recognition and text‑to‑speech technologies to deliver high‑quality, multi‑channel voice services for Android, iOS and other devices, detailing its four‑layer architecture, core functionalities, and the role of ASR and TTS in modern human‑machine interaction.

Mobile Developmentartificial intelligencecloud architecture
0 likes · 5 min read
iFLY Mobile Speech Platform: Enabling Voice Recognition and Synthesis
Baidu Tech Salon
Baidu Tech Salon
Jul 29, 2014 · Artificial Intelligence

Baidu Speech Synthesis: Balancing Trade‑offs and Opening the Platform to Developers

Baidu’s speech synthesis system, developed since 2013 to give machines natural Chinese voices, tackles millions of tonal variations through phonetic compression and optimized acoustic models, balances trade‑offs in data and scalability, and offers a free open platform that lets developers integrate high‑quality text‑to‑speech into apps, advancing equal access to information.

BaiduDeveloper PlatformHMM
0 likes · 6 min read
Baidu Speech Synthesis: Balancing Trade‑offs and Opening the Platform to Developers