Tagged articles
121 articles
Page 1 of 2
Woodpecker Software Testing
Woodpecker Software Testing
May 14, 2026 · Artificial Intelligence

AI Testing in Practice: 3 Real-World Case Studies

The article examines how AI testing has shifted from simple functional checks to evaluating model reliability, fairness, robustness, and explainability, illustrating the shift with three detailed client cases—financial bias audit, automotive voice‑assistant stress testing, and medical‑imaging consistency verification.

AI testingAequitasRAGAS
0 likes · 8 min read
AI Testing in Practice: 3 Real-World Case Studies
Machine Heart
Machine Heart
May 8, 2026 · Artificial Intelligence

Can Qianwen’s Desktop Voice Input Finally Make the Keyboard Obsolete?

The article evaluates Qianwen’s new desktop voice‑input system, showing how it filters filler words, understands screen context, executes AI commands, and generates structured text, PPTs, and Excel reports, positioning voice as a viable replacement for traditional keyboard typing.

AI AssistantQianwendesktop AI
0 likes · 12 min read
Can Qianwen’s Desktop Voice Input Finally Make the Keyboard Obsolete?
Weekly Large Model Application
Weekly Large Model Application
May 5, 2026 · Artificial Intelligence

What Pretraining Actually Teaches: Listening to All Sounds

The article explains that pretraining for speech models functions like a broad liberal‑arts education, teaching universal acoustic and linguistic patterns through next‑token prediction, joint audio‑text training, and mask‑or contrast objectives, while clarifying common misconceptions and highlighting data bias and the need for clean, task‑specific fine‑tuning.

Fine-tuningaudio-text alignmentdata bias
0 likes · 6 min read
What Pretraining Actually Teaches: Listening to All Sounds
Weekly Large Model Application
Weekly Large Model Application
May 5, 2026 · Artificial Intelligence

Why More GPUs and Data Aren’t Enough: Defining Scenarios and Data for Speech Model Training

The article argues that successful speech model training starts with understanding user scenarios, then selecting appropriate data, and finally choosing metrics, detailing six key questions, data sourcing strategies, evaluation criteria, and compliance considerations to avoid the misconception that sheer data volume guarantees performance.

AI trainingModel Evaluationdata collection
0 likes · 6 min read
Why More GPUs and Data Aren’t Enough: Defining Scenarios and Data for Speech Model Training
Geek Labs
Geek Labs
May 3, 2026 · Artificial Intelligence

VibeVoice: Microsoft’s Open‑Source Cutting‑Edge Speech AI Models

The article introduces Microsoft’s open‑source VibeVoice project, detailing its long‑audio ASR‑7B and real‑time TTS‑0.5B models, the continuous speech tokenizer and next‑token diffusion techniques, and provides quick‑start instructions for online demos and local deployment via Hugging Face.

Hugging FaceMicrosoftVibeVoice
0 likes · 3 min read
VibeVoice: Microsoft’s Open‑Source Cutting‑Edge Speech AI Models
James' Growth Diary
James' Growth Diary
May 2, 2026 · Artificial Intelligence

How to Add Real‑Time Speech Recognition and Streaming TTS to Your AI Agent

This guide walks through choosing the right voice‑agent architecture, implementing streaming ASR with WebSocket, triggering sentence‑by‑sentence TTS, wiring the three layers together via async generators, optimizing latency to under a second, and avoiding common pitfalls such as missing VAD and checkpoint persistence.

LangChainWebSocketasync generators
0 likes · 19 min read
How to Add Real‑Time Speech Recognition and Streaming TTS to Your AI Agent
Wuming AI
Wuming AI
Apr 21, 2026 · Artificial Intelligence

Can AI Voice Input Boost Office Productivity? A Hands‑On Review of Typeless and ShandianShuo

The article examines how AI‑powered voice input can replace keyboard typing in office settings, evaluates environmental constraints, compares two leading tools—Typeless and ShandianShuo—through feature lists, limitations, and real‑world usage scenarios, and concludes with practical advice on choosing the right solution.

AI voice inputProduct ComparisonShandianShuo
0 likes · 7 min read
Can AI Voice Input Boost Office Productivity? A Hands‑On Review of Typeless and ShandianShuo
AI Waka
AI Waka
Mar 26, 2026 · Artificial Intelligence

Building Production‑Ready AI Agents with NVIDIA Nemotron: A Full‑Stack Guide

This guide explains how to assemble NVIDIA's Nemotron Speech, RAG, and Safety models into a low‑latency, secure production AI agent stack, covering performance benchmarks, multimodal retrieval, safety data sets, integration code, and deployment options for cloud, on‑premise, and edge environments.

Content SafetyEdge ComputingMultimodal Retrieval
0 likes · 9 min read
Building Production‑Ready AI Agents with NVIDIA Nemotron: A Full‑Stack Guide
Weekly Large Model Application
Weekly Large Model Application
Mar 4, 2026 · Artificial Intelligence

Qwen3‑ASR vs FunASR: In‑Depth Technical Comparison

This article provides a detailed side‑by‑side analysis of the open‑source ASR tools FunASR and Qwen3‑ASR, covering team origins, model architectures, language coverage, speed, deployment requirements, and ideal use‑cases so readers can decide which solution fits their projects best.

ASRFunASRParaformer
0 likes · 10 min read
Qwen3‑ASR vs FunASR: In‑Depth Technical Comparison
Weekly Large Model Application
Weekly Large Model Application
Feb 22, 2026 · Artificial Intelligence

2026 Guide to Running Open‑Source ASR on Pure CPU

The 2026 overview details lightweight, heavily quantized open‑source speech‑recognition models and CPU‑specific inference engines, offering concrete tips, model comparisons, and a concise selection guide that enable real‑time, GPU‑free ASR deployment with low latency and high stability.

ASRCPU inferenceModel Selection
0 likes · 4 min read
2026 Guide to Running Open‑Source ASR on Pure CPU
AI Engineering
AI Engineering
Feb 15, 2026 · Artificial Intelligence

Qwen3‑ASR Runs Natively on Apple Silicon via MLX for Full‑Speed Speech Recognition

A developer has re‑implemented the state‑of‑the‑art Qwen3‑ASR model in MLX, enabling native execution on Apple M1‑M4 chips with real‑time factors as low as 0.08, 4‑bit quantization speedups of 4.7×, multilingual support for 52 languages, and features such as word‑level timestamps and streaming transcription.

Apple SiliconMLXQwen3-ASR
0 likes · 5 min read
Qwen3‑ASR Runs Natively on Apple Silicon via MLX for Full‑Speed Speech Recognition
Old Zhang's AI Learning
Old Zhang's AI Learning
Feb 1, 2026 · Artificial Intelligence

Microsoft VibeVoice‑ASR Open‑Source: One‑Shot 60‑Minute Transcription with Speaker ID and Timestamps

Microsoft’s newly open‑sourced VibeVoice‑ASR model can transcribe up to 60‑minute audio in a single pass, preserving global context while providing built‑in speaker diarization and timestamps, supports 50+ languages, offers custom hot‑word injection, and can be deployed via Docker, Gradio, or vLLM for high‑throughput API services.

ASRDockerLoRA
0 likes · 9 min read
Microsoft VibeVoice‑ASR Open‑Source: One‑Shot 60‑Minute Transcription with Speaker ID and Timestamps
Woodpecker Software Testing
Woodpecker Software Testing
Jan 27, 2026 · Artificial Intelligence

How to Build a Multimodal AI Assistant with FastAPI, Alibaba Cloud and DashScope

This guide walks through configuring Alibaba Cloud credentials, implementing a FastAPI backend with email function calling, Alibaba OpenSearch, image generation via DashScope, speech recognition, and a responsive HTML/CSS/JavaScript front‑end that supports text chat, image recognition, image synthesis, and voice interaction.

Alibaba CloudDashscopeFastAPI
0 likes · 38 min read
How to Build a Multimodal AI Assistant with FastAPI, Alibaba Cloud and DashScope
Woodpecker Software Testing
Woodpecker Software Testing
Jan 25, 2026 · Artificial Intelligence

Integrating LLMs with Speech: Whisper, Vosk, and Alibaba Cloud in Python and JavaScript

This tutorial walks through setting up local speech recognition with OpenAI's Whisper and Vosk, leveraging Alibaba Cloud's ASR services, building a WebSocket server/client for real‑time audio streaming, capturing audio in the browser via MediaRecorder or RecordRTC, and performing speech synthesis with pyttsx3 and Alibaba's Sambert model.

Alibaba CloudJavaScriptPython
0 likes · 20 min read
Integrating LLMs with Speech: Whisper, Vosk, and Alibaba Cloud in Python and JavaScript
AI Waka
AI Waka
Jan 24, 2026 · Artificial Intelligence

Building Production‑Ready AI Agents with NVIDIA’s Nemotron Stack

The article explains how NVIDIA’s Nemotron Stack combines ultra‑fast speech recognition, multimodal retrieval, and advanced safety models into a unified, low‑latency pipeline, offering practical integration code, performance insights, and deployment options for turning experimental AI agents into production‑grade services.

AI agentsContent SafetyDeployment
0 likes · 9 min read
Building Production‑Ready AI Agents with NVIDIA’s Nemotron Stack
Old Zhang's AI Learning
Old Zhang's AI Learning
Jan 23, 2026 · Artificial Intelligence

Open‑Source GLM‑ASR‑Nano‑2512: Chinese Dialect‑Optimized Speech Recognition on Consumer‑Grade GPUs

GLM‑ASR‑Nano‑2512, a 1.5 B‑parameter open‑source speech‑recognition model released in December 2025, delivers state‑of‑the‑art accuracy on Chinese dialects and low‑volume audio, outperforms Whisper V3 on benchmark tests, runs on consumer GPUs, and provides detailed installation and deployment guides for transformers, vLLM and SGLang.

Chinese dialectsGLM-ASR-Nano-2512SGLang
0 likes · 11 min read
Open‑Source GLM‑ASR‑Nano‑2512: Chinese Dialect‑Optimized Speech Recognition on Consumer‑Grade GPUs
360 Smart Cloud
360 Smart Cloud
Dec 1, 2025 · Artificial Intelligence

How to Build Real‑Time Streaming Speech Recognition with a Large‑Model API (Go & Python)

This guide explains the background of speech‑to‑text technology, introduces the large‑model streaming speech recognition API, walks through obtaining an API key, and provides detailed Go and Python code for establishing a WebSocket connection, sending full‑client and audio‑only requests, and parsing server responses.

AIGolangLarge Model
0 likes · 12 min read
How to Build Real‑Time Streaming Speech Recognition with a Large‑Model API (Go & Python)
AntTech
AntTech
Oct 28, 2025 · Artificial Intelligence

Ming-Flash-Omni-Preview: 103B Open-Source Multimodal Model Excelling in Image, Video, and Speech

Introducing Ming‑Flash‑Omni‑Preview, a 103‑billion‑parameter open‑source multimodal model built on a sparse MoE architecture that delivers state‑of‑the‑art performance in controllable image generation, streaming video understanding, and context‑aware speech recognition, surpassing prior models on GenEval and GEdit benchmarks.

image generationlarge language modelmultimodal
0 likes · 8 min read
Ming-Flash-Omni-Preview: 103B Open-Source Multimodal Model Excelling in Image, Video, and Speech
Liangxu Linux
Liangxu Linux
Oct 21, 2025 · Artificial Intelligence

Explore 4 Must‑Try Open‑Source AI Tools: Datasets, Finance Model, Real‑Time Speech, and Agent Toolbox

This article introduces four high‑impact open‑source projects—a curated public dataset collection, the Kronos financial K‑line analysis model, WhisperLiveKit for real‑time speech transcription, and Youtu‑agent for building versatile AI agents—each with descriptions, key features, and GitHub links.

AI modelsDatasetsagent toolbox
0 likes · 6 min read
Explore 4 Must‑Try Open‑Source AI Tools: Datasets, Finance Model, Real‑Time Speech, and Agent Toolbox
Baidu Maps Tech Team
Baidu Maps Tech Team
Jul 31, 2025 · Artificial Intelligence

How Baidu’s AI Voice Assistant Turns Speech into Precise Navigation Commands

This article explains how Baidu Map’s AI voice assistant converts spoken commands into precise navigation actions by detailing the speech‑to‑text pipeline, intent parsing, template and generative approaches, tool‑calling mechanisms, memory and reflection capabilities, and future directions for intelligent agents.

AIIntent ParsingLLM
0 likes · 14 min read
How Baidu’s AI Voice Assistant Turns Speech into Precise Navigation Commands
AntTech
AntTech
Jul 3, 2025 · Artificial Intelligence

How Ant Group’s AI Multimodal Evaluation Transforms Image, Speech, and Video Quality Testing

In a QECon 2025 talk, Ant Group’s AI team detailed a comprehensive multimodal evaluation framework that leverages large‑model metrics, custom pipelines, and benchmark datasets to assess image generation, speech recognition, and video quality, while also contributing to industry standards and academic research.

AI Evaluationimage assessmentlarge models
0 likes · 16 min read
How Ant Group’s AI Multimodal Evaluation Transforms Image, Speech, and Video Quality Testing
iQIYI Technical Product Team
iQIYI Technical Product Team
Jul 3, 2025 · Artificial Intelligence

Three iQIYI AI Papers Break New Ground at ACL 2025 & INTERSPEECH 2025

iQIYI’s AI research team secured three paper acceptances—two at ACL 2025 (including a main conference and a Findings paper) and one at INTERSPEECH 2025—covering long‑context large language model evaluation, Chinese novel summarization, and efficient Thai speech recognition, with links to each work.

ACL 2025AI researchINTERSPEECH 2025
0 likes · 7 min read
Three iQIYI AI Papers Break New Ground at ACL 2025 & INTERSPEECH 2025
System Architect Go
System Architect Go
Nov 28, 2024 · Artificial Intelligence

An Overview of Modern AI Audio Technologies: ASR, TTS, and Voice Cloning

This article explains how modern AI advances have transformed audio processing, covering digital audio fundamentals, automatic speech recognition (ASR), text‑to‑speech (TTS), voice cloning techniques, and provides practical Python code examples using OpenAI Whisper and HuggingFace TTS models.

AIAudio Processingspeech recognition
0 likes · 7 min read
An Overview of Modern AI Audio Technologies: ASR, TTS, and Voice Cloning
Huolala Tech
Huolala Tech
Jul 9, 2024 · Artificial Intelligence

Building an In-Car Voice Assistant: From Wake‑Word to NLP

This article details the end‑to‑end development of an in‑vehicle voice assistant, covering motivation, functional design, technology stack selection, dialogue flow, privacy, third‑party integration, wake‑word detection, on‑device speech recognition, noise filtering, NLP processing, and deployment considerations.

Voice Assistantin‑car technologynatural language processing
0 likes · 18 min read
Building an In-Car Voice Assistant: From Wake‑Word to NLP
Ops Development & AI Practice
Ops Development & AI Practice
Jun 22, 2024 · Artificial Intelligence

Why Transformers Revolutionized AI: From NLP to Vision and Speech

Transformers, introduced in 2017, have reshaped neural networks by leveraging attention mechanisms to outperform RNNs and CNNs across NLP, computer vision, and speech tasks, offering parallel processing, long‑range dependency capture, and versatile applications such as translation, text generation, image classification, and speech recognition.

Attention MechanismComputer VisionDeep Learning
0 likes · 6 min read
Why Transformers Revolutionized AI: From NLP to Vision and Speech
Huolala Tech
Huolala Tech
Nov 23, 2023 · Artificial Intelligence

How HuoLaLa Built a Custom ASR System to Boost Accuracy and Cut Costs

This article details HuoLaLa's development of an in‑house Automatic Speech Recognition system, covering its architecture, VAD optimization, language‑model and hot‑word enhancements, punctuation restoration, task and resource scheduling, and the resulting improvements in accuracy and cost efficiency.

ASRLanguage ModelVAD
0 likes · 18 min read
How HuoLaLa Built a Custom ASR System to Boost Accuracy and Cut Costs
DataFunTalk
DataFunTalk
Sep 23, 2023 · Artificial Intelligence

Paraformer: An Industrial Non‑Autoregressive End‑to‑End Speech Recognition Model and Its Deployment on ModelScope

This article introduces the Paraformer non‑autoregressive end‑to‑end speech recognition model released by Alibaba DAMO Academy, details its architecture, training strategies, large‑scale performance, and provides step‑by‑step guidance for using and fine‑tuning the model on the ModelScope platform with the FunASR toolkit.

ASRModelScopeParaformer
0 likes · 13 min read
Paraformer: An Industrial Non‑Autoregressive End‑to‑End Speech Recognition Model and Its Deployment on ModelScope
Test Development Learning Exchange
Test Development Learning Exchange
Jul 27, 2023 · Artificial Intelligence

Splitting PDF Files and Recognizing MP3 Audio with Python

This guide explains how to split a PDF into separate files using PyPDF2 and provides two Python approaches for converting MP3 audio to text—one leveraging Google Speech‑Recognition for higher accuracy and another using PocketSphinx for complete transcription—complete with ready‑to‑run code examples.

PDFPyPDF2Python
0 likes · 5 min read
Splitting PDF Files and Recognizing MP3 Audio with Python
58 Tech
58 Tech
Jul 6, 2023 · Artificial Intelligence

Design and Optimization of a Kaldi‑Based Speech Recognition Backend at 58.com

This article details the evolution from the initial Kaldi‑based speech recognition architecture (version 1.0) to a re‑engineered version 2.0, describing business background, service components, identified shortcomings, and a series of performance, concurrency, GPU, I/O, GC, and dispatch optimizations that dramatically improve resource utilization, latency, and reliability for large‑scale voice processing at 58.com.

AIBackend ArchitectureGPU
0 likes · 15 min read
Design and Optimization of a Kaldi‑Based Speech Recognition Backend at 58.com
58 Tech
58 Tech
Jun 21, 2023 · Artificial Intelligence

GPU Hotword Enhancement for WeNet End-to-End Speech Recognition

This article explains the design, implementation, and experimental evaluation of hot‑word augmentation in WeNet's GPU runtime, detailing how character‑ and word‑based language model scoring are extended to boost recognition of rare proper nouns in both streaming and non‑streaming ASR services.

ASRCTC decoderGPU
0 likes · 12 min read
GPU Hotword Enhancement for WeNet End-to-End Speech Recognition
php Courses
php Courses
Jun 17, 2023 · Mobile Development

Implementing Voice Functionality in WeChat Mini Programs

This guide explains how to integrate WeChat Mini Program voice capabilities by importing the recorder and audio APIs, recording audio, uploading for speech recognition, and playing back the result, with example code snippets for each step.

JavaScriptVoice APIWeChat Mini Program
0 likes · 3 min read
Implementing Voice Functionality in WeChat Mini Programs
21CTO
21CTO
Jun 10, 2023 · Artificial Intelligence

How Huang Xuedong’s Team Achieved Human-Level Speech Recognition at Microsoft

The article chronicles the career of Chinese AI pioneer Huang Xuedong, detailing his education, rise at Microsoft, leadership of Azure AI, groundbreaking human‑level speech recognition breakthroughs, the engineering feats behind them—including a ten‑network model and the CNTK framework—and his recent move to Zoom.

CNTKDeep LearningMicrosoft
0 likes · 14 min read
How Huang Xuedong’s Team Achieved Human-Level Speech Recognition at Microsoft
Meituan Technology Team
Meituan Technology Team
Apr 13, 2023 · Artificial Intelligence

Peak-First Regularization for Low-Latency Streaming Speech Recognition

The paper presents a low‑latency streaming speech‑recognition solution that reframes latency reduction as a knowledge‑distillation task, using a simple peak‑first regularization term to shift CTC output probabilities leftward and achieve up to 200 ms average latency reduction without harming word error rate.

CTCLatency ReductionPeak-First Regularization
0 likes · 21 min read
Peak-First Regularization for Low-Latency Streaming Speech Recognition
Bilibili Tech
Bilibili Tech
Feb 28, 2023 · Artificial Intelligence

High‑Quality Automatic Speech Recognition (ASR) Solutions at Bilibili: Data, Model, and Deployment Optimizations

Bilibili’s high‑quality ASR system combines large‑scale filtered business data, semi‑supervised Noisy‑Student training, an end‑to‑end CTC model with lattice‑free MMI decoding, and FP16‑optimized FasterTransformer inference on Triton, delivering top‑ranked accuracy, low latency, and scalable deployment for diverse Chinese‑English video content.

ASRBilibiliEnd-to-End
0 likes · 18 min read
High‑Quality Automatic Speech Recognition (ASR) Solutions at Bilibili: Data, Model, and Deployment Optimizations
DataFunTalk
DataFunTalk
Feb 5, 2023 · Artificial Intelligence

A Six‑Year Retrospective on Deep Learning Algorithms and Their Applications

This article reviews the author’s six‑year hands‑on experience with deep learning, covering breakthroughs in speech recognition, computer vision, language modeling, reinforcement learning, privacy protection, model compression, recommendation systems, and future research directions, while summarizing technical lessons and practical insights.

AIRecommendation Systemsmodel compression
0 likes · 30 min read
A Six‑Year Retrospective on Deep Learning Algorithms and Their Applications
DataFunSummit
DataFunSummit
Jan 14, 2023 · Artificial Intelligence

Key Transformer Model Papers Across Language, Vision, Speech, and Time‑Series Domains

This article surveys the most influential Transformer‑based research papers—from the original Attention Is All You Need work to recent models such as Autoformer and FEDformer—covering breakthroughs in natural language processing, computer vision, speech recognition, and long‑term series forecasting, and provides download links for each.

AITime-Series ForecastingTransformer
0 likes · 17 min read
Key Transformer Model Papers Across Language, Vision, Speech, and Time‑Series Domains
58 Tech
58 Tech
Jan 12, 2023 · Artificial Intelligence

Efficient Conformer for End‑to‑End Speech Recognition: Model, Implementation, Streaming Inference, and Experimental Results

This article presents a comprehensive overview of the Efficient Conformer model for large‑scale end‑to‑end speech recognition, detailing its architectural improvements such as progressive downsampling and grouped multi‑head self‑attention, the PyTorch implementation in WeNet, streaming inference handling, experimental CER gains on AISHELL‑1 and production data, and future development plans.

ASREfficient ConformerModel Optimization
0 likes · 16 min read
Efficient Conformer for End‑to‑End Speech Recognition: Model, Implementation, Streaming Inference, and Experimental Results
DataFunTalk
DataFunTalk
Dec 7, 2022 · Artificial Intelligence

Vivo's Self‑Developed Streaming Speech‑Recognition Inference Engine and KunlunChip High‑Performance Inference Library

The article details vivo's development of a high‑accuracy, high‑performance streaming speech‑recognition inference engine built on the wenet framework, its optimization techniques such as dynamic batching and memory pooling, collaborative acceleration with KunlunChip's high‑performance inference library, and extensive performance benchmarks demonstrating multi‑batch GPU and XPU gains.

AI inferenceKunlun chipPerformance Optimization
0 likes · 10 min read
Vivo's Self‑Developed Streaming Speech‑Recognition Inference Engine and KunlunChip High‑Performance Inference Library
58 Tech
58 Tech
Sep 29, 2022 · Artificial Intelligence

End-to-End Speech Recognition Optimization and Deployment at 58.com

58.com’s AI Lab presents a comprehensive overview of its end‑to‑end speech recognition system, detailing data collection, semi‑supervised training, Efficient Conformer architecture, model compression, and deployment strategies that together achieve high accuracy across diverse acoustic conditions and large‑scale production workloads.

AIDeploymentEfficient Conformer
0 likes · 19 min read
End-to-End Speech Recognition Optimization and Deployment at 58.com
Zuoyebang Tech Team
Zuoyebang Tech Team
Sep 23, 2022 · Artificial Intelligence

How AI Powers K‑12 Education: Insights from a Chief Algorithm Expert

In this interview, the chief algorithm expert at Zuoyebang discusses how AI technologies such as NLP, speech recognition, large‑model pre‑training, and knowledge‑graph construction are applied to K‑12 education, covering practical challenges, deployment strategies, and future research directions.

AIEducation TechnologyKnowledge Graph
0 likes · 27 min read
How AI Powers K‑12 Education: Insights from a Chief Algorithm Expert
Zuoyebang Tech Team
Zuoyebang Tech Team
Aug 12, 2022 · Artificial Intelligence

How End-to-End Speech Recognition is Transforming AI Voice Applications

The AISummit AI conference highlighted advances in intelligent voice, with experts from ZuoYeBang, ByteDance, Microsoft and others discussing end‑to‑end speech recognition, pronunciation correction, and high‑quality speech synthesis, and exploring how multimodal pre‑trained models will shape the future of voice AI.

AI Conferenceend-to-end AIintelligent voice
0 likes · 6 min read
How End-to-End Speech Recognition is Transforming AI Voice Applications
Zuoyebang Tech Team
Zuoyebang Tech Team
Jul 29, 2022 · Artificial Intelligence

Boosting Chinese‑English Code‑Switching Speech Recognition with Language ID and LM Enhancements

This report details a series of experiments on Chinese‑English mixed‑language speech recognition, introducing language‑identification loss and language‑model integration to improve acoustic modeling, reduce mixed error rates, and achieve significant gains over a baseline end‑to‑end ASR system.

Code-SwitchingDeep Learninglanguage identification
0 likes · 16 min read
Boosting Chinese‑English Code‑Switching Speech Recognition with Language ID and LM Enhancements
Zuoyebang Tech Team
Zuoyebang Tech Team
Jul 14, 2022 · Artificial Intelligence

Enhancing Speech Keyword Detection Using Prefix Automaton Beam Search

This article presents a method to improve keyword detection in large‑scale speech recognition by integrating a prefix automaton into the beam‑search decoding of seq2seq models, enabling real‑time addition of new terms while reducing computational overhead compared to traditional approaches.

Beam SearchSeq2Seqkeyword detection
0 likes · 12 min read
Enhancing Speech Keyword Detection Using Prefix Automaton Beam Search
Zuoyebang Tech Team
Zuoyebang Tech Team
Jun 10, 2022 · Artificial Intelligence

How End-to-End Phoneme Recognition Boosts English Pronunciation Detection

This article examines the challenges of English pronunciation teaching in China and presents a practical end-to-end phoneme‑level mispronunciation detection system that leverages CTC models, attention‑based text fusion, and data augmentation to dramatically reduce false alarms and improve diagnostic accuracy.

AI educationend-to-end modelslanguage learning
0 likes · 9 min read
How End-to-End Phoneme Recognition Boosts English Pronunciation Detection
NetEase LeiHuo Testing Center
NetEase LeiHuo Testing Center
Apr 15, 2022 · Artificial Intelligence

Practical AI‑Powered Voice Recognition for Game Dialogue Testing: A Step‑by‑Step Case Study

This article presents a detailed case study of using AI speech‑recognition techniques—including acoustic modeling with VGG, pypinyin conversion, feature extraction, and CTC decoding—to automatically verify game dialogue audio against script text, outlining the workflow, challenges, implementation details, and experimental results.

AICTC decodingPython
0 likes · 10 min read
Practical AI‑Powered Voice Recognition for Game Dialogue Testing: A Step‑by‑Step Case Study
DataFunSummit
DataFunSummit
Apr 1, 2022 · Artificial Intelligence

Detecting Invalid Queries in Voice Interaction: Non‑Human Interaction and Ambiguous Intent Recognition

This talk presents a comprehensive study of invalid query detection in voice assistants, covering the definition of effective and ineffective queries, challenges of non‑human interaction and ambiguous intent recognition, data collection, model design, experimental results, user‑feedback loops, and future research directions.

invalid query detectionmachine learningnatural language understanding
0 likes · 20 min read
Detecting Invalid Queries in Voice Interaction: Non‑Human Interaction and Ambiguous Intent Recognition
DataFunTalk
DataFunTalk
Mar 20, 2022 · Artificial Intelligence

Detecting Invalid Queries in Voice Interaction: Non‑Human Interaction and Ambiguous Intent Recognition

This talk presents a comprehensive study of invalid query detection in voice assistants, covering the definition and taxonomy of invalid queries, challenges of non‑human interaction and ambiguous intent recognition, data collection and labeling strategies, feature engineering, deep neural network modeling, experimental results, user‑feedback loops, and current performance limits.

AIdialogue systeminvalid query
0 likes · 17 min read
Detecting Invalid Queries in Voice Interaction: Non‑Human Interaction and Ambiguous Intent Recognition
Baidu Geek Talk
Baidu Geek Talk
Feb 14, 2022 · Artificial Intelligence

AI Sign Language Digital Human: Technology, Challenges, and Development by Baidu Intelligent Cloud

Baidu’s AI‑driven sign‑language digital human combines ultra‑accurate speech recognition, specialized translation, and precise gesture‑generation models—backed by extensive motion‑capture data and expert validation—to deliver 24‑hour, high‑fidelity signing for millions of hearing‑impaired users, showcasing inclusive AI communication.

AIaccessibilitygesture generation
0 likes · 12 min read
AI Sign Language Digital Human: Technology, Challenges, and Development by Baidu Intelligent Cloud
DataFunSummit
DataFunSummit
Jan 16, 2022 · Artificial Intelligence

Multimodal Text and Speech Emotion Analysis: Overview, MSCNN‑SPU Model, and Domain Adaptation

This talk presents an overview of text‑plus‑speech multimodal emotion analysis, covering background, single‑modal text and audio models, the MSCNN‑SPU multimodal architecture, domain‑adaptation techniques, and future directions, with detailed model explanations, experimental results, and practical deployment insights.

Audio ProcessingDeep Learningmultimodal emotion analysis
0 likes · 40 min read
Multimodal Text and Speech Emotion Analysis: Overview, MSCNN‑SPU Model, and Domain Adaptation
Python Programming Learning Circle
Python Programming Learning Circle
Jan 10, 2022 · Artificial Intelligence

Building a Siri‑Like Voice Chatbot with Python

This tutorial explains how to create a Siri‑style conversational robot in Python by configuring the environment, describing the speech‑recognition and chatbot principles, and showing the implementation that uses Baidu speech recognition and the Turing chatbot API.

AIChatbotPython
0 likes · 3 min read
Building a Siri‑Like Voice Chatbot with Python
Beike Product & Technology
Beike Product & Technology
Dec 23, 2021 · Artificial Intelligence

KeSpeech: A Large-Scale Chinese Mandarin Dialect Speech Benchmark Presented at NeurIPS 2021

KeSpeech, a benchmark jointly released by Beike AI and Tsinghua University at NeurIPS 2021, provides a massive Chinese Mandarin dialect dataset covering 30,000 speakers from 34 cities, supporting speech recognition, speaker verification, dialect identification, and voice conversion tasks, and includes rich multi‑scenario and parallel corpora for advanced research.

AINeurIPSdialect benchmark
0 likes · 5 min read
KeSpeech: A Large-Scale Chinese Mandarin Dialect Speech Benchmark Presented at NeurIPS 2021
DataFunTalk
DataFunTalk
Dec 5, 2021 · Artificial Intelligence

Real‑Time Voice Dialogue: Practices, Challenges, and Duplex Conversation

This article presents a comprehensive overview of real‑time voice dialogue systems, covering the hotline robot architecture, unique challenges of spoken interactions, ASR‑robust SLU models, multimodal emotion detection, oral expression handling, and the design and benefits of duplex (full‑duplex) conversational frameworks.

ASR robustnessSLUduplex conversation
0 likes · 23 min read
Real‑Time Voice Dialogue: Practices, Challenges, and Duplex Conversation
DataFunTalk
DataFunTalk
Nov 5, 2021 · Artificial Intelligence

End-to-End Entity Extraction for Tmall Genie: Speech2Slot Model and Unsupervised Pre‑Training

This article presents the business background of Tmall Genie’s voice‑driven content‑on‑demand service, critiques the traditional pipeline for entity extraction, and details an end‑to‑end speech‑semantic model—including the Speech2Slot architecture, knowledge‑enhanced encoding, and Phoneme‑BERT unsupervised pre‑training—demonstrating significant performance gains in both generation and classification tasks.

Voice Assistantend-to-end modelentity extraction
0 likes · 14 min read
End-to-End Entity Extraction for Tmall Genie: Speech2Slot Model and Unsupervised Pre‑Training
DataFunSummit
DataFunSummit
Nov 3, 2021 · Artificial Intelligence

Innovations and Practices of Entity Extraction in Tmall Genie Voice Assistant

The article presents Tmall Genie’s end‑to‑end speech‑semantic understanding pipeline, detailing the limitations of traditional ASR‑NLU‑IR pipelines, introducing the Speech2Slot model with knowledge‑enhanced encoders, and describing unsupervised phoneme‑based pre‑training (Phoneme‑BERT) that improves entity extraction performance in voice‑driven content playback.

Phoneme-BERTTmall Genieend-to-end model
0 likes · 14 min read
Innovations and Practices of Entity Extraction in Tmall Genie Voice Assistant
HelloTech
HelloTech
Aug 13, 2021 · Backend Development

Understanding WebSocket Protocol and Its Application in Real‑Time Speech Recognition

The article explains why traditional polling methods fall short for real‑time data, introduces the WebSocket protocol’s full‑duplex handshake and heartbeat mechanisms, and demonstrates how a Java‑based WebSocket service efficiently streams audio to an ASR engine for low‑latency speech recognition.

JavaSpring BootWebSocket
0 likes · 12 min read
Understanding WebSocket Protocol and Its Application in Real‑Time Speech Recognition
DataFunTalk
DataFunTalk
Aug 13, 2021 · Artificial Intelligence

Predictions for Speech Recognition Technology Over the Next Decade: Research and Application Directions

The article, authored by a former Stanford PhD now at Zoom, forecasts that by 2030 speech recognition will rely heavily on semi‑supervised learning, on‑device models, richer representations, and personalization, while applications such as transcription services and voice assistants will evolve modestly.

AIFuture TrendsSemi-supervised Learning
0 likes · 7 min read
Predictions for Speech Recognition Technology Over the Next Decade: Research and Application Directions
58 Tech
58 Tech
Jul 14, 2021 · Artificial Intelligence

Multi‑Turn Voice Bot Architecture and End‑to‑End Dialogue Jump Strategies at 58.com

This article describes the overall architecture of 58.com’s multi‑turn voice robot, explains rule‑based, intent‑based and text‑matching dialogue jump strategies, introduces an end‑to‑end classification approach using TextCNN, and reports its online performance improvements and future research directions.

AIdialogue managementend-to-end model
0 likes · 17 min read
Multi‑Turn Voice Bot Architecture and End‑to‑End Dialogue Jump Strategies at 58.com
Beike Product & Technology
Beike Product & Technology
Jul 1, 2021 · Artificial Intelligence

Semantic Data Augmentation and GigaSpeech: Highlights of Two INTERSPEECH 2021 Papers from the Beike Voice Team

The article summarizes two INTERSPEECH 2021 papers from Beike's voice technology team, detailing a grammar‑based semantic data augmentation method that improves end‑to‑end Chinese speech recognition and introducing GigaSpeech, a massive 10,000‑hour multilingual English speech dataset for robust ASR research.

ChineseGigaSpeechInterspeech
0 likes · 7 min read
Semantic Data Augmentation and GigaSpeech: Highlights of Two INTERSPEECH 2021 Papers from the Beike Voice Team
58 Tech
58 Tech
May 31, 2021 · Artificial Intelligence

Practical Implementation of Voice Activity Detection (VAD) for Streaming and Offline Scenarios at 58.com

This article presents the design, training, deployment, and evaluation of a self‑developed Voice Activity Detection system used in both real‑time streaming dialogues and offline audio analysis at 58.com, detailing algorithm choices, smoothing strategies, engineering challenges, and future research directions.

AIVADVoice Activity Detection
0 likes · 18 min read
Practical Implementation of Voice Activity Detection (VAD) for Streaming and Offline Scenarios at 58.com
Sohu Tech Products
Sohu Tech Products
May 12, 2021 · Artificial Intelligence

Zero‑Basis Food Sound Recognition with ASR: Theory, Workflow, and Complete Python Code

This article introduces the fundamentals of automatic speech recognition (ASR) for food‑sound classification, explains key audio representations and modeling approaches, and provides a fully runnable Python implementation using librosa, TensorFlow/Keras, and classic machine‑learning tools to train and predict on the Tianchi competition dataset.

ASRAudio ClassificationCNN
0 likes · 11 min read
Zero‑Basis Food Sound Recognition with ASR: Theory, Workflow, and Complete Python Code
Didi Tech
Didi Tech
Apr 29, 2021 · Artificial Intelligence

Design and Architecture of DiDi Driver-side Intelligent Voice Assistant "XiaoDi"

The document details DiDi’s driver‑side intelligent voice assistant “XiaoDi,” describing its three‑layer architecture—audio source switching controller, semantic‑parsing core, and business API—along with conflict‑resolution mechanisms, multi‑turn dialogue handling, and a four‑region UI design that together enhance driver safety, convenience, and well‑being.

AIDriver AppMobile Development
0 likes · 30 min read
Design and Architecture of DiDi Driver-side Intelligent Voice Assistant "XiaoDi"
ITPUB
ITPUB
Feb 25, 2021 · Artificial Intelligence

How 58.com Scales Voice Quality Inspection with AI-Powered Architecture

This article details the AI-driven intelligent voice quality inspection system built by 58.com, covering its background, multi‑layer architecture, speech recognition, role and tag identification, backend services, and the resulting efficiency gains for large‑scale call‑center operations.

AIDeep Learningcall center automation
0 likes · 15 min read
How 58.com Scales Voice Quality Inspection with AI-Powered Architecture
58 Tech
58 Tech
Dec 11, 2020 · Artificial Intelligence

Weighted Finite State Transducers (WFST) in Traditional Speech Recognition: Principles and Optimization

This article explains the role of Weighted Finite State Transducers in conventional HMM‑based speech recognition, covering language models, pronunciation dictionaries, WFST definitions, semiring theory, composition and determinization operations, decoding graph construction (HCLG), lattice rescoring, and practical optimization techniques for real‑world scenarios.

ASRLanguage ModelWFST
0 likes · 23 min read
Weighted Finite State Transducers (WFST) in Traditional Speech Recognition: Principles and Optimization
58 Tech
58 Tech
Nov 27, 2020 · Artificial Intelligence

An Overview of Kaldi Chain Model Speech Recognition and Its Relationship with HMM‑DNN and Discriminative Training

This article explains the Kaldi chain model speech‑recognition system, covering HMM‑DNN fundamentals, discriminative (MMI) training, the special single‑state HMM topology, TDNN architecture, training pipelines, and experimental results that demonstrate its performance advantages over traditional GMM‑based approaches.

HMM-DNNKaldiTDNN
0 likes · 19 min read
An Overview of Kaldi Chain Model Speech Recognition and Its Relationship with HMM‑DNN and Discriminative Training
Sohu Tech Products
Sohu Tech Products
Aug 19, 2020 · Artificial Intelligence

ASR Error Correction with BERT, ELECTRA and a Fuzzy‑Phoneme Generator: Techniques from Xiaomi AI

This article describes how Xiaomi's AI team tackles Automatic Speech Recognition (ASR) query errors by analyzing error patterns, employing BERT, ELECTRA and a soft‑masked BERT model, generating synthetic noisy data with a fuzzy‑phoneme generator, and presenting experimental results and future research directions.

ASRBERTDeep Learning
0 likes · 18 min read
ASR Error Correction with BERT, ELECTRA and a Fuzzy‑Phoneme Generator: Techniques from Xiaomi AI
58 Tech
58 Tech
Aug 19, 2020 · Artificial Intelligence

Speech Recognition in 58.com: Application Scenarios, Data Collection, Kaldi Chain Model Practice, and End‑to‑End Exploration

This article presents a comprehensive overview of how 58.com leverages large‑scale voice data from call‑center, private phone, and micro‑chat platforms, detailing data collection, annotation, Kaldi‑based chain model training, lattice‑free techniques, and end‑to‑end Transformer‑CTC models to improve Chinese speech recognition performance.

ASRChineseDeep Learning
0 likes · 16 min read
Speech Recognition in 58.com: Application Scenarios, Data Collection, Kaldi Chain Model Practice, and End‑to‑End Exploration
58 Tech
58 Tech
Aug 7, 2020 · Artificial Intelligence

Technical Overview of 58.com Intelligent Voice Analysis Platform

The article presents a comprehensive technical overview of 58.com’s intelligent voice analysis platform, detailing its business background, system architecture, speech and NLP technologies, speaker diarization methods, model performance, data labeling workflow, and practical applications in call‑center quality inspection and user profiling.

AI Platformdata labelingnatural language processing
0 likes · 11 min read
Technical Overview of 58.com Intelligent Voice Analysis Platform
58 Tech
58 Tech
Aug 3, 2020 · Artificial Intelligence

Intelligent Voice Quality Inspection System Architecture and Implementation at 58.com

The article details the design and deployment of an AI-powered intelligent voice quality inspection system at 58.com, covering its overall architecture, speech recognition, role identification, tag detection, rechecking platform, and backend infrastructure, and demonstrates its impact on call‑center efficiency and service quality.

AIBackend ArchitectureDeep Learning
0 likes · 12 min read
Intelligent Voice Quality Inspection System Architecture and Implementation at 58.com
DataFunTalk
DataFunTalk
Jul 15, 2020 · Artificial Intelligence

ASR Error Correction with BERT, ELECTRA, and a Fuzzy‑Phoneme Generator: Methods, Experiments, and Future Directions

This article presents a comprehensive overview of automatic speech recognition (ASR) error correction techniques employed by Xiaomi's Xiao‑Ai team, detailing problem definition, related work on BERT and ELECTRA, a custom generator‑discriminator architecture with a fuzzy‑phoneme simulator, experimental results, and prospective research directions.

ASRBERTELECTRA
0 likes · 19 min read
ASR Error Correction with BERT, ELECTRA, and a Fuzzy‑Phoneme Generator: Methods, Experiments, and Future Directions
58 Tech
58 Tech
Jun 15, 2020 · Artificial Intelligence

Intelligent Voice Robot Architecture, Core Technologies, and Enterprise Applications

This article presents the engineering architecture of intelligent voice robots, detailing voice preprocessing, intent recognition, slot extraction, dialogue management, and showcases multiple enterprise use cases that improve efficiency and revenue across sales, customer service, and recruitment.

Enterprise Automationdialogue managementintent classification
0 likes · 14 min read
Intelligent Voice Robot Architecture, Core Technologies, and Enterprise Applications
Didi Tech
Didi Tech
May 25, 2020 · Artificial Intelligence

How Didi Harnesses Cutting‑Edge Speech Recognition: From ASR Basics to Transformer Models

This article provides a comprehensive technical overview of modern speech recognition, covering Didi’s driver‑assistant and smart‑customer‑service applications, fundamental ASR concepts, classic GMM‑HMM methods, deep‑learning breakthroughs such as DNN‑HMM, CTC, attention‑based and transformer models, practical training tricks, signal‑processing steps, and multimodal fusion techniques.

ASRCTCDeep Learning
0 likes · 16 min read
How Didi Harnesses Cutting‑Edge Speech Recognition: From ASR Basics to Transformer Models
Didi Tech
Didi Tech
Apr 2, 2020 · Artificial Intelligence

Interview: Didi AI’s DELTA – A Unified Framework for NLP and Speech Model Development

In this interview, Didi AI Labs’ Han Kun explains how the DELTA platform unifies TensorFlow‑based NLP and speech models—supporting tasks from text classification to voice emotion recognition—through a modular, easily deployable architecture, accelerating development, powering Didi products, and now open‑sourced for broader AI collaboration.

AI PlatformDeltaNLP
0 likes · 14 min read
Interview: Didi AI’s DELTA – A Unified Framework for NLP and Speech Model Development
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 30, 2020 · Artificial Intelligence

How AI Is Transforming Language, Speech, and Vision: Key Technologies and Future Trends

This article provides a comprehensive overview of AI's rapid evolution, covering deep learning foundations, machine learning components, natural language processing advances, speech recognition breakthroughs, multimodal interaction, computer vision progress, model compression techniques, and the shift from data‑driven to knowledge‑based AI approaches.

machine learningspeech recognition
0 likes · 19 min read
How AI Is Transforming Language, Speech, and Vision: Key Technologies and Future Trends
DataFunTalk
DataFunTalk
Mar 19, 2020 · Artificial Intelligence

Advances in Voice Interaction: 360's Intelligent Dialogue System Architecture and Core Technologies

This article presents a comprehensive overview of 360's voice interaction platform, detailing dialogue system fundamentals, platform architecture, and core technologies such as semantic understanding, dialog management, and question answering, all driven by deep learning and multimodal innovations.

AIKnowledge Graphdialogue system
0 likes · 16 min read
Advances in Voice Interaction: 360's Intelligent Dialogue System Architecture and Core Technologies
TAL Education Technology
TAL Education Technology
Feb 28, 2020 · Artificial Intelligence

TPNN Multi‑GPU Training and Mobile Optimization for Children's Acoustic Speech Recognition Models

This article describes the TPNN deep‑learning platform’s multi‑GPU acceleration, data‑parallel BMUF training, LSTM‑CTC acoustic modeling, and a suite of mobile‑side optimizations—including model pruning, 8‑bit quantization, low‑precision matrix multiplication and mixed‑precision computation—that together achieve over 92% recognition accuracy for children’s English speech on both server and mobile devices.

BMUFCTCDeep Learning
0 likes · 15 min read
TPNN Multi‑GPU Training and Mobile Optimization for Children's Acoustic Speech Recognition Models
DataFunTalk
DataFunTalk
Feb 3, 2020 · Artificial Intelligence

Advances in Speech Recognition: Concepts, Deep Learning Methods, and Didi’s Applications

This article presents a comprehensive overview of modern speech recognition technology, covering basic ASR concepts, classic acoustic and language models, deep‑learning approaches such as DNN‑HMM, CTC, attention‑based and transformer models, multimodal fusion, signal‑processing pipelines, and practical deployment considerations at Didi.

ASRCTCDeep Learning
0 likes · 15 min read
Advances in Speech Recognition: Concepts, Deep Learning Methods, and Didi’s Applications
21CTO
21CTO
Jan 31, 2020 · Artificial Intelligence

How Microsoft’s First Chinese AI Fellow Is Driving Speech and Language Breakthroughs

Microsoft appointed its first Chinese Global Technical Fellow, Huang Xuedong, as the company’s Global AI CTO, overseeing Azure’s speech, translation, vision, and language services, while highlighting his groundbreaking achievements such as achieving human‑level word error rates and leading AI research teams.

AI researchAzureMicrosoft
0 likes · 7 min read
How Microsoft’s First Chinese AI Fellow Is Driving Speech and Language Breakthroughs
iQIYI Technical Product Team
iQIYI Technical Product Team
Jan 17, 2020 · Artificial Intelligence

Voice and Language Technologies in Natural Interaction: iQIYI HomeAI Speech Interaction System

The talk introduced iQIYI’s HomeAI platform, which combines user profiling (including voiceprint and age detection) with automatic video semantic extraction to enable natural, multi‑turn voice‑based video search—addressing hot‑content updates, contextual awareness, device environments, and personalized recommendations for screen‑less or accessibility‑focused users.

AIContext-Awareentity extraction
0 likes · 19 min read
Voice and Language Technologies in Natural Interaction: iQIYI HomeAI Speech Interaction System
58 Tech
58 Tech
Nov 18, 2019 · Artificial Intelligence

Comprehensive Solution for Human‑Machine Voice Dialogue Robot at 58.com

This article presents a complete solution for 58.com’s human‑machine voice dialogue robot, detailing its background, overall architecture, intelligent outbound process, core functions such as call service, anti‑spam, status recognition, multi‑turn dialogue management, intent classification, slot extraction, whole‑round intent detection, and various practical application scenarios.

AITelephonydialogue management
0 likes · 13 min read
Comprehensive Solution for Human‑Machine Voice Dialogue Robot at 58.com
DataFunTalk
DataFunTalk
Nov 18, 2019 · Artificial Intelligence

Complete Solution of 58.com Human-Machine Voice Dialogue Robot: Architecture, Core Modules, and Application Scenarios

This article presents the end‑to‑end solution of 58.com’s voice dialogue robot, detailing its overall architecture, intelligent outbound process, core functions such as call dialing, status recognition, dialogue management, intent detection, and showcasing multiple real‑world application scenarios that improve sales, operations, and customer service efficiency.

AIIntent DetectionTelephony
0 likes · 12 min read
Complete Solution of 58.com Human-Machine Voice Dialogue Robot: Architecture, Core Modules, and Application Scenarios
WeChat Backend Team
WeChat Backend Team
Sep 3, 2019 · Artificial Intelligence

How Tencent Scaled Massive n‑gram Language Models for Real‑Time Speech Recognition

This article presents a distributed system that efficiently supports large‑scale n‑gram language models for automatic speech recognition by introducing caching, a two‑level distributed index, batch processing, and a cascading fault‑tolerance mechanism, demonstrating robust scalability and low communication overhead in Tencent's WeChat ASR service.

Language ModelN-gramcaching
0 likes · 35 min read
How Tencent Scaled Massive n‑gram Language Models for Real‑Time Speech Recognition
Tencent Cloud Developer
Tencent Cloud Developer
Aug 25, 2019 · Artificial Intelligence

Understanding Intelligent Speech Recognition Technology

Intelligent speech recognition converts spoken audio to text using a pipeline of feature extraction, acoustic and language modeling, where deep neural networks—especially CNN, LSTM, and hybrid CLDNN architectures—drive high accuracy, enabling mobile voice input, call‑center transcription, legal record keeping, and Tencent Cloud ASR’s 97% Mandarin accuracy with speaker separation and on‑premises deployment.

AILanguage ModelTencent Cloud
0 likes · 7 min read
Understanding Intelligent Speech Recognition Technology
58 Tech
58 Tech
Aug 14, 2019 · Artificial Intelligence

Design and Implementation of a Dialogue Management System for Intelligent Voice Robots

This article presents a comprehensive overview of an intelligent voice robot's dialogue management system, detailing its architecture, natural language understanding components, dialogue manager design, strategy handling, and workflow processes to achieve fluent multi‑turn interactions in telephone scenarios.

AINLUconversation system
0 likes · 14 min read
Design and Implementation of a Dialogue Management System for Intelligent Voice Robots
Didi Tech
Didi Tech
Aug 2, 2019 · Artificial Intelligence

How Didi’s Open‑Source DELTA Platform Accelerates NLP and Speech Model Development

At ACL 2019, Didi unveiled DELTA, an open‑source TensorFlow‑based training framework that unifies NLP and speech tasks, offers configurable pipelines, benchmark models, and seamless deployment, enabling AI developers to quickly move from research to production while leveraging Didi’s extensive open‑source ecosystem.

AI PlatformModel TrainingNLP
0 likes · 6 min read
How Didi’s Open‑Source DELTA Platform Accelerates NLP and Speech Model Development
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 20, 2019 · Artificial Intelligence

Unlock Cutting-Edge Voice AI: Highlights from Alibaba’s Speech & Signal Processing eBook

This article introduces Alibaba's new e‑book collection of five ICASSP‑accepted papers that showcase advances in speech recognition, synthesis, and emotion detection, detailing novel models like DFSMN, A‑LSTM, and speaker‑adaptation techniques that dramatically improve speed, size, and accuracy.

AI voiceDeep LearningEmotion Recognition
0 likes · 6 min read
Unlock Cutting-Edge Voice AI: Highlights from Alibaba’s Speech & Signal Processing eBook
DataFunTalk
DataFunTalk
May 15, 2019 · Artificial Intelligence

AI‑Driven Audio Content Understanding and Safety for Live Streams

Using AI to automatically understand and secure audio content, this article discusses the challenges of manual audio analysis, outlines a four‑step pipeline—audio segmentation, speech‑to‑text, labeling, and synthesis—and describes models such as VAD, ASR, sound classification, text recognition, and behavior detection for live‑stream moderation.

AIAudio ProcessingContent Safety
0 likes · 11 min read
AI‑Driven Audio Content Understanding and Safety for Live Streams