Tagged articles

speech recognition

125 articles · Page 1 of 2

Jun 9, 2026 · Artificial Intelligence

Open-Source ASR That Runs Faster on CPU Than Whisper on GPU

FunASR is an industrial‑grade, open‑source speech‑recognition toolkit that combines VAD, transcription, punctuation, speaker diarization and emotion detection in one call, achieving up to 170× real‑time on GPU and 17× on CPU, outperforming Whisper while supporting 50+ languages and offering OpenAI‑compatible APIs.

ASRCPU performanceFunASR

0 likes · 13 min read

Open-Source ASR That Runs Faster on CPU Than Whisper on GPU

Weekly Large Model Application

May 28, 2026 · Artificial Intelligence

Open-Source ASR Optimization: Solving Misrecognition of Proper Nouns and Real-Time Lag

This guide analyzes common deployment problems of open‑source speech‑recognition models—misrecognizing proper nouns and lagging behind spoken input—and presents a decision‑tree‑based, five‑layer optimization framework that balances accuracy and speed through concrete techniques such as hot‑word bias, model fine‑tuning, INT8 quantization, and appropriate runtimes.

ASRAccuracyOptimization

0 likes · 10 min read

Open-Source ASR Optimization: Solving Misrecognition of Proper Nouns and Real-Time Lag

Woodpecker Software Testing

May 14, 2026 · Artificial Intelligence

AI Testing in Practice: 3 Real-World Case Studies

The article examines how AI testing has shifted from simple functional checks to evaluating model reliability, fairness, robustness, and explainability, illustrating the shift with three detailed client cases—financial bias audit, automotive voice‑assistant stress testing, and medical‑imaging consistency verification.

AI testingAequitasRAGAS

0 likes · 8 min read

AI Testing in Practice: 3 Real-World Case Studies

Machine Heart

May 8, 2026 · Artificial Intelligence

Can Qianwen’s Desktop Voice Input Finally Make the Keyboard Obsolete?

The article evaluates Qianwen’s new desktop voice‑input system, showing how it filters filler words, understands screen context, executes AI commands, and generates structured text, PPTs, and Excel reports, positioning voice as a viable replacement for traditional keyboard typing.

AI assistantQianwendesktop AI

0 likes · 12 min read

Can Qianwen’s Desktop Voice Input Finally Make the Keyboard Obsolete?

Weekly Large Model Application

May 5, 2026 · Artificial Intelligence

What Pretraining Actually Teaches: Listening to All Sounds

The article explains that pretraining for speech models functions like a broad liberal‑arts education, teaching universal acoustic and linguistic patterns through next‑token prediction, joint audio‑text training, and mask‑or contrast objectives, while clarifying common misconceptions and highlighting data bias and the need for clean, task‑specific fine‑tuning.

audio-text alignmentdata biasfine-tuning

0 likes · 6 min read

What Pretraining Actually Teaches: Listening to All Sounds

Weekly Large Model Application

May 5, 2026 · Artificial Intelligence

Why More GPUs and Data Aren’t Enough: Defining Scenarios and Data for Speech Model Training

The article argues that successful speech model training starts with understanding user scenarios, then selecting appropriate data, and finally choosing metrics, detailing six key questions, data sourcing strategies, evaluation criteria, and compliance considerations to avoid the misconception that sheer data volume guarantees performance.

AI trainingdata collectionmodel evaluation

0 likes · 6 min read

Why More GPUs and Data Aren’t Enough: Defining Scenarios and Data for Speech Model Training

Geek Labs

May 3, 2026 · Artificial Intelligence

VibeVoice: Microsoft’s Open‑Source Cutting‑Edge Speech AI Models

The article introduces Microsoft’s open‑source VibeVoice project, detailing its long‑audio ASR‑7B and real‑time TTS‑0.5B models, the continuous speech tokenizer and next‑token diffusion techniques, and provides quick‑start instructions for online demos and local deployment via Hugging Face.

Hugging FaceMicrosoftText‑to‑Speech

0 likes · 3 min read

VibeVoice: Microsoft’s Open‑Source Cutting‑Edge Speech AI Models

James' Growth Diary

May 2, 2026 · Artificial Intelligence

How to Add Real‑Time Speech Recognition and Streaming TTS to Your AI Agent

This guide walks through choosing the right voice‑agent architecture, implementing streaming ASR with WebSocket, triggering sentence‑by‑sentence TTS, wiring the three layers together via async generators, optimizing latency to under a second, and avoiding common pitfalls such as missing VAD and checkpoint persistence.

LangChainText‑to‑SpeechWebSocket

0 likes · 19 min read

How to Add Real‑Time Speech Recognition and Streaming TTS to Your AI Agent

Wuming AI

Apr 21, 2026 · Artificial Intelligence

Can AI Voice Input Boost Office Productivity? A Hands‑On Review of Typeless and ShandianShuo

The article examines how AI‑powered voice input can replace keyboard typing in office settings, evaluates environmental constraints, compares two leading tools—Typeless and ShandianShuo—through feature lists, limitations, and real‑world usage scenarios, and concludes with practical advice on choosing the right solution.

AI voice inputShandianShuoTypeless

0 likes · 7 min read

Can AI Voice Input Boost Office Productivity? A Hands‑On Review of Typeless and ShandianShuo

Weekly Large Model Application

Apr 16, 2026 · Artificial Intelligence

Deep Dive into Conformer: The Convolution‑Augmented Transformer for Speech Recognition

The Conformer architecture blends global self‑attention with a depthwise separable convolution module in a Macaron‑style block, addressing the strong local time‑frequency structure and long sequence length of speech signals while keeping computational cost manageable for modern ASR systems.

ASRConformerConvolution

0 likes · 11 min read

Deep Dive into Conformer: The Convolution‑Augmented Transformer for Speech Recognition

AI Waka

Mar 26, 2026 · Artificial Intelligence

Building Production‑Ready AI Agents with NVIDIA Nemotron: A Full‑Stack Guide

This guide explains how to assemble NVIDIA's Nemotron Speech, RAG, and Safety models into a low‑latency, secure production AI agent stack, covering performance benchmarks, multimodal retrieval, safety data sets, integration code, and deployment options for cloud, on‑premise, and edge environments.

Content SafetyNVIDIAProduction Deployment

0 likes · 9 min read

Building Production‑Ready AI Agents with NVIDIA Nemotron: A Full‑Stack Guide

Weekly Large Model Application

Mar 4, 2026 · Artificial Intelligence

Qwen3‑ASR vs FunASR: In‑Depth Technical Comparison

This article provides a detailed side‑by‑side analysis of the open‑source ASR tools FunASR and Qwen3‑ASR, covering team origins, model architectures, language coverage, speed, deployment requirements, and ideal use‑cases so readers can decide which solution fits their projects best.

ASRFunASRLarge Language Model

0 likes · 10 min read

Qwen3‑ASR vs FunASR: In‑Depth Technical Comparison

Weekly Large Model Application

Feb 22, 2026 · Artificial Intelligence

2026 Guide to Running Open‑Source ASR on Pure CPU

The 2026 overview details lightweight, heavily quantized open‑source speech‑recognition models and CPU‑specific inference engines, offering concrete tips, model comparisons, and a concise selection guide that enable real‑time, GPU‑free ASR deployment with low latency and high stability.

ASRCPU inferenceQuantization

0 likes · 4 min read

2026 Guide to Running Open‑Source ASR on Pure CPU

AI Engineering

Feb 15, 2026 · Artificial Intelligence

Qwen3‑ASR Runs Natively on Apple Silicon via MLX for Full‑Speed Speech Recognition

A developer has re‑implemented the state‑of‑the‑art Qwen3‑ASR model in MLX, enabling native execution on Apple M1‑M4 chips with real‑time factors as low as 0.08, 4‑bit quantization speedups of 4.7×, multilingual support for 52 languages, and features such as word‑level timestamps and streaming transcription.

Apple SiliconMLXQuantization

0 likes · 5 min read

Qwen3‑ASR Runs Natively on Apple Silicon via MLX for Full‑Speed Speech Recognition

Old Zhang's AI Learning

Feb 1, 2026 · Artificial Intelligence

Microsoft VibeVoice‑ASR Open‑Source: One‑Shot 60‑Minute Transcription with Speaker ID and Timestamps

Microsoft’s newly open‑sourced VibeVoice‑ASR model can transcribe up to 60‑minute audio in a single pass, preserving global context while providing built‑in speaker diarization and timestamps, supports 50+ languages, offers custom hot‑word injection, and can be deployed via Docker, Gradio, or vLLM for high‑throughput API services.

ASRDockerLoRA

0 likes · 9 min read

Microsoft VibeVoice‑ASR Open‑Source: One‑Shot 60‑Minute Transcription with Speaker ID and Timestamps

Woodpecker Software Testing

Jan 27, 2026 · Artificial Intelligence

How to Build a Multimodal AI Assistant with FastAPI, Alibaba Cloud and DashScope

This guide walks through configuring Alibaba Cloud credentials, implementing a FastAPI backend with email function calling, Alibaba OpenSearch, image generation via DashScope, speech recognition, and a responsive HTML/CSS/JavaScript front‑end that supports text chat, image recognition, image synthesis, and voice interaction.

Alibaba CloudDashScopeFastAPI

0 likes · 38 min read

How to Build a Multimodal AI Assistant with FastAPI, Alibaba Cloud and DashScope

Woodpecker Software Testing

Jan 25, 2026 · Artificial Intelligence

Integrating LLMs with Speech: Whisper, Vosk, and Alibaba Cloud in Python and JavaScript

This tutorial walks through setting up local speech recognition with OpenAI's Whisper and Vosk, leveraging Alibaba Cloud's ASR services, building a WebSocket server/client for real‑time audio streaming, capturing audio in the browser via MediaRecorder or RecordRTC, and performing speech synthesis with pyttsx3 and Alibaba's Sambert model.

Alibaba CloudJavaScriptPython

0 likes · 20 min read

Integrating LLMs with Speech: Whisper, Vosk, and Alibaba Cloud in Python and JavaScript

AI Waka

Jan 24, 2026 · Artificial Intelligence

Building Production‑Ready AI Agents with NVIDIA’s Nemotron Stack

The article explains how NVIDIA’s Nemotron Stack combines ultra‑fast speech recognition, multimodal retrieval, and advanced safety models into a unified, low‑latency pipeline, offering practical integration code, performance insights, and deployment options for turning experimental AI agents into production‑grade services.

AI AgentsContent SafetyNVIDIA

0 likes · 9 min read

Building Production‑Ready AI Agents with NVIDIA’s Nemotron Stack

Old Zhang's AI Learning

Jan 23, 2026 · Artificial Intelligence

Open‑Source GLM‑ASR‑Nano‑2512: Chinese Dialect‑Optimized Speech Recognition on Consumer‑Grade GPUs

GLM‑ASR‑Nano‑2512, a 1.5 B‑parameter open‑source speech‑recognition model released in December 2025, delivers state‑of‑the‑art accuracy on Chinese dialects and low‑volume audio, outperforms Whisper V3 on benchmark tests, runs on consumer GPUs, and provides detailed installation and deployment guides for transformers, vLLM and SGLang.

Chinese dialectsGLM-ASR-Nano-2512SGLang

0 likes · 11 min read

Open‑Source GLM‑ASR‑Nano‑2512: Chinese Dialect‑Optimized Speech Recognition on Consumer‑Grade GPUs

Xiaomi Tech

Dec 19, 2025 · Artificial Intelligence

AI Evolution Mirrors Biology—Open Source Speeds Progress 1,000× (Daniel Povey)

Daniel Povey compares AI's trial‑and‑error development to biological evolution, argues that open‑source collaboration can make research a thousand times faster, and outlines his dual‑strategy approach and the three breakthroughs of the new Zapformer speech model.

AITransformerZapformer

0 likes · 12 min read

AI Evolution Mirrors Biology—Open Source Speeds Progress 1,000× (Daniel Povey)

Xiaomi Tech

Dec 11, 2025 · Artificial Intelligence

Open‑Source AI Evolution: From Zipformer to Zapformer and Smart Automotive Quality

The MEET 2026 conference showcased Daniel Povey’s analogy of AI evolution to biological evolution, Xiaomi’s open‑source AI breakthroughs such as Zipformer and Zapformer, and the company’s multi‑agent automotive quality engine that leverages large‑scale models, data‑driven diagnostics, and open collaboration to accelerate intelligent technology across industries.

Automotive QualityModel Evolutionartificial-intelligence

0 likes · 12 min read

Open‑Source AI Evolution: From Zipformer to Zapformer and Smart Automotive Quality

360 Smart Cloud

Dec 1, 2025 · Artificial Intelligence

How to Build Real‑Time Streaming Speech Recognition with a Large‑Model API (Go & Python)

This guide explains the background of speech‑to‑text technology, introduces the large‑model streaming speech recognition API, walks through obtaining an API key, and provides detailed Go and Python code for establishing a WebSocket connection, sending full‑client and audio‑only requests, and parsing server responses.

AIStreaming APIgolang

0 likes · 12 min read

How to Build Real‑Time Streaming Speech Recognition with a Large‑Model API (Go & Python)

AntTech

Oct 28, 2025 · Artificial Intelligence

Ming-Flash-Omni-Preview: 103B Open-Source Multimodal Model Excelling in Image, Video, and Speech

Introducing Ming‑Flash‑Omni‑Preview, a 103‑billion‑parameter open‑source multimodal model built on a sparse MoE architecture that delivers state‑of‑the‑art performance in controllable image generation, streaming video understanding, and context‑aware speech recognition, surpassing prior models on GenEval and GEdit benchmarks.

Large Language ModelMultimodalSparse MoE

0 likes · 8 min read

Ming-Flash-Omni-Preview: 103B Open-Source Multimodal Model Excelling in Image, Video, and Speech

Liangxu Linux

Oct 21, 2025 · Artificial Intelligence

Explore 4 Must‑Try Open‑Source AI Tools: Datasets, Finance Model, Real‑Time Speech, and Agent Toolbox

This article introduces four high‑impact open‑source projects—a curated public dataset collection, the Kronos financial K‑line analysis model, WhisperLiveKit for real‑time speech transcription, and Youtu‑agent for building versatile AI agents—each with descriptions, key features, and GitHub links.

AI modelsFinancial Analysisagent toolbox

0 likes · 6 min read

Explore 4 Must‑Try Open‑Source AI Tools: Datasets, Finance Model, Real‑Time Speech, and Agent Toolbox

Python Programming Learning Circle

Oct 7, 2025 · Artificial Intelligence

Build a Voice-Enabled Chatbot in Python Using Baidu AI and Qingyunke

This tutorial walks through creating a Python program that captures spoken input, converts it to text with Baidu AI, sends the text to the free Qingyunke chatbot API for a response, and finally synthesizes the reply back into speech, complete with code snippets and setup instructions.

Baidu AIChatbotText‑to‑Speech

0 likes · 9 min read

Build a Voice-Enabled Chatbot in Python Using Baidu AI and Qingyunke

Python Programming Learning Circle

Sep 22, 2025 · Artificial Intelligence

Build a Voice‑Enabled Chatbot in Python with Baidu AI and Qingyunke

Learn how to create a Python program that captures spoken input, converts it to text using Baidu's speech‑recognition API, sends the text to the free Qingyunke chatbot for intelligent replies, and then synthesizes the response back into speech, with complete code snippets and setup instructions.

Baidu AIChatbotPython

0 likes · 10 min read

Build a Voice‑Enabled Chatbot in Python with Baidu AI and Qingyunke

Python Programming Learning Circle

Aug 22, 2025 · Artificial Intelligence

Build a Powerful Python Voice Assistant with GPT‑4: Step‑by‑Step Guide

This tutorial walks you through creating a Python voice assistant powered by GPT‑4, covering project setup, virtual environment creation, required package installation, core code for speech recognition, text‑to‑speech, command handling, and optional speech‑rate adjustment.

GPT-4Text‑to‑SpeechVoice Assistant

0 likes · 17 min read

Build a Powerful Python Voice Assistant with GPT‑4: Step‑by‑Step Guide

Baidu Maps Tech Team

Jul 31, 2025 · Artificial Intelligence

How Baidu’s AI Voice Assistant Turns Speech into Precise Navigation Commands

This article explains how Baidu Map’s AI voice assistant converts spoken commands into precise navigation actions by detailing the speech‑to‑text pipeline, intent parsing, template and generative approaches, tool‑calling mechanisms, memory and reflection capabilities, and future directions for intelligent agents.

AIIntent ParsingLLM

0 likes · 14 min read

How Baidu’s AI Voice Assistant Turns Speech into Precise Navigation Commands

AntTech

Jul 3, 2025 · Artificial Intelligence

How Ant Group’s AI Multimodal Evaluation Transforms Image, Speech, and Video Quality Testing

In a QECon 2025 talk, Ant Group’s AI team detailed a comprehensive multimodal evaluation framework that leverages large‑model metrics, custom pipelines, and benchmark datasets to assess image generation, speech recognition, and video quality, while also contributing to industry standards and academic research.

AI evaluationMultimodalimage assessment

0 likes · 16 min read

How Ant Group’s AI Multimodal Evaluation Transforms Image, Speech, and Video Quality Testing

iQIYI Technical Product Team

Jul 3, 2025 · Artificial Intelligence

Three iQIYI AI Papers Break New Ground at ACL 2025 & INTERSPEECH 2025

iQIYI’s AI research team secured three paper acceptances—two at ACL 2025 (including a main conference and a Findings paper) and one at INTERSPEECH 2025—covering long‑context large language model evaluation, Chinese novel summarization, and efficient Thai speech recognition, with links to each work.

ACL 2025AI researchINTERSPEECH 2025

0 likes · 7 min read

Three iQIYI AI Papers Break New Ground at ACL 2025 & INTERSPEECH 2025

Test Development Learning Exchange

Jan 13, 2025 · Artificial Intelligence

Python Tool for Converting English Videos to Chinese Dubbed Videos with Subtitles

This article provides a comprehensive guide on developing a Python tool to convert English videos into versions with Chinese dubbing and subtitles, covering all steps from audio extraction to final synthesis.

AI toolsFFmpegMachine Translation

0 likes · 5 min read

Python Tool for Converting English Videos to Chinese Dubbed Videos with Subtitles

System Architect Go

Nov 28, 2024 · Artificial Intelligence

An Overview of Modern AI Audio Technologies: ASR, TTS, and Voice Cloning

This article explains how modern AI advances have transformed audio processing, covering digital audio fundamentals, automatic speech recognition (ASR), text‑to‑speech (TTS), voice cloning techniques, and provides practical Python code examples using OpenAI Whisper and HuggingFace TTS models.

AIAudio ProcessingText‑to‑Speech

0 likes · 7 min read

An Overview of Modern AI Audio Technologies: ASR, TTS, and Voice Cloning

Huolala Tech

Jul 9, 2024 · Artificial Intelligence

Building an In-Car Voice Assistant: From Wake‑Word to NLP

This article details the end‑to‑end development of an in‑vehicle voice assistant, covering motivation, functional design, technology stack selection, dialogue flow, privacy, third‑party integration, wake‑word detection, on‑device speech recognition, noise filtering, NLP processing, and deployment considerations.

Voice Assistantin‑car technologynatural language processing

0 likes · 18 min read

Building an In-Car Voice Assistant: From Wake‑Word to NLP

Ops Development & AI Practice

Jun 22, 2024 · Artificial Intelligence

Why Transformers Revolutionized AI: From NLP to Vision and Speech

Transformers, introduced in 2017, have reshaped neural networks by leveraging attention mechanisms to outperform RNNs and CNNs across NLP, computer vision, and speech tasks, offering parallel processing, long‑range dependency capture, and versatile applications such as translation, text generation, image classification, and speech recognition.

Attention MechanismNLPTransformer

0 likes · 6 min read

Why Transformers Revolutionized AI: From NLP to Vision and Speech

Huolala Tech

Nov 23, 2023 · Artificial Intelligence

How HuoLaLa Built a Custom ASR System to Boost Accuracy and Cut Costs

This article details HuoLaLa's development of an in‑house Automatic Speech Recognition system, covering its architecture, VAD optimization, language‑model and hot‑word enhancements, punctuation restoration, task and resource scheduling, and the resulting improvements in accuracy and cost efficiency.

ASRLanguage ModelResource Scheduling

0 likes · 18 min read

How HuoLaLa Built a Custom ASR System to Boost Accuracy and Cut Costs

DataFunTalk

Sep 23, 2023 · Artificial Intelligence

Paraformer: An Industrial Non‑Autoregressive End‑to‑End Speech Recognition Model and Its Deployment on ModelScope

This article introduces the Paraformer non‑autoregressive end‑to‑end speech recognition model released by Alibaba DAMO Academy, details its architecture, training strategies, large‑scale performance, and provides step‑by‑step guidance for using and fine‑tuning the model on the ModelScope platform with the FunASR toolkit.

ASRModelScopeParaformer

0 likes · 13 min read

Paraformer: An Industrial Non‑Autoregressive End‑to‑End Speech Recognition Model and Its Deployment on ModelScope

Test Development Learning Exchange

Jul 27, 2023 · Artificial Intelligence

Splitting PDF Files and Recognizing MP3 Audio with Python

This guide explains how to split a PDF into separate files using PyPDF2 and provides two Python approaches for converting MP3 audio to text—one leveraging Google Speech‑Recognition for higher accuracy and another using PocketSphinx for complete transcription—complete with ready‑to‑run code examples.

PDFPyPDF2Python

0 likes · 5 min read

Splitting PDF Files and Recognizing MP3 Audio with Python

58 Tech

Jul 6, 2023 · Artificial Intelligence

Design and Optimization of a Kaldi‑Based Speech Recognition Backend at 58.com

This article details the evolution from the initial Kaldi‑based speech recognition architecture (version 1.0) to a re‑engineered version 2.0, describing business background, service components, identified shortcomings, and a series of performance, concurrency, GPU, I/O, GC, and dispatch optimizations that dramatically improve resource utilization, latency, and reliability for large‑scale voice processing at 58.com.

AIGPUKaldi

0 likes · 15 min read

Design and Optimization of a Kaldi‑Based Speech Recognition Backend at 58.com

58 Tech

Jun 21, 2023 · Artificial Intelligence

GPU Hotword Enhancement for WeNet End-to-End Speech Recognition

This article explains the design, implementation, and experimental evaluation of hot‑word augmentation in WeNet's GPU runtime, detailing how character‑ and word‑based language model scoring are extended to boost recognition of rare proper nouns in both streaming and non‑streaming ASR services.

ASRCTC decoderGPU

0 likes · 12 min read

GPU Hotword Enhancement for WeNet End-to-End Speech Recognition

php Courses

Jun 17, 2023 · Mobile Development

Implementing Voice Functionality in WeChat Mini Programs

This guide explains how to integrate WeChat Mini Program voice capabilities by importing the recorder and audio APIs, recording audio, uploading for speech recognition, and playing back the result, with example code snippets for each step.

JavaScriptVoice APIWeChat Mini Program

0 likes · 3 min read

Implementing Voice Functionality in WeChat Mini Programs

DataFunSummit

Jun 15, 2023 · Artificial Intelligence

Paraformer: An Industrial Non‑Autoregressive End‑to‑End Speech Recognition Model

This article introduces the Paraformer model released by Alibaba DAMO Academy on ModelScope, detailing its non‑autoregressive architecture, training strategies, performance on benchmark datasets, and step‑by‑step guidance for fine‑tuning and deploying the model using FunASR and ModelScope pipelines.

ASRModelScopeParaformer

0 likes · 13 min read

Paraformer: An Industrial Non‑Autoregressive End‑to‑End Speech Recognition Model

21CTO

Jun 10, 2023 · Artificial Intelligence

How Huang Xuedong’s Team Achieved Human-Level Speech Recognition at Microsoft

The article chronicles the career of Chinese AI pioneer Huang Xuedong, detailing his education, rise at Microsoft, leadership of Azure AI, groundbreaking human‑level speech recognition breakthroughs, the engineering feats behind them—including a ten‑network model and the CNTK framework—and his recent move to Zoom.

CNTKMicrosoftartificial-intelligence

0 likes · 14 min read

How Huang Xuedong’s Team Achieved Human-Level Speech Recognition at Microsoft

Meituan Technology Team

Apr 13, 2023 · Artificial Intelligence

Peak-First Regularization for Low-Latency Streaming Speech Recognition

The paper presents a low‑latency streaming speech‑recognition solution that reframes latency reduction as a knowledge‑distillation task, using a simple peak‑first regularization term to shift CTC output probabilities leftward and achieve up to 200 ms average latency reduction without harming word error rate.

CTCLatency ReductionPeak-First Regularization

0 likes · 21 min read

Peak-First Regularization for Low-Latency Streaming Speech Recognition

Bilibili Tech

Feb 28, 2023 · Artificial Intelligence

High‑Quality Automatic Speech Recognition (ASR) Solutions at Bilibili: Data, Model, and Deployment Optimizations

Bilibili’s high‑quality ASR system combines large‑scale filtered business data, semi‑supervised Noisy‑Student training, an end‑to‑end CTC model with lattice‑free MMI decoding, and FP16‑optimized FasterTransformer inference on Triton, delivering top‑ranked accuracy, low latency, and scalable deployment for diverse Chinese‑English video content.

ASRBilibiliEnd-to-End

0 likes · 18 min read

High‑Quality Automatic Speech Recognition (ASR) Solutions at Bilibili: Data, Model, and Deployment Optimizations

DataFunTalk

Feb 5, 2023 · Artificial Intelligence

A Six‑Year Retrospective on Deep Learning Algorithms and Their Applications

This article reviews the author’s six‑year hands‑on experience with deep learning, covering breakthroughs in speech recognition, computer vision, language modeling, reinforcement learning, privacy protection, model compression, recommendation systems, and future research directions, while summarizing technical lessons and practical insights.

AIRecommendation Systemsmodel compression

0 likes · 30 min read

A Six‑Year Retrospective on Deep Learning Algorithms and Their Applications

DataFunSummit

Jan 14, 2023 · Artificial Intelligence

Key Transformer Model Papers Across Language, Vision, Speech, and Time‑Series Domains

This article surveys the most influential Transformer‑based research papers—from the original Attention Is All You Need work to recent models such as Autoformer and FEDformer—covering breakthroughs in natural language processing, computer vision, speech recognition, and long‑term series forecasting, and provides download links for each.

AILanguage ModelsTime-Series Forecasting

0 likes · 17 min read

Key Transformer Model Papers Across Language, Vision, Speech, and Time‑Series Domains

58 Tech

Jan 12, 2023 · Artificial Intelligence

Efficient Conformer for End‑to‑End Speech Recognition: Model, Implementation, Streaming Inference, and Experimental Results

This article presents a comprehensive overview of the Efficient Conformer model for large‑scale end‑to‑end speech recognition, detailing its architectural improvements such as progressive downsampling and grouped multi‑head self‑attention, the PyTorch implementation in WeNet, streaming inference handling, experimental CER gains on AISHELL‑1 and production data, and future development plans.

ASREfficient ConformerModel Optimization

0 likes · 16 min read

Efficient Conformer for End‑to‑End Speech Recognition: Model, Implementation, Streaming Inference, and Experimental Results

DataFunTalk

Dec 7, 2022 · Artificial Intelligence

Vivo's Self‑Developed Streaming Speech‑Recognition Inference Engine and KunlunChip High‑Performance Inference Library

The article details vivo's development of a high‑accuracy, high‑performance streaming speech‑recognition inference engine built on the wenet framework, its optimization techniques such as dynamic batching and memory pooling, collaborative acceleration with KunlunChip's high‑performance inference library, and extensive performance benchmarks demonstrating multi‑batch GPU and XPU gains.

AI inferenceKunlun chipPerformance Optimization

0 likes · 10 min read

Vivo's Self‑Developed Streaming Speech‑Recognition Inference Engine and KunlunChip High‑Performance Inference Library

58 Tech

Sep 29, 2022 · Artificial Intelligence

End-to-End Speech Recognition Optimization and Deployment at 58.com

58.com’s AI Lab presents a comprehensive overview of its end‑to‑end speech recognition system, detailing data collection, semi‑supervised training, Efficient Conformer architecture, model compression, and deployment strategies that together achieve high accuracy across diverse acoustic conditions and large‑scale production workloads.

AIEfficient ConformerEnd-to-End

0 likes · 19 min read

End-to-End Speech Recognition Optimization and Deployment at 58.com

Zuoyebang Tech Team

Sep 23, 2022 · Artificial Intelligence

How AI Powers K‑12 Education: Insights from a Chief Algorithm Expert

In this interview, the chief algorithm expert at Zuoyebang discusses how AI technologies such as NLP, speech recognition, large‑model pre‑training, and knowledge‑graph construction are applied to K‑12 education, covering practical challenges, deployment strategies, and future research directions.

AIEducation TechnologyKnowledge Graph

0 likes · 27 min read

How AI Powers K‑12 Education: Insights from a Chief Algorithm Expert

Zuoyebang Tech Team

Aug 12, 2022 · Artificial Intelligence

How End-to-End Speech Recognition is Transforming AI Voice Applications

The AISummit AI conference highlighted advances in intelligent voice, with experts from ZuoYeBang, ByteDance, Microsoft and others discussing end‑to‑end speech recognition, pronunciation correction, and high‑quality speech synthesis, and exploring how multimodal pre‑trained models will shape the future of voice AI.

AI Conferenceend-to-end AIintelligent voice

0 likes · 6 min read

How End-to-End Speech Recognition is Transforming AI Voice Applications

Zuoyebang Tech Team

Jul 29, 2022 · Artificial Intelligence

Boosting Chinese‑English Code‑Switching Speech Recognition with Language ID and LM Enhancements

This report details a series of experiments on Chinese‑English mixed‑language speech recognition, introducing language‑identification loss and language‑model integration to improve acoustic modeling, reduce mixed error rates, and achieve significant gains over a baseline end‑to‑end ASR system.

Code-Switchingdeep learninglanguage identification

0 likes · 16 min read

Boosting Chinese‑English Code‑Switching Speech Recognition with Language ID and LM Enhancements

Zuoyebang Tech Team

Jul 14, 2022 · Artificial Intelligence

Enhancing Speech Keyword Detection Using Prefix Automaton Beam Search

This article presents a method to improve keyword detection in large‑scale speech recognition by integrating a prefix automaton into the beam‑search decoding of seq2seq models, enabling real‑time addition of new terms while reducing computational overhead compared to traditional approaches.

Beam SearchSeq2Seqkeyword detection

0 likes · 12 min read

Enhancing Speech Keyword Detection Using Prefix Automaton Beam Search

Zuoyebang Tech Team

Jun 10, 2022 · Artificial Intelligence

How End-to-End Phoneme Recognition Boosts English Pronunciation Detection

This article examines the challenges of English pronunciation teaching in China and presents a practical end-to-end phoneme‑level mispronunciation detection system that leverages CTC models, attention‑based text fusion, and data augmentation to dramatically reduce false alarms and improve diagnostic accuracy.

AI Educationend-to-end modelslanguage learning

0 likes · 9 min read

How End-to-End Phoneme Recognition Boosts English Pronunciation Detection

Python Programming Learning Circle

Apr 22, 2022 · Artificial Intelligence

Building a Python Voice Chatbot with Baidu AI Speech Recognition and Qingyunke

This tutorial explains how to create a Python voice chatbot by recording audio, converting speech to text with Baidu AI, sending the text to the Qingyunke chatbot API for a response, and finally synthesizing the reply back into speech using pyttsx3.

Baidu AIChatbotText‑to‑Speech

0 likes · 8 min read

Building a Python Voice Chatbot with Baidu AI Speech Recognition and Qingyunke

NetEase LeiHuo Testing Center

Apr 15, 2022 · Artificial Intelligence

Practical AI‑Powered Voice Recognition for Game Dialogue Testing: A Step‑by‑Step Case Study

This article presents a detailed case study of using AI speech‑recognition techniques—including acoustic modeling with VGG, pypinyin conversion, feature extraction, and CTC decoding—to automatically verify game dialogue audio against script text, outlining the workflow, challenges, implementation details, and experimental results.

AICTC decodingPython

0 likes · 10 min read

Practical AI‑Powered Voice Recognition for Game Dialogue Testing: A Step‑by‑Step Case Study

DataFunSummit

Apr 1, 2022 · Artificial Intelligence

Detecting Invalid Queries in Voice Interaction: Non‑Human Interaction and Ambiguous Intent Recognition

This talk presents a comprehensive study of invalid query detection in voice assistants, covering the definition of effective and ineffective queries, challenges of non‑human interaction and ambiguous intent recognition, data collection, model design, experimental results, user‑feedback loops, and future research directions.

Natural Language Understandinginvalid query detectionmachine learning

0 likes · 20 min read

Detecting Invalid Queries in Voice Interaction: Non‑Human Interaction and Ambiguous Intent Recognition

DataFunTalk

Mar 20, 2022 · Artificial Intelligence

Detecting Invalid Queries in Voice Interaction: Non‑Human Interaction and Ambiguous Intent Recognition

This talk presents a comprehensive study of invalid query detection in voice assistants, covering the definition and taxonomy of invalid queries, challenges of non‑human interaction and ambiguous intent recognition, data collection and labeling strategies, feature engineering, deep neural network modeling, experimental results, user‑feedback loops, and current performance limits.

AIdialogue systeminvalid query

0 likes · 17 min read

Baidu Geek Talk

Feb 14, 2022 · Artificial Intelligence

AI Sign Language Digital Human: Technology, Challenges, and Development by Baidu Intelligent Cloud

Baidu’s AI‑driven sign‑language digital human combines ultra‑accurate speech recognition, specialized translation, and precise gesture‑generation models—backed by extensive motion‑capture data and expert validation—to deliver 24‑hour, high‑fidelity signing for millions of hearing‑impaired users, showcasing inclusive AI communication.

AIaccessibilitygesture generation

0 likes · 12 min read

AI Sign Language Digital Human: Technology, Challenges, and Development by Baidu Intelligent Cloud

DataFunSummit

Jan 16, 2022 · Artificial Intelligence

Multimodal Text and Speech Emotion Analysis: Overview, MSCNN‑SPU Model, and Domain Adaptation

This talk presents an overview of text‑plus‑speech multimodal emotion analysis, covering background, single‑modal text and audio models, the MSCNN‑SPU multimodal architecture, domain‑adaptation techniques, and future directions, with detailed model explanations, experimental results, and practical deployment insights.

Audio ProcessingText Classificationdeep learning

0 likes · 40 min read

Multimodal Text and Speech Emotion Analysis: Overview, MSCNN‑SPU Model, and Domain Adaptation

Python Programming Learning Circle

Jan 10, 2022 · Artificial Intelligence

Building a Siri‑Like Voice Chatbot with Python

This tutorial explains how to create a Siri‑style conversational robot in Python by configuring the environment, describing the speech‑recognition and chatbot principles, and showing the implementation that uses Baidu speech recognition and the Turing chatbot API.

AIChatbotPython

0 likes · 3 min read

Building a Siri‑Like Voice Chatbot with Python

Beike Product & Technology

Dec 23, 2021 · Artificial Intelligence

KeSpeech: A Large-Scale Chinese Mandarin Dialect Speech Benchmark Presented at NeurIPS 2021

KeSpeech, a benchmark jointly released by Beike AI and Tsinghua University at NeurIPS 2021, provides a massive Chinese Mandarin dialect dataset covering 30,000 speakers from 34 cities, supporting speech recognition, speaker verification, dialect identification, and voice conversion tasks, and includes rich multi‑scenario and parallel corpora for advanced research.

AINeurIPSdialect benchmark

0 likes · 5 min read

KeSpeech: A Large-Scale Chinese Mandarin Dialect Speech Benchmark Presented at NeurIPS 2021

DataFunTalk

Dec 5, 2021 · Artificial Intelligence

Real‑Time Voice Dialogue: Practices, Challenges, and Duplex Conversation

This article presents a comprehensive overview of real‑time voice dialogue systems, covering the hotline robot architecture, unique challenges of spoken interactions, ASR‑robust SLU models, multimodal emotion detection, oral expression handling, and the design and benefits of duplex (full‑duplex) conversational frameworks.

ASR robustnessSLUduplex conversation

0 likes · 23 min read

Real‑Time Voice Dialogue: Practices, Challenges, and Duplex Conversation

DataFunTalk

Nov 5, 2021 · Artificial Intelligence

End-to-End Entity Extraction for Tmall Genie: Speech2Slot Model and Unsupervised Pre‑Training

This article presents the business background of Tmall Genie’s voice‑driven content‑on‑demand service, critiques the traditional pipeline for entity extraction, and details an end‑to‑end speech‑semantic model—including the Speech2Slot architecture, knowledge‑enhanced encoding, and Phoneme‑BERT unsupervised pre‑training—demonstrating significant performance gains in both generation and classification tasks.

Knowledge IntegrationVoice Assistantend-to-end model

0 likes · 14 min read

End-to-End Entity Extraction for Tmall Genie: Speech2Slot Model and Unsupervised Pre‑Training

DataFunSummit

Nov 3, 2021 · Artificial Intelligence

Innovations and Practices of Entity Extraction in Tmall Genie Voice Assistant

The article presents Tmall Genie’s end‑to‑end speech‑semantic understanding pipeline, detailing the limitations of traditional ASR‑NLU‑IR pipelines, introducing the Speech2Slot model with knowledge‑enhanced encoders, and describing unsupervised phoneme‑based pre‑training (Phoneme‑BERT) that improves entity extraction performance in voice‑driven content playback.

Phoneme-BERTTmall Genieend-to-end model

0 likes · 14 min read

Innovations and Practices of Entity Extraction in Tmall Genie Voice Assistant

HelloTech

Aug 13, 2021 · Backend Development

Understanding WebSocket Protocol and Its Application in Real‑Time Speech Recognition

The article explains why traditional polling methods fall short for real‑time data, introduces the WebSocket protocol’s full‑duplex handshake and heartbeat mechanisms, and demonstrates how a Java‑based WebSocket service efficiently streams audio to an ASR engine for low‑latency speech recognition.

JavaReal‑time communicationSpring Boot

0 likes · 12 min read

Understanding WebSocket Protocol and Its Application in Real‑Time Speech Recognition

DataFunTalk

Aug 13, 2021 · Artificial Intelligence

Predictions for Speech Recognition Technology Over the Next Decade: Research and Application Directions

The article, authored by a former Stanford PhD now at Zoom, forecasts that by 2030 speech recognition will rely heavily on semi‑supervised learning, on‑device models, richer representations, and personalization, while applications such as transcription services and voice assistants will evolve modestly.

AISemi-supervised Learningfuture trends

0 likes · 7 min read

Predictions for Speech Recognition Technology Over the Next Decade: Research and Application Directions

58 Tech

Jul 21, 2021 · Artificial Intelligence

Streaming Speech Recognition Engine: Architecture, Workflow, and Optimizations at 58.com

The article details the design, components, real‑time processing flow, and performance optimizations of 58.com’s streaming speech recognition engine, covering its SDK access layer, logical services, data storage, Kaldi‑based decoding, and the practical impact on voice‑driven applications.

AIKaldiarchitecture

0 likes · 12 min read

Streaming Speech Recognition Engine: Architecture, Workflow, and Optimizations at 58.com

58 Tech

Jul 14, 2021 · Artificial Intelligence

Multi‑Turn Voice Bot Architecture and End‑to‑End Dialogue Jump Strategies at 58.com

This article describes the overall architecture of 58.com’s multi‑turn voice robot, explains rule‑based, intent‑based and text‑matching dialogue jump strategies, introduces an end‑to‑end classification approach using TextCNN, and reports its online performance improvements and future research directions.

AIdialogue managementend-to-end model

0 likes · 17 min read

Multi‑Turn Voice Bot Architecture and End‑to‑End Dialogue Jump Strategies at 58.com

Beike Product & Technology

Jul 1, 2021 · Artificial Intelligence

Semantic Data Augmentation and GigaSpeech: Highlights of Two INTERSPEECH 2021 Papers from the Beike Voice Team

The article summarizes two INTERSPEECH 2021 papers from Beike's voice technology team, detailing a grammar‑based semantic data augmentation method that improves end‑to‑end Chinese speech recognition and introducing GigaSpeech, a massive 10,000‑hour multilingual English speech dataset for robust ASR research.

ChineseData AugmentationGigaSpeech

0 likes · 7 min read

Semantic Data Augmentation and GigaSpeech: Highlights of Two INTERSPEECH 2021 Papers from the Beike Voice Team

58 Tech

May 31, 2021 · Artificial Intelligence

Practical Implementation of Voice Activity Detection (VAD) for Streaming and Offline Scenarios at 58.com

This article presents the design, training, deployment, and evaluation of a self‑developed Voice Activity Detection system used in both real‑time streaming dialogues and offline audio analysis at 58.com, detailing algorithm choices, smoothing strategies, engineering challenges, and future research directions.

AIVADVoice Activity Detection

0 likes · 18 min read

Practical Implementation of Voice Activity Detection (VAD) for Streaming and Offline Scenarios at 58.com

Sohu Tech Products

May 12, 2021 · Artificial Intelligence

Zero‑Basis Food Sound Recognition with ASR: Theory, Workflow, and Complete Python Code

This article introduces the fundamentals of automatic speech recognition (ASR) for food‑sound classification, explains key audio representations and modeling approaches, and provides a fully runnable Python implementation using librosa, TensorFlow/Keras, and classic machine‑learning tools to train and predict on the Tianchi competition dataset.

ASRAudio ClassificationCNN

0 likes · 11 min read

Zero‑Basis Food Sound Recognition with ASR: Theory, Workflow, and Complete Python Code

Didi Tech

Apr 29, 2021 · Artificial Intelligence

Design and Architecture of DiDi Driver-side Intelligent Voice Assistant "XiaoDi"

The document details DiDi’s driver‑side intelligent voice assistant “XiaoDi,” describing its three‑layer architecture—audio source switching controller, semantic‑parsing core, and business API—along with conflict‑resolution mechanisms, multi‑turn dialogue handling, and a four‑region UI design that together enhance driver safety, convenience, and well‑being.

AIDriver AppMobile Development

0 likes · 30 min read

Design and Architecture of DiDi Driver-side Intelligent Voice Assistant "XiaoDi"

JD Cloud Developers

Mar 8, 2021 · Artificial Intelligence

How AI Voice Synthesis Brings ‘Hi, Mom’ to Life: From Film to Real‑World Tech

The article explores how modern AI technologies such as speech synthesis, natural language understanding, and the FastReID computer‑vision library enable realistic voice recreation and cross‑temporal dialogue, turning the emotional premise of the movie “Hi, Mom” into a tangible technical demonstration.

AIFastReIDNatural Language Understanding

0 likes · 10 min read

How AI Voice Synthesis Brings ‘Hi, Mom’ to Life: From Film to Real‑World Tech

TAL Education Technology

Feb 25, 2021 · Artificial Intelligence

TAL Education Releases 587‑Hour Bilingual Speech Dataset for AI Research

TAL Education (好未来) has opened a 587‑hour bilingual Chinese‑English speech dataset from classroom teaching, one of the largest open educational corpora, aiming to fill the data scarcity in mixed‑language speech recognition research and support AI model development.

AIEducation TechnologyOpen Data

0 likes · 5 min read

TAL Education Releases 587‑Hour Bilingual Speech Dataset for AI Research

ITPUB

Feb 25, 2021 · Artificial Intelligence

How 58.com Scales Voice Quality Inspection with AI-Powered Architecture

This article details the AI-driven intelligent voice quality inspection system built by 58.com, covering its background, multi‑layer architecture, speech recognition, role and tag identification, backend services, and the resulting efficiency gains for large‑scale call‑center operations.

AIcall center automationdeep learning

0 likes · 15 min read

How 58.com Scales Voice Quality Inspection with AI-Powered Architecture

58 Tech

Feb 22, 2021 · Artificial Intelligence

Building a Self‑Developed Speech Recognition Engine at 58.com: From Team Formation to Production Deployment

This article details how a three‑person team at 58.com built a self‑developed speech recognition engine in less than a year, covering background, team formation, data annotation, model selection, engineering architecture, performance optimizations, deployment results, and future directions.

ASRKaldiReal-time

0 likes · 25 min read

Building a Self‑Developed Speech Recognition Engine at 58.com: From Team Formation to Production Deployment

58 Tech

Dec 21, 2020 · Artificial Intelligence

Voice Robot Sound Classification: Feature Extraction, VGGish Model, and Optimization Experiments

This article describes the end‑to‑end pipeline of a voice robot, covering speech framing, feature extraction (FBank, MFCC), the VGGish embedding network, various model architectures, experimental results on accuracy and recall, and future directions for improving sound‑type classification.

FBankMFCCVGGish

0 likes · 11 min read

Voice Robot Sound Classification: Feature Extraction, VGGish Model, and Optimization Experiments

58 Tech

Dec 11, 2020 · Artificial Intelligence

Weighted Finite State Transducers (WFST) in Traditional Speech Recognition: Principles and Optimization

This article explains the role of Weighted Finite State Transducers in conventional HMM‑based speech recognition, covering language models, pronunciation dictionaries, WFST definitions, semiring theory, composition and determinization operations, decoding graph construction (HCLG), lattice rescoring, and practical optimization techniques for real‑world scenarios.

ASRLanguage ModelOptimization

0 likes · 23 min read

Weighted Finite State Transducers (WFST) in Traditional Speech Recognition: Principles and Optimization

58 Tech

Nov 27, 2020 · Artificial Intelligence

An Overview of Kaldi Chain Model Speech Recognition and Its Relationship with HMM‑DNN and Discriminative Training

This article explains the Kaldi chain model speech‑recognition system, covering HMM‑DNN fundamentals, discriminative (MMI) training, the special single‑state HMM topology, TDNN architecture, training pipelines, and experimental results that demonstrate its performance advantages over traditional GMM‑based approaches.

HMM-DNNKaldiTDNN

0 likes · 19 min read

An Overview of Kaldi Chain Model Speech Recognition and Its Relationship with HMM‑DNN and Discriminative Training

Sohu Tech Products

Aug 19, 2020 · Artificial Intelligence

ASR Error Correction with BERT, ELECTRA and a Fuzzy‑Phoneme Generator: Techniques from Xiaomi AI

This article describes how Xiaomi's AI team tackles Automatic Speech Recognition (ASR) query errors by analyzing error patterns, employing BERT, ELECTRA and a soft‑masked BERT model, generating synthetic noisy data with a fuzzy‑phoneme generator, and presenting experimental results and future research directions.

ASRBERTELECTRA

0 likes · 18 min read

ASR Error Correction with BERT, ELECTRA and a Fuzzy‑Phoneme Generator: Techniques from Xiaomi AI

58 Tech

Aug 19, 2020 · Artificial Intelligence

Speech Recognition in 58.com: Application Scenarios, Data Collection, Kaldi Chain Model Practice, and End‑to‑End Exploration

This article presents a comprehensive overview of how 58.com leverages large‑scale voice data from call‑center, private phone, and micro‑chat platforms, detailing data collection, annotation, Kaldi‑based chain model training, lattice‑free techniques, and end‑to‑end Transformer‑CTC models to improve Chinese speech recognition performance.

ASRChineseEnd-to-End

0 likes · 16 min read

Speech Recognition in 58.com: Application Scenarios, Data Collection, Kaldi Chain Model Practice, and End‑to‑End Exploration

58 Tech

Aug 7, 2020 · Artificial Intelligence

Technical Overview of 58.com Intelligent Voice Analysis Platform

The article presents a comprehensive technical overview of 58.com’s intelligent voice analysis platform, detailing its business background, system architecture, speech and NLP technologies, speaker diarization methods, model performance, data labeling workflow, and practical applications in call‑center quality inspection and user profiling.

AI platformdata labelingnatural language processing

0 likes · 11 min read

Technical Overview of 58.com Intelligent Voice Analysis Platform

58 Tech

Aug 3, 2020 · Artificial Intelligence

Intelligent Voice Quality Inspection System Architecture and Implementation at 58.com

The article details the design and deployment of an AI-powered intelligent voice quality inspection system at 58.com, covering its overall architecture, speech recognition, role identification, tag detection, rechecking platform, and backend infrastructure, and demonstrates its impact on call‑center efficiency and service quality.

AIbackend-architecturedeep learning

0 likes · 12 min read

Intelligent Voice Quality Inspection System Architecture and Implementation at 58.com

DataFunTalk

Jul 15, 2020 · Artificial Intelligence

ASR Error Correction with BERT, ELECTRA, and a Fuzzy‑Phoneme Generator: Methods, Experiments, and Future Directions

This article presents a comprehensive overview of automatic speech recognition (ASR) error correction techniques employed by Xiaomi's Xiao‑Ai team, detailing problem definition, related work on BERT and ELECTRA, a custom generator‑discriminator architecture with a fuzzy‑phoneme simulator, experimental results, and prospective research directions.

ASRBERTELECTRA

0 likes · 19 min read

ASR Error Correction with BERT, ELECTRA, and a Fuzzy‑Phoneme Generator: Methods, Experiments, and Future Directions

58 Tech

Jun 15, 2020 · Artificial Intelligence

Intelligent Voice Robot Architecture, Core Technologies, and Enterprise Applications

This article presents the engineering architecture of intelligent voice robots, detailing voice preprocessing, intent recognition, slot extraction, dialogue management, and showcases multiple enterprise use cases that improve efficiency and revenue across sales, customer service, and recruitment.

Enterprise Automationdialogue managementintent classification

0 likes · 14 min read

Intelligent Voice Robot Architecture, Core Technologies, and Enterprise Applications

Didi Tech

May 25, 2020 · Artificial Intelligence

How Didi Harnesses Cutting‑Edge Speech Recognition: From ASR Basics to Transformer Models

This article provides a comprehensive technical overview of modern speech recognition, covering Didi’s driver‑assistant and smart‑customer‑service applications, fundamental ASR concepts, classic GMM‑HMM methods, deep‑learning breakthroughs such as DNN‑HMM, CTC, attention‑based and transformer models, practical training tricks, signal‑processing steps, and multimodal fusion techniques.

ASRCTCMultimodal

0 likes · 16 min read

How Didi Harnesses Cutting‑Edge Speech Recognition: From ASR Basics to Transformer Models

Didi Tech

Apr 2, 2020 · Artificial Intelligence

Interview: Didi AI’s DELTA – A Unified Framework for NLP and Speech Model Development

In this interview, Didi AI Labs’ Han Kun explains how the DELTA platform unifies TensorFlow‑based NLP and speech models—supporting tasks from text classification to voice emotion recognition—through a modular, easily deployable architecture, accelerating development, powering Didi products, and now open‑sourced for broader AI collaboration.

AI platformDeltaNLP

0 likes · 14 min read

Interview: Didi AI’s DELTA – A Unified Framework for NLP and Speech Model Development

Alibaba Cloud Developer

Mar 30, 2020 · Artificial Intelligence

How AI Is Transforming Language, Speech, and Vision: Key Technologies and Future Trends

This article provides a comprehensive overview of AI's rapid evolution, covering deep learning foundations, machine learning components, natural language processing advances, speech recognition breakthroughs, multimodal interaction, computer vision progress, model compression techniques, and the shift from data‑driven to knowledge‑based AI approaches.

machine learningspeech recognition

0 likes · 19 min read

How AI Is Transforming Language, Speech, and Vision: Key Technologies and Future Trends

DataFunTalk

Mar 19, 2020 · Artificial Intelligence

Advances in Voice Interaction: 360's Intelligent Dialogue System Architecture and Core Technologies

This article presents a comprehensive overview of 360's voice interaction platform, detailing dialogue system fundamentals, platform architecture, and core technologies such as semantic understanding, dialog management, and question answering, all driven by deep learning and multimodal innovations.

AIKnowledge GraphNatural Language Understanding

0 likes · 16 min read

Advances in Voice Interaction: 360's Intelligent Dialogue System Architecture and Core Technologies

DataFunTalk

Mar 10, 2020 · Artificial Intelligence

Interspeech 2019 Highlights: End‑to‑End Speech AI Technologies and Key Paper Summaries

The article reviews Interspeech 2019, summarizing major trends and representative papers in end‑to‑end speech recognition, synthesis, natural language understanding, speaker recognition, and speech translation, while also highlighting best student papers and providing resources for further study.

AIInterspeech 2019Natural Language Understanding

0 likes · 24 min read

Interspeech 2019 Highlights: End‑to‑End Speech AI Technologies and Key Paper Summaries

TAL Education Technology

Feb 28, 2020 · Artificial Intelligence

TPNN Multi‑GPU Training and Mobile Optimization for Children's Acoustic Speech Recognition Models

This article describes the TPNN deep‑learning platform’s multi‑GPU acceleration, data‑parallel BMUF training, LSTM‑CTC acoustic modeling, and a suite of mobile‑side optimizations—including model pruning, 8‑bit quantization, low‑precision matrix multiplication and mixed‑precision computation—that together achieve over 92% recognition accuracy for children’s English speech on both server and mobile devices.

BMUFCTCLSTM

0 likes · 15 min read

TPNN Multi‑GPU Training and Mobile Optimization for Children's Acoustic Speech Recognition Models

DataFunTalk

Feb 3, 2020 · Artificial Intelligence

Advances in Speech Recognition: Concepts, Deep Learning Methods, and Didi’s Applications

This article presents a comprehensive overview of modern speech recognition technology, covering basic ASR concepts, classic acoustic and language models, deep‑learning approaches such as DNN‑HMM, CTC, attention‑based and transformer models, multimodal fusion, signal‑processing pipelines, and practical deployment considerations at Didi.

ASRCTCDidi

0 likes · 15 min read

Advances in Speech Recognition: Concepts, Deep Learning Methods, and Didi’s Applications

21CTO

Jan 31, 2020 · Artificial Intelligence

How Microsoft’s First Chinese AI Fellow Is Driving Speech and Language Breakthroughs

Microsoft appointed its first Chinese Global Technical Fellow, Huang Xuedong, as the company’s Global AI CTO, overseeing Azure’s speech, translation, vision, and language services, while highlighting his groundbreaking achievements such as achieving human‑level word error rates and leading AI research teams.

AI researchAzureMicrosoft

0 likes · 7 min read

How Microsoft’s First Chinese AI Fellow Is Driving Speech and Language Breakthroughs

iQIYI Technical Product Team

Jan 17, 2020 · Artificial Intelligence

Voice and Language Technologies in Natural Interaction: iQIYI HomeAI Speech Interaction System

The talk introduced iQIYI’s HomeAI platform, which combines user profiling (including voiceprint and age detection) with automatic video semantic extraction to enable natural, multi‑turn voice‑based video search—addressing hot‑content updates, contextual awareness, device environments, and personalized recommendations for screen‑less or accessibility‑focused users.

AIContext-Awareentity extraction

0 likes · 19 min read

Voice and Language Technologies in Natural Interaction: iQIYI HomeAI Speech Interaction System

58 Tech

Nov 18, 2019 · Artificial Intelligence

Comprehensive Solution for Human‑Machine Voice Dialogue Robot at 58.com

This article presents a complete solution for 58.com’s human‑machine voice dialogue robot, detailing its background, overall architecture, intelligent outbound process, core functions such as call service, anti‑spam, status recognition, multi‑turn dialogue management, intent classification, slot extraction, whole‑round intent detection, and various practical application scenarios.

AITelephonydialogue management

0 likes · 13 min read

Comprehensive Solution for Human‑Machine Voice Dialogue Robot at 58.com

DataFunTalk

Nov 18, 2019 · Artificial Intelligence

Complete Solution of 58.com Human-Machine Voice Dialogue Robot: Architecture, Core Modules, and Application Scenarios

This article presents the end‑to‑end solution of 58.com’s voice dialogue robot, detailing its overall architecture, intelligent outbound process, core functions such as call dialing, status recognition, dialogue management, intent detection, and showcasing multiple real‑world application scenarios that improve sales, operations, and customer service efficiency.

AIIntent DetectionTelephony

0 likes · 12 min read

Complete Solution of 58.com Human-Machine Voice Dialogue Robot: Architecture, Core Modules, and Application Scenarios

WeChat Backend Team

Sep 3, 2019 · Artificial Intelligence

How Tencent Scaled Massive n‑gram Language Models for Real‑Time Speech Recognition

This article presents a distributed system that efficiently supports large‑scale n‑gram language models for automatic speech recognition by introducing caching, a two‑level distributed index, batch processing, and a cascading fault‑tolerance mechanism, demonstrating robust scalability and low communication overhead in Tencent's WeChat ASR service.

CachingLanguage ModelN-gram

0 likes · 35 min read

How Tencent Scaled Massive n‑gram Language Models for Real‑Time Speech Recognition

Tencent Cloud Developer

Aug 25, 2019 · Artificial Intelligence

Understanding Intelligent Speech Recognition Technology

Intelligent speech recognition converts spoken audio to text using a pipeline of feature extraction, acoustic and language modeling, where deep neural networks—especially CNN, LSTM, and hybrid CLDNN architectures—drive high accuracy, enabling mobile voice input, call‑center transcription, legal record keeping, and Tencent Cloud ASR’s 97% Mandarin accuracy with speaker separation and on‑premises deployment.

AILanguage ModelTencent Cloud

0 likes · 7 min read

Understanding Intelligent Speech Recognition Technology

58 Tech

Aug 14, 2019 · Artificial Intelligence

Design and Implementation of a Dialogue Management System for Intelligent Voice Robots

This article presents a comprehensive overview of an intelligent voice robot's dialogue management system, detailing its architecture, natural language understanding components, dialogue manager design, strategy handling, and workflow processes to achieve fluent multi‑turn interactions in telephone scenarios.

AINLUconversation system

0 likes · 14 min read

Design and Implementation of a Dialogue Management System for Intelligent Voice Robots