Tagged articles

ASR

57 articles · Page 1 of 1

Jun 23, 2026 · Artificial Intelligence

Inside Artificial Analysis: Independent AI Voice Benchmarks for ASR, TTS, and Speech‑to‑Speech

Artificial Analysis provides an independent, reproducible benchmarking platform for voice AI, offering objective WER scores for ASR, Elo‑based blind‑listening scores for TTS, and three‑dimensional metrics for end‑to‑end speech dialogue, together with detailed methodology, top‑model rankings, and practical guidance for developers to choose the most suitable model and provider for their scenarios.

AI voice evaluationASRArtificial Analysis

0 likes · 14 min read

Inside Artificial Analysis: Independent AI Voice Benchmarks for ASR, TTS, and Speech‑to‑Speech

Weekly Large Model Application

Jun 16, 2026 · Artificial Intelligence

Building a Reproducible, Scalable ASR Evaluation Framework for 2025‑2026

The article outlines why a unified ASR evaluation pipeline—combining a TestSet Zoo, Model Zoo, and standardized Benchmark Pipeline—is essential for fair cross‑model comparison, describes 2025‑2026 trends such as multi‑track metrics and robustness, and provides a step‑by‑step implementation guide with best‑practice warnings.

ASREvaluationNeMo

0 likes · 9 min read

Building a Reproducible, Scalable ASR Evaluation Framework for 2025‑2026

Old Zhang's AI Learning

Jun 9, 2026 · Artificial Intelligence

Open-Source ASR That Runs Faster on CPU Than Whisper on GPU

FunASR is an industrial‑grade, open‑source speech‑recognition toolkit that combines VAD, transcription, punctuation, speaker diarization and emotion detection in one call, achieving up to 170× real‑time on GPU and 17× on CPU, outperforming Whisper while supporting 50+ languages and offering OpenAI‑compatible APIs.

ASRCPU performanceFunASR

0 likes · 13 min read

Open-Source ASR That Runs Faster on CPU Than Whisper on GPU

Weekly Large Model Application

May 29, 2026 · Artificial Intelligence

From Direct Transcription to Reasoning ASR and Parallel Decoding: CoT‑ASR vs Whisfusion

ASR is shifting from direct verbatim transcription to two new paradigms—Chain‑of‑Thought reasoning (CoT‑ASR) that cuts WER and entity error rates, and diffusion‑based parallel decoding (Whisfusion) that slashes latency by over eight times—offering complementary routes for smarter, faster speech recognition.

ASRChain-of-ThoughtCoT-ASR

0 likes · 12 min read

From Direct Transcription to Reasoning ASR and Parallel Decoding: CoT‑ASR vs Whisfusion

Weekly Large Model Application

May 28, 2026 · Artificial Intelligence

Open-Source ASR Optimization: Solving Misrecognition of Proper Nouns and Real-Time Lag

This guide analyzes common deployment problems of open‑source speech‑recognition models—misrecognizing proper nouns and lagging behind spoken input—and presents a decision‑tree‑based, five‑layer optimization framework that balances accuracy and speed through concrete techniques such as hot‑word bias, model fine‑tuning, INT8 quantization, and appropriate runtimes.

ASRAccuracyOptimization

0 likes · 10 min read

Open-Source ASR Optimization: Solving Misrecognition of Proper Nouns and Real-Time Lag

JavaGuide

May 11, 2026 · Artificial Intelligence

Running Code Review and Voice Agents with Step Plan and Claude Code

The article walks through using Step Plan’s unified API to integrate Claude Code for automated code review and to build a voice‑agent pipeline that transcribes meeting recordings, generates structured summaries, and produces audio briefs, while discussing setup, costs, model selection, practical demos, and observed limitations.

AI AgentASRClaude Code

0 likes · 24 min read

Running Code Review and Voice Agents with Step Plan and Claude Code

Weekly Large Model Application

Apr 16, 2026 · Artificial Intelligence

Deep Dive into Conformer: The Convolution‑Augmented Transformer for Speech Recognition

The Conformer architecture blends global self‑attention with a depthwise separable convolution module in a Macaron‑style block, addressing the strong local time‑frequency structure and long sequence length of speech signals while keeping computational cost manageable for modern ASR systems.

ASRConformerConvolution

0 likes · 11 min read

Deep Dive into Conformer: The Convolution‑Augmented Transformer for Speech Recognition

Weekly Large Model Application

Mar 30, 2026 · Artificial Intelligence

Inside Kimi-Audio: A Unified Large Audio Model Covering ASR, AQA, TTS and More

Kimi-Audio, a general‑purpose audio foundation model from Moonshot AI, integrates ASR, audio QA, automatic audio captioning, emotion classification and end‑to‑end speech dialogue within a single framework, detailing its mixed‑audio input, MiMo‑Transformer core, efficient synthesis pipeline, architectural strengths, limitations, and suitable application scenarios.

ASRAudio LLMBigVGAN

0 likes · 9 min read

Inside Kimi-Audio: A Unified Large Audio Model Covering ASR, AQA, TTS and More

Weekly Large Model Application

Mar 17, 2026 · Artificial Intelligence

Essential Features Every Voice Interaction System Must Support

The article provides a comprehensive analysis of core voice interaction system capabilities—including barge‑in, turn‑taking, multi‑turn dialogue, intent recognition, speaker identification, streaming latency, noise robustness, multilingual support, emotion handling, personalization, security, and deployment considerations—highlighting typical scenarios such as smart speakers, in‑car assistants, call centers, and meeting transcription.

ASRLatencyMultimodal

0 likes · 11 min read

Essential Features Every Voice Interaction System Must Support

Weekly Large Model Application

Mar 13, 2026 · Artificial Intelligence

Speech Large Models: Why End-to-End Architecture Beats Traditional ASR‑LLM‑TTS Pipelines

The article defines true speech large models as native end‑to‑end systems that directly map audio to audio, compares them with traditional cascade ASR‑LLM‑TTS pipelines across architecture, error control, latency, paralinguistic perception, long‑context handling and deployment, and surveys the leading open‑source and commercial speech LLMs released in March 2026 with a quick selection guide.

AIASREnd-to-End

0 likes · 11 min read

Speech Large Models: Why End-to-End Architecture Beats Traditional ASR‑LLM‑TTS Pipelines

Old Zhang's AI Learning

Mar 7, 2026 · Artificial Intelligence

vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility

The vLLM 0.17.0 release brings FlashAttention 4 integration, a mature Model Runner V2, complete Qwen 3.5 series support, a one‑click performance‑mode flag, Anthropic API compatibility, advanced weight‑offloading, broader hardware support beyond NVIDIA, ASR model integration, and detailed upgrade and installation guidance.

ASRAnthropic APIFlashAttention

0 likes · 12 min read

vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility

Weekly Large Model Application

Mar 4, 2026 · Artificial Intelligence

Qwen3‑ASR vs FunASR: In‑Depth Technical Comparison

This article provides a detailed side‑by‑side analysis of the open‑source ASR tools FunASR and Qwen3‑ASR, covering team origins, model architectures, language coverage, speed, deployment requirements, and ideal use‑cases so readers can decide which solution fits their projects best.

ASRFunASRLarge Language Model

0 likes · 10 min read

Qwen3‑ASR vs FunASR: In‑Depth Technical Comparison

Xiaohongshu Tech REDtech

Mar 4, 2026 · Mobile Development

How Xiaohongshu Delivered Billion‑User Voice & Fireworks Effects with Adaptive Rendering

During the 2026 Chinese New Year, Xiaohongshu built a real‑time dynamic interaction system that combined adaptive scheduling, high‑performance particle rendering, and industrial‑grade ASR to deliver synchronized voice greetings and emoji fireworks to over a billion daily active users across heterogeneous mobile devices.

ASRCross-Platformadaptive scheduling

0 likes · 13 min read

How Xiaohongshu Delivered Billion‑User Voice & Fireworks Effects with Adaptive Rendering

Weekly Large Model Application

Feb 22, 2026 · Artificial Intelligence

2026 Guide to Running Open‑Source ASR on Pure CPU

The 2026 overview details lightweight, heavily quantized open‑source speech‑recognition models and CPU‑specific inference engines, offering concrete tips, model comparisons, and a concise selection guide that enable real‑time, GPU‑free ASR deployment with low latency and high stability.

ASRCPU inferenceQuantization

0 likes · 4 min read

2026 Guide to Running Open‑Source ASR on Pure CPU

Weekly Large Model Application

Feb 20, 2026 · Artificial Intelligence

Intelligent Speech vs. Voice Agent: Key Differences and How They Relate

This article explains the technical distinction between intelligent speech— a toolbox of ASR, TTS, NLU, and NLG technologies— and Voice Agent, an end‑to‑end conversational system built on those tools and large‑model reasoning, illustrating their layered relationship, functional gaps, and typical use cases.

ASRDialogue SystemsLarge Language Model

0 likes · 7 min read

Intelligent Speech vs. Voice Agent: Key Differences and How They Relate

Old Zhang's AI Learning

Feb 1, 2026 · Artificial Intelligence

Microsoft VibeVoice‑ASR Open‑Source: One‑Shot 60‑Minute Transcription with Speaker ID and Timestamps

Microsoft’s newly open‑sourced VibeVoice‑ASR model can transcribe up to 60‑minute audio in a single pass, preserving global context while providing built‑in speaker diarization and timestamps, supports 50+ languages, offers custom hot‑word injection, and can be deployed via Docker, Gradio, or vLLM for high‑throughput API services.

ASRDockerLoRA

0 likes · 9 min read

Microsoft VibeVoice‑ASR Open‑Source: One‑Shot 60‑Minute Transcription with Speaker ID and Timestamps

Zhuanzhuan Tech

Dec 24, 2025 · Artificial Intelligence

Building an ASR+LLM+Vector Knowledge Base for Precise Video Ad Category Detection

This article presents a layered ASR‑LLM‑vector‑knowledge‑base pipeline that cleans speech transcripts, semantically repairs text, performs hierarchical exact and fuzzy matching, and iteratively refines mappings to accurately identify product categories in video advertisements, while detailing module functions, technical choices, and LLM parameter tuning.

ASRKnowledge BaseLLM

0 likes · 11 min read

Building an ASR+LLM+Vector Knowledge Base for Precise Video Ad Category Detection

360 Zhihui Cloud Developer

Sep 16, 2025 · Artificial Intelligence

How AI Transforms Video Conferencing: From ASR to LLM-Powered Smart Meetings

This article explores how integrating ASR, TTS, and large language models into video conferencing creates an intelligent collaboration hub that boosts efficiency, enhances user experience, expands multilingual scenarios, and provides practical architecture and Python code examples for real‑time smart meetings.

AIASRLLM

0 likes · 11 min read

How AI Transforms Video Conferencing: From ASR to LLM-Powered Smart Meetings

Huolala Tech

Sep 10, 2025 · Artificial Intelligence

How AI Voice Humanization Cuts Call‑Center Costs: ASR, Smart Interrupt & TTS Deep Dive

This article examines how AI‑driven voice humanization—covering advanced ASR, intelligent interruption, and expressive TTS—addresses high labor costs, efficiency bottlenecks, and inconsistent service quality in inbound and outbound call‑center operations, presenting technical evaluations, optimization strategies, and future research directions.

AI voiceASRHumanization

0 likes · 13 min read

How AI Voice Humanization Cuts Call‑Center Costs: ASR, Smart Interrupt & TTS Deep Dive

Alibaba Cloud Big Data AI Platform

Jul 8, 2025 · Artificial Intelligence

How Video Retrieval‑Augmented Generation Transforms Multimodal AI Search

This article explains the end‑to‑end implementation of Video RAG in OpenSearch LLM, covering offline parsing, key‑frame extraction, audio transcription, slice creation, multimodal vectorization, hybrid indexing, and online query processing while addressing challenges like recall performance and long‑video efficiency.

ASRKey Frame ExtractionLLM

0 likes · 10 min read

How Video Retrieval‑Augmented Generation Transforms Multimodal AI Search

Efficient Ops

Oct 28, 2024 · Artificial Intelligence

How AI Powers Real-Time Business Hot‑Word Monitoring in Remote Banking

ICBC's remote‑banking hotline system uses AI, speech recognition and Python keyword extraction to rank inbound business volumes and surface hot‑word trends, delivering early alerts that help prevent risks, resolve customer issues, and support data‑driven decision making across millions of daily transactions.

AIASRHot Word Detection

0 likes · 4 min read

How AI Powers Real-Time Business Hot‑Word Monitoring in Remote Banking

DataFunTalk

Jun 3, 2024 · Artificial Intelligence

Deploying Speech AI Services Quickly with NVIDIA Riva

This article explains how to use NVIDIA Riva to rapidly deploy speech AI services, covering Riva's overview, Chinese ASR model updates, TTS capabilities, customization options, the Quickstart tool, and a Q&A session that clarifies deployment, model fine‑tuning, and integration with NeMo and Triton.

ASRGPU AccelerationNVIDIA Riva

0 likes · 13 min read

Deploying Speech AI Services Quickly with NVIDIA Riva

AI Large Model Application Practice

Mar 22, 2024 · Artificial Intelligence

How to Build a Real‑Time AI‑Powered 3D Digital Human with Unreal Engine

This guide explains the architecture of an interactive digital‑human system, walks through 3D avatar creation with Unreal Engine, details the AI controller that combines ASR, LLM and TTS, and provides step‑by‑step instructions for deploying the open‑source Fay project.

AI AvatarASRFay

0 likes · 14 min read

How to Build a Real‑Time AI‑Powered 3D Digital Human with Unreal Engine

DataFunTalk

Feb 13, 2024 · Artificial Intelligence

An Overview of NVIDIA NeMo: Open‑Source Framework for Speech AI, ASR, TTS, NLP and Large Language Model Training

This article introduces NVIDIA’s open‑source NeMo framework, detailing its PyTorch‑based architecture for Speech AI, ASR and TTS training, NLP and LLM support, GPU‑optimized parallelism, pre‑trained model resources, fine‑tuning techniques, and the accompanying NeMo Aligner and Framework tools.

ASRNVIDIA NeMoPyTorch

0 likes · 18 min read

An Overview of NVIDIA NeMo: Open‑Source Framework for Speech AI, ASR, TTS, NLP and Large Language Model Training

DataFunTalk

Jan 26, 2024 · Artificial Intelligence

Efficient Deployment of Speech AI Models on GPUs

This article explains how to efficiently deploy speech AI models—including ASR and TTS—on GPUs using NVIDIA's Triton Inference Server and TensorRT, covering background challenges, GPU‑based solutions, decoding optimizations, Whisper acceleration with TensorRT‑LLM, streaming TTS improvements, voice‑cloning pipelines, future plans, and a Q&A session.

ASRGPUTTS

0 likes · 20 min read

Efficient Deployment of Speech AI Models on GPUs

Ctrip Technology

Dec 21, 2023 · Backend Development

Load Balancing ASR Services in Ctrip Call Center: Architecture and Implementation with FreeSWITCH and OpenSIPS

This article details the design, evolution, and best‑practice implementation of load‑balancing for ASR (speech‑recognition) services in Ctrip's massive call‑center, covering component architecture, MRCP integration, challenges with traditional balancers, and two practical solutions using FreeSWITCH distributor and OpenSIPS.

ASRFreeSWITCHMRCP

0 likes · 27 min read

Load Balancing ASR Services in Ctrip Call Center: Architecture and Implementation with FreeSWITCH and OpenSIPS

Ximalaya Technology Team

Dec 19, 2023 · Cloud Computing

Text-Based Audio Editing in Cloud Editing: Architecture, Features, and Performance Optimizations

The article discusses cloud-based audio editing tool architecture, focusing on text‑based editing enabled by ASR, hierarchical DOM (Word, Sentence, Paragraph), performance challenges with massive character nodes, and optimizations like viewport‑based rendering and efficient drag‑select, achieving large speed gains for long recordings.

ASRPerformance OptimizationText Editing

0 likes · 14 min read

Text-Based Audio Editing in Cloud Editing: Architecture, Features, and Performance Optimizations

Huolala Tech

Nov 23, 2023 · Artificial Intelligence

How HuoLaLa Built a Custom ASR System to Boost Accuracy and Cut Costs

This article details HuoLaLa's development of an in‑house Automatic Speech Recognition system, covering its architecture, VAD optimization, language‑model and hot‑word enhancements, punctuation restoration, task and resource scheduling, and the resulting improvements in accuracy and cost efficiency.

ASRLanguage ModelResource Scheduling

0 likes · 18 min read

How HuoLaLa Built a Custom ASR System to Boost Accuracy and Cut Costs

Bilibili Tech

Oct 13, 2023 · Artificial Intelligence

Multimodal Video High‑Energy Segment Extraction for Dynamic Video Covers

The authors present a multimodal system that automatically extracts high‑energy video segments for dynamic covers by analyzing subtitles, audio, visual frames, and danmu, employing LLM prompt‑tuning, scene‑cut detection, and aesthetic scoring to reduce manual effort and boost click‑through rates.

ASRLarge Language ModelMultimodal AI

0 likes · 14 min read

Multimodal Video High‑Energy Segment Extraction for Dynamic Video Covers

DataFunTalk

Sep 23, 2023 · Artificial Intelligence

Paraformer: An Industrial Non‑Autoregressive End‑to‑End Speech Recognition Model and Its Deployment on ModelScope

This article introduces the Paraformer non‑autoregressive end‑to‑end speech recognition model released by Alibaba DAMO Academy, details its architecture, training strategies, large‑scale performance, and provides step‑by‑step guidance for using and fine‑tuning the model on the ModelScope platform with the FunASR toolkit.

ASRModelScopeParaformer

0 likes · 13 min read

Paraformer: An Industrial Non‑Autoregressive End‑to‑End Speech Recognition Model and Its Deployment on ModelScope

DataFunTalk

Sep 19, 2023 · Artificial Intelligence

Simultaneous Speech Translation: Technical Background, System Architecture, and Key Challenges

This article reviews the technical background of simultaneous speech translation, compares offline and real‑time scenarios, details ASR and MT technologies, describes the system architecture and design strategies, and discusses the major challenges and solutions for deploying robust, low‑latency translation services.

ASRHuaweiMachine Translation

0 likes · 16 min read

Simultaneous Speech Translation: Technical Background, System Architecture, and Key Challenges

58 Tech

Jun 21, 2023 · Artificial Intelligence

GPU Hotword Enhancement for WeNet End-to-End Speech Recognition

This article explains the design, implementation, and experimental evaluation of hot‑word augmentation in WeNet's GPU runtime, detailing how character‑ and word‑based language model scoring are extended to boost recognition of rare proper nouns in both streaming and non‑streaming ASR services.

ASRCTC decoderGPU

0 likes · 12 min read

GPU Hotword Enhancement for WeNet End-to-End Speech Recognition

DataFunSummit

Jun 15, 2023 · Artificial Intelligence

Paraformer: An Industrial Non‑Autoregressive End‑to‑End Speech Recognition Model

This article introduces the Paraformer model released by Alibaba DAMO Academy on ModelScope, detailing its non‑autoregressive architecture, training strategies, performance on benchmark datasets, and step‑by‑step guidance for fine‑tuning and deploying the model using FunASR and ModelScope pipelines.

ASRModelScopeParaformer

0 likes · 13 min read

Paraformer: An Industrial Non‑Autoregressive End‑to‑End Speech Recognition Model

DataFunSummit

May 4, 2023 · Artificial Intelligence

An Overview of NVIDIA NeMo for Speech AI: ASR Training, Chinese Support, and Related Applications

This article provides a comprehensive introduction to NVIDIA's NeMo toolkit for conversational AI, detailing its ASR capabilities, model architectures, training workflow, Chinese language support, deployment options, and additional speech AI features such as VAD and speaker diarization.

ASRChinese SpeechConformer

0 likes · 15 min read

An Overview of NVIDIA NeMo for Speech AI: ASR Training, Chinese Support, and Related Applications

DataFunSummit

Apr 18, 2023 · Artificial Intelligence

Best Practices for Deploying Speech AI on GPUs with Triton and TensorRT

This article presents comprehensive best‑practice guidelines for deploying conversational speech AI—including ASR and TTS pipelines—on GPU servers using NVIDIA Triton Inference Server and TensorRT, covering workflow overview, performance optimizations, streaming inference, and real‑world deployment tips.

ASRConversational AIGPU deployment

0 likes · 14 min read

Best Practices for Deploying Speech AI on GPUs with Triton and TensorRT

Meituan Technology Team

Mar 9, 2023 · Artificial Intelligence

Implementation and Practice of MRCP in Meituan Voice Interaction

This article details Meituan’s adoption of the Media Resource Control Protocol (MRCP) to standardize ASR and TTS integration, describing its architecture, key components, high‑availability deployment, and measured performance gains such as up to 55% latency reduction and a 15% increase in outbound call success rates.

ASRMRCPMeituan

0 likes · 24 min read

Implementation and Practice of MRCP in Meituan Voice Interaction

Bilibili Tech

Feb 28, 2023 · Artificial Intelligence

High‑Quality Automatic Speech Recognition (ASR) Solutions at Bilibili: Data, Model, and Deployment Optimizations

Bilibili’s high‑quality ASR system combines large‑scale filtered business data, semi‑supervised Noisy‑Student training, an end‑to‑end CTC model with lattice‑free MMI decoding, and FP16‑optimized FasterTransformer inference on Triton, delivering top‑ranked accuracy, low latency, and scalable deployment for diverse Chinese‑English video content.

ASRBilibiliEnd-to-End

0 likes · 18 min read

High‑Quality Automatic Speech Recognition (ASR) Solutions at Bilibili: Data, Model, and Deployment Optimizations

58 Tech

Jan 12, 2023 · Artificial Intelligence

Efficient Conformer for End‑to‑End Speech Recognition: Model, Implementation, Streaming Inference, and Experimental Results

This article presents a comprehensive overview of the Efficient Conformer model for large‑scale end‑to‑end speech recognition, detailing its architectural improvements such as progressive downsampling and grouped multi‑head self‑attention, the PyTorch implementation in WeNet, streaming inference handling, experimental CER gains on AISHELL‑1 and production data, and future development plans.

ASREfficient ConformerModel Optimization

0 likes · 16 min read

Efficient Conformer for End‑to‑End Speech Recognition: Model, Implementation, Streaming Inference, and Experimental Results

DataFunTalk

Jul 30, 2022 · Artificial Intelligence

Technical Analysis of Huawei’s Offline Speech‑to‑Text and Length‑Constrained Speech Translation Systems in IWSLT 2022

This article reviews the IWSLT 2022 competition tasks, explains Huawei’s cascade offline speech‑to‑text translation pipeline, details four major technical innovations—including ensemble‑based ASR de‑noise, context‑aware re‑ranking, domain‑controlled training, and length‑control strategies—and presents experimental results that demonstrate Huawei’s leading performance across multiple language directions.

ASRHuaweiIWSLT

0 likes · 18 min read

Technical Analysis of Huawei’s Offline Speech‑to‑Text and Length‑Constrained Speech Translation Systems in IWSLT 2022

DataFunTalk

Jul 7, 2022 · Artificial Intelligence

Huawei Translation’s Achievements and Technical Solutions in IWSLT 2022 Speech Translation Tasks

This article reviews Huawei Translation’s top-ranking results in the IWSLT 2022 speech translation competition across speech‑to‑speech, offline speech‑to‑text, and length‑controlled translation tasks, and details their cascade and end‑to‑end technical approaches, including domain‑controlled ASR, context‑aware MT re‑ranking, and VITS‑based TTS.

ASREnd-to-EndHuawei

0 likes · 13 min read

Huawei Translation’s Achievements and Technical Solutions in IWSLT 2022 Speech Translation Tasks

Code DAO

Dec 10, 2021 · Artificial Intelligence

Deep Learning for Automatic Speech Recognition (ASR): From Mel Spectrograms to CTC Decoding

This article explains the end‑to‑end deep‑learning pipeline for speech‑to‑text, covering audio digitization, preprocessing with librosa, conversion to Mel spectrograms and MFCCs, data augmentation, a CNN‑RNN architecture, CTC loss, decoding strategies and evaluation with word error rate.

ASRBeam SearchCTC

0 likes · 13 min read

Deep Learning for Automatic Speech Recognition (ASR): From Mel Spectrograms to CTC Decoding

DataFunSummit

Dec 3, 2021 · Artificial Intelligence

Real‑Time Voice Dialogue: Practices, Challenges, and Duplex Conversation

This article presents an in‑depth overview of Alibaba's real‑time voice dialogue system, covering the Hotline XiaoMi robot, the unique challenges of spoken interactions such as colloquialism, multimodality and duplex communication, and the research advances in ASR‑robust SLU, emotion detection, colloquial processing, and duplex conversation modeling.

ASRMultimodalSLU

0 likes · 22 min read

Real‑Time Voice Dialogue: Practices, Challenges, and Duplex Conversation

Sohu Tech Products

May 12, 2021 · Artificial Intelligence

Zero‑Basis Food Sound Recognition with ASR: Theory, Workflow, and Complete Python Code

This article introduces the fundamentals of automatic speech recognition (ASR) for food‑sound classification, explains key audio representations and modeling approaches, and provides a fully runnable Python implementation using librosa, TensorFlow/Keras, and classic machine‑learning tools to train and predict on the Tianchi competition dataset.

ASRAudio ClassificationCNN

0 likes · 11 min read

Zero‑Basis Food Sound Recognition with ASR: Theory, Workflow, and Complete Python Code

58 Tech

Feb 22, 2021 · Artificial Intelligence

Building a Self‑Developed Speech Recognition Engine at 58.com: From Team Formation to Production Deployment

This article details how a three‑person team at 58.com built a self‑developed speech recognition engine in less than a year, covering background, team formation, data annotation, model selection, engineering architecture, performance optimizations, deployment results, and future directions.

ASRKaldiReal-time

0 likes · 25 min read

Building a Self‑Developed Speech Recognition Engine at 58.com: From Team Formation to Production Deployment

58 Tech

Dec 11, 2020 · Artificial Intelligence

Weighted Finite State Transducers (WFST) in Traditional Speech Recognition: Principles and Optimization

This article explains the role of Weighted Finite State Transducers in conventional HMM‑based speech recognition, covering language models, pronunciation dictionaries, WFST definitions, semiring theory, composition and determinization operations, decoding graph construction (HCLG), lattice rescoring, and practical optimization techniques for real‑world scenarios.

ASRLanguage ModelOptimization

0 likes · 23 min read

Weighted Finite State Transducers (WFST) in Traditional Speech Recognition: Principles and Optimization

Sohu Tech Products

Aug 19, 2020 · Artificial Intelligence

ASR Error Correction with BERT, ELECTRA and a Fuzzy‑Phoneme Generator: Techniques from Xiaomi AI

This article describes how Xiaomi's AI team tackles Automatic Speech Recognition (ASR) query errors by analyzing error patterns, employing BERT, ELECTRA and a soft‑masked BERT model, generating synthetic noisy data with a fuzzy‑phoneme generator, and presenting experimental results and future research directions.

ASRBERTELECTRA

0 likes · 18 min read

ASR Error Correction with BERT, ELECTRA and a Fuzzy‑Phoneme Generator: Techniques from Xiaomi AI

58 Tech

Aug 19, 2020 · Artificial Intelligence

Speech Recognition in 58.com: Application Scenarios, Data Collection, Kaldi Chain Model Practice, and End‑to‑End Exploration

This article presents a comprehensive overview of how 58.com leverages large‑scale voice data from call‑center, private phone, and micro‑chat platforms, detailing data collection, annotation, Kaldi‑based chain model training, lattice‑free techniques, and end‑to‑end Transformer‑CTC models to improve Chinese speech recognition performance.

ASRChineseEnd-to-End

0 likes · 16 min read

Speech Recognition in 58.com: Application Scenarios, Data Collection, Kaldi Chain Model Practice, and End‑to‑End Exploration

DataFunTalk

Jul 15, 2020 · Artificial Intelligence

ASR Error Correction with BERT, ELECTRA, and a Fuzzy‑Phoneme Generator: Methods, Experiments, and Future Directions

This article presents a comprehensive overview of automatic speech recognition (ASR) error correction techniques employed by Xiaomi's Xiao‑Ai team, detailing problem definition, related work on BERT and ELECTRA, a custom generator‑discriminator architecture with a fuzzy‑phoneme simulator, experimental results, and prospective research directions.

ASRBERTELECTRA

0 likes · 19 min read

ASR Error Correction with BERT, ELECTRA, and a Fuzzy‑Phoneme Generator: Methods, Experiments, and Future Directions

Didi Tech

May 25, 2020 · Artificial Intelligence

How Didi Harnesses Cutting‑Edge Speech Recognition: From ASR Basics to Transformer Models

This article provides a comprehensive technical overview of modern speech recognition, covering Didi’s driver‑assistant and smart‑customer‑service applications, fundamental ASR concepts, classic GMM‑HMM methods, deep‑learning breakthroughs such as DNN‑HMM, CTC, attention‑based and transformer models, practical training tricks, signal‑processing steps, and multimodal fusion techniques.

ASRCTCMultimodal

0 likes · 16 min read

How Didi Harnesses Cutting‑Edge Speech Recognition: From ASR Basics to Transformer Models

Alibaba Cloud Developer

Apr 7, 2020 · Artificial Intelligence

How Does Alibaba’s Tmall Genie Achieve Full‑Duplex Natural Dialogue?

This article explains the concept of full‑duplex natural dialogue for Alibaba’s Tmall Genie, illustrates interaction scenarios, and details the technical solution covering device‑side management, speech recognition, language understanding, synthesis, dialogue control, duration handling, and conversation flow.

ASRHuman-Computer InteractionNLU

0 likes · 8 min read

How Does Alibaba’s Tmall Genie Achieve Full‑Duplex Natural Dialogue?

DataFunTalk

Feb 3, 2020 · Artificial Intelligence

Advances in Speech Recognition: Concepts, Deep Learning Methods, and Didi’s Applications

This article presents a comprehensive overview of modern speech recognition technology, covering basic ASR concepts, classic acoustic and language models, deep‑learning approaches such as DNN‑HMM, CTC, attention‑based and transformer models, multimodal fusion, signal‑processing pipelines, and practical deployment considerations at Didi.

ASRCTCDidi

0 likes · 15 min read

Advances in Speech Recognition: Concepts, Deep Learning Methods, and Didi’s Applications

DataFunTalk

Jan 16, 2020 · Artificial Intelligence

Voice Conversion: Fundamentals, Methods, and iQIYI Applications

This article provides a comprehensive overview of voice conversion technology, covering its definition, parallel and non‑parallel data approaches, classic and deep‑learning methods such as DTW, GMM, seq2seq, PPG, VAE, Flow, GAN, and practical applications and challenges in iQIYI’s products.

ASRGaNSpeech synthesis

0 likes · 8 min read

Voice Conversion: Fundamentals, Methods, and iQIYI Applications

360 Quality & Efficiency

May 10, 2019 · Artificial Intelligence

Smart Speaker Voice Interaction Platform: Concepts, Processes, and Testing Metrics

This article introduces the architecture of smart speaker voice interaction systems, covering wake‑word activation, automatic speech recognition (ASR), natural language understanding (NLU), skill processing, text‑to‑speech synthesis (TTS), and the key performance and testing metrics for each component.

ASRNLUTTS

0 likes · 11 min read

Smart Speaker Voice Interaction Platform: Concepts, Processes, and Testing Metrics

Hulu Beijing

Apr 22, 2019 · Artificial Intelligence

How Has Speech Recognition Evolved from Traditional Methods to Modern Deep Learning?

This article reviews the fundamentals of automatic speech recognition, compares traditional MFCC‑GMM‑HMM pipelines with modern deep neural network approaches such as DNN‑HMM, LSTM‑CTC, and attention‑based models, and illustrates each evolution step with flowchart diagrams and key references.

ASRCTCDNN

0 likes · 11 min read

How Has Speech Recognition Evolved from Traditional Methods to Modern Deep Learning?

Tencent Cloud Developer

Feb 26, 2019 · Artificial Intelligence

Tencent Cloud Intelligent Speech Technology: Development, Challenges and Practical Applications

Tencent Cloud's intelligent speech platform combines high‑accuracy ASR, advanced WaveNet‑based TTS, and solutions for noise, far‑field, and dialect challenges, enabling voice input, transcription, and customer‑service bots, with real‑world deployments in finance, museums, hotels, and other industry scenarios.

ASRHuman-Computer InteractionSpeech synthesis

0 likes · 8 min read

Tencent Cloud Intelligent Speech Technology: Development, Challenges and Practical Applications

DataFunTalk

Jul 26, 2018 · Artificial Intelligence

Natural Language Understanding in the Music Domain: Architecture, Features, and Challenges

The article details the design and implementation of Xiaomi's music‑focused natural language understanding platform, covering its service architecture, intent extraction, knowledge‑base search, slot filling, personalization, and the specific data and modeling challenges encountered.

ASRKnowledge BaseMusic

0 likes · 9 min read

Natural Language Understanding in the Music Domain: Architecture, Features, and Challenges

Liulishuo Tech Team

Oct 28, 2016 · Artificial Intelligence

Open‑sourcing kaldi‑ctc: Fast GPU‑Accelerated CTC End‑to‑End Speech Recognition

The article announces the open‑source release of kaldi‑ctc, a GPU‑accelerated CTC‑based end‑to‑end speech recognition toolkit built on Kaldi, warp‑ctc and cuDNN, highlighting its 5‑6× training speedup, real‑time decoding factor of 0.02, and performance comparisons on the LibriSpeech corpus.

ASRCTCGPU

0 likes · 4 min read

Open‑sourcing kaldi‑ctc: Fast GPU‑Accelerated CTC End‑to‑End Speech Recognition