Tagged articles
53 articles
Page 1 of 1
Geek Labs
Geek Labs
May 1, 2026 · Artificial Intelligence

ACE-Step UI: An Open-Source, Free Alternative to Suno for AI Music Generation

ACE-Step UI is a completely free, locally run open-source interface for the ACE‑Step 1.5 AI music model, offering professional‑grade song generation, a modern React/Express stack, and a suite of audio tools, making it a viable alternative to Suno.

AI music generationAudio ProcessingExpress
0 likes · 7 min read
ACE-Step UI: An Open-Source, Free Alternative to Suno for AI Music Generation
Data STUDIO
Data STUDIO
Mar 26, 2026 · Operations

10 Open‑Source Python Tools That Replace Paid SaaS Apps

The article presents ten Python libraries—pikepdf, Playwright, pdf2image + pytesseract, moviepy, pydub + ffmpeg, reportlab, yt‑dlp, watchdog, pyvirtualcam, and rich + textual—each with code samples, runtime requirements, complexity analysis, practical tips, and common pitfalls, showing how they can substitute costly commercial software while offering greater control, privacy, and customization.

Audio ProcessingAutomationFile Monitoring
0 likes · 19 min read
10 Open‑Source Python Tools That Replace Paid SaaS Apps
Data Party THU
Data Party THU
Oct 8, 2025 · Artificial Intelligence

Build a Music Genre Classifier from Scratch with KNN and MFCC

This tutorial walks through constructing a complete music‑genre classification project using Python, covering dataset preparation, MFCC feature extraction, K‑Nearest Neighbors implementation, train‑test splitting, model evaluation, and testing on new audio files, all with reproducible code snippets.

Audio ProcessingMFCCMusic Genre Classification
0 likes · 14 min read
Build a Music Genre Classifier from Scratch with KNN and MFCC
Data STUDIO
Data STUDIO
Sep 15, 2025 · Artificial Intelligence

Build a Music Genre Classifier with KNN and MFCC from Scratch

This tutorial walks through building a music‑genre classification system using the GTZAN dataset, extracting MFCC features, implementing a K‑Nearest Neighbors classifier in Python, and achieving roughly 70% accuracy on test data.

Audio ProcessingMFCCMusic Genre Classification
0 likes · 14 min read
Build a Music Genre Classifier with KNN and MFCC from Scratch
Baidu Geek Talk
Baidu Geek Talk
Aug 4, 2025 · Fundamentals

How to Build High‑Performance Audio Post‑Processing with FFmpeg: Bass Boost & Voice Clarity

This article explains the importance of audio post‑processing in modern player architectures, outlines a modular FFmpeg‑based framework, details core techniques such as bass enhancement and voice clarity, provides algorithmic insights and code snippets, and shows how to integrate these filters into a playback pipeline.

Audio ProcessingMedia Playbackaudio filters
0 likes · 17 min read
How to Build High‑Performance Audio Post‑Processing with FFmpeg: Bass Boost & Voice Clarity
Baidu App Technology
Baidu App Technology
Jul 29, 2025 · Fundamentals

How to Build High‑Performance Bass Boost and Voice Clarity Filters with FFmpeg

This article explains the architecture, key techniques, and implementation details of audio post‑processing in a media player, covering bass‑enhancement and voice‑clarity filters, frequency‑range design, device constraints, FFmpeg filter chains, and sample code for a high‑performance, low‑latency solution.

Audio Processingbass boostdigital signal processing
0 likes · 17 min read
How to Build High‑Performance Bass Boost and Voice Clarity Filters with FFmpeg
Cognitive Technology Team
Cognitive Technology Team
Jul 1, 2025 · Artificial Intelligence

How We Built a Live‑Streaming TTS Engine: From Data Pipelines to AI Voice Generation

This article presents a comprehensive practice summary of building an intelligent digital‑human system, covering six core modules—LLM content generation, LLM interaction, TTS synthesis, visual driving, audio‑video engineering, and backend services—while detailing data collection, signal processing, ASR annotation, speaker clustering, model optimization (V1‑V4), evaluation metrics, and future research directions.

AI voiceAudio ProcessingDigital Human
0 likes · 23 min read
How We Built a Live‑Streaming TTS Engine: From Data Pipelines to AI Voice Generation
Programmer DD
Programmer DD
May 13, 2025 · Frontend Development

How I Built a Cross‑Platform Audio/Video App in Hours with AI‑Powered CodeBuddy

This article chronicles how a developer transformed the TransDuck audio‑video SaaS tool into a native desktop application using Tauri, Vue, and ffmpeg, while leveraging the AI‑driven CodeBuddy extension to automate project scaffolding, code generation, error fixing, and UI refinement, cutting development time from days to a few hours.

AI-assisted developmentAudio ProcessingCode Generation
0 likes · 10 min read
How I Built a Cross‑Platform Audio/Video App in Hours with AI‑Powered CodeBuddy
System Architect Go
System Architect Go
Nov 28, 2024 · Artificial Intelligence

An Overview of Modern AI Audio Technologies: ASR, TTS, and Voice Cloning

This article explains how modern AI advances have transformed audio processing, covering digital audio fundamentals, automatic speech recognition (ASR), text‑to‑speech (TTS), voice cloning techniques, and provides practical Python code examples using OpenAI Whisper and HuggingFace TTS models.

AIAudio Processingspeech recognition
0 likes · 7 min read
An Overview of Modern AI Audio Technologies: ASR, TTS, and Voice Cloning
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Oct 18, 2024 · Fundamentals

Fundamentals of Audio and Video Processing, Compression, and Streaming Protocols

This article provides a comprehensive overview of audio and video fundamentals, including signal conversion, PCM encoding, compression techniques, spatial audio concepts, video encoding standards such as H.264/H.265, streaming protocols, bitrate control, and practical optimization algorithms for both audio and video pipelines.

Audio ProcessingStreaming ProtocolsVideo Encoding
0 likes · 49 min read
Fundamentals of Audio and Video Processing, Compression, and Streaming Protocols
Kuaishou Tech
Kuaishou Tech
May 31, 2024 · Artificial Intelligence

Innovative Features and Technical Implementation of Huaisen K‑Song Community: Recording, Editing, and Smart Pitch Correction

This article details how Huaisen reshapes the karaoke workflow by introducing innovative features such as clear‑singing pitch‑finding, a comprehensive editing SDK, and intelligent pitch‑correction algorithms, explaining the underlying audio analysis, strategy generation, and system architecture that enhance user experience across recording, editing, and publishing stages.

AIAudio ProcessingSoftware Engineering
0 likes · 21 min read
Innovative Features and Technical Implementation of Huaisen K‑Song Community: Recording, Editing, and Smart Pitch Correction
Kuaishou Tech
Kuaishou Tech
May 22, 2024 · Mobile Development

Technical Deep Dive into the Music Bullet (弹幕) System in a K‑Song Community

This article provides a comprehensive technical analysis of the music bullet feature in a K‑song community, detailing its core roles, client‑side production and consumption pipelines, real‑time mixing, alignment, volume balancing, precise seeking, performance optimizations, scalability, and sharing mechanisms across iOS and Android platforms.

AndroidAudio ProcessingiOS
0 likes · 18 min read
Technical Deep Dive into the Music Bullet (弹幕) System in a K‑Song Community
21CTO
21CTO
May 14, 2024 · Artificial Intelligence

What Makes OpenAI’s New GPT‑4o a Game‑Changing Multimodal AI?

OpenAI’s latest flagship model GPT‑4o combines text, audio, image and video processing in a single, faster, cheaper multimodal system that delivers near‑human response times, expanded API access, and new safety measures, reshaping how developers and users interact with AI.

AI modelAudio ProcessingGPT-4o
0 likes · 10 min read
What Makes OpenAI’s New GPT‑4o a Game‑Changing Multimodal AI?
DataFunSummit
DataFunSummit
Feb 4, 2024 · Mobile Development

Advanced Mobile Audio Recording Techniques in Quanjian K‑Song: Low Latency, High Fidelity, and Intelligent Audio Processing

The article details how Quanjian K‑Song has built a comprehensive mobile‑focused audio recording system since 2014, covering low‑latency capture, high‑quality sampling, lyric and vocal‑accompaniment alignment, ear‑return, pitch shifting, vocal enhancement, 3A processing, and AI‑driven scoring to deliver a professional karaoke experience on smartphones.

AI scoringAudio ProcessingLow latency
0 likes · 14 min read
Advanced Mobile Audio Recording Techniques in Quanjian K‑Song: Low Latency, High Fidelity, and Intelligent Audio Processing
Tencent Music Tech Team
Tencent Music Tech Team
Feb 4, 2024 · Mobile Development

Technical Guidelines for High-Quality Mobile Recording and Audio Processing in Quanmin K Song

Quanmin K Song’s decade‑long mobile‑recording platform combines 48 kHz/16‑bit dry‑signal capture, sub‑70 ms latency via OpenSL ES/AAudio, real‑time clipping and noise detection, lyric‑ and vocal‑accompaniment alignment, pitch‑shifting, adaptive vocal enhancement, 3A DSP/AI processing, and AI‑driven pitch correction to deliver industry‑leading high‑quality mobile singing experiences.

AIAudio ProcessingLow latency
0 likes · 15 min read
Technical Guidelines for High-Quality Mobile Recording and Audio Processing in Quanmin K Song
Kuaishou Tech
Kuaishou Tech
Dec 28, 2023 · Artificial Intelligence

Kuaishou Audio Team Wins ICASSP 2024 SSI and PLC Challenges with Advanced Speech Enhancement and Packet Loss Concealment

The Kuaishou audio team secured first place in both the ICASSP 2024 Speech Signal Improvement and Audio Deep Packet Loss Concealment challenges by deploying a two‑stage GAN‑based speech enhancement system and a hybrid time‑frequency packet‑loss concealment model that dramatically improve real‑time communication quality.

Audio ProcessingGANICASSP 2024
0 likes · 8 min read
Kuaishou Audio Team Wins ICASSP 2024 SSI and PLC Challenges with Advanced Speech Enhancement and Packet Loss Concealment
Ximalaya Technology Team
Ximalaya Technology Team
Nov 17, 2023 · Cloud Computing

Technical Case Study of Cloud Audio Editing: Challenges, Solutions, and Optimization

The case study details how the Cloud Editing team tackled severe waveform loading delays, zoom lag, and inefficient IndexedDB storage by refactoring the processing pipeline, standardizing multi‑transaction storage, adding monitoring and cleanup tools, and rigorously testing releases, ultimately cutting processing times by over half and dramatically improving user experience.

Audio ProcessingIndexedDBTechnical Case Study
0 likes · 9 min read
Technical Case Study of Cloud Audio Editing: Challenges, Solutions, and Optimization
NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
Oct 11, 2023 · Backend Development

Decoupling Audio‑Video Algorithms: AVProcessEngine Reduces RTC SDK Size & Improves Performance

The article explains how NetEase Cloud Communication’s AVProcessEngine framework separates audio‑video algorithms from the NERTC SDK, addressing SDK bloat and performance drops on low‑end devices by using plugin‑based processing, dynamic algorithm adjustment, and unified interfaces.

Audio ProcessingPerformance Optimizationplugin architecture
0 likes · 11 min read
Decoupling Audio‑Video Algorithms: AVProcessEngine Reduces RTC SDK Size & Improves Performance
OPPO Kernel Craftsman
OPPO Kernel Craftsman
Jul 21, 2023 · Frontend Development

Audio Architecture and Quality Optimization in WebRTC: Devices, 3A Processing, Codec, NetEQ and Scenario‑Based Solutions

The article explains WebRTC’s audio pipeline—from device capture through hardware or software 3A (AEC, ANS, AGC), Opus codec selection, and NetEQ jitter‑buffer handling—detailing how device specifics and scenario‑based configurations (live streaming, meetings, education, watch‑together) affect quality and why pure‑software 3A is emerging as the preferred future solution.

3AAudio ProcessingNetEQ
0 likes · 29 min read
Audio Architecture and Quality Optimization in WebRTC: Devices, 3A Processing, Codec, NetEQ and Scenario‑Based Solutions
Baidu Geek Talk
Baidu Geek Talk
Feb 15, 2023 · Artificial Intelligence

PaddlePaddle 2.4 Release: New Sparse, Graph, and Audio APIs

PaddlePaddle 2.4 introduces 167 new APIs—including sparse computing (paddle.sparse), graph learning (paddle.geometric), and audio processing (paddle.audio) modules—enabling efficient sparse model training and inference, graph message‑passing, advanced audio feature extraction, plus fresh loss functions, tensor utilities, and expanded vision transforms.

API ReleaseAudio ProcessingDeep Learning
0 likes · 16 min read
PaddlePaddle 2.4 Release: New Sparse, Graph, and Audio APIs
DataFunSummit
DataFunSummit
Aug 8, 2022 · Artificial Intelligence

Voice Analysis for Financial Risk Control: Feature Extraction, Single-Channel Speech Separation, and Text Tagging

This talk presents the application of voice analysis in financial risk control, covering voice‑based risk feature extraction, single‑channel speech separation techniques, and speech‑text labeling methods, demonstrating how acoustic and textual cues can be leveraged to improve risk detection and model performance.

Audio Processingmachine learningrisk control
0 likes · 12 min read
Voice Analysis for Financial Risk Control: Feature Extraction, Single-Channel Speech Separation, and Text Tagging
NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
Apr 7, 2022 · Artificial Intelligence

How NetEase Cloud Communication Tackles Voice Reverberation with Adaptive Dual‑Mic Algorithms

This article explains the growing need for speech dereverberation in audio‑video conferencing, outlines the physical causes of reverberation, reviews historical research, and details NetEase Cloud's adaptive dual‑mic signal‑correlation approach, algorithm implementations, performance optimizations, and future directions.

Audio Processingadaptive algorithmsdual-mic
0 likes · 8 min read
How NetEase Cloud Communication Tackles Voice Reverberation with Adaptive Dual‑Mic Algorithms
Alibaba Terminal Technology
Alibaba Terminal Technology
Mar 1, 2022 · Frontend Development

How Alibaba Built a Web‑Based Short Video Editor: Front‑End Insights

This article details Alibaba’s front‑end engineer’s approach to building a web‑based short video editor, covering the motivation, design principles, three‑layer architecture, script protocol, immutable data handling, audio‑video processing with WebCodecs and FFmpeg, rendering pipeline, and challenges of browser implementation.

Audio ProcessingWebAssemblyWebCodecs
0 likes · 10 min read
How Alibaba Built a Web‑Based Short Video Editor: Front‑End Insights
ByteFE
ByteFE
Feb 21, 2022 · Frontend Development

ReolAudio: A Frontend‑Focused Audio Processing Library for Efficient Long‑Audio Editing

ReolAudio is a lightweight, JavaScript‑based library that replaces memory‑heavy AudioBuffer editing with streaming and random‑access decoding, frame‑based data structures, and a high‑performance AudioWorklet player, dramatically improving memory usage, start‑up time, and waveform rendering for long audio projects.

Audio Processingframe based editingstreaming decoding
0 likes · 33 min read
ReolAudio: A Frontend‑Focused Audio Processing Library for Efficient Long‑Audio Editing
DataFunSummit
DataFunSummit
Jan 16, 2022 · Artificial Intelligence

Multimodal Text and Speech Emotion Analysis: Overview, MSCNN‑SPU Model, and Domain Adaptation

This talk presents an overview of text‑plus‑speech multimodal emotion analysis, covering background, single‑modal text and audio models, the MSCNN‑SPU multimodal architecture, domain‑adaptation techniques, and future directions, with detailed model explanations, experimental results, and practical deployment insights.

Audio ProcessingDeep Learningmultimodal emotion analysis
0 likes · 40 min read
Multimodal Text and Speech Emotion Analysis: Overview, MSCNN‑SPU Model, and Domain Adaptation
DataFunTalk
DataFunTalk
Dec 14, 2021 · Artificial Intelligence

Speech Translation: Enterprise Applications and Research

This article presents an overview of speech translation, discusses its motivations and applications at ByteDance, compares cascade and end‑to‑end modeling approaches, introduces advanced encoder and decoder designs such as LUT, Chimera, and COSTT, outlines progressive multi‑task training and data‑augmentation strategies, and shares experimental results and Q&A.

AIAudio Processingend-to-end models
0 likes · 16 min read
Speech Translation: Enterprise Applications and Research
Douyu Streaming
Douyu Streaming
Dec 1, 2021 · Mobile Development

How to Get, Build, and Extend WebRTC m79 Source for Windows, Android, and iOS

This guide explains how to obtain the WebRTC m79 source, compile it for Windows, Android, and iOS, walk through the basic signaling and peer‑connection workflow, and implement advanced video‑capture and audio‑volume features with custom C++ extensions, while unifying the codebase across platforms.

Audio ProcessingCCompilation
0 likes · 19 min read
How to Get, Build, and Extend WebRTC m79 Source for Windows, Android, and iOS
NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
Nov 1, 2021 · Artificial Intelligence

How AI is Transforming Real-Time Audio Communication: Challenges and Solutions

This article explores the evolution of AI audio algorithms in real‑time communication, detailing current trends, technical hurdles such as computational complexity and data scarcity, and practical solutions including lightweight models, data augmentation, and hybrid AI‑traditional pipelines, illustrated with real‑world NetEase Cloud IM case studies.

AIAudio ProcessingVoice Activity Detection
0 likes · 18 min read
How AI is Transforming Real-Time Audio Communication: Challenges and Solutions
High Availability Architecture
High Availability Architecture
Oct 21, 2021 · Cloud Computing

Optimizing NetEase Cloud Music Audio/Video Processing Platform with Serverless

This article describes how NetEase Cloud Music leveraged Serverless function computing to redesign its audio/video algorithm processing platform, covering the existing challenges, the selection criteria for Serverless solutions, the implementation details, performance gains, cost savings, and future directions.

Audio ProcessingCloud FunctionsNetEase
0 likes · 11 min read
Optimizing NetEase Cloud Music Audio/Video Processing Platform with Serverless
Volcano Engine Developer Services
Volcano Engine Developer Services
Oct 20, 2021 · Artificial Intelligence

How ByteDance’s AI Transforms Music Creation and Discovery on TikTok

ByteDance leverages advanced AI models such as SpectTNT, semi‑supervised music tagging transformers, language identification, chord recognition, contrastive representation learning, and source separation to power TikTok’s massive music library, enabling seamless music‑video interaction, smarter recommendations, and new creative tools for creators worldwide.

Audio ProcessingDeep Learninglanguage identification
0 likes · 10 min read
How ByteDance’s AI Transforms Music Creation and Discovery on TikTok
ELab Team
ELab Team
Aug 18, 2021 · Frontend Development

Unlock Powerful Audio Effects with Web Audio API: A Hands‑On Guide

This article introduces the Web Audio API, covering AudioContext, audio nodes, routing graphs, various source types, processing nodes like AnalyserNode and BiquadFilterNode, as well as spatialization with PannerNode and convolution reverb, providing code examples and practical demos for frontend developers.

Audio ProcessingAudioContextJavaScript
0 likes · 16 min read
Unlock Powerful Audio Effects with Web Audio API: A Hands‑On Guide
iQIYI Technical Product Team
iQIYI Technical Product Team
Jun 11, 2021 · Artificial Intelligence

iQIYI M2VoC Multi‑Speaker Multi‑Style Voice Cloning Challenge at ICASSP 2021 – Overview and Results

The iQIYI M2VoC competition at ICASSP 2021, the first low‑resource multi‑speaker, multi‑style voice‑cloning challenge, attracted 153 academic and industry teams to tackle few‑shot (100 utterances) and extreme few‑shot (5 utterances) tracks, evaluated by professional listeners, yielding strong innovations and applications while confirming that single‑sample cloning remains unsolved.

AIAudio ProcessingICASSP2021
0 likes · 7 min read
iQIYI M2VoC Multi‑Speaker Multi‑Style Voice Cloning Challenge at ICASSP 2021 – Overview and Results
TAL Education Technology
TAL Education Technology
Jun 3, 2021 · Frontend Development

Exploring Web Live Streaming: From Web 1.0 to 2.1 – Architecture, AI Integration, and Refactoring

This article chronicles the evolution of a web‑based live‑streaming platform from its initial Web 1.0 prototype through successive versions, detailing AI speech detection, RTC integration, extensive refactoring, and the resulting framework that proves web technologies can effectively support live‑streaming scenarios.

AI integrationAudio ProcessingJavaScript
0 likes · 6 min read
Exploring Web Live Streaming: From Web 1.0 to 2.1 – Architecture, AI Integration, and Refactoring
Kuaishou Tech
Kuaishou Tech
May 17, 2021 · Industry Insights

How Kuaishou Delivered Real‑Time Deep‑Learning Voice Conversion on PC

Kuaishou becomes the first company to deploy a deep‑learning‑based real‑time voice‑conversion system on PC clients, delivering stable, natural‑sounding transformed speech with sub‑200 ms latency, and the article analyzes industry methods, technical challenges, and the four‑module architecture that made it possible.

Audio ProcessingDeep LearningKuaishou
0 likes · 10 min read
How Kuaishou Delivered Real‑Time Deep‑Learning Voice Conversion on PC
Python Programming Learning Circle
Python Programming Learning Circle
Apr 22, 2020 · Artificial Intelligence

Python Audio‑Based Parkinson’s Disease Detection Using Machine Learning

This tutorial demonstrates how to build a Python library that extracts acoustic measurements from healthy and Parkinson’s disease audio recordings, constructs a machine‑learning dataset, trains a logistic‑regression classifier with scikit‑learn, evaluates its accuracy, and provides functions to load and use the trained model in other applications.

Audio ProcessingParkinson's DiseaseParselmouth
0 likes · 12 min read
Python Audio‑Based Parkinson’s Disease Detection Using Machine Learning
Tencent Cloud Developer
Tencent Cloud Developer
Mar 19, 2020 · Artificial Intelligence

Real-Time Voice Communication Technologies and AI Enhancements in Tencent Meeting

Shang Shidong outlines Tencent Meeting’s shift from analog PSTN to IP‑based VoIP, using H.323, SIP, RTP/UDP and the Opus codec, while AI‑driven super‑resolution, deep‑learning packet‑loss concealment, advanced noise reduction, and speech‑music classification boost audio quality, complemented by reference‑free MOS assessment and future 5G‑enabled cloud, IoT and WebRTC integration.

AIAudio ProcessingRTP
0 likes · 30 min read
Real-Time Voice Communication Technologies and AI Enhancements in Tencent Meeting
iQIYI Technical Product Team
iQIYI Technical Product Team
Jan 9, 2020 · Artificial Intelligence

Voice Conversion (VC): Fundamentals, Progress, and Applications

Voice conversion (VC) technology changes a speaker’s timbre and style while keeping the spoken text unchanged, supporting one‑to‑one, many‑to‑one, and many‑to‑many scenarios for medical assistance and entertainment, using parallel or non‑parallel data through methods such as DTW‑aligned frame mapping, attention‑based neural networks, PPG‑LSTM pipelines, VAEs, normalizing‑flow models, and GANs, with iQIYI focusing on non‑parallel data, prosody preservation, and noise‑robust augmentation.

Audio ProcessingDeep LearningGAN
0 likes · 12 min read
Voice Conversion (VC): Fundamentals, Progress, and Applications
Tencent Cloud Developer
Tencent Cloud Developer
Jul 11, 2019 · Industry Insights

How Real-Time Audio/Video Meets Traditional PSTN: Architecture and Low‑Latency Solutions

This article provides an in‑depth technical analysis of integrating real‑time audio/video (RTC) with legacy PSTN, covering latency sources, protocol and codec differences, adaptation layers, system architecture, and optimization techniques such as jitter buffering, ARQ/FEC, and automatic failover.

Audio ProcessingLow latencyPSTN integration
0 likes · 17 min read
How Real-Time Audio/Video Meets Traditional PSTN: Architecture and Low‑Latency Solutions
DataFunTalk
DataFunTalk
May 15, 2019 · Artificial Intelligence

AI‑Driven Audio Content Understanding and Safety for Live Streams

Using AI to automatically understand and secure audio content, this article discusses the challenges of manual audio analysis, outlines a four‑step pipeline—audio segmentation, speech‑to‑text, labeling, and synthesis—and describes models such as VAD, ASR, sound classification, text recognition, and behavior detection for live‑stream moderation.

AIAudio ProcessingContent Safety
0 likes · 11 min read
AI‑Driven Audio Content Understanding and Safety for Live Streams
MaGe Linux Operations
MaGe Linux Operations
Apr 20, 2019 · Artificial Intelligence

Master Python Speech Recognition: Install, Record, and Transcribe Audio

This comprehensive guide walks you through the fundamentals of speech recognition, explains how it works, compares Python packages, shows step‑by‑step installation of SpeechRecognition, demonstrates processing audio files and live microphone input, and offers techniques for handling noise and multilingual transcription.

Audio ProcessingPythonSpeechRecognition
0 likes · 16 min read
Master Python Speech Recognition: Install, Record, and Transcribe Audio
Tencent Cloud Developer
Tencent Cloud Developer
Oct 10, 2018 · Artificial Intelligence

What Are the Real Challenges and Future Trends in Intelligent Voice Technology?

This article examines the current landscape of intelligent voice technology—including speech recognition, synthesis, voiceprint identification, and acoustic event detection—highlighting technical hurdles, evaluation metrics, recent advances such as WaveNet, and a wide range of practical applications from mobile devices to smart hardware and enterprise solutions.

Audio ProcessingSpeech synthesisTencent Cloud
0 likes · 16 min read
What Are the Real Challenges and Future Trends in Intelligent Voice Technology?
iQIYI Technical Product Team
iQIYI Technical Product Team
Sep 14, 2018 · Artificial Intelligence

AI RAP: End-to-End Speech Synthesis for Rap Generation Using Location‑Sensitive Attention and Inference Mask

AI RAP is an end‑to‑end AI service that lets users generate personalized rap with a single click by combining location‑sensitive attention and an inference mask to achieve perfect alignment, beat‑synchronous timing, multi‑character voice timbres, sub‑second synthesis, and a scalable architecture supporting millions of daily users.

AIAttention MechanismAudio Processing
0 likes · 5 min read
AI RAP: End-to-End Speech Synthesis for Rap Generation Using Location‑Sensitive Attention and Inference Mask
MaGe Linux Operations
MaGe Linux Operations
May 30, 2018 · Artificial Intelligence

Master Python Speech Recognition: From Basics to Real-World Audio Transcription

This comprehensive guide walks you through the fundamentals of speech recognition, explains how Python’s SpeechRecognition library works, shows how to install and use various recognizer packages, process audio files and microphone input, handle noise, and troubleshoot common errors with clear code examples.

Audio ProcessingSpeechRecognitionVoice Transcription
0 likes · 18 min read
Master Python Speech Recognition: From Basics to Real-World Audio Transcription
Xianyu Technology
Xianyu Technology
Apr 20, 2018 · Artificial Intelligence

Client‑Side Voice Recognition with TensorFlow Lite and MFCC Optimization

The paper presents a client‑side speech recognizer that uses a compact TensorFlow Lite Inception‑v3 CNN model combined with an optimized MFCC feature pipeline and ARM‑NEON‑accelerated, multi‑threaded processing, achieving low‑latency, high‑accuracy voice recognition on mobile and embedded devices.

Audio ProcessingMFCCNeural Networks
0 likes · 14 min read
Client‑Side Voice Recognition with TensorFlow Lite and MFCC Optimization
Tencent Music Tech Team
Tencent Music Tech Team
May 5, 2017 · Mobile Development

Understanding iOS Core Audio: Definitions of Sample, Frame, and Packet

The article clarifies Apple’s Core Audio terminology—defining a sample as a single channel value, a frame as simultaneous samples, and a packet as one or more contiguous frames—explains why these terms are often confused across audio, networking, and codec contexts, and demonstrates the definitions with an MP3 parsing example.

Audio ProcessingCore Audioframe
0 likes · 11 min read
Understanding iOS Core Audio: Definitions of Sample, Frame, and Packet