Deep Learning for Automatic Speech Recognition (ASR): From Mel Spectrograms to CTC Decoding
This article explains the end‑to‑end deep‑learning pipeline for speech‑to‑text, covering audio digitization, preprocessing with librosa, conversion to Mel spectrograms and MFCCs, data augmentation, a CNN‑RNN architecture, CTC loss, decoding strategies and evaluation with word error rate.
Voice assistants such as Google Home, Amazon Echo, Siri and Cortana have popularized speech‑to‑text technology, which converts spoken audio into written text. This article introduces a deep‑learning based approach to automatic speech recognition (ASR).
Audio Data and Preprocessing
Raw audio consists of sound and noise. An audio file (e.g., ".wav" or ".mp3") is loaded into a 2‑D NumPy array where each element represents the amplitude at a specific time point. With a sampling rate of 44.1 kHz, one second of audio yields 44,100 samples. Audio may be mono or stereo; stereo data becomes a 3‑D array with a depth of 2.
Because deep‑learning models require inputs of uniform size, the data is standardized: all clips are resampled to the same rate, converted to the same number of channels, and padded or truncated to a fixed duration. When audio quality is poor, noise‑removal algorithms are applied.
Data augmentation techniques such as small time shifts, pitch or speed changes, and SpecAugment (time and frequency masking) are used to increase diversity and improve generalization.
Feature Extraction
The preprocessed audio is transformed into a Mel spectrogram, which visualizes the frequency content over time. For human speech, the Mel spectrogram can be further compressed into Mel‑Frequency Cepstral Coefficients (MFCC), retaining the most important frequency coefficients that correspond to the human vocal range.
Target Labels
The transcription target consists of sentences broken into characters. A vocabulary of characters is built and each character is mapped to an integer ID, forming the label sequence that the model must predict.
Model Architecture
Two common ASR architectures are mentioned: (1) a CNN followed by an RNN with CTC loss (e.g., Baidu’s Deep Speech) and (2) an encoder‑decoder sequence‑to‑sequence model (e.g., Google’s LAS). This article adopts the first approach.
The model comprises:
A residual CNN that processes the spectrogram and outputs feature maps.
Several bidirectional LSTM layers that convert the feature maps into a sequence of frames, each representing a time step.
A linear layer with softmax that produces a probability distribution over the character vocabulary for each frame.
An optional linear projection between the CNN and RNN to match dimensions.
Alignment Challenge and CTC
Because speech does not align neatly with characters—durations vary, gaps and pauses exist, characters may merge or repeat—explicit alignment is difficult. Connectionist Temporal Classification (CTC) solves this by automatically aligning input frames with output characters.
CTC operates in two modes:
CTC loss during training, where the network maximizes the probability of the correct transcription.
CTC decoding during inference, where the most probable character sequence is derived without a reference text.
Decoding steps:
Select the most probable character (including the blank token) for each frame (greedy selection). Example: "-G-o-ood".
Merge repeated characters that are not separated by blanks (e.g., "oo" → "o").
Remove all blank tokens, yielding the final transcription.
CTC loss is computed as the negative log‑probability of all valid alignments that collapse to the target text. The algorithm filters out probabilities for characters not present in the target and enforces the correct order, then sums the probabilities of the remaining valid paths.
Evaluation Metric
After training, performance is measured with Word Error Rate (WER), which counts insertions, deletions and substitutions between the predicted and reference texts, expressed as a percentage of the total number of words.
Language Model and Beam Search
While CTC treats the output as a raw character sequence, integrating an external language model can improve transcription quality by biasing predictions toward more plausible word sequences. Beam Search, a heuristic that keeps multiple candidate paths during decoding, is commonly used to incorporate language model scores and achieve better results than greedy decoding.
Conclusion
The article has presented the building blocks and techniques for constructing an ASR system using deep learning, from audio preprocessing and feature extraction to a CNN‑RNN architecture, CTC alignment, decoding strategies, and evaluation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code DAO
We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
