Open‑sourcing kaldi‑ctc: Fast GPU‑Accelerated CTC End‑to‑End Speech Recognition
The article announces the open‑source release of kaldi‑ctc, a GPU‑accelerated CTC‑based end‑to‑end speech recognition toolkit built on Kaldi, warp‑ctc and cuDNN, highlighting its 5‑6× training speedup, real‑time decoding factor of 0.02, and performance comparisons on the LibriSpeech corpus.
Recently, English Fluency (English Liulishuo) officially open‑sourced kaldi‑ctc , which can be used to build Connectionist Temporal Classification (CTC) end‑to‑end speech recognition systems based on Kaldi, warp‑ctc and cuDNN.
Both training and decoding are extremely fast; the cuDNN‑based LSTM‑RNN training speed is about 5–6 times that of the original kaldi/src/nnet/lstm (the open source only supports cuDNN RNN) and supports multi‑GPU training. When frame_subsampling_factor is set to 3, decoding can achieve a Real‑Time‑Factor of 0.02 .
Peak Phenomenon
Fig1. Cross‑Entropy RNN Softmax probabilities
Fig2. CTC‑RNN Softmax shows the probability
Fig3. CTC‑RNN Softmax does not plot the probability
RNN models trained with the CTC criterion exhibit a clear peak phenomenon (Fig2, Fig3), which is markedly different from Cross‑Entropy‑trained RNN models (Fig1). Most frames have Softmax probabilities near 1.0 at the position, allowing these frames to be skipped during decoding without searching the network.
Reasons for the huge decoding speed boost in CTC‑ASR
Using monophone (Google gradually uses triphone single‑state) or character as modeling units reduces the number of states.
Skipping frames in decoding; over 80% of frames are directly skipped.
Librispeech example script
During training, the accuracy metric (Unique Phone Sequence) is computed as 1 − PhoneErrorRate, where the sequence is the deduplicated phone sequence of the highest‑probability RNN Softmax output per frame.
On this dataset, the CTC‑ASR system shows a larger gap compared to LFMMI‑ASR (chain). Currently, CTC‑ASR training data contains many OOVs; the quality of training data is known to greatly affect end‑to‑end system performance and needs to be fixed.
On larger datasets, the CTC‑ASR system outperforms LFMMI‑ASR.
Liulishuo Tech Team
Help everyone become a global citizen!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.