Artificial Intelligence 4 min read

Open‑sourcing kaldi‑ctc: Fast GPU‑Accelerated CTC End‑to‑End Speech Recognition

The article announces the open‑source release of kaldi‑ctc, a GPU‑accelerated CTC‑based end‑to‑end speech recognition toolkit built on Kaldi, warp‑ctc and cuDNN, highlighting its 5‑6× training speedup, real‑time decoding factor of 0.02, and performance comparisons on the LibriSpeech corpus.

Liulishuo Tech Team
Liulishuo Tech Team
Liulishuo Tech Team
Open‑sourcing kaldi‑ctc: Fast GPU‑Accelerated CTC End‑to‑End Speech Recognition

Recently, English Fluency (English Liulishuo) officially open‑sourced kaldi‑ctc , which can be used to build Connectionist Temporal Classification (CTC) end‑to‑end speech recognition systems based on Kaldi, warp‑ctc and cuDNN.

Both training and decoding are extremely fast; the cuDNN‑based LSTM‑RNN training speed is about 5–6 times that of the original kaldi/src/nnet/lstm (the open source only supports cuDNN RNN) and supports multi‑GPU training. When frame_subsampling_factor is set to 3, decoding can achieve a Real‑Time‑Factor of 0.02 .

Peak Phenomenon

Fig1. Cross‑Entropy RNN Softmax probabilities

Fig2. CTC‑RNN Softmax shows the probability

Fig3. CTC‑RNN Softmax does not plot the probability

RNN models trained with the CTC criterion exhibit a clear peak phenomenon (Fig2, Fig3), which is markedly different from Cross‑Entropy‑trained RNN models (Fig1). Most frames have Softmax probabilities near 1.0 at the position, allowing these frames to be skipped during decoding without searching the network.

Reasons for the huge decoding speed boost in CTC‑ASR

Using monophone (Google gradually uses triphone single‑state) or character as modeling units reduces the number of states.

Skipping frames in decoding; over 80% of frames are directly skipped.

Librispeech example script

During training, the accuracy metric (Unique Phone Sequence) is computed as 1 − PhoneErrorRate, where the sequence is the deduplicated phone sequence of the highest‑probability RNN Softmax output per frame.

On this dataset, the CTC‑ASR system shows a larger gap compared to LFMMI‑ASR (chain). Currently, CTC‑ASR training data contains many OOVs; the quality of training data is known to greatly affect end‑to‑end system performance and needs to be fixed.

On larger datasets, the CTC‑ASR system outperforms LFMMI‑ASR.

deep learningGPUSpeech RecognitionASRCTCKaldi
Liulishuo Tech Team
Written by

Liulishuo Tech Team

Help everyone become a global citizen!

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.