Artificial Intelligence 6 min read

How Improved Latency‑Controlled BLSTM Models Boost Online Speech Recognition Efficiency

This article explains how latency‑controlled BLSTM acoustic models were refined to accelerate online speech recognition while preserving accuracy, detailing the training strategy, computational trade‑offs, and two model enhancements that achieve up to 60% faster decoding with modest resource savings.

Alibaba Cloud Developer

Mar 17, 2017

How Improved Latency‑Controlled BLSTM Models Boost Online Speech Recognition Efficiency

Improving Latency‑Controlled BLSTM for Online Speech Recognition

Alibaba algorithm expert Kun Cheng presented the paper “Improving Latency‑Controlled BLSTM Acoustic Models for Online Speech Recognition” at ICASSP 2017, aiming to enhance recognition accuracy by employing latency‑controlled BLSTM (LC‑BLSTM) acoustic models.

Unlike standard BLSTM, which trains and decodes on whole utterances, LC‑BLSTM updates using a truncated BPTT approach with a central chunk and a rightward auxiliary chunk. During training, each update processes a small data segment; the auxiliary chunk contributes only to the cell state computation, while error propagation occurs solely on the central chunk. The cell state from the previous segment initializes the next segment, and backward‑moving networks reset the cell state to zero at each segment start. This method speeds up convergence and improves performance.

During decoding, the data handling mirrors training, but the dimensions of the central and auxiliary chunks can be adjusted independently, allowing flexibility without sacrificing accuracy. LC‑BLSTM maintains BLSTM‑level recognition performance under acceptable decoding latency, making it suitable for online services.

However, the auxiliary chunk doubles computational cost because it is also processed by a BLSTM. For example, with Nc=30 and Nr=30, the computation is twice that of a conventional BLSTM.

The paper proposes two improved LC‑BLSTM models that retain accuracy while reducing decoding computation, enabling a single server to handle 1.5–2× more concurrent requests.

First improvement : For forward‑moving LSTM, the rightward chunk computation is removed; for backward‑moving LSTM, the auxiliary chunk is simplified by replacing the LSTM with a fully‑connected layer, averaging its output to initialize the central chunk. Experiments on the Switchboard dataset show over 40% faster decoding with comparable accuracy.

Second improvement : Similar removal of the rightward chunk for forward LSTM, and replacement of the backward LSTM with a simple RNN, which has far lower computational cost. This yields more than 60% faster decoding on Switchboard with only a slight loss in recognition rate.

Full paper (PDF) can be downloaded at: http://download.taobaocdn.com/freedom/42562/pdf/p1bbah8vsqfhef711bcs1jqt14k54.pdf

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Efficiency deep learning speech recognition acoustic modeling LC-BLSTM online ASR

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.