Boosting Online Speech Recognition with Improved Latency‑Controlled BLSTM Models

This article explains how improved latency‑controlled BLSTM acoustic models can boost online speech‑recognition accuracy while cutting decoding computation, detailing two model refinements that achieve 40‑60% speed gains with minimal loss in recognition performance.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
Boosting Online Speech Recognition with Improved Latency‑Controlled BLSTM Models

Alibaba algorithm expert Kun Cheng presented the paper “Improving Latency‑Controlled BLSTM Acoustic Models for Online Speech Recognition” at ICASSP 2017.

The study aims to achieve higher speech‑recognition accuracy by employing a Latency‑Controlled BLSTM (LC‑BLSTM) acoustic model instead of the standard BLSTM that processes whole utterances.

Unlike standard BLSTM, LC‑BLSTM updates the network with truncated BPTT using a central chunk and a right‑ward additional chunk; the additional chunk is used only for cell‑state computation, while error propagation occurs only on the central chunk. During training, each update processes a small data segment, passing the cell state forward to the next segment, and resetting the backward‑moving LSTM’s cell state to zero at the start of each segment.

This approach accelerates convergence and preserves BLSTM accuracy under acceptable decoding latency, making it suitable for online speech‑recognition services.

However, decoding incurs higher computational cost because a long right‑ward chunk is often needed; with Nc = 30 and Nr = 30 the computation can be twice that of a traditional BLSTM.

The paper proposes two improved LC‑BLSTM models that reduce decoding computation while maintaining accuracy, allowing a server to handle 1.5–2× more concurrent sessions.

In the first improvement, the right‑ward chunk computation is removed for the forward‑moving LSTM, and for the backward‑moving LSTM the right chunk is replaced by a fully‑connected layer whose averaged output initializes the central chunk. Experiments on the Switchboard dataset show over 40 % decoding‑speed improvement with comparable accuracy.

The second improvement replaces the backward‑moving LSTM with a simple RNN, which further cuts computation; Switchboard results demonstrate more than 60 % speedup with only a slight loss in recognition accuracy.

Additionally, the authors found that the backward‑propagating LSTM contributes less than the forward LSTM, so they substituted it with a simple RNN, further accelerating the model; experiments show over 60 % decoding‑speed increase with minimal accuracy degradation.

The full paper is available for free download at the provided link.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Deep Learningspeech recognitionComputational EfficiencyLC-BLSTMonline ASR
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.