Peak-First Regularization for Low-Latency Streaming Speech Recognition

The paper presents a low‑latency streaming speech‑recognition solution that reframes latency reduction as a knowledge‑distillation task, using a simple peak‑first regularization term to shift CTC output probabilities leftward and achieve up to 200 ms average latency reduction without harming word error rate.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
Peak-First Regularization for Low-Latency Streaming Speech Recognition

1. Introduction

Human‑machine voice interaction demands both high accuracy and minimal word‑output latency. Traditional non‑streaming ASR waits for utterance completion, causing long delays, while streaming ASR can return partial results in real time. The Meituan Voice Interaction team targets this latency problem in various business scenarios (customer service, phone marketing, etc.).

1.1 Speech‑Recognition Background

Most production systems use Connectionist Temporal Classification (CTC) models because they map acoustic frames to text without requiring an encoder‑decoder or attention mechanism. A typical DFSMN‑CTC architecture consists of an acoustic encoder and a linear output layer. The model predicts a probability distribution over tokens for each frame; the time gap between the end of a spoken token and the peak of its predicted probability is called output latency (or peak latency).

1.2 Problem and Challenges

Low latency improves user experience, reduces misunderstandings, and frees up computational budget for more complex downstream models. However, CTC outputs contain many admissible decoding paths, some of which incur higher latency.

2. Peak‑First Regularization Method

2.1 CTC Model Basics

CTC inserts a blank token φ to align acoustic frames (typically 10 ms apart) with a shorter text sequence. The loss is computed via a forward‑backward dynamic‑programming algorithm that sums probabilities of all valid paths.

2.2 Peak‑First Regularization Description

Analysis of CTC path spaces shows that low‑latency paths have probability peaks occurring earlier in time. The authors hypothesize that shifting the entire probability distribution leftward will reduce latency. To achieve this, they introduce a regularization term called Peak‑First Regularization (PFR) that applies knowledge distillation between adjacent frames: each frame’s probability distribution is forced to mimic the distribution of the next frame. This encourages the model to move probability mass forward, effectively shifting peaks earlier.

The overall training objective becomes: Loss = CTC_Loss + λ·PFR_Loss where λ balances the original CTC loss and the regularization term to avoid collapse.

2.3 Gradient Analysis

When a frame predicts a high probability for a token, the gradient with respect to the next frame’s probability is large, driving the distribution leftward. If the next frame’s probability is low (e.g., a blank), the gradient is small, resulting in little shift. This explains why learning only the immediate next frame (≈40 ms) can produce latency reductions far exceeding 40 ms after many training steps.

3. Related Work

The authors categorize existing latency‑reduction techniques into four groups:

Force Alignment : uses external alignments to penalize delayed paths.

Path Decomposition (e.g., FastEmit) : re‑weights low‑latency paths during RNNT training.

Minimum Bayes Risk : adds a latency‑related risk term to the loss.

Self‑Alignment : selects low‑latency decoded paths as regularizers but incurs heavy online decoding cost.

Compared with these, PFR requires no external alignments, no complex loss redesign, and works for both streaming and non‑streaming models.

4. Evaluation Metrics

Character Error Rate (CER) : standard edit‑distance based accuracy metric.

Average Peak Latency (APL) : average time difference between the end of a token’s acoustic span and the first peak of its predicted probability.

PR50 / PR90 : 50th and 90th percentile of per‑sentence peak latency, measuring tail‑distribution behavior.

5. Experiments and Analysis

5.1 Experimental Setup

Experiments use the AISHELL‑1 Mandarin dataset. Both a streaming and a non‑streaming Transformer‑based CTC model are built (12 encoder layers, 2‑D convolution front‑end). The streaming model relies on a 510 ms acoustic context.

5.2 Latency Comparison

Results on the test set show that adding PFR consistently reduces all latency metrics. With an appropriate λ, the streaming model’s average latency drops by 101 ms and the non‑streaming model by 149 ms, while CER remains within an acceptable range. Larger λ values further lower latency (up to >200 ms) but eventually increase CER because the model over‑prioritizes low‑latency paths and loses acoustic context.

5.3 Visualization

Probability‑distribution visualizations illustrate that, after applying PFR, peaks move closer to the left edge of the corresponding acoustic spans for both streaming and non‑streaming models. Higher regularization weights produce larger shifts.

6. Conclusion and Outlook

The study demonstrates that treating CTC output latency as a knowledge‑distillation problem and applying peak‑first regularization effectively reduces word‑output latency without complex loss engineering or external alignment data. The method is simple, scalable, and may be extensible to other streaming models such as Transducer architectures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Knowledge DistillationSpeech RecognitionLatency ReductionCTCStreaming ASRPeak-First Regularization
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.