How Tencent Scaled Massive n‑gram Language Models for Real‑Time Speech Recognition

This article presents a distributed system that efficiently supports large‑scale n‑gram language models for automatic speech recognition by introducing caching, a two‑level distributed index, batch processing, and a cascading fault‑tolerance mechanism, demonstrating robust scalability and low communication overhead in Tencent's WeChat ASR service.

WeChat Backend Team
WeChat Backend Team
WeChat Backend Team
How Tencent Scaled Massive n‑gram Language Models for Real‑Time Speech Recognition

Abstract

n‑gram language models are widely used in NLP tasks such as automatic speech recognition (ASR). Large models provide better ranking but require huge memory, which can be addressed by distributing the model across nodes. This paper introduces a distributed system with novel optimizations—distributed indexing, batching, and caching—to reduce network traffic and a cascading fault‑tolerance mechanism to handle network failures. Experiments on nine ASR datasets show the system scales efficiently and robustly, handling up to 100 million queries per minute in Tencent's WeChat ASR.

Keywords

n‑gram language model, distributed computing, speech recognition, WeChat

1. Introduction

Language models assign probabilities to word sequences, crucial for ranking candidates in ASR, machine translation, and information retrieval. While larger n‑gram models improve accuracy, they demand excessive memory, prompting distribution across multiple nodes, which introduces significant communication overhead and new bottlenecks.

2. Background

2.1 Language Model A language model estimates the probability of a word sequence P(w_1…w_m). Using the Markov assumption, an n‑gram model approximates this probability based on the preceding n‑1 words.

2.2 Smoothing To avoid zero probabilities for unseen n‑grams, techniques such as back‑off and interpolated Kneser‑Ney smoothing redistribute probability mass from high‑frequency to low‑frequency n‑grams.

2.3 Training and Inference Training counts frequencies in a large corpus to produce conditional probabilities stored in an ARPA file. Inference retrieves these probabilities or applies smoothing when an n‑gram is absent.

3. Distributed System

The system, called DLM (Distributed Language Model), partitions the ARPA file across server nodes while client nodes cache short n‑grams.

3.1 Caching

Clients cache 1‑gram and 2‑gram statistics locally, eliminating network requests for low‑order n‑grams.

3.2 Distributed Index

Global Index maps each n‑gram to a specific server based on hash functions of its last two words, ensuring that all statistics needed for a given n‑gram reside on the same server.

Local Index on each server is a suffix‑tree storing probabilities and back‑off weights for assigned n‑grams.

3.3 Batch Processing

Requests destined for the same server are merged into a single batch message, dramatically reducing the number of network messages.

3.4 Fault Tolerance

A cascading mechanism switches to a smaller n‑gram model when network fault rates exceed thresholds, ensuring robustness without severely degrading ASR accuracy.

4. Experiments

Experiments were conducted on a cluster of Xeon E5‑2670 V3 nodes (128 GB RAM, 10 Gbps network) using a 500 GB ARPA file trimmed to 50 GB–400 GB models. Nine ASR test sets were evaluated.

4.1 Scalability and Efficiency

Throughput scales linearly with the number of servers. Optimizations (caching, global index, batching) reduce network messages by up to 66 % and cut processing time per audio second to 0.1 s.

4.2 Component Ablation

Caching 1‑gram/2‑gram saves 16 % of messages.

Global index reduces messages to one per n‑gram, saving ~66 %.

Batching halves the remaining messages.

4.3 Fault‑Tolerance Evaluation

Using the cascade, WER increases only slightly under high‑fault conditions, confirming the mechanism’s effectiveness.

5. Conclusion

Distributed n‑gram language models enable high‑accuracy NLP applications at scale. By combining caching, a two‑level distributed index, batch processing, and cascading fault tolerance, the proposed system achieves efficient, effective, and robust inference for large models, as demonstrated in Tencent's production WeChat ASR system.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cachingscalingSpeech Recognitionlanguage modeldistributed systemN-gram
WeChat Backend Team
Written by

WeChat Backend Team

Official account of the WeChat backend development team, sharing their experience in large-scale distributed system development.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.