High‑Quality Automatic Speech Recognition (ASR) Solutions at Bilibili: Data, Model, and Deployment Optimizations
Bilibili’s high‑quality ASR system combines large‑scale filtered business data, semi‑supervised Noisy‑Student training, an end‑to‑end CTC model with lattice‑free MMI decoding, and FP16‑optimized FasterTransformer inference on Triton, delivering top‑ranked accuracy, low latency, and scalable deployment for diverse Chinese‑English video content.
Automatic Speech Recognition (ASR) technology has been deployed at Bilibili for large‑scale business scenarios such as audio‑video content safety review, AI subtitles (C‑side, Bianjian, S12 live), and video understanding (full‑text retrieval).
Bilibili’s ASR engine also achieved the first place in the 2022 SpeechIO benchmark ( https://github.com/SpeechColab/Leaderboard ).
全部测试集排名
排名
厂家
字错误率
1
B站
2.82%
2
阿里云
2.85%
3
依图
3.16%
4
微软
3.28%
5
腾讯
3.85%
6
讯飞
4.05%
7
思必驰
5.19%
8
百度
8.14%
A high‑quality, cost‑effective ASR engine for industrial production should have the following characteristics:
High accuracy and robustness in target business scenarios.
High performance: low latency, fast speed, and low resource consumption.
High scalability: efficient support for business‑driven customization and rapid updates.
Data cold‑start is a major challenge because the ASR system requires a large and diverse training set (different acoustic environments, domains, and accents). Bilibili faces three specific difficulties:
Cold start: only a tiny amount of open‑source data is initially available, and purchased data has low relevance to the business.
Broad domain coverage: dozens of content categories demand high data diversity.
Mixed Chinese‑English content: many user‑generated videos contain both languages.
Solutions include business data filtering (cleaning timestamps, aligning sentences, handling numeric conversions) and semi‑supervised training using Noisy Student Training (NST). Approximately 500 k raw videos were filtered to generate ~40 k h of automatically labeled data, which, combined with 15 k h of manually labeled data, improved recognition accuracy by about 15 %.
ASR technology evolution can be divided into three stages:
1993‑2009: HMM‑GMM era, slow progress, high word error rates.
2009‑2015: Deep learning rise, HMM‑DNN hybrid models, significant accuracy gains.
2015‑present: End‑to‑end (E2E) models (CTC, AED, RNNT) with large, complex networks, sometimes surpassing human performance.
Comparison of hybrid and end‑to‑end frameworks:
混合框架 (hybrid)
端到端框架 (E2E)
HTK, Kaldi
Espnet, Wenet, DeepSpeech, K2
C/C++, Shell
Python, Shell
从头开发
TensorFlow / PyTorch
Typical CER results on representative datasets:
Librispeech
GigaSpeech
Aishell‑1
WenetSpeech
Hybrid (Kaldi Chain + LM)
3.06
14.84
7.43
12.83
E2E‑AED
11.8
6.6
4.72
—
E2E‑RNNT
12.4
—
—
—
Optimized E2E‑CTC
7.1
5.8
—
—
Based on the analysis, Bilibili adopts an end‑to‑end CTC system with a dynamic decoder to meet high‑throughput, low‑latency, and high‑accuracy requirements across diverse scenarios.
End‑to‑end lattice‑free MMI discriminative training further improves timestamp accuracy and overall CER. Results on Bilibili’s video test set:
Model
CER (%)
CTC baseline
6.96
Traditional DT
6.63
E2E LF‑MMI DT
6.13
The end‑to‑end decoder, based on beam search, consumes only about 1/5 of the resources of a traditional WFST decoder and is 5× faster, while being easier to customize with external language models.
Model inference is optimized by using FP16 precision, converting the model to FasterTransformer, and deploying with Triton for automatic batch sizing. On a single NVIDIA T4 GPU, throughput increases by 2× and speed improves by 30 % (≈3000 h of audio per hour).
Summary
This article presents Bilibili’s end‑to‑end ASR solution, covering data cold‑start, semi‑supervised training, model algorithm optimizations, decoder design, and inference deployment. Future work includes hot‑word integration, entity‑level accuracy improvements, and real‑time streaming ASR for games and sports events.
References
A. Baevski, H. Zhou, et al., “wav2vec 2.0: A Framework for Self‑Supervised Learning of Speech Representations.”
A. Baevski, W. Hsu, et al., “data2vec: A General Framework for Self‑supervised Learning in Speech, Vision and Language.”
D. S., Y. Zhang, et al., “Improved Noisy Student Training for Automatic Speech Recognition.”
C. Lüscher, E. Beck, et al., “RWTH ASR Systems for LibriSpeech: Hybrid vs Attention – w/o Data Augmentation.”
R. Prabhavalkar, K. Rao, et al., “A Comparison of Sequence‑to‑Sequence Models for Speech Recognition.”
D. Povey, V. Peddinti, et al., “Purely sequence‑trained neural networks for ASR based on lattice‑free MMI.”
H. Xiang, Z. Ou, “CRF‑Based Single‑Stage Acoustic Modeling with CTC Topology.”
Z. Chen, W. Deng, et al., “Phone Synchronous Decoding with CTC Lattice.”
https://github.com/NVIDIA/FasterTransformer
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.