Paraformer: An Industrial Non‑Autoregressive End‑to‑End Speech Recognition Model and Its Deployment on ModelScope
This article introduces the Paraformer non‑autoregressive end‑to‑end speech recognition model released by Alibaba DAMO Academy, details its architecture, training strategies, large‑scale performance, and provides step‑by‑step guidance for using and fine‑tuning the model on the ModelScope platform with the FunASR toolkit.
The article begins with a brief history of speech AI, highlighting the evolution from early isolated word recognizers to modern deep‑learning based systems and the emergence of industrial‑grade models such as those offered by Alibaba DAMO Academy through the ModelScope community.
ModelScope hosts a comprehensive suite of over 50 speech‑related models, including more than 30 speech‑recognition models covering multiple languages, as well as synthesis, wake‑up, signal‑processing, and spoken‑language models.
Paraformer is presented as a non‑autoregressive end‑to‑end ASR model that addresses three core challenges: accurate length prediction, extraction of encoder hidden representations for the decoder, and enhancing internal dependency modeling to reduce substitution errors.
The Paraformer architecture consists of five components—Encoder, Predictor, Sampler, Decoder, and Loss. The Encoder uses a SAN‑M structure with a Memory Block for local modeling; the Predictor employs CIF to estimate token counts and extract embeddings; the Sampler generates token embeddings based on a learned gate; the Decoder processes these embeddings, and the training loss combines MAE, CE, and MWER‑based objectives.
Paraformer‑large, an industrial‑scale variant, features 50 encoder layers, 60 decoder layers, and 220 M parameters, achieving up to six‑fold computational reduction and 5‑10× inference speedup while delivering state‑of‑the‑art accuracy on benchmarks such as AISHELL‑1, AISHELL‑2, and WenetSpeech.
Training data combines high‑quality labeled corpora with low‑cost OCR‑ASR cross‑validated data, and advanced strategies like layer‑wise learning rates, random layer dropping, and head pruning further improve robustness.
The article provides a practical guide for accessing Paraformer on ModelScope, experimenting with the online demo, and fine‑tuning the model on private datasets using the open‑source FunASR toolkit, which supplies recipes, data loaders, and support for various audio formats.
Deployment options include exporting to ONNX or TouchScript for runtime inference, with on‑CPU tests showing a three‑fold speed increase over the standard pipeline and support for gRPC services.
Performance evaluations demonstrate significant character error rate reductions on public and private test sets, as well as improved keyword recall after domain‑specific fine‑tuning.
Finally, the article outlines the customization workflow for other speech models (synthesis, denoising, wake‑up) within ModelScope, encouraging community participation and further development.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.