Artificial Intelligence 13 min read

Paraformer: An Industrial Non‑Autoregressive End‑to‑End Speech Recognition Model

This article introduces the Paraformer model released by Alibaba DAMO Academy on ModelScope, detailing its non‑autoregressive architecture, training strategies, performance on benchmark datasets, and step‑by‑step guidance for fine‑tuning and deploying the model using FunASR and ModelScope pipelines.

DataFunSummit
DataFunSummit
DataFunSummit
Paraformer: An Industrial Non‑Autoregressive End‑to‑End Speech Recognition Model

The article provides an overview of ModelScope's speech model ecosystem, highlighting more than 50 industrial‑grade models covering speech recognition, synthesis, wake‑up, signal processing, and spoken language processing, and explains the motivation behind open‑sourcing these models to foster AI innovation.

It then focuses on Paraformer, a non‑autoregressive end‑to‑end speech recognition model, describing the three challenges it addresses: accurate length prediction, extracting encoder representations for the decoder, and enhancing internal dependency modeling.

The Paraformer architecture consists of five components: Encoder (using a SAN‑M structure with local memory blocks), Predictor (employing CIF to predict token count and extract embeddings), Sampler (sampling token representations), Decoder (supporting various decoding strategies, including the GLM‑based approach used by Paraformer), and a multi‑loss training scheme combining MAE, CE, and MWER losses.

Paraformer‑large, the flagship model, features 50 encoder layers, 60 decoder layers, and 220 M parameters, achieving up to six‑fold computational reduction through 6× down‑sampling and delivering 5‑10× inference speedup on GPUs while matching cloud service accuracy.

Training data combines high‑quality annotated speech from multiple domains (telephony, live streaming, meetings, etc.) with low‑cost data generated via OCR‑ASR cross‑validation, and employs strategies such as layer‑wise learning rates, random layer and head dropping to improve robustness.

Benchmark results show Paraformer‑large attaining state‑of‑the‑art performance on AISHELL‑1, AISHELL‑2, and WenetSpeech, and ranking first on the SpeechIO leaderboard, with significant CER reductions on both public and private datasets.

The article also guides users on how to experience and fine‑tune Paraformer via the ModelScope community: selecting the model from the speech‑recognition category, using the provided demo, preparing data (text annotations and wav.scp), adjusting training parameters (e.g., setting dataset_type="large" ), and running the training script.

After fine‑tuning, inference can be performed through ModelScope pipelines supporting various audio inputs (wav, pcm, URLs, binary data). Users can combine multiple models (VAD, punctuation, LM) within a single pipeline, enable timestamp output, and export the model to ONNX or TorchScript for runtime deployment.

FunASR, the underlying training framework, offers recipes for both academic and industrial models, supports additional speech tasks (VAD, punctuation, data2vec pre‑training, speaker verification), and provides data loaders for large‑scale datasets and diverse audio formats (including MP3 and raw bytes).

Overall, the article demonstrates how Paraformer bridges the gap between research and production, enabling developers to leverage a high‑performance, open‑source ASR solution for various applications.

deep learningSpeech RecognitionASRModelScopenon-autoregressiveParaformer
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.