Artificial Intelligence 22 min read

Unlocking PEGASUS: How EasyNLP Simplifies Text Summarization with Pre‑Training

This article explains the importance of text generation, introduces the PEGASUS model’s gap‑sentence pre‑training for abstractive summarization, and shows how the EasyNLP framework integrates PEGASUS and other Chinese and English summarization models with step‑by‑step installation, data preparation, and training commands.

Alibaba Cloud Big Data AI Platform

Sep 21, 2022

Unlocking PEGASUS: How EasyNLP Simplifies Text Summarization with Pre‑Training

Text generation is a key research direction in natural language processing with many practical applications, and abstractive summarization is an important sub‑task used for news headline generation, abstract creation, and keyword extraction.

Pre‑trained language models such as BERT, MASS, and UniLM excel in NLU tasks but their token‑level masking objectives are not suitable for generative summarization, which requires coarser‑grained semantic understanding at the sentence or paragraph level.

The PEGASUS model addresses this gap by introducing an unsupervised pre‑training task called Gap Sentence Generation (GSG), where several whole sentences are randomly masked in a document and the model learns to reconstruct them. This objective aligns closely with downstream summarization tasks, enabling strong performance after lightweight fine‑tuning.

PEGASUS uses a standard encoder‑decoder transformer architecture. It applies two masking strategies: sub‑word masking (used in BERT, denoted as mask2) and sentence‑level masking (GSG, denoted as mask1). For GSG, three sentence‑selection schemes are proposed: Random, Lead, and Importance‑based (Ind‑Orig), where importance scores are computed via ROUGE between a candidate sentence and the rest of the document. Experiments show the importance‑based scheme yields the best results.

EasyNLP is an Alibaba PAI‑based, PyTorch‑powered NLP toolkit that provides a one‑stop experience from training to deployment, supporting a variety of Chinese pre‑trained models and large‑model deployment techniques. It now integrates PEGASUS for text summarization and offers additional Chinese models such as mT5‑based summarizers and Randeng.

Installation

git clone https://github.com/alibaba/EasyNLP.git
cd EasyNLP
# Follow the README for environment setup

Data preparation

Prepare a TSV file with two columns separated by a tab: the first column contains the summary (title) and the second column contains the source text (content).

湖北：“四上企业”复工率已达93.8%  央视网消息：4月1日，记者从湖北省新冠肺炎疫情防控工作新闻发布会上获悉，...

Training command (Chinese models)

python main.py \
    --mode train \
    --app_name=sequence_generation \
    --worker_gpu=1 \
    --tables=./cn_train.tsv,./cn_dev.tsv  \
    --input_schema=title_tokens:str:1,content_tokens:str:1 \
    --first_sequence=content_tokens \
    --second_sequence=title_tokens \
    --label_name=title_tokens \
    --checkpoint_dir=./finetuned_zh_model \
    --micro_batch_size=8 \
    --sequence_length=512 \
    --epoch_num=1  \
    --save_checkpoint_steps=150 \
    --user_defined_parameters 'pretrain_model_name_or_path=alibaba-pai/mt5-title-generation-zh language=zh copy=false max_encoder_length=512 min_decoder_length=12 max_decoder_length=32 no_repeat_ngram_size=2 num_beams=5 num_return_sequences=5'

Prediction command (Chinese models)

python main.py \
    --mode=predict \
    --app_name=sequence_generation \
    --worker_gpu=1 \
    --tables=./cn_dev.tsv  \
    --outputs=./cn.preds.txt \
    --input_schema=title:str:1,content:str:1,title_tokens:str:1,content_tokens:str:1,tag:str:1 \
    --output_schema=predictions,beams \
    --append_cols=content,title,tag \
    --first_sequence=content_tokens \
    --checkpoint_dir=./finetuned_zh_model \
    --micro_batch_size=32 \
    --sequence_length=512 \
    --user_defined_parameters 'language=zh copy=false max_encoder_length=512 min_decoder_length=12 max_decoder_length=32 no_repeat_ngram_size=2 num_beams=5 num_return_sequences=5'

Performance of Chinese models on the news‑title and paper‑abstract datasets:

Model

News Title (Rouge1/2/L)

Paper Abstract (Rouge1/2/L)

hfl/randeng-238M-Summary-Chinese

59.66/46.26/55.95

54.55/39.37/50.69

hfl/randeng-523M-Summary-Chinese

62.86/49.67/58.89

53.83/39.17/49.92

alibaba-pai/mt5-title-generation-zh-275m

62.35/48.63/58.96

54.28/40.26/50.55

alibaba-pai/randeng-238M-Summary-Chinese-tuned

64.31/51.80/60.97

58.83/45.28/55.72

alibaba-pai/randeng-523M-Summary-Chinese-tuned

64.76/51.65/61.06

59.27/45.58/55.92

English model training

wget http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/generation/en_train.tsv
wget http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/generation/en_dev.tsv
python main.py \
    --mode train \
    --app_name=sequence_generation \
    --worker_gpu=1 \
    --tables=./en_train.tsv,./en_dev.tsv  \
    --input_schema=title:str:1,content:str:1 \
    --first_sequence=content \
    --second_sequence=title \
    --label_name=title \
    --checkpoint_dir=./finetuned_en_model \
    --micro_batch_size=1 \
    --sequence_length=512 \
    --epoch_num=1 \
    --save_checkpoint_steps=500 \
    --user_defined_parameters 'language=en pretrain_model_name_or_path=alibaba-pai/pegasus-summary-generation-en copy=false max_encoder_length=512 min_decoder_length=64 max_decoder_length=128 no_repeat_ngram_size=2 num_beams=5 num_return_sequences=5'

Performance of English models:

Model

Summarization (Rouge1/2/L)

alibaba-pai/pegasus-summary-generation-en

37.79/18.69/35.44

hfl/brio-cnndm-uncased

41.46/23.34/38.91

Future outlook

We plan to integrate more Chinese knowledge‑enhanced pre‑trained models covering various NLU and NLG tasks into EasyNLP, as well as additional state‑of‑the‑art models for multilingual and multimodal applications.

References

Chengyu Wang, Minghui Qiu, Taolin Zhang, et al. EasyNLP: A Comprehensive and Easy‑to‑use Toolkit for Natural Language Processing. arXiv.

Zhang, Jingqing, et al. "Pegasus: Pre‑training with extracted gap‑sentences for abstractive summarization." ICML, 2020.

Xue, Linting, et al. "mT5: A massively multilingual pre‑trained text‑to‑text transformer." arXiv, 2020.

Lewis, Mike, et al. "Bart: Denoising sequence‑to‑sequence pre‑training for natural language generation, translation, and comprehension." arXiv, 2019.

Song, Kaitao, et al. "Mass: Masked sequence to sequence pre‑training for language generation." arXiv, 2019.

Dong, Li, et al. "Unified language model pre‑training for natural language understanding and generation." NeurIPS, 2019.

Liu, Yixin, et al. "BRIO: Bringing Order to Abstractive Summarization." ACL, 2022.

NLP pretraining text summarization PEGASUS EasyNLP

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.