MT-BERT: Pre‑training and Fine‑tuning Practices at Meituan‑Dianping
MT‑BERT at Meituan‑Dianping combines mixed‑precision, domain‑adapted continual pre‑training, knowledge‑graph‑aware masking, and extensive compression techniques to produce fast, accurate BERT models that power fine‑grained sentiment analysis, intent classification, recommendation reasoning, and other NLP tasks across the platform.
Background – Since 2018, pre‑training language models (ELMo, ULMFiT, GPT, BERT) have become the dominant paradigm in NLP. Pre‑training on massive unlabeled corpora learns rich semantic representations, which are then fine‑tuned on downstream tasks with a small amount of labeled data.
The success of pre‑training in computer vision (ImageNet, ResNet, etc.) inspired similar research in NLP. Word embeddings such as Word2Vec and GloVe are early forms of pre‑training, but they lack contextual information and cannot resolve polysemy.
Context‑aware models like Context2Vec and ELMo use bidirectional LSTMs to capture context. The later “fine‑tuning” approach (GPT, BERT) trains deep Transformers on large corpora and directly adapts them to downstream tasks.
BERT Model Overview – BERT is a deep bidirectional Transformer encoder. It consists of multiple stacked Transformer layers and uses two pre‑training objectives: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). The model’s input representation is the sum of token, segment, and position embeddings, with special tokens [CLS] and [SEP].
MT‑BERT Practice at Meituan‑Dianping
The MT‑BERT project follows four stages: (1) mixed‑precision training for speed, (2) domain‑adaptation by continuing pre‑training on Meituan‑Dianping business corpora, (3) incorporation of knowledge‑graph entities during pre‑training, and (4) fine‑tuning on various business scenarios.
Distributed training is performed on the internally developed AFO (AI Framework On Yarn) platform, which schedules thousands of GPU cards via YARN and provides Horovod‑based data‑parallel training. Horovod’s MPI‑based synchronization and NCCL communication ensure high scalability.
Mixed‑precision training (FP32 + FP16) reduces memory consumption and doubles throughput on Nvidia V100 GPUs. Experiments show that mixed‑precision does not degrade model quality on both Meituan‑Dianping benchmarks and public datasets.
Domain Adaptation – Starting from Google’s Chinese BERT, additional Meituan‑Dianping data are used for continual training (Domain‑aware Continual Training). This yields the MT‑BERT Large model, which outperforms the original BERT on eight benchmark tasks (five Chinese public benchmarks and three internal business benchmarks).
Knowledge Integration – To address BERT’s lack of common‑sense and entity‑level understanding, a Knowledge‑aware Masking strategy is applied. Instead of masking single characters, whole entities from the Meituan “Brain” knowledge graph are masked, forcing the model to learn entity‑level semantics. This improves performance on fine‑grained sentiment analysis.
Model Light‑weighting – Because BERT’s size hinders online deployment, three compression techniques are explored: (1) low‑precision quantization (FP16/INT8), (2) layer pruning (e.g., reducing Transformer depth to 4 layers, resulting in MT‑BERT‑MINI/MBM), and (3) knowledge distillation. The pruned model achieves 2× faster inference (12‑14 ms latency) while maintaining comparable accuracy.
Applications in Meituan‑Dianping
Fine‑grained sentiment analysis: a multi‑task MT‑BERT model predicts sentiment across 20 fine‑grained attributes, achieving a significant Macro‑F1 boost.
Query intent classification: the 4‑layer MT‑BERT‑MINI model (MBM) is deployed for 17 business channels, reaching >95 % accuracy and reducing QPS pressure via caching.
Recommendation‑reason classification: MT‑BERT is fine‑tuned for scene‑specific recommendation‑reason generation (e.g., food‑delivery vs. hotel).
Sentence‑pair tasks (NLI, STS) and sequence labeling (NER, POS, slot filling) are handled by feeding sentence pairs with [CLS] and [SEP] tokens and using the pooled output for classification.
Future Outlook
One‑stop MT‑BERT training & inference platform for short‑text classification and sentence‑pair tasks.
Deeper integration of knowledge‑graph information into pre‑training.
Further research on model compression (quantization, pruning, distillation) tailored to specific downstream tasks.
References – The article cites seminal works on word embeddings, Transformers, BERT, RoBERTa, mixed‑precision training, Horovod, and knowledge distillation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
