Artificial Intelligence 31 min read

MT-BERT: Domain‑Adapted BERT Pre‑training and Fine‑tuning for Meituan‑Dianping NLP Tasks

This article describes the development of MT‑BERT, a BERT‑based language model pre‑trained on Meituan‑Dianping business data, its distributed mixed‑precision training pipeline, domain adaptation, knowledge‑graph integration, model compression techniques, and the wide range of downstream NLP applications achieved in the platform.

DataFunTalk

Nov 15, 2019

MT-BERT: Domain‑Adapted BERT Pre‑training and Fine‑tuning for Meituan‑Dianping NLP Tasks

The NLP community has shifted toward pre‑training language models such as ELMo, GPT and BERT, which learn rich semantic representations from massive unlabeled text and can be fine‑tuned on downstream tasks with limited labeled data.

Google's BERT, a deep bidirectional Transformer encoder, set new records on many NLU benchmarks and inspired a wave of research, often referred to as the "ImageNet moment" for NLP.

Meituan‑Dianping, a leading Chinese lifestyle e‑commerce platform, possesses billions of user‑generated reviews (UGC). To improve its numerous NLP services (search, recommendation, advertising, etc.), the company built MT‑BERT, a BERT variant further pre‑trained on domain‑specific corpora and fine‑tuned for multiple business scenarios.

MT‑BERT’s architecture follows the standard BERT base/large configurations, with token, segment and position embeddings summed for each input token. Pre‑training tasks include Masked Language Modeling (MLM) and Next Sentence Prediction (NSP), though later research shows NSP may be optional.

Training MT‑BERT required substantial compute; the team employed the internal AFO (AI Framework On Yarn) platform, leveraging Horovod for GPU‑based data‑parallel training across thousands of cards. Mixed‑precision (FP16/FP32) training accelerated throughput by more than twofold without degrading accuracy.

Domain adaptation was performed by continuing pre‑training a publicly released Chinese BERT on Meituan‑Dianping text, yielding MT‑BERT models that outperform the original on both generic and business‑specific benchmarks.

To address BERT’s lack of common‑sense and entity‑level knowledge, the authors incorporated knowledge‑graph information from the Meituan Brain using a Knowledge‑aware Masking strategy, masking whole entities rather than individual characters during MLM.

Model lightweighting techniques—mixed‑precision, layer pruning (producing MT‑BERT‑MINI), and knowledge distillation—were explored to meet latency requirements for online services. The pruned 4‑layer MT‑BERT‑MINI achieved sub‑15 ms inference while maintaining comparable accuracy.

Fine‑tuned MT‑BERT powers a variety of downstream tasks: single‑sentence classification (fine‑grained sentiment analysis), sentence‑pair classification (query intent detection, query rewriting verification), and sequence labeling (named entity recognition, query component analysis). The system supports over 40 business units and improves metrics such as F1 score and latency.

Future work includes building a one‑stop MT‑BERT training and inference platform, deeper integration of knowledge graphs, and further research on model compression to balance performance and efficiency.

References to seminal works on word embeddings, Transformers, BERT, RoBERTa, mixed‑precision training, Horovod, and knowledge distillation are provided.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

model compression NLP Domain Adaptation knowledge graph BERT Meituan

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.