Practical Applications of Pretrained Language Models (BERT, GPT, ELMo) in NetEase Yanxuan NLP Tasks
The article reviews the principles of popular pretrained language models, compares their architectures, and details how NetEase Yanxuan applied BERT, GPT and ELMo to classification, matching, sequence labeling and generation tasks, presenting experimental results and deployment insights.
With the release of BERT, pre‑training has become a hot direction in NLP. This article introduces the basic principles and usage of several common language models (ELMo, GPT, BERT) and reports their practical deployment in NetEase Yanxuan’s NLP services such as classification, text matching, sequence labeling and text generation.
Model structures
We selected three representative language models—ELMo, GPT and BERT—and compared them as shown in the table below.
Language Model
BERT
GPT
ELMo
Model Architecture
Transformer encoder
Transformer decoder
Bi‑LSTM
Pre‑training Tasks
Masked LM & Next Sentence Prediction
Standard language model (predict next token)
Bidirectional language model (predict forward and backward)
Recommended Usage
Fine‑tuning
Fine‑tuning
Feature ensemble
Pros / Cons
Bidirectional, strong representation
Unidirectional
LSTM feature extraction is weaker and training is slower
Transformer, introduced in the 2017 paper “Attention Is All You Need”, replaces RNN/CNN with multi‑head self‑attention, achieving superior performance in machine translation and other tasks. The dot‑product attention consists of four steps: mapping query to (key, value), computing query‑key weights, normalising with softmax, and weighting the values.
Usage modes
When applying a pretrained language model to a new NLP task, two common patterns are used:
Feature ensemble – obtain token embeddings from the pretrained model and feed them into a downstream model.
Fine‑tuning – keep the same network architecture as pre‑training and continue training on a small labelled dataset.
Empirical studies suggest that for ELMo, feature ensemble usually outperforms fine‑tuning, while for BERT, fine‑tuning is superior on sentence‑pair tasks such as matching.
Feature representation
Two strategies are used: (1) using only the top‑layer features, (2) weighted combination of multiple layers. For BERT, weighting the second‑to‑last layer yields the best sentence‑level similarity.
Practical experiments
1. Text Classification
Model
Data Size
Test F1
ABL (attention Bi‑LSTM)
150k
0.9743
BERT
5k
0.9612
BERT
20k
0.9714
BERT
150k
0.9745
Results show that BERT brings limited improvement for classification because shallow semantic features are often sufficient.
2. Text Matching
Method
Precision
Recall
F1
Latency per query
Siamese‑LSTM
0.98
0.75
0.85
<30 ms
BERT
0.96
0.97
0.97
>50 ms
BERT outperforms the Siamese network, likely because its pre‑training includes next‑sentence prediction, which captures inter‑sentence relations.
3. Sequence Labeling (NER)
Method
Precision
Recall
F1
Latency per query
Feature ensemble (Bi‑LSTM + CRF)
0.9686
0.8813
0.9220
>100 ms
Fine‑tuning (multi‑layer fusion)
0.9361
0.8801
0.9072
<10 ms
Fine‑tuning (high‑layer only)
0.9356
0.8368
0.8824
<10 ms
Feature‑ensemble yields higher accuracy but higher latency; fine‑tuning is more suitable for online services.
4. Generative Tasks
Model
Craftsmanship
Style
Target (ground truth)
先染后纺,色牢度高
经典格纹,帅气立领
BERT‑generator
针织工艺,精致细腻
经典版型,时尚百搭
GPT‑2
100%长绒棉,严格品控一家人满意
学院风格,日系简约
In the Yanxuan scenario, BERT is used as a seq2seq encoder for copy‑writing generation, while GPT‑2 serves as a pure generative model.
Beyond the above, the pretrained models are also explored for reading comprehension, text summarisation, and other downstream tasks. Model compression techniques such as knowledge distillation and lightweight variants like ALBERT are employed to meet online QPS and latency requirements.
Overall, the experiments demonstrate that pretrained language models can significantly improve performance on many NLP tasks in an e‑commerce setting, provided that the appropriate usage mode (feature ensemble vs fine‑tuning) and model optimisation are chosen.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.