How NetEase Yanxuan Leverages BERT, GPT, and ELMo for Real-World NLP Tasks
This article reviews the evolution of language models from bag‑of‑words to BERT, compares ELMo, GPT, and BERT architectures, and details how NetEase Yanxuan applies pre‑trained models to classification, text matching, sequence labeling, and generative tasks in production.
Introduction
Since the release of BERT at the end of 2018, pre‑training has become a dominant trend in natural language processing (NLP). Large‑scale unsupervised corpora combined with a small amount of labeled data now form the standard recipe for building NLP models. This article introduces three representative language models—ELMo, GPT, and BERT—explains their core principles and usage, and shares practical experiences from NetEase Yanxuan, covering text classification, matching, sequence labeling, and generation.
Evolution of Text Representation
Text representation has progressed from simple bag‑of‑words and topic models (LDA) to dense word vectors (word2vec) and finally to contextual language models such as BERT. Early word embeddings are static and cannot resolve polysemy, whereas modern contextual models generate fine‑grained semantic vectors that differ across contexts.
Model Structures
We compare the three models:
ELMo : bi‑LSTM encoder, suitable for feature‑ensemble usage.
GPT : Transformer decoder, typically fine‑tuned.
BERT : Transformer encoder, fine‑tuned for most downstream tasks.
Transformer Details
The Transformer follows a standard seq2seq pattern, with multi‑head self‑attention as its core feature extractor. Dot‑product attention consists of mapping queries to keys/values, computing weights, applying softmax, and performing a weighted sum.
Comparison of Feature Extraction Methods
RNN, CNN, and Transformer differ in time complexity, non‑linearity, parameter count, and strengths. Transformers achieve O(1) complexity for sequence length and provide constant‑time long‑range dependencies.
Application Modes
Two main ways to use a pre‑trained model:
Feature Ensemble : Extract embeddings from the frozen model and feed them into a downstream classifier.
Fine‑tuning : Continue training the entire model on a small labeled dataset to adapt it to a specific task.
Research shows that ELMo benefits more from feature‑ensemble, while BERT fine‑tuning excels on sentence‑matching tasks.
Feature Representation Strategies
Use only the top‑most layer.
Weight and combine multiple layers.
Findings: ELMo recommends task‑specific weighted layers; BERT’s top layer works best for sentence‑matching, whereas multi‑layer fusion improves sequence‑labeling performance.
Practical Use Cases
Text Classification
We built an intent‑recognition system using an attention‑BiLSTM (ABL) and BERT fine‑tuning. Results on the test set:
ABL (15W samples) – F1 = 0.9743
BERT (5K samples) – F1 = 0.9612
BERT (2W samples) – F1 = 0.9714
BERT (15W samples) – F1 = 0.9745
Conclusion: BERT provides comparable performance with far fewer labeled examples, though classification tasks often do not require deep semantic features.
Text Representation
For similarity or clustering, we extract token embeddings from the last four BERT layers and combine them. Empirically, weighting the second‑to‑last layer yields the best sentence‑level similarity.
Text Matching (NLI)
Matching pairs of sentences is a classic NLP task. Model evolution: Siamese‑LSTM → InferNet → Decomposable Attention → ESIM → BERT. In our 60K‑sample dataset, BERT outperformed Siamese‑LSTM (F1 = 0.97 vs 0.85) despite higher latency.
Siamese‑LSTM – Precision 0.98, Recall 0.75, F1 0.85, <30 ms.
BERT – Precision 0.96, Recall 0.97, F1 0.97, >50 ms.
Reasons: BERT’s next‑sentence prediction objective captures inter‑sentence relations, and self‑attention provides fine‑grained token‑to‑token interactions.
Sequence Labeling (NER)
We focus on product‑entity NER (e.g., “pants”, “red”). Baseline uses bi‑LSTM + CRF. Comparisons:
Feature‑ensemble (bi‑LSTM + CRF with BERT embeddings) – Precision 0.9686, Recall 0.8813, F1 0.922, >100 ms.
Fine‑tuning with multi‑layer fusion – Precision 0.9361, Recall 0.8801, F1 0.9072, <10 ms.
Fine‑tuning with top‑layer only – Precision 0.9356, Recall 0.8368, F1 0.8824, <10 ms.
Feature‑ensemble yields higher accuracy but higher latency; fine‑tuning meets online latency requirements.
Generative Tasks
BERT itself is not suited for generation; we use models such as MASS and GPT‑2. Applications include:
Chatbot : A generative idle‑chat module trained on external dialogue data.
Couplet Bot : Generates matching couplets or head‑character couplets using a Transformer.
Praise Bot : Generates encouraging replies based on collected corpora.
Examples are illustrated with images.
Copywriting Generation
For product advertising, we generate selling points using pretrained models. Sample outputs for a children’s shirt:
Target : “先染后纺,色牢度高” (craft).
BERT‑generator : “针织工艺,精致细腻”.
GPT‑2 : “100%长绒棉,严格品控一家人满意”.
BERT is used as a seq2seq encoder in this scenario.
Other Applications
Reading comprehension for dynamic activity‑rule queries.
Text summarization of multi‑turn conversations.
Conclusion
Pre‑trained language models have been widely adopted in NetEase Yanxuan’s NLP pipelines. Lightweight variants such as ALBERT, knowledge‑distillation, and multi‑task learning further improve inference speed and resource efficiency while maintaining strong performance across diverse tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Yanxuan Tech Team
NetEase Yanxuan Tech Team shares e-commerce tech insights and quality finds for mindful living. This is the public portal for NetEase Yanxuan's technology and product teams, featuring weekly tech articles, team activities, and job postings.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
