Artificial Intelligence 19 min read

How NetEase Yanxuan Leverages BERT, GPT, and ELMo for Real-World NLP Tasks

This article reviews the evolution of language models from bag‑of‑words to BERT, compares ELMo, GPT, and BERT architectures, and details how NetEase Yanxuan applies pre‑trained models to classification, text matching, sequence labeling, and generative tasks in production.

Yanxuan Tech Team

Dec 9, 2019

How NetEase Yanxuan Leverages BERT, GPT, and ELMo for Real-World NLP Tasks

Introduction

Since the release of BERT at the end of 2018, pre‑training has become a dominant trend in natural language processing (NLP). Large‑scale unsupervised corpora combined with a small amount of labeled data now form the standard recipe for building NLP models. This article introduces three representative language models—ELMo, GPT, and BERT—explains their core principles and usage, and shares practical experiences from NetEase Yanxuan, covering text classification, matching, sequence labeling, and generation.

Evolution of Text Representation

Text representation has progressed from simple bag‑of‑words and topic models (LDA) to dense word vectors (word2vec) and finally to contextual language models such as BERT. Early word embeddings are static and cannot resolve polysemy, whereas modern contextual models generate fine‑grained semantic vectors that differ across contexts.

Model Structures

We compare the three models:

ELMo : bi‑LSTM encoder, suitable for feature‑ensemble usage.

GPT : Transformer decoder, typically fine‑tuned.

BERT : Transformer encoder, fine‑tuned for most downstream tasks.

Transformer Details

The Transformer follows a standard seq2seq pattern, with multi‑head self‑attention as its core feature extractor. Dot‑product attention consists of mapping queries to keys/values, computing weights, applying softmax, and performing a weighted sum.

Comparison of Feature Extraction Methods

RNN, CNN, and Transformer differ in time complexity, non‑linearity, parameter count, and strengths. Transformers achieve O(1) complexity for sequence length and provide constant‑time long‑range dependencies.

Application Modes

Two main ways to use a pre‑trained model:

Feature Ensemble : Extract embeddings from the frozen model and feed them into a downstream classifier.

Fine‑tuning : Continue training the entire model on a small labeled dataset to adapt it to a specific task.

Research shows that ELMo benefits more from feature‑ensemble, while BERT fine‑tuning excels on sentence‑matching tasks.

Feature Representation Strategies

Use only the top‑most layer.

Weight and combine multiple layers.

Findings: ELMo recommends task‑specific weighted layers; BERT’s top layer works best for sentence‑matching, whereas multi‑layer fusion improves sequence‑labeling performance.

Practical Use Cases

Text Classification

We built an intent‑recognition system using an attention‑BiLSTM (ABL) and BERT fine‑tuning. Results on the test set:

ABL (15W samples) – F1 = 0.9743

BERT (5K samples) – F1 = 0.9612

BERT (2W samples) – F1 = 0.9714

BERT (15W samples) – F1 = 0.9745

Conclusion: BERT provides comparable performance with far fewer labeled examples, though classification tasks often do not require deep semantic features.

Text Representation

For similarity or clustering, we extract token embeddings from the last four BERT layers and combine them. Empirically, weighting the second‑to‑last layer yields the best sentence‑level similarity.

Text Matching (NLI)

Matching pairs of sentences is a classic NLP task. Model evolution: Siamese‑LSTM → InferNet → Decomposable Attention → ESIM → BERT. In our 60K‑sample dataset, BERT outperformed Siamese‑LSTM (F1 = 0.97 vs 0.85) despite higher latency.

Siamese‑LSTM – Precision 0.98, Recall 0.75, F1 0.85, <30 ms.

BERT – Precision 0.96, Recall 0.97, F1 0.97, >50 ms.

Reasons: BERT’s next‑sentence prediction objective captures inter‑sentence relations, and self‑attention provides fine‑grained token‑to‑token interactions.

Sequence Labeling (NER)

We focus on product‑entity NER (e.g., “pants”, “red”). Baseline uses bi‑LSTM + CRF. Comparisons:

Feature‑ensemble (bi‑LSTM + CRF with BERT embeddings) – Precision 0.9686, Recall 0.8813, F1 0.922, >100 ms.

Fine‑tuning with multi‑layer fusion – Precision 0.9361, Recall 0.8801, F1 0.9072, <10 ms.

Fine‑tuning with top‑layer only – Precision 0.9356, Recall 0.8368, F1 0.8824, <10 ms.

Feature‑ensemble yields higher accuracy but higher latency; fine‑tuning meets online latency requirements.

Generative Tasks

BERT itself is not suited for generation; we use models such as MASS and GPT‑2. Applications include:

Chatbot : A generative idle‑chat module trained on external dialogue data.

Couplet Bot : Generates matching couplets or head‑character couplets using a Transformer.

Praise Bot : Generates encouraging replies based on collected corpora.

Examples are illustrated with images.

Copywriting Generation

For product advertising, we generate selling points using pretrained models. Sample outputs for a children’s shirt:

Target : “先染后纺，色牢度高” (craft).

BERT‑generator : “针织工艺，精致细腻”.

GPT‑2 : “100%长绒棉，严格品控一家人满意”.

BERT is used as a seq2seq encoder in this scenario.

Other Applications

Reading comprehension for dynamic activity‑rule queries.

Text summarization of multi‑turn conversations.

Conclusion

Pre‑trained language models have been widely adopted in NetEase Yanxuan’s NLP pipelines. Lightweight variants such as ALBERT, knowledge‑distillation, and multi‑task learning further improve inference speed and resource efficiency while maintaining strong performance across diverse tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

NLP pretraining BERT text classification GPT text matching sequence labeling ELMo

Written by

Yanxuan Tech Team

NetEase Yanxuan Tech Team shares e-commerce tech insights and quality finds for mindful living. This is the public portal for NetEase Yanxuan's technology and product teams, featuring weekly tech articles, team activities, and job postings.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.