From Neurons to BERT: Tracing the Evolution of Deep Learning in NLP

This article walks through the development of deep learning for natural language processing, starting with basic neural cells and shallow networks, then exploring CNNs, RNNs, LSTMs, TextCNN, ESIM, ELMo, and culminating with the Transformer‑based BERT model, its training objectives, fine‑tuning strategies, and performance comparisons.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
From Neurons to BERT: Tracing the Evolution of Deep Learning in NLP

Introduction

The Ant Financial Wealth Dialogue algorithm team reviews the historical development of deep learning models in natural language processing (NLP), from simple neurons to the sophisticated BERT architecture, and discusses future application directions.

Neural Cell

A neural cell consists of a linear combination of inputs from the previous layer and a non‑linear transformation; without the non‑linearity, multiple linear layers collapse into a single linear layer.

Shallow Neural Network

A network with only one hidden layer is called a shallow network.

Deep Neural Network (Multilayer Perceptron)

Networks with two or more hidden layers are considered deep networks, which can represent complex functions with fewer parameters than shallow networks.

Convolutional Neural Network (CNN)

CNN neurons connect only to a local region of the previous layer, mimicking the receptive fields of visual neurons. A convolution kernel acts as a pattern extractor; multiple kernels extract multiple patterns, forming a convolutional layer.

Recurrent Neural Network (RNN)

RNNs capture sequential dependencies by sharing weight matrices (V, U, W) across time steps; the hidden state S evolves with each input and influences the final output.

Long Short‑Term Memory (LSTM)

LSTM addresses the vanishing‑gradient problem of RNNs by introducing forget, input, and output gates that control information flow in the cell.

TextCNN

TextCNN applies one‑dimensional convolutions over word sequences to extract n‑gram features; it excels at short‑text classification due to strong shallow feature extraction and fast inference.

Enhanced Sequential Inference Model (ESIM)

ESIM enhances a pair of LSTM encoders with intra‑sentence attention, local inference composition, and a final Bi‑LSTM followed by average‑ and max‑pooling before a softmax classifier.

ELMo

ELMo generates context‑dependent word embeddings by feeding each token through a deep bidirectional LSTM, capturing both semantic and syntactic information and handling polysemy better than static embeddings like word2vec.

Pre‑training and Language Models

Bidirectional language models predict masked tokens (Masked Language Model) and the relationship between sentence pairs (Next Sentence Prediction), enabling the model to learn rich contextual representations.

BERT

BERT (Bidirectional Encoder Representations from Transformers) adopts the Transformer encoder architecture, pre‑trains on massive corpora with MLM and NSP objectives, and can be fine‑tuned for downstream tasks by using the [CLS] token for classification or token‑level outputs for NER.

Training Objectives

Masked Language Model (MLM): randomly mask 15% of tokens, replace 80% with [MASK], 10% with random words, 10% unchanged.

Next Sentence Prediction (NSP): binary classification of whether two sentences are consecutive.

Fine‑tuning

Different downstream tasks use different parts of BERT’s output: classification uses the [CLS] embedding, while token‑level tasks (e.g., NER) use each token’s final hidden state.

Comparison of CNN, RNN, and Self‑Attention

CNN captures local patterns and works well for short texts; RNN models sequential dependencies but suffers from gradient vanishing; Self‑Attention (as in Transformers) enables parallel computation and captures long‑range dependencies without gradient issues.

Summary

Experiments on intent classification show that BERT outperforms XGBoost, TextCNN, LSTM, and ERNIE, with layer depth linearly increasing latency but offering diminishing returns for short queries; multi‑head attention contributes more to accuracy than additional layers, while adding heads has little impact on latency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

CNNDeep LearningNLPBERTRNN
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.