Artificial Intelligence 77 min read

NLP Study Notes: Methods for Natural Language Processing Using Pre‑trained Models

This article reviews the evolution of deep learning, its key concepts, model architectures, training strategies, and applications—especially in speech, vision, and natural language processing—highlighting seminal research, comparative analyses, and current challenges for future AI development.

Lisa Notes

Jul 4, 2026

NLP Study Notes: Methods for Natural Language Processing Using Pre‑trained Models

Deep Learning Concept

Deep Learning (DL) was introduced by Hinton et al. in 2006 as a sub‑field of Machine Learning that learns hierarchical feature representations from large‑scale data. Compared with shallow models such as SVM or boosting, DL networks contain many hidden layers, enabling more expressive nonlinear functions and reducing the number of parameters required to represent complex functions.

Key Historical Milestones

2006 – Hinton’s Science paper demonstrated strong feature‑learning ability of multilayer neural networks and proposed layer‑wise unsupervised pre‑training to alleviate optimization difficulties.

2010 – DARPA funded the first large‑scale DL project involving NEC, NYU and Stanford.

2011‑12 – Microsoft and Google applied deep neural networks (DNNs) to speech recognition, lowering word‑error rates by 20‑30 %.

2012 – Krizhevsky et al. reduced ImageNet top‑5 error by 9 % using a deep convolutional neural network (CNN). The same year Andrew Ng’s team built a 16 000‑processor network that learned to recognize cats, illustrating DL scalability.

Model Categories

Feed‑forward deep networks (FFDN) : multilayer perceptrons (MLP) and CNNs.

Feedback deep networks (FBDN) : deconvolutional networks that decode representations.

Bidirectional deep networks (BDDN) : combine encoder and decoder layers, e.g., Deep Belief Networks (DBN) and stacked auto‑encoders (SAE).

Convolutional Neural Network Architecture

A typical CNN layer consists of (1) a convolution stage with weight‑shared filters that detect local patterns, (2) a non‑linear activation stage (commonly ReLU) that introduces non‑saturation, and (3) an optional pooling stage (max or average) that reduces spatial resolution while preserving salient features. Stacking such layers yields hierarchical representations that mirror the human visual system: low‑level edge detectors feed into higher‑level shape detectors.

Training Procedure

Unsupervised bottom‑up pre‑training of each layer (using Restricted Boltzmann Machines or auto‑encoders).

Supervised top‑down fine‑tuning of the whole network.

Empirical studies reported that random initialization followed by gradient‑based optimization often converges to poor local minima, especially as depth increases. Layer‑wise unsupervised pre‑training provides better initial parameters, improves generalization, and speeds up convergence (Hinton et al., 2006; Erhan et al.; Glorot et al.). Variants such as regularized deep Fisher mapping, sparse encoding symmetric machines, and adaptive learning‑rate schemes (AdaGrad, RMSProp) further enhance performance on specific tasks.

Representative Model Variants

Deep Belief Networks (DBN) – stack of binary RBMs; continuous‑value RBM variants (mcRBM, mPoT, spike‑and‑slab RBM) improve performance on MNIST and other benchmarks.

Sum‑Product Networks (SPN) – directed acyclic graphs that make partition‑function computation tractable; evaluated on Caltech‑101 and Olivetti datasets, outperforming DBN and nearest‑neighbor baselines.

Rectified linear units (ReLU) – replace saturating sigmoids, yield sparser activations and faster convergence; dropout and DropConnect regularize fully connected layers at the cost of slower training.

Applications with Quantitative Results

Speech recognition : Microsoft’s CD‑DNN‑HMM reduced word‑error rate by >30 % on the Switchboard corpus; on a 300 h Switchboard set the model achieved 18.5 % WER (33 % relative reduction).

Machine translation : Cho et al.’s RNN‑enc and Bahdanau et al.’s attention‑based RNNsearch obtained higher BLEU scores than the phrase‑based Moses system on the WMT 2014 English‑French task.

Image classification : Krizhevsky et al. (2012) reported 15.3 % top‑5 error on ImageNet; Zeiler et al. (2013) improved to 11.7 % (11.2 % with pre‑training); GoogLeNet (2014) achieved 6.7 % top‑5 error.

Face recognition : DeepID (4‑layer CNN) reached 97.45 % accuracy on LFW; DeepFace (5‑layer CNN) 97.35 %; DeepID2 (4‑layer CNN with mixed weight sharing) improved to 99.15 %.

Video and action recognition : Karpathy et al. 4‑stream CNN attained 63.9 % accuracy on Sports‑1M; 3‑D CNNs (Ji et al., Baccouche et al.) achieved >94 % accuracy on the KTH dataset.

Advantages of Deep Learning

Higher expressive power for complex functions.

Compact hierarchical representations reduce computational complexity compared with shallow equivalents.

Biologically inspired processing that mirrors the visual cortex.

Learned features can be reused across related tasks.

Improved generalization when depth is appropriately chosen.

Open Problems and Future Directions

Theoretical understanding of optimization landscapes and why deep models are hard to train.

Design of scalable multimodal architectures and efficient parallel training on large clusters.

Leveraging massive unlabeled data through unsupervised or semi‑supervised learning.

Balancing model size, training speed, and regularization.

Integrating deep models with other techniques such as kernel methods or probabilistic graphical models.

References

1. Hinton et al., 2006, Science . 2. Krizhevsky et al., 2012, ImageNet. 3. Cho et al., 2014, RNN‑enc. 4. Bahdanau et al., 2014, RNNsearch. 5. Zeiler et al., 2013, DeconvNet. 6. Additional works cited throughout the text.

Code example

前馈神经网络是最初的人工神经网络模型之一。在这种网络中，信息只沿一个方向流动，从输入单元通过一个或多个隐层到达输出单元，在网络中没有封闭环路。典型的前馈神经网络有多层感知机和卷积神经网络等。F. Rosenblatt提出的感知机是最简单的单层前向人工神经网络，但随后M. Minsky等证明单层感知机无法解决线性不可分问题(如异或操作)，这一结论将人工神经网络研究领域引入到一个低潮期，直到研究人员认识到多层感知机可解决线性不可分问题，以及反向传播算法与神经网络结合的研究，使得神经网络的研究重新开始成为热点。但是由于传统的反向传播算法，具有收敛速度慢、需要大量带标签的训练数据、容易陷入局部最优等缺点，多层感知机的效果并不是十分理想。1984年日本学者K. Fukushima等基于感受野概念，提出的神经认知机可看作卷积神经网络的一种特例。Y. Lecun等提出的卷积神经网络是神经认知机的推广形式。卷积神经网络是由多个单层卷积神经网络组成的可训练的多层网络结构。每个单层卷积神经网络包括卷积、非线性变换和下采样3个阶段，其中下采样阶段不是每层都必需的。每层的输入和输出为一组向量构成的特征图(feature map)(第一层的原始输入信号可以看作一个具有高稀疏度的高维特征图)。例如，输入部分是一张彩色图像，每个特征图对应的则是一个包含输入图像彩色通道的二维数组(对于音频输入，特征图对应的是一维向量；对于视频或立体影像，对应的是三维数组)；对应的输出部分，每个特征图对应的是表示从输入图片所有位置上提取的特定特征。
(1)单层卷积神经网络：卷积阶段，通过提取信号的不同特征实现输入信号进行特定模式的观测。其观测模式也称为卷积核，其定义源于由D. H. Hubel等基于对猫视觉皮层细胞研究提出的局部感受野概念。每个卷积核检测输入特征图上所有位置上的特定特征，实现同一个输入特征图上的权值共享。为了提取输入特征图上不同的特征，使用不同的卷积核进行卷积操作。卷积阶段的输入是由n1个n2*n3大小的二维特征图构成的三维数组。每个特征图记为xi，该阶段的输出y，也是个三维数组，由m1个m2*m3大小的特征图构成。在卷积阶段，连接输入特征图xi和输出特征图yj的权值记为wij，即可训练的卷积核(局部感受野)，卷积核的大小为k2*k3，输出特征图为yj。
非线性阶段，对卷积阶段得到的特征按照一定的原则进行筛选，筛选原则通常采用非线性变换的方式，以避免线性模型表达能力不够的问题。非线性阶段将卷积阶段提取的特征作为输入，进行非线性映射R=h(y)。传统卷积神经网络中非线性操作采用sigmoid、tanh 或softsign等饱和非线性(saturating nonlinearities)函数，近几年的卷积神经网络中多采用不饱和非线性(non-saturating nonlinearity)函数ReLU(rectifiedlinear units)。在训练梯度下降时，ReLU比传统的饱和非线性函数有更快的收敛速度，因此在训练整个网络时，训练速度也比传统的方法快很多。

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning AI deep learning neural networks natural language processing NLP pre‑trained models

Written by

Lisa Notes

Lisa's notes: musings on daily life, work, study, personal growth, and casual reflections.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.