Getting Started with GPT: How Generative Pre‑Training and Discriminative Fine‑Tuning Work
This article explains GPT's two‑stage learning—unsupervised generative pre‑training on large raw corpora followed by discriminative fine‑tuning on labeled tasks—detailing the underlying Transformer decoder architecture, loss functions, and task‑specific input transformations.
GPT (Generative Pre‑trained Transformer) first performs generative pre‑training on diverse unlabeled text corpora, learning a universal representation that can be transferred to many downstream language‑understanding tasks with minimal adaptation. After pre‑training, a discriminative fine‑tuning model is trained on each labeled sub‑task.
Background
In natural‑language processing, most deep‑learning methods require large amounts of manually annotated data, which limits their use in domains lacking such resources. Leveraging the linguistic information hidden in raw text offers a valuable alternative, but collecting more annotations remains time‑consuming and costly.
Challenges
Extracting information beyond the word level from raw text is difficult.
No single objective function performs well across all tasks.
Solution Overview
GPT adopts a semi‑supervised approach that combines unsupervised pre‑training with supervised fine‑tuning to address language‑understanding tasks.
Unsupervised Pre‑Training
The model maximizes the standard language‑model likelihood:
GPT uses a multi‑layer Transformer decoder. Given a corpus u = (u₁, u₂, …, uₙ), the tokens are embedded, positional encodings are added to form the input h₀, and the sequence passes through n Transformer blocks before a final softmax produces the next‑token probability distribution.
Supervised Fine‑Tuning
For a labeled dataset C, each instance consists of a token sequence (x₁,…,xₘ) and a label y. The pretrained model provides the final hidden vector hₘₗ from the last Transformer block; a linear layer Wᵧ maps this vector to logits, and a softmax yields the label probability.
The fine‑tuning loss L₂ is the maximum‑likelihood loss shown below:
Empirically, combining the pre‑training loss L₁ (next‑token prediction) with the fine‑tuning loss L₂ improves performance. The paper also defines a combined objective L₃:
Task‑Specific Input Transformations
Before fine‑tuning, inputs are often transformed to suit the target task, as illustrated in the following figure:
Related Work
At the time of writing, the most successful unsupervised models were word‑embedding approaches, which could be broadly applied to improve many NLP tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Network Intelligence Research Center (NIRC)
NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
