Getting Started with GPT: How Generative Pre‑Training and Discriminative Fine‑Tuning Work

This article explains GPT's two‑stage learning—unsupervised generative pre‑training on large raw corpora followed by discriminative fine‑tuning on labeled tasks—detailing the underlying Transformer decoder architecture, loss functions, and task‑specific input transformations.

Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Getting Started with GPT: How Generative Pre‑Training and Discriminative Fine‑Tuning Work

GPT (Generative Pre‑trained Transformer) first performs generative pre‑training on diverse unlabeled text corpora, learning a universal representation that can be transferred to many downstream language‑understanding tasks with minimal adaptation. After pre‑training, a discriminative fine‑tuning model is trained on each labeled sub‑task.

Background

In natural‑language processing, most deep‑learning methods require large amounts of manually annotated data, which limits their use in domains lacking such resources. Leveraging the linguistic information hidden in raw text offers a valuable alternative, but collecting more annotations remains time‑consuming and costly.

Challenges

Extracting information beyond the word level from raw text is difficult.

No single objective function performs well across all tasks.

Solution Overview

GPT adopts a semi‑supervised approach that combines unsupervised pre‑training with supervised fine‑tuning to address language‑understanding tasks.

Unsupervised Pre‑Training

The model maximizes the standard language‑model likelihood:

Language model likelihood equation
Language model likelihood equation

GPT uses a multi‑layer Transformer decoder. Given a corpus u = (u₁, u₂, …, uₙ), the tokens are embedded, positional encodings are added to form the input h₀, and the sequence passes through n Transformer blocks before a final softmax produces the next‑token probability distribution.

Transformer decoder architecture
Transformer decoder architecture

Supervised Fine‑Tuning

For a labeled dataset C, each instance consists of a token sequence (x₁,…,xₘ) and a label y. The pretrained model provides the final hidden vector hₘₗ from the last Transformer block; a linear layer Wᵧ maps this vector to logits, and a softmax yields the label probability.

Supervised fine‑tuning diagram
Supervised fine‑tuning diagram

The fine‑tuning loss L₂ is the maximum‑likelihood loss shown below:

Maximum likelihood loss L2
Maximum likelihood loss L2

Empirically, combining the pre‑training loss L₁ (next‑token prediction) with the fine‑tuning loss L₂ improves performance. The paper also defines a combined objective L₃:

Combined objective L3
Combined objective L3

Task‑Specific Input Transformations

Before fine‑tuning, inputs are often transformed to suit the target task, as illustrated in the following figure:

Task‑specific input transformation
Task‑specific input transformation

Related Work

At the time of writing, the most successful unsupervised models were word‑embedding approaches, which could be broadly applied to improve many NLP tasks.

Related work illustration
Related work illustration
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

transformerfine-tuningNLPUnsupervised LearningGPTGenerative Pre‑Training
Network Intelligence Research Center (NIRC)
Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.