Artificial Intelligence 5 min read

Getting Started with GPT: How Generative Pre‑Training and Discriminative Fine‑Tuning Work

This article explains GPT's two‑stage learning—unsupervised generative pre‑training on large raw corpora followed by discriminative fine‑tuning on labeled tasks—detailing the underlying Transformer decoder architecture, loss functions, and task‑specific input transformations.

Network Intelligence Research Center (NIRC)

Jul 29, 2023

Getting Started with GPT: How Generative Pre‑Training and Discriminative Fine‑Tuning Work

GPT (Generative Pre‑trained Transformer) first performs generative pre‑training on diverse unlabeled text corpora, learning a universal representation that can be transferred to many downstream language‑understanding tasks with minimal adaptation. After pre‑training, a discriminative fine‑tuning model is trained on each labeled sub‑task.

Background

In natural‑language processing, most deep‑learning methods require large amounts of manually annotated data, which limits their use in domains lacking such resources. Leveraging the linguistic information hidden in raw text offers a valuable alternative, but collecting more annotations remains time‑consuming and costly.

Challenges

Extracting information beyond the word level from raw text is difficult.

No single objective function performs well across all tasks.

Solution Overview

GPT adopts a semi‑supervised approach that combines unsupervised pre‑training with supervised fine‑tuning to address language‑understanding tasks.

Unsupervised Pre‑Training

The model maximizes the standard language‑model likelihood:

GPT uses a multi‑layer Transformer decoder. Given a corpus u = (u₁, u₂, …, uₙ), the tokens are embedded, positional encodings are added to form the input h₀, and the sequence passes through n Transformer blocks before a final softmax produces the next‑token probability distribution.

Supervised Fine‑Tuning

For a labeled dataset C, each instance consists of a token sequence (x₁,…,xₘ) and a label y. The pretrained model provides the final hidden vector hₘₗ from the last Transformer block; a linear layer Wᵧ maps this vector to logits, and a softmax yields the label probability.

The fine‑tuning loss L₂ is the maximum‑likelihood loss shown below:

Empirically, combining the pre‑training loss L₁ (next‑token prediction) with the fine‑tuning loss L₂ improves performance. The paper also defines a combined objective L₃:

Task‑Specific Input Transformations

Before fine‑tuning, inputs are often transformed to suit the target task, as illustrated in the following figure:

Related Work

At the time of writing, the most successful unsupervised models were word‑embedding approaches, which could be broadly applied to improve many NLP tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

transformer fine-tuning NLP Unsupervised Learning GPT Generative Pre‑Training

Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.