How Does ChatGPT Really Work? Inside the RLHF Training Process

This article explains ChatGPT’s architecture, the distinction between model capability and consistency, how next‑token and masked‑language‑model training lead to inconsistencies, and how OpenAI’s supervised fine‑tuning, reward‑model training, and PPO reinforcement learning (RLHF) are combined to improve alignment while highlighting the method’s limitations.

21CTO
21CTO
21CTO
How Does ChatGPT Really Work? Inside the RLHF Training Process

Capability vs. Consistency in Large Language Models

Model capability measures how well a model optimizes its objective function, while consistency evaluates whether the model’s behavior matches human expectations. Large language models like GPT‑3 excel at predicting word sequences but often produce outputs that are misaligned with user intent, leading to issues such as providing invalid help, fabricating facts, lacking interpretability, and exhibiting harmful bias.

Providing invalid assistance: ignoring explicit user instructions.

Hallucinating content: generating false or nonexistent facts.

Lack of interpretability: difficulty understanding model decisions.

Bias and harmful content: reflecting biased training data.

How Training Strategies Cause Inconsistency

Two core training techniques are used: next‑token prediction, where the model predicts the following word in a sequence, and masked‑language‑modeling, where some tokens are replaced with a mask and the model predicts them. While these objectives enable the model to learn statistical language patterns, they do not guarantee that the model grasps deeper meaning, causing inconsistencies on complex tasks.

Reinforcement Learning from Human Feedback (RLHF)

OpenAI improves alignment through three steps:

Supervised Fine‑Tuning (SFT): A small, high‑quality dataset (≈12‑15k examples) of prompts and desired outputs is collected from annotators and OpenAI API logs.

Reward Model (RM) Training: Annotators rank multiple SFT outputs for each prompt; the rankings form a larger dataset used to train a model that scores outputs according to human preference.

Proximal Policy Optimization (PPO): The RM guides further fine‑tuning of the SFT model. PPO updates the policy while limiting changes (trust‑region) and adds a KL penalty to prevent over‑optimizing the RM.

The evaluation relies on human‑rated prompts not seen during training and measures helpfulness, truthfulness, and harmlessness. An “alignment tax” is observed: improving consistency can reduce performance on some zero‑shot NLP tasks.

Limitations of the RLHF Approach

Key drawbacks include data bias from annotator preferences, lack of controlled studies comparing RLHF to pure supervised fine‑tuning, heterogeneous human preferences, uncertain RM stability to prompt variations, and potential over‑optimization where the model learns to game the reward model.

References

Training language models to follow instructions with human feedback (https://arxiv.org/pdf/2203.02155.pdf)

Learning to summarize from Human Feedback (https://arxiv.org/pdf/2009.01325.pdf)

PPO algorithm paper (https://arxiv.org/pdf/1707.06347.pdf)

Deep reinforcement learning from human preferences (https://arxiv.org/abs/1706.03741)

DeepMind Sparrow and GopherCite alternatives (https://arxiv.org/pdf/2209.14375.pdf, https://arxiv.org/abs/2203.11147)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsChatGPTreinforcement learningRLHFAI Alignment
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.