How Does ChatGPT Really Work? Inside the RLHF Training Process
This article explains ChatGPT’s architecture, the distinction between model capability and consistency, how next‑token and masked‑language‑model training lead to inconsistencies, and how OpenAI’s supervised fine‑tuning, reward‑model training, and PPO reinforcement learning (RLHF) are combined to improve alignment while highlighting the method’s limitations.
Capability vs. Consistency in Large Language Models
Model capability measures how well a model optimizes its objective function, while consistency evaluates whether the model’s behavior matches human expectations. Large language models like GPT‑3 excel at predicting word sequences but often produce outputs that are misaligned with user intent, leading to issues such as providing invalid help, fabricating facts, lacking interpretability, and exhibiting harmful bias.
Providing invalid assistance: ignoring explicit user instructions.
Hallucinating content: generating false or nonexistent facts.
Lack of interpretability: difficulty understanding model decisions.
Bias and harmful content: reflecting biased training data.
How Training Strategies Cause Inconsistency
Two core training techniques are used: next‑token prediction, where the model predicts the following word in a sequence, and masked‑language‑modeling, where some tokens are replaced with a mask and the model predicts them. While these objectives enable the model to learn statistical language patterns, they do not guarantee that the model grasps deeper meaning, causing inconsistencies on complex tasks.
Reinforcement Learning from Human Feedback (RLHF)
OpenAI improves alignment through three steps:
Supervised Fine‑Tuning (SFT): A small, high‑quality dataset (≈12‑15k examples) of prompts and desired outputs is collected from annotators and OpenAI API logs.
Reward Model (RM) Training: Annotators rank multiple SFT outputs for each prompt; the rankings form a larger dataset used to train a model that scores outputs according to human preference.
Proximal Policy Optimization (PPO): The RM guides further fine‑tuning of the SFT model. PPO updates the policy while limiting changes (trust‑region) and adds a KL penalty to prevent over‑optimizing the RM.
The evaluation relies on human‑rated prompts not seen during training and measures helpfulness, truthfulness, and harmlessness. An “alignment tax” is observed: improving consistency can reduce performance on some zero‑shot NLP tasks.
Limitations of the RLHF Approach
Key drawbacks include data bias from annotator preferences, lack of controlled studies comparing RLHF to pure supervised fine‑tuning, heterogeneous human preferences, uncertain RM stability to prompt variations, and potential over‑optimization where the model learns to game the reward model.
References
Training language models to follow instructions with human feedback (https://arxiv.org/pdf/2203.02155.pdf)
Learning to summarize from Human Feedback (https://arxiv.org/pdf/2009.01325.pdf)
PPO algorithm paper (https://arxiv.org/pdf/1707.06347.pdf)
Deep reinforcement learning from human preferences (https://arxiv.org/abs/1706.03741)
DeepMind Sparrow and GopherCite alternatives (https://arxiv.org/pdf/2209.14375.pdf, https://arxiv.org/abs/2203.11147)
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
