How Does ChatGPT Work? Inside RLHF and Model Consistency
This article explains the inner workings of ChatGPT, detailing its evolution from GPT‑3, the role of reinforcement learning from human feedback (RLHF) in improving consistency, the training pipeline steps, and the limitations and evaluation methods of large language models.
Large Language Model Capabilities and Consistency
Since the release of ChatGPT, many have wondered how it actually works. Although the internal implementation details are not fully disclosed, recent research reveals its basic principles.
ChatGPT is OpenAI's latest language model, improving significantly over GPT‑3. Like other large models, it can generate text in various styles and for different purposes, offering better accuracy, detail, and contextual coherence.
OpenAI fine‑tunes ChatGPT using a combination of supervised learning and reinforcement learning from human feedback (RLHF), which helps reduce unhelpful, distorted, or biased outputs.
How Training Strategies Cause Inconsistency
GPT‑3 and similar models are trained to predict the next token in a sequence, optimizing a probability distribution over words. This objective can lead to inconsistencies: a model may achieve low loss (high capability) but still produce outputs that do not align with human expectations.
Examples of inconsistency include providing invalid help, fabricating facts, lacking explainability, and exhibiting harmful bias.
Training Strategies
Two core pre‑training objectives are used:
Next‑token prediction: given a context, the model predicts the most probable next word.
Masked‑language modeling: some tokens are replaced with a mask token, and the model predicts the original word.
These objectives enable the model to learn statistical language patterns but can cause the model to miss deeper semantic understanding, leading to inconsistency on complex tasks.
From Human Feedback to Reinforcement Learning
The RLHF pipeline consists of three steps:
Supervised fine‑tuning (SFT): a small, high‑quality dataset of prompts and desired outputs is used to train an initial model.
Training a Reward Model (RM): human annotators rank multiple SFT outputs for the same prompt; the rankings form a new dataset to train the RM.
Proximal Policy Optimization (PPO) fine‑tuning: the RM guides further training of the SFT model, optimizing for human‑preferred behavior.
Step 1: Supervised Fine‑tuning
Data collection involves creating a prompt list and having annotators write expected responses, resulting in roughly 12‑15k high‑quality examples. The base model chosen is a GPT‑3.5 variant (e.g., text‑davinci‑003).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
