How Does ChatGPT Work? Inside RLHF and Model Consistency

This article explains the inner workings of ChatGPT, detailing its evolution from GPT‑3, the role of reinforcement learning from human feedback (RLHF) in improving consistency, the training pipeline steps, and the limitations and evaluation methods of large language models.

Open Source Linux
Open Source Linux
Open Source Linux
How Does ChatGPT Work? Inside RLHF and Model Consistency

Large Language Model Capabilities and Consistency

Since the release of ChatGPT, many have wondered how it actually works. Although the internal implementation details are not fully disclosed, recent research reveals its basic principles.

ChatGPT is OpenAI's latest language model, improving significantly over GPT‑3. Like other large models, it can generate text in various styles and for different purposes, offering better accuracy, detail, and contextual coherence.

OpenAI fine‑tunes ChatGPT using a combination of supervised learning and reinforcement learning from human feedback (RLHF), which helps reduce unhelpful, distorted, or biased outputs.

How Training Strategies Cause Inconsistency

GPT‑3 and similar models are trained to predict the next token in a sequence, optimizing a probability distribution over words. This objective can lead to inconsistencies: a model may achieve low loss (high capability) but still produce outputs that do not align with human expectations.

Examples of inconsistency include providing invalid help, fabricating facts, lacking explainability, and exhibiting harmful bias.

Training Strategies

Two core pre‑training objectives are used:

Next‑token prediction: given a context, the model predicts the most probable next word.

Masked‑language modeling: some tokens are replaced with a mask token, and the model predicts the original word.

These objectives enable the model to learn statistical language patterns but can cause the model to miss deeper semantic understanding, leading to inconsistency on complex tasks.

From Human Feedback to Reinforcement Learning

The RLHF pipeline consists of three steps:

Supervised fine‑tuning (SFT): a small, high‑quality dataset of prompts and desired outputs is used to train an initial model.

Training a Reward Model (RM): human annotators rank multiple SFT outputs for the same prompt; the rankings form a new dataset to train the RM.

Proximal Policy Optimization (PPO) fine‑tuning: the RM guides further training of the SFT model, optimizing for human‑preferred behavior.

Step 1: Supervised Fine‑tuning

Data collection involves creating a prompt list and having annotators write expected responses, resulting in roughly 12‑15k high‑quality examples. The base model chosen is a GPT‑3.5 variant (e.g., text‑davinci‑003).

Supervised fine‑tuning diagram
Supervised fine‑tuning diagram
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIlarge language modelsChatGPTreinforcement learningRLHFModel Alignment
Open Source Linux
Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.