Artificial Intelligence 15 min read

How Does ChatGPT Work? Inside RLHF and Model Consistency

This article explains the inner workings of ChatGPT, detailing its evolution from GPT‑3, the role of reinforcement learning from human feedback (RLHF) in improving consistency, the training pipeline steps, and the limitations and evaluation methods of large language models.

Open Source Linux

Feb 13, 2023

How Does ChatGPT Work? Inside RLHF and Model Consistency

Large Language Model Capabilities and Consistency

Since the release of ChatGPT, many have wondered how it actually works. Although the internal implementation details are not fully disclosed, recent research reveals its basic principles.

ChatGPT is OpenAI's latest language model, improving significantly over GPT‑3. Like other large models, it can generate text in various styles and for different purposes, offering better accuracy, detail, and contextual coherence.

OpenAI fine‑tunes ChatGPT using a combination of supervised learning and reinforcement learning from human feedback (RLHF), which helps reduce unhelpful, distorted, or biased outputs.

How Training Strategies Cause Inconsistency

GPT‑3 and similar models are trained to predict the next token in a sequence, optimizing a probability distribution over words. This objective can lead to inconsistencies: a model may achieve low loss (high capability) but still produce outputs that do not align with human expectations.

Examples of inconsistency include providing invalid help, fabricating facts, lacking explainability, and exhibiting harmful bias.

Training Strategies

Two core pre‑training objectives are used:

Next‑token prediction: given a context, the model predicts the most probable next word.

Masked‑language modeling: some tokens are replaced with a mask token, and the model predicts the original word.

These objectives enable the model to learn statistical language patterns but can cause the model to miss deeper semantic understanding, leading to inconsistency on complex tasks.

From Human Feedback to Reinforcement Learning

The RLHF pipeline consists of three steps:

Supervised fine‑tuning (SFT): a small, high‑quality dataset of prompts and desired outputs is used to train an initial model.

Training a Reward Model (RM): human annotators rank multiple SFT outputs for the same prompt; the rankings form a new dataset to train the RM.

Proximal Policy Optimization (PPO) fine‑tuning: the RM guides further training of the SFT model, optimizing for human‑preferred behavior.

Step 1: Supervised Fine‑tuning

Data collection involves creating a prompt list and having annotators write expected responses, resulting in roughly 12‑15k high‑quality examples. The base model chosen is a GPT‑3.5 variant (e.g., text‑davinci‑003).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI large language models ChatGPT reinforcement learning RLHF model alignment

Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.