Artificial Intelligence 17 min read

Why ChatGPT Still Gets It Wrong: Inside RLHF and Model Consistency

ChatGPT, OpenAI’s latest language model, builds on GPT‑3 but uses supervised fine‑tuning and Reinforcement Learning from Human Feedback (RLHF) to improve alignment, yet its training methods still cause consistency issues such as invalid help, hallucinations, bias, and limited explainability.

dbaplus Community

Feb 18, 2023

Why ChatGPT Still Gets It Wrong: Inside RLHF and Model Consistency

1. Ability vs Consistency

In machine learning, ability measures how well a model optimises its training objective, while consistency measures how closely the model’s behaviour matches the intended real‑world goal. A classic inconsistency example is a bird‑classifier that minimises log‑loss (high ability) but misclassifies sparrows as robins, showing a gap between the training objective and the actual performance. Large language models (LLMs) such as GPT‑3 are trained to predict the next token in massive internet text; this objective does not guarantee outputs that align with human expectations, leading to reliability problems.

Providing invalid help: the model fails to follow explicit user instructions.

Hallucinating content: the model fabricates facts that are false or nonexistent.

Lack of explainability: users cannot understand why a particular decision was made.

Harmful bias: biased training data can produce discriminatory or unsafe outputs.

2. Training objectives that produce inconsistency

Two core pre‑training techniques are used:

Next‑token prediction: given a token sequence, the model predicts the next word. Example: for the prompt "The cat sat on the" the model may predict "mat", "chair" or "floor" based on statistical likelihood.

Masked‑language modeling (MLM): some tokens are replaced with a [MASK] token and the model predicts the missing word. Example: for "The [MASK] sat on the" the model may predict "cat" or "dog".

These objectives teach the model the statistical structure of language but do not distinguish between critical and trivial errors. For instance, with the prompt "The Roman Empire [MASK] with the reign of Augustus." the model might fill "began" or "ended", both statistically plausible yet semantically ambiguous. Consequently, models trained solely on next‑token prediction often struggle with tasks that require deeper understanding, leading to an “alignment tax” when alignment techniques reduce performance on certain benchmarks.

3. Reinforcement Learning from Human Feedback (RLHF)

RLHF mitigates inconsistency through three main steps:

Supervised fine‑tuning (SFT): a pre‑trained LLM is fine‑tuned on a small, high‑quality dataset (≈12‑15 k prompt‑response pairs) to learn a basic instruction‑following policy.

Reward model (RM) training: human annotators rank multiple SFT outputs for the same prompt, creating a comparison dataset roughly ten times larger than the SFT set. A reward model is trained to predict a scalar preference score for any candidate response.

Proximal Policy Optimization (PPO): the RM provides a reward signal in a bandit‑style reinforcement‑learning loop. PPO updates the policy while constraining changes with a KL‑penalty to keep the fine‑tuned model close to the original SFT behaviour and to avoid over‑optimising the RM.

The PPO step iteratively improves the policy, mitigating catastrophic drift.

4. Limitations of the RLHF approach

Human annotator preferences are subjective and may not represent the full user base.

Lack of rigorous control experiments comparing SFT‑only versus RLHF‑enhanced models makes it hard to isolate the true benefit of RLHF.

Comparison data often lack grounding facts, increasing variance in rankings.

Assuming a single static preference distribution ignores heterogeneous human values.

Reward‑model stability to prompt paraphrases is untested; small syntactic changes may cause large score fluctuations.

Over‑optimisation can occur when the policy learns to exploit the RM; KL‑penalty safeguards are required.

5. Evaluation methodology

Model performance is assessed on a held‑out test set of prompts from external OpenAI customers (unseen during training) using human‑rated metrics:

Helpfulness: ability to follow and infer user instructions.

Truthfulness: tendency to avoid fabricating facts on closed‑domain tasks.

Harmlessness: avoidance of disallowed or discriminatory content.

Zero‑shot evaluations on standard NLP benchmarks (question answering, reading comprehension, summarisation) reveal a modest “alignment tax”: RLHF improves alignment but can slightly reduce raw task performance compared with the base GPT‑3 model.

References

Training language models to follow instructions with human feedback – https://arxiv.org/pdf/2203.02155.pdf

Learning to summarize from Human Feedback – https://arxiv.org/pdf/2009.01325.pdf

Proximal Policy Optimization Algorithms (PPO) – https://arxiv.org/pdf/1707.06347.pdf

Deep reinforcement learning from human preferences – https://arxiv.org/abs/1706.03741

Sparrow: A Dialogue Agent that Follows Human Preferences – https://arxiv.org/pdf/2209.14375.pdf

GopherCite: Large Language Models Cite Their Sources – https://arxiv.org/abs/2203.11147

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Large Language Models ChatGPT reinforcement learning RLHF PPO model alignment

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.