Why ChatGPT Still Gets It Wrong: Inside RLHF and Model Consistency

ChatGPT, OpenAI’s latest language model, builds on GPT‑3 but uses supervised fine‑tuning and Reinforcement Learning from Human Feedback (RLHF) to improve alignment, yet its training methods still cause consistency issues such as invalid help, hallucinations, bias, and limited explainability.

dbaplus Community
dbaplus Community
dbaplus Community
Why ChatGPT Still Gets It Wrong: Inside RLHF and Model Consistency

1. Ability vs Consistency

In machine learning, ability measures how well a model optimises its training objective, while consistency measures how closely the model’s behaviour matches the intended real‑world goal. A classic inconsistency example is a bird‑classifier that minimises log‑loss (high ability) but misclassifies sparrows as robins, showing a gap between the training objective and the actual performance. Large language models (LLMs) such as GPT‑3 are trained to predict the next token in massive internet text; this objective does not guarantee outputs that align with human expectations, leading to reliability problems.

Providing invalid help: the model fails to follow explicit user instructions.

Hallucinating content: the model fabricates facts that are false or nonexistent.

Lack of explainability: users cannot understand why a particular decision was made.

Harmful bias: biased training data can produce discriminatory or unsafe outputs.

Ability vs Consistency diagram
Ability vs Consistency diagram

2. Training objectives that produce inconsistency

Two core pre‑training techniques are used:

Next‑token prediction: given a token sequence, the model predicts the next word. Example: for the prompt "The cat sat on the" the model may predict "mat", "chair" or "floor" based on statistical likelihood.

Masked‑language modeling (MLM): some tokens are replaced with a [MASK] token and the model predicts the missing word. Example: for "The [MASK] sat on the" the model may predict "cat" or "dog".

These objectives teach the model the statistical structure of language but do not distinguish between critical and trivial errors. For instance, with the prompt "The Roman Empire [MASK] with the reign of Augustus." the model might fill "began" or "ended", both statistically plausible yet semantically ambiguous. Consequently, models trained solely on next‑token prediction often struggle with tasks that require deeper understanding, leading to an “alignment tax” when alignment techniques reduce performance on certain benchmarks.

Model fine‑tuning pipeline
Model fine‑tuning pipeline

3. Reinforcement Learning from Human Feedback (RLHF)

RLHF mitigates inconsistency through three main steps:

Supervised fine‑tuning (SFT): a pre‑trained LLM is fine‑tuned on a small, high‑quality dataset (≈12‑15 k prompt‑response pairs) to learn a basic instruction‑following policy.

Reward model (RM) training: human annotators rank multiple SFT outputs for the same prompt, creating a comparison dataset roughly ten times larger than the SFT set. A reward model is trained to predict a scalar preference score for any candidate response.

Proximal Policy Optimization (PPO): the RM provides a reward signal in a bandit‑style reinforcement‑learning loop. PPO updates the policy while constraining changes with a KL‑penalty to keep the fine‑tuned model close to the original SFT behaviour and to avoid over‑optimising the RM.

The PPO step iteratively improves the policy, mitigating catastrophic drift.

Reward model training
Reward model training
PPO fine‑tuning
PPO fine‑tuning

4. Limitations of the RLHF approach

Human annotator preferences are subjective and may not represent the full user base.

Lack of rigorous control experiments comparing SFT‑only versus RLHF‑enhanced models makes it hard to isolate the true benefit of RLHF.

Comparison data often lack grounding facts, increasing variance in rankings.

Assuming a single static preference distribution ignores heterogeneous human values.

Reward‑model stability to prompt paraphrases is untested; small syntactic changes may cause large score fluctuations.

Over‑optimisation can occur when the policy learns to exploit the RM; KL‑penalty safeguards are required.

5. Evaluation methodology

Model performance is assessed on a held‑out test set of prompts from external OpenAI customers (unseen during training) using human‑rated metrics:

Helpfulness: ability to follow and infer user instructions.

Truthfulness: tendency to avoid fabricating facts on closed‑domain tasks.

Harmlessness: avoidance of disallowed or discriminatory content.

Zero‑shot evaluations on standard NLP benchmarks (question answering, reading comprehension, summarisation) reveal a modest “alignment tax”: RLHF improves alignment but can slightly reduce raw task performance compared with the base GPT‑3 model.

References

Training language models to follow instructions with human feedback – https://arxiv.org/pdf/2203.02155.pdf

Learning to summarize from Human Feedback – https://arxiv.org/pdf/2009.01325.pdf

Proximal Policy Optimization Algorithms (PPO) – https://arxiv.org/pdf/1707.06347.pdf

Deep reinforcement learning from human preferences – https://arxiv.org/abs/1706.03741

Sparrow: A Dialogue Agent that Follows Human Preferences – https://arxiv.org/pdf/2209.14375.pdf

GopherCite: Large Language Models Cite Their Sources – https://arxiv.org/abs/2203.11147

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsChatGPTreinforcement learningRLHFPPOModel Alignment
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.