Understanding RLHF: How Human Feedback Trains Modern LLMs
This article explains the RLHF (Reinforcement Learning from Human Feedback) pipeline that powers ChatGPT and other large language models, covering the limitations of traditional fine‑tuning, the creation of human‑feedback datasets, reward‑model training, loss design, and the final PPO‑based fine‑tuning step.
Introduction
ChatGPT’s 2022 debut highlighted a breakthrough beyond larger datasets or model architecture changes: Reinforcement Learning from Human Feedback (RLHF). The article focuses on how RLHF overcomes the bottleneck of manually labeled data that traditional fine‑tuning requires.
Pre‑training and Fine‑tuning
LLM development historically consists of two stages: pre‑training, where the model learns to predict masked tokens via language modeling, and fine‑tuning, where the model is adapted to downstream tasks (summarization, translation, QA, etc.) using manually annotated datasets. Fine‑tuning is limited by the labor‑intensive nature of data annotation; building millions of QA pairs for large‑scale models is costly and hard to scale.
RLHF Overview
RLHF replaces exhaustive labeling with a binary preference task: given two model outputs for the same prompt, a human selects the better one. This creates a human‑feedback dataset without requiring numeric scores for each answer.
Generating Responses
Two sampling strategies are described: (1) deterministic selection of the highest‑probability token, and (2) stochastic sampling from the token probability distribution. The stochastic method yields diverse response pairs, which are then labeled by humans for preference.
Reward Model
Using the preference dataset, a reward model—architecturally identical to the base LLM except for a final scalar output—is trained to assign higher scores to preferred responses and lower scores to less‑preferred ones. Both the original prompt and the generated response are fed into the reward model.
Loss Function
The loss compares the reward scores of the preferred (R₊) and non‑preferred (R₋) responses. If the difference R₊ − R₋ is negative, the loss grows proportionally, forcing the model to adjust. If the difference is positive, the loss is bounded between 0 and 0.69, indicating good discrimination. This design lets the model learn meaningful reward values from only binary preference labels.
Training and Inference
After training, the reward model provides scalar feedback for new LLM outputs. These scores are used as a reinforcement signal to update the original LLM’s weights, typically via Proximal Policy Optimization (PPO). During inference, only the fine‑tuned LLM is used, while user prompts continue to be collected for ongoing improvement.
Conclusion
The article concludes that RLHF efficiently scales LLM training by coupling a reward model with human preferences, dramatically reducing the need for extensive manual annotation. RLHF is now employed in popular models such as ChatGPT, Claude, Gemini, and Mistral.
AI Algorithm Path
A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
