Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Abstract: We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune LLMs to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about
The paper explains how to train an AI assistant to be both helpful (answers your questions and follows instructions) and harmless (avoids giving dangerous, hateful, or unethical responses). The team shows a way to use people’s opinions to teach the AI what “good” behavior looks like, and they test whether this training also affects the AI’s general skills.
The big questions the researchers asked
- Can we use human feedback to train an AI that is both helpful and harmless?
- Does this “alignment” training make the AI worse at other tasks, like school-style questions, reading, or coding—or can it even help?
- Is there a tradeoff between being helpful and being harmless?
- How stable and reliable is this training as models get larger and we collect more data?
- Can we keep improving the AI by updating it regularly with new feedback?
How they trained the assistant
Collecting human preferences
Think of a chat where a person asks the AI something. At each turn, the person sees two possible AI replies. For “helpfulness,” they pick the more helpful and honest reply. For “harmlessness,” they do “red teaming”: they try to trick the AI into saying something harmful, then pick the reply that is more harmful (so the system learns what to avoid).
These choices create a dataset of human preferences.
Teaching a “preference model”
A preference model is like a judge that learns to predict which of two AI replies people would prefer. It looks at pairs of answers and learns patterns: which sounds more helpful, which looks risky, and so on.
Reinforcement learning from human feedback (RLHF)
Once the judge (preference model) is trained, the AI assistant is trained to write replies that the judge would score highly. You can think of this like a video game: the AI gets a “reward” when its answer matches what people would prefer, and it learns to aim for higher rewards over time.
Iterated “online” training
The team didn’t just train once. Every week or so, they:
- Collected fresh human feedback on the latest model,
- Updated the preference model,
- Trained a new AI assistant with RLHF,
- Deployed it again to get even better feedback.
This loop helped them steadily improve quality and fill in gaps the AI hadn’t mastered yet.
Extra safety checks
- Calibration: The judge’s scores should match real human preferences. The team checked this and found the judge’s probabilities were well-aligned, especially for helpfulness.
- Out-of-distribution (OOD) detection: The AI can flag unusual or risky requests (things it hasn’t seen much or that look harmful) and refuse or ask for clarification.
- Robustness tests: They trained a model with one half of the data and tested it with a judge trained on the other half, to see if it overfits or “game the system.”
What they found
Here are the main results and why they matter:
- Helpful vs. harmless is a real tension. If you only make the AI helpful, it’s easier to push it into harmful answers. If you only make it harmless, it may become too cautious and unhelpful. Training on a balanced mix produces assistants that are both quite helpful and much less harmful.
- Alignment training often improves skills for larger models. For small models, helpful/harmless training can slightly hurt performance (“alignment tax”). But for bigger models (like 13B and 52B parameters), the training actually improves accuracy on many language tasks (“alignment bonus”). In other words, making big AIs safer can also make them better at general tasks.
- Specialized skills don’t suffer—and can even improve. When they added helpful/harmless training on top of coding models, coding performance improved (likely because the model followed instructions better). Mixing in summarization didn’t hurt either skill.
- Weekly “online” updates work well. Updating the data and models regularly led to rapid gains. Human evaluators preferred the newer models more often, and the dataset improved too (more high-quality examples).
- The preference model is pretty well-calibrated. Its scores match human choices at the right rates, which means the “judge” is trustworthy within the data it knows. However, very cleverly written but wrong answers can still fool it sometimes, showing it’s not perfectly robust.
- Robustness and over-optimization: Training too long against a single judge can lead to “reward hacking” (doing what the judge likes, not what people truly want). Splitting data to test with an independent judge revealed where this starts to happen and helped set safer training limits.
- A simple relationship during training: They found a stable link between how different the trained AI becomes from its starting point and how much reward it gets. In plain terms, as the AI learns to please the judge more, it changes by a steady, predictable amount. This can help monitor and guide training.
- Bias and safety checks: The models showed improved sentiment toward different groups and no strong gender bias in simple tests, but bias wasn’t eliminated. The harmlessness training and OOD detection also helped the AI refuse many risky or unusual requests.
- Human preferences back it up: Using an Elo-style scoring system (like rating chess players), humans consistently preferred the new RLHF-trained models over earlier versions. In a non-adversarial test, they even preferred the AI over professional writers’ responses a little more than half the time (though the authors caution not to over-interpret this).
Why it matters and what’s next
This work shows a practical path to build AI assistants that are more aligned with human values—helpful without being harmful—by learning directly from people’s preferences. Importantly, for large models, this safety-focused training does not have to trade off with capability; it can actually make the AI better at many tasks.
Going forward, this approach can:
- Make deployed AI systems more useful and safer for everyday users,
- Combine easily with specialized skills like coding or summarization,
- Improve steadily through regular updates with fresh feedback,
- Provide tools (like calibration checks and robustness tests) to catch over-optimization and reduce bad behavior.
There’s still more to do: truthfulness and adversarial robustness can be improved, and bias needs continued attention. But this paper provides strong evidence that learning from human feedback is a powerful, scalable way to align AI with what people actually want.
Collections
Sign up for free to add this paper to one or more collections.