Dec 6, 2025 · Artificial Intelligence

Why Adding Toxic Data Can Make Language Models Safer and More Capable

A recent study shows that deliberately mixing a moderate amount of toxic content into large‑language‑model pre‑training actually sharpens the model’s internal representation of toxicity, enabling post‑training interventions to more effectively detoxify the model while preserving or even improving its general capabilities.

LLMToxic Datadetoxification

0 likes · 10 min read

Why Adding Toxic Data Can Make Language Models Safer and More Capable