AI Frontier Lectures
Oct 27, 2025 · Artificial Intelligence
How ARGRE Revolutionizes LLM Detoxification with Autoregressive Reward‑Guided Editing
The paper introduces ARGRE, a novel test‑time detoxification framework for large language models that visualizes toxicity trajectories in representation space and uses a lightweight autoregressive reward model to efficiently reduce harmful outputs while preserving generation quality.
ARGRELLM safetyNeurIPS 2025
0 likes · 10 min read
