Artificial Intelligence 10 min read

How ARGRE Revolutionizes LLM Detoxification with Autoregressive Reward‑Guided Editing

The paper introduces ARGRE, a novel test‑time detoxification framework for large language models that visualizes toxicity trajectories in representation space and uses a lightweight autoregressive reward model to efficiently reduce harmful outputs while preserving generation quality.

AI Frontier Lectures

Oct 27, 2025

How ARGRE Revolutionizes LLM Detoxification with Autoregressive Reward‑Guided Editing

Background

Large language models (LLMs) can generate hateful, discriminatory, or threatening content. Existing detoxification approaches either require large amounts of annotated data and compute (training‑time preference optimization) or intervene only coarsely at test time (representation editing).

ARGRE Overview

Autoregressive Reward‑Guided Representation Editing (ARGRE) addresses these bottlenecks by explicitly modeling a continuous toxicity transition trajectory in the latent representation space and by training a token‑level reward model that guides inference‑time edits.

Toxicity Trajectory Exploration

Assuming a linear encoding of semantic concepts, for a given prompt the final‑token representations of a toxic continuation ( h_{tox}) and a benign continuation ( h_{ben}) are extracted. Their difference vector Δ = h_{tox} - h_{ben} is projected with PCA to obtain the dominant non‑toxic direction d. Interpolating between h_{tox} and h_{ben} along d yields a fine‑grained toxicity trajectory, which is then converted into a paired preference dataset.

Autoregressive Reward Model

A lightweight two‑layer MLP takes a token representation h_t and predicts a scalar reward r_t. The model is trained so that non‑toxic tokens receive higher rewards than toxic ones. The overall sequence reward is the sum of token rewards, enabling precise token‑level guidance.

Adaptive Representation Editing (Inference)

For each generated token representation h_t:

Shift h_t along the learned non‑toxic direction d to reduce the gap between its current reward and the average non‑toxic reward.

Perform a few steps of lightweight gradient ascent on h_t to maximize the reward model output.

This two‑step strategy avoids costly iterative optimization while preventing the representation from getting trapped in local minima.

Experimental Evaluation

ARGRE was evaluated on eight mainstream LLMs ranging from 355 M (GPT‑2 Medium) to 30 B (LLaMA‑30B) using the RealToxicityPrompts benchmark. Toxicity was measured with Detoxify, and language fluency with perplexity on WikiText‑2.

ARGRE reduced average toxicity by 62.21 % while increasing perplexity by only 0.52, outperforming all baselines.

A simplified version without gradient steps still achieved a 59.63 % reduction.

On LLaMA‑30B, inference time was cut by 47.58 % compared to the strongest baseline.

Zero‑shot downstream task performance was preserved.

Limitations

Requires white‑box access to model representations, limiting applicability to closed‑source LLMs.

Current trajectory exploration is confined to the first principal component; future work will investigate richer directions.

Resources

Paper title: Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing

Paper URL: https://arxiv.org/abs/2510.01243

detoxification NeurIPS 2025 LLM safety ARGRE autoregressive reward representation editing

Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.