How ARGRE Revolutionizes LLM Detoxification with Autoregressive Reward‑Guided Editing

The paper introduces ARGRE, a novel test‑time detoxification framework for large language models that visualizes toxicity trajectories in representation space and uses a lightweight autoregressive reward model to efficiently reduce harmful outputs while preserving generation quality.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
How ARGRE Revolutionizes LLM Detoxification with Autoregressive Reward‑Guided Editing

Background

Large language models (LLMs) can generate hateful, discriminatory, or threatening content. Existing detoxification approaches either require large amounts of annotated data and compute (training‑time preference optimization) or intervene only coarsely at test time (representation editing).

ARGRE Overview

Autoregressive Reward‑Guided Representation Editing (ARGRE) addresses these bottlenecks by explicitly modeling a continuous toxicity transition trajectory in the latent representation space and by training a token‑level reward model that guides inference‑time edits.

ARGRE framework illustration
ARGRE framework illustration

Toxicity Trajectory Exploration

Assuming a linear encoding of semantic concepts, for a given prompt the final‑token representations of a toxic continuation ( h_{tox}) and a benign continuation ( h_{ben}) are extracted. Their difference vector Δ = h_{tox} - h_{ben} is projected with PCA to obtain the dominant non‑toxic direction d. Interpolating between h_{tox} and h_{ben} along d yields a fine‑grained toxicity trajectory, which is then converted into a paired preference dataset.

PCA direction extraction
PCA direction extraction
Preference dataset illustration
Preference dataset illustration

Autoregressive Reward Model

A lightweight two‑layer MLP takes a token representation h_t and predicts a scalar reward r_t. The model is trained so that non‑toxic tokens receive higher rewards than toxic ones. The overall sequence reward is the sum of token rewards, enabling precise token‑level guidance.

Reward model objective
Reward model objective
Reward decomposition
Reward decomposition
Token‑level reward
Token‑level reward

Adaptive Representation Editing (Inference)

For each generated token representation h_t:

Shift h_t along the learned non‑toxic direction d to reduce the gap between its current reward and the average non‑toxic reward.

Perform a few steps of lightweight gradient ascent on h_t to maximize the reward model output.

This two‑step strategy avoids costly iterative optimization while preventing the representation from getting trapped in local minima.

Adaptive editing pipeline
Adaptive editing pipeline

Experimental Evaluation

ARGRE was evaluated on eight mainstream LLMs ranging from 355 M (GPT‑2 Medium) to 30 B (LLaMA‑30B) using the RealToxicityPrompts benchmark. Toxicity was measured with Detoxify, and language fluency with perplexity on WikiText‑2.

ARGRE reduced average toxicity by 62.21 % while increasing perplexity by only 0.52, outperforming all baselines.

A simplified version without gradient steps still achieved a 59.63 % reduction.

On LLaMA‑30B, inference time was cut by 47.58 % compared to the strongest baseline.

Zero‑shot downstream task performance was preserved.

Effectiveness results
Effectiveness results
Efficiency results
Efficiency results
Zero‑shot performance
Zero‑shot performance

Limitations

Requires white‑box access to model representations, limiting applicability to closed‑source LLMs.

Current trajectory exploration is confined to the first principal component; future work will investigate richer directions.

Resources

Paper title: Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing

Paper URL: https://arxiv.org/abs/2510.01243

Disclaimer
Disclaimer
Additional illustration
Additional illustration
detoxificationNeurIPS 2025LLM safetyARGREautoregressive rewardrepresentation editing
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.