Artificial Intelligence 10 min read

ProSafePrune: One‑Shot Pruning to Eliminate Over‑Refusal in Large Language Models

ProSafePrune, a low‑rank parameter pruning framework presented at ICLR 2026, precisely removes over‑harmful encoding in LLMs, dramatically reducing over‑refusal while preserving safety defenses and slightly improving general‑task performance.

Machine Heart

Apr 22, 2026

ProSafePrune: One‑Shot Pruning to Eliminate Over‑Refusal in Large Language Models

Research Background

Large language models (LLMs) are widely used for content creation and intelligent assistants, making safety alignment essential. Existing alignment techniques such as supervised fine‑tuning (SFT) and reinforcement learning from human feedback (RLHF) effectively suppress malicious outputs but often cause “over‑refusal,” where harmless queries containing risky keywords are mistakenly blocked. This over‑refusal stems from a cognitive bias in the model’s hidden states: pseudo‑harmful instructions project onto both harmful and harmless subspaces, and excessive safety fine‑tuning amplifies the harmful component, shifting decision boundaries.

Core Finding

Probe experiments reveal that over‑refusal is caused by “over‑harmful encoding.” In LLaMA‑2‑7B, pseudo‑harmful prompts generate strong harmful signals in early layers and retain high harmful encoding in deep layers, leading to a 38.5% over‑refusal rate, whereas LLaMA‑3‑8B shows only 10.5%.

ProSafePrune Design

The framework consists of three key components that target the identified bias without additional training:

Subspace Extraction : Using singular value decomposition (SVD), the method separates safety, harmful, and pseudo‑harmful subspaces from activation matrices of each module (Q, K, V, O, FFN) in a given layer, preserving discriminative directions while minimizing information loss.

Overlap Operator : A three‑step operator isolates pseudo‑harmful directions that overlap with the harmful subspace while excluding components aligned with the safety subspace, ensuring that only over‑harmful elements are pruned.

Mid‑Layer Pruning : t‑SNE visualizations and silhouette‑score analysis identify the middle layers as having the strongest feature separation for safety‑related attributes. Pruning these layers maximizes over‑refusal mitigation with minimal impact on overall performance. The pruning is applied via the formula shown in the figure below, where λ∈[0,1] controls pruning intensity.

Experimental Validation

Extensive evaluations on LLaMA‑2/3, Qwen 2.5/3 (7B–70B parameters) cover three dimensions: over‑refusal, safety defense, and general‑task performance.

Over‑refusal reduction : On OR‑Bench and PHTest, ProSafePrune raises compliance rate (C.R.) of LLaMA‑2‑7B from 11.0% to 73.0%, surpassing Self‑CD (43.5%) and Surgical (57.5%).

Safety preservation : On AdvBench and JailbreakBench, security scores (S.S.) drop only marginally, confirming that true harmful detection remains intact.

General‑task gains : MMLU score improves from 37.1 to 39.6, CommonQA from 49.0 to 53.0, and GSM8K from 23.0 to 25.5.

Ablation studies show that pruning entire layers yields far higher compliance than pruning single sub‑modules, that removing pseudo‑harmful subspace projections improves compliance but harms safety scores, and that middle‑layer pruning outperforms bottom or top layers.

Method Advantages

No inference overhead : The pruned model is a standalone checkpoint, requiring no extra storage or runtime adjustments.

Fast inference : On OR‑Bench‑Hard‑1K, pruning completes in 16 minutes, far faster than Self‑CD (43 min) and SCAN (20 min).

Strong generalization : Effectiveness persists on 32B‑parameter Qwen 3 and 70B‑parameter LLaMA‑2, with LLaMA‑2‑70B compliance rising from 6.5% to 68.5%.

Conclusion and Outlook

ProSafePrune demonstrates that over‑refusal originates from over‑harmful encoding in the representation space and can be cured by low‑rank, subspace‑aware pruning. The approach achieves three goals simultaneously: safety defenses are unchanged, over‑refusal is dramatically reduced, and general performance is slightly enhanced, offering a new paradigm for safe LLM deployment.

LLM safety ICLR 2026 low-rank pruning over-refusal parameter pruning ProSafePrune

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.