ProSafePrune: One‑Shot Pruning to Eliminate Over‑Refusal in Large Language Models
ProSafePrune, a low‑rank parameter pruning framework presented at ICLR 2026, precisely removes over‑harmful encoding in LLMs, dramatically reducing over‑refusal while preserving safety defenses and slightly improving general‑task performance.
Research Background
Large language models (LLMs) are widely used for content creation and intelligent assistants, making safety alignment essential. Existing alignment techniques such as supervised fine‑tuning (SFT) and reinforcement learning from human feedback (RLHF) effectively suppress malicious outputs but often cause “over‑refusal,” where harmless queries containing risky keywords are mistakenly blocked. This over‑refusal stems from a cognitive bias in the model’s hidden states: pseudo‑harmful instructions project onto both harmful and harmless subspaces, and excessive safety fine‑tuning amplifies the harmful component, shifting decision boundaries.
Core Finding
Probe experiments reveal that over‑refusal is caused by “over‑harmful encoding.” In LLaMA‑2‑7B, pseudo‑harmful prompts generate strong harmful signals in early layers and retain high harmful encoding in deep layers, leading to a 38.5% over‑refusal rate, whereas LLaMA‑3‑8B shows only 10.5%.
ProSafePrune Design
The framework consists of three key components that target the identified bias without additional training:
Subspace Extraction : Using singular value decomposition (SVD), the method separates safety, harmful, and pseudo‑harmful subspaces from activation matrices of each module (Q, K, V, O, FFN) in a given layer, preserving discriminative directions while minimizing information loss.
Overlap Operator : A three‑step operator isolates pseudo‑harmful directions that overlap with the harmful subspace while excluding components aligned with the safety subspace, ensuring that only over‑harmful elements are pruned.
Mid‑Layer Pruning : t‑SNE visualizations and silhouette‑score analysis identify the middle layers as having the strongest feature separation for safety‑related attributes. Pruning these layers maximizes over‑refusal mitigation with minimal impact on overall performance. The pruning is applied via the formula shown in the figure below, where λ∈[0,1] controls pruning intensity.
Experimental Validation
Extensive evaluations on LLaMA‑2/3, Qwen 2.5/3 (7B–70B parameters) cover three dimensions: over‑refusal, safety defense, and general‑task performance.
Over‑refusal reduction : On OR‑Bench and PHTest, ProSafePrune raises compliance rate (C.R.) of LLaMA‑2‑7B from 11.0% to 73.0%, surpassing Self‑CD (43.5%) and Surgical (57.5%).
Safety preservation : On AdvBench and JailbreakBench, security scores (S.S.) drop only marginally, confirming that true harmful detection remains intact.
General‑task gains : MMLU score improves from 37.1 to 39.6, CommonQA from 49.0 to 53.0, and GSM8K from 23.0 to 25.5.
Ablation studies show that pruning entire layers yields far higher compliance than pruning single sub‑modules, that removing pseudo‑harmful subspace projections improves compliance but harms safety scores, and that middle‑layer pruning outperforms bottom or top layers.
Method Advantages
No inference overhead : The pruned model is a standalone checkpoint, requiring no extra storage or runtime adjustments.
Fast inference : On OR‑Bench‑Hard‑1K, pruning completes in 16 minutes, far faster than Self‑CD (43 min) and SCAN (20 min).
Strong generalization : Effectiveness persists on 32B‑parameter Qwen 3 and 70B‑parameter LLaMA‑2, with LLaMA‑2‑70B compliance rising from 6.5% to 68.5%.
Conclusion and Outlook
ProSafePrune demonstrates that over‑refusal originates from over‑harmful encoding in the representation space and can be cured by low‑rank, subspace‑aware pruning. The approach achieves three goals simultaneously: safety defenses are unchanged, over‑refusal is dramatically reduced, and general performance is slightly enhanced, offering a new paradigm for safe LLM deployment.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
