How Researchers Made Large Language Models Forget or Amplify Specific Concepts

A new study from Meta and NYU reveals a two‑step technique—SAMD to locate concept‑specific attention heads and SAMI to scale their influence—enabling precise, low‑cost editing of transformer models for tasks ranging from factual recall to safety control.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
How Researchers Made Large Language Models Forget or Amplify Specific Concepts

Background

Large language models (LLMs) store knowledge in a highly sparse manner: a complex concept is often represented by only a handful of attention heads. The paper From Concepts to Components (Meta & NYU) introduces two complementary techniques for locating and editing these concept-specific modules.

Scalable Attention Module Discovery (SAMD)

SAMD identifies the attention heads that encode a target concept.

Concept vectorisation : Encode the concept as a vector. For concrete words (e.g., “dog”) a word embedding can be used; for abstract notions (e.g., “reasoning”) a representation generated by a chain‑of‑thought (CoT) prompt is employed.

Similarity scoring : For every attention head in the transformer, compute the cosine similarity between the concept vector and the head’s output (the value after the attention operation).

Module construction : Select the top‑K heads with the highest similarity (typically 3–10 heads). These heads usually cluster in a few consecutive layers, forming a “concept module”.

The method works for both language models and Vision Transformers (ViT).

SAMD concept localisation diagram
SAMD concept localisation diagram

Scalar Attention Module Intervention (SAMI)

Given a module discovered by SAMD, SAMI modifies its contribution with a single scalar s applied to the module’s output in the residual stream. s < 1 attenuates the concept (e.g., the model forgets that “dogs bark”). s > 1 amplifies the concept, making it dominate the generation.

The intervention touches only the selected heads; the rest of the model remains unchanged.

SAMI scaling illustration
SAMI scaling illustration

Experimental Evaluation

The authors tested the pipeline on four representative scenarios.

1. Simple factual concepts

Concepts such as “dog” and “San Francisco” were edited.

Negative scaling ( s = 0.1) removed the concept, causing the model to hallucinate unrelated entities (e.g., answering “hummingbird” when asked what animal barks).

Positive scaling ( s = 10⁴) produced repetitive outputs, repeatedly mentioning the target concept.

2. Mathematical reasoning

On the GSM8K benchmark, SAMI was applied to the reasoning module of LLaMA‑3.1‑8B‑INSTRUCT and GEMMA‑7B‑BASE.

Scaling s = 1.4 raised accuracy from 84.61 % to 85.44 %.

Scaling s = 1.2 improved a weaker model from 54.36 % to 56.71 %.

Other tasks (Commonsense QA, code generation) showed negligible change, demonstrating targeted improvement.

3. Safety and jailbreak control

The “safety module” in Llama‑2‑Chat‑7B was located (heads in layers 11‑18).

Disabling the module (setting s = 0) increased attack success on HarmBench to 71 %, far above baseline attacks (e.g., GCG at 34.5 %).

Amplifying the safety module caused the model to loop on safety‑related tokens (“safety/saf/cert”).

4. Vision Transformers

On ViT‑B/32, 200 ImageNet classes were each associated with a 3‑head module.

Manipulating the “lighter” class module (setting s = 0) yielded 100 % mis‑recognition for lighters while only raising the average error of other classes by ~15 %.

Safety module intervention results
Safety module intervention results

Key Findings

Concepts are encoded by a very small set of attention heads (typically 3‑10).

Both localisation (SAMD) and low‑cost editing (SAMI) are model‑agnostic and work for language and vision transformers.

Scalar intervention can improve specialised abilities (e.g., reasoning) without degrading overall performance.

The approach offers a principled alternative to costly full‑model fine‑tuning.

Reference

arXiv pre‑print: https://arxiv.org/abs/2506.17052

AI safetysparse attentionmodel editingconcept controltransformer interpretability
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.