How Anthropic’s Path Tracing Reveals the Inner Workings of Claude 3.5 Haiku

Anthropic’s recent paper introduces a path‑tracing technique that uses cross‑layer transcoders and attribution graphs to sparsely visualize and analyze the decision‑making process of the Claude 3.5 Haiku large language model, demonstrating Pareto‑optimal improvements and a four‑stage reverse‑engineering framework while acknowledging current limitations.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
How Anthropic’s Path Tracing Reveals the Inner Workings of Claude 3.5 Haiku

Background

Anthropic introduced path tracing , a technique that makes the step‑by‑step decision process of large language models (LLMs) observable. The method was demonstrated on the 18‑layer Claude 3.5 Haiku model.

Key Concepts

Path tracing : connects model components that correspond to real‑world concepts and visualises their interactions as a directed graph.

Cross‑layer transcoders : sparse, interpretable modules that replace the original MLP layers. Each transcoder reads from a residual stream, produces a set of sparse features, and contributes to all downstream MLP layers.

Attribution graphs : directed graphs whose nodes represent active features, token embeddings, reconstruction error, and log‑probabilities. Edges encode linear effects between features, so a node’s activation is the sum of its incoming edge contributions.

Methodology

The approach consists of two main phases.

Feature extraction : Train a cross‑layer transcoder with an L0 sparsity penalty and a mean‑squared‑error (MSE) reconstruction loss. The loss forces each feature to be active only on a small subset of tokens, encouraging alignment with human‑readable concepts. The transcoder replaces the original MLPs and matches the base model’s output in roughly 50 % of inference cases.

Interaction analysis : For a given prompt, generate an attribution graph by propagating the sparse feature activations through the frozen attention patterns and layer‑norm denominators (these are kept unchanged to enforce a linear direct‑effect assumption). Then prune the graph, retaining only the nodes and edges that contribute the most to the target token’s log‑probability, yielding a sparse, interpretable computation graph.

Experimental Evaluation

Anthropic evaluated the pipeline on ten benchmark tasks using Claude 3.5 Haiku. Two types of experiments were performed:

Perturbation experiments : Inject noise along a selected feature direction and measure downstream activation changes. The observed changes qualitatively matched the predicted edge weights in the attribution graphs.

Quantitative trade‑off analysis : Compare sparsity (L0 count), reconstruction error (MSE), and fidelity (percentage of outputs matching the original model). Cross‑layer transcoders achieved a Pareto improvement over neuron‑level analysis and single‑layer transcoders, offering a better balance of sparsity, accuracy, and fidelity.

Reverse‑Engineering Framework

The paper formalises a four‑stage reverse‑engineering pipeline:

Component decomposition : Train a sparse cross‑layer transcoder to replace each MLP module.

Feature description : Use activation statistics (e.g., top‑k token contexts) to assign human‑readable labels to each sparse feature.

Interaction analysis : Build attribution graphs for specific prompts, then prune to the most influential nodes/edges.

Validation : Conduct causal intervention experiments (perturb a feature, observe downstream changes) to verify hypotheses derived from the graphs.

Limitations

The method assumes linear direct effects between features, which requires freezing attention patterns and layer‑norm denominators. This simplification omits non‑linear dynamics such as Q‑K pathways and may miss important interactions. Training the transcoders incurs a substantial upfront cost, and the current study only covers a limited set of ten tasks, leaving many model behaviours unexplored.

Future Directions

Potential extensions include:

Developing more computationally efficient transcoders or alternative sparse coding schemes.

Incorporating attention‑path analysis (e.g., Q‑K circuits) to capture non‑linear interactions.

Exploring richer pruning criteria or adaptive thresholds to reduce the number of active features per input.

Scaling the framework to a broader suite of tasks and larger model families.

Conclusion

Path tracing provides a concrete, reproducible way to visualise and analyse the internal computation of LLMs. By replacing MLP layers with sparse cross‑layer transcoders and constructing attribution graphs, researchers can obtain interpretable, sparsified representations of model reasoning while preserving a substantial portion of the original model’s behaviour.

Code example

收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelInterpretabilityAnthropicClaude 3.5Path TracingAttribution Graph
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.