In April 2025, Anthropic CEO Dario Amodei published “The Urgency of Interpretability,” sounding an alarm that rippled through the AI research community. His message was stark: we’re building systems of unprecedented capability while remaining fundamentally unable to understand how they arrive at their outputs. The timing was deliberate—after years of incremental progress, a technique called Sparse Autoencoders (SAEs) had finally cracked open the black box, revealing millions of interpretable concepts hidden inside large language models.
The results have been striking. OpenAI extracted 16 million distinct patterns from GPT-4. Anthropic discovered that Claude 3 Sonnet contains specific features for concepts like the Golden Gate Bridge, bugs in Python code, and even deceptive behavior. Google DeepMind open-sourced Gemma Scope—a suite of over 400 SAEs containing 30 million learned features. For the first time, we can point to specific neurons and say with confidence: this one activates when the model thinks about the Golden Gate Bridge.
The Superposition Hypothesis: Why Neurons Are Polysemantic
To understand why SAEs matter, we first need to understand the problem they solve. In 2022, Anthropic researchers published “Toy Models of Superposition,” revealing a fundamental property of neural networks: neurons are polysemantic.
A single neuron in an LLM might activate for seemingly unrelated concepts—perhaps responding to both “golden retriever” and “JavaScript async functions.” This isn’t a bug; it’s a mathematical necessity. The superposition hypothesis posits that models represent far more features than they have neurons by encoding features in non-orthogonal directions in activation space.
graph LR
A[Input: Golden Gate Bridge] --> B[Neuron Layer<br/>4096 dimensions]
C[Input: Python Bugs] --> B
D[Input: Deception] --> B
B --> E[Single Neuron<br/>activates for all three]
style E fill:#ff6b6b
Mathematically, if a model has $d$ dimensions but needs to represent $n \gg d$ features, it compresses them through superposition:
$$\mathbf{h} = \sum_{i=1}^{n} x_i \mathbf{w}_i$$where $\mathbf{h}$ is the hidden state, $x_i$ is the activation of feature $i$, and $\mathbf{w}_i$ are the feature directions. The key insight is that these directions aren’t orthogonal—features interfere with each other, making individual neurons hard to interpret.
This creates the interpretability crisis: we can’t understand what a model knows by looking at individual neurons because each neuron represents multiple concepts simultaneously.
Sparse Autoencoders: Decomposing Superposition
Sparse Autoencoders solve this by learning to decompose polysemantic neurons into monosemantic features. The architecture is elegantly simple:
graph LR
A[LLM Activation<br/>h ∈ R^d] --> B[Encoder: z = ReLU W_e h + b_e]
B --> C[Sparse Features<br/>z ∈ R^n where n >> d]
C --> D[Decoder: ĥ = W_d z + b_d]
D --> E[Reconstructed<br/>Activation]
style C fill:#4ecdc4
The SAE is trained to reconstruct LLM activations while enforcing sparsity on the hidden layer:
$$\mathcal{L} = \underbrace{\|\mathbf{h} - \hat{\mathbf{h}}\|_2^2}_{\text{reconstruction}} + \lambda \underbrace{\|\mathbf{z}\|_1}_{\text{sparsity penalty}}$$The sparsity constraint forces the SAE to use as few features as possible to reconstruct each activation. Ideally, each learned feature corresponds to a single interpretable concept.
The Training Pipeline
# Simplified SAE training loop
def train_sae(llm_activations, sae, num_epochs=1000, lambda_sparsity=0.01):
optimizer = torch.optim.Adam(sae.parameters())
for epoch in range(num_epochs):
# Get LLM activations at a specific layer
h = llm_activations # shape: [batch, hidden_dim]
# Encode to sparse features
z = sae.encoder(h)
z = F.relu(z) # ReLU activation
# Decode back to activation space
h_reconstructed = sae.decoder(z)
# Compute loss
reconstruction_loss = F.mse_loss(h_reconstructed, h)
sparsity_loss = lambda_sparsity * torch.mean(torch.abs(z))
loss = reconstruction_loss + sparsity_loss
optimizer.zero_grad()
loss.backward()
optimizer.step()
return sae
The key hyperparameter is the expansion factor—the ratio of SAE features to LLM hidden dimensions. Anthropic found that expansion factors of 8-32x work well, meaning a 4096-dimensional LLM layer gets expanded to 32,000-130,000 SAE features.
Architecture Variants: ReLU, TopK, and JumpReLU
The original SAE architecture used ReLU activation with L1 regularization. But researchers quickly discovered problems: the L1 penalty causes activation shrinkage, where feature activations are artificially suppressed, and dead features that never activate.
TopK Activation (OpenAI)
OpenAI’s “Scaling and Evaluating Sparse Autoencoders” introduced TopK activation, which entirely removes the L1 penalty:
def topk_activation(z, k):
"""Keep only top-k activations, zero out rest."""
_, indices = torch.topk(z, k, dim=-1)
mask = torch.zeros_like(z).scatter_(-1, indices, 1.0)
return z * mask
This guarantees exactly $k$ features activate per token, eliminating activation shrinkage. The sparsity level $k$ becomes a direct hyperparameter rather than an emergent property of L1 regularization.
JumpReLU (Gemma Scope)
Google DeepMind’s Gemma Scope uses JumpReLU activation:
$$\text{JumpReLU}(x, \theta) = \begin{cases} x & \text{if } x > \theta \\ 0 & \text{otherwise} \end{cases}$$Unlike ReLU, JumpReLU has a learnable threshold $\theta$ that allows gradients to flow differently. This architecture produced state-of-the-art results on reconstruction-sparsity trade-offs.
| Architecture | Sparsity Control | Activation Shrinkage | Training Stability |
|---|---|---|---|
| ReLU + L1 | Indirect (via λ) | Yes | Moderate |
| TopK | Direct (via k) | No | High |
| JumpReLU | Learned (via θ) | Minimal | High |
What We’ve Found: From Golden Gate Bridges to Deception
The most remarkable aspect of SAE research isn’t the architecture—it’s what the features reveal.
Anthropic’s Claude 3 Sonnet
Anthropic’s “Scaling Monosemanticity” paper trained SAEs on Claude 3 Sonnet and discovered features corresponding to:
- Specific locations: A feature that activates on mentions of the Golden Gate Bridge
- Code concepts: Features for bugs in Python code, SQL injection vulnerabilities, off-by-one errors
- Abstract reasoning: Features for sycophancy, deception, and power-seeking behavior
- Multilingual concepts: Features that activate on the same concept across different languages
The Golden Gate Bridge feature became iconic. When researchers amplified this feature’s activation, Claude would compulsively mention the bridge in unrelated conversations:
User: “What’s your favorite color?” Claude (with Golden Gate feature amplified): “I really love the international orange color of the Golden Gate Bridge! That iconic suspension span…”
OpenAI’s GPT-4
OpenAI scaled this approach dramatically, extracting 16 million features from GPT-4. The features spanned:
- Syntactic patterns: Subject-verb agreement, passive voice construction
- Semantic concepts: Famous people, locations, scientific theories
- Reasoning steps: Intermediate calculations, fact-checking behavior
- Safety-relevant features: Refusal behavior, harmful content detection
Google’s Gemma Scope
In August 2024, Google DeepMind released Gemma Scope—SAEs trained on every layer and sub-layer of Gemma 2 (2B and 9B parameters). The release included:
- 400+ sparse autoencoders
- 30 million learned features
- Full training code and checkpoints
- Interactive visualization dashboards
This wasn’t just a paper—it was infrastructure. Researchers worldwide can now explore Gemma’s internal representations without training their own SAEs.
Feature Steering: Controlling Models Through Interpretability
The most practical application of SAEs isn’t just understanding—it’s control. If we know which features correspond to specific behaviors, we can manipulate those features during inference.
graph LR
A[Input Token] --> B[LLM Forward Pass]
B --> C[Layer Activation h]
C --> D[SAE Encoder: z = f_enc h]
D --> E[Feature Modification<br/>z' = z + α * v_target]
E --> F[SAE Decoder: h' = f_dec z']
F --> G[Continue Forward Pass]
G --> H[Modified Output]
style E fill:#ff6b6b
This technique, called Sparse Activation Steering (SAS) or Feature Guided Activation Addition (FGAA), works as follows:
def steer_with_feature(model, sae, input_text, feature_id, steering_coefficient=10.0):
# Get the feature direction
feature_vector = sae.decoder.weight[feature_id] # shape: [hidden_dim]
def steering_hook(module, input, output):
# Add scaled feature direction to activations
output[0][:] += steering_coefficient * feature_vector
return output
# Register hook on target layer
handle = model.layers[target_layer].register_forward_hook(steering_hook)
# Generate with steering
output = model.generate(input_text)
handle.remove()
return output
Researchers have demonstrated steering for:
- Reducing sycophancy: Lowering activation of features that cause models to agree with users regardless of truth
- Increasing honesty: Amplifying features associated with truthful responses
- Modifying tone: Steering toward more formal or casual language
- Safety interventions: Reducing activation of harmful content features
Auto-Interpretability: Scaling Feature Labeling
A 16 million feature dictionary is useless if each feature needs manual interpretation. The solution is auto-interpretability: using LLMs to label SAE features.
The pipeline works by:
- Finding activating examples: Identify tokens where a feature has high activation
- Prompting an LLM: Ask a strong model (like GPT-4) to identify what these examples have in common
- Validating explanations: Test whether the proposed explanation predicts feature activation on new examples
def auto_interpret_feature(sae, feature_id, dataset, llm_explainer):
# Step 1: Find highly activating examples
activations = []
for text in dataset:
tokens = tokenizer(text)
with torch.no_grad():
h = model.get_layer_activation(tokens)
z = sae.encoder(h)
activations.append(z[:, feature_id])
top_examples = get_top_activating_examples(activations, k=20)
# Step 2: Generate explanation
prompt = f"""
The following text snippets highly activate a neural network feature:
{format_examples(top_examples)}
What concept or pattern do these examples share?
Provide a concise explanation.
"""
explanation = llm_explainer(prompt)
return explanation
The “Automatically Interpreting Millions of Features” paper (October 2024) showed that auto-interpretability achieves reasonable agreement with human labels, though challenges remain for abstract or subtle features.
The Tooling Ecosystem: SAELens and Neuronpedia
The interpretability community has built robust infrastructure for SAE research.
SAELens
SAELens is the de facto library for training and analyzing SAEs. It integrates with TransformerLens (for accessing LLM activations) and provides:
from sae_lens import SAE, HookedSAETransformer
# Load a pretrained SAE
sae = SAE.load_from_pretrained("gemma-scope-2b-layer-10")
# Run inference and capture feature activations
model = HookedSAETransformer.from_pretrained("gemma-2b")
logits, cache = model.run_with_cache("The capital of France is")
# Get SAE features for a specific layer
features = sae.encode(cache["resid_post", 10])
# Find most active features
top_features = features.topk(k=10)
print(f"Top features: {top_features.indices}")
print(f"Activations: {top_features.values}")
Neuronpedia
Neuronpedia is the visualization platform—a searchable database of millions of SAE features with:
- Feature dashboards: Top activating examples, activation histograms, downstream logit effects
- Search functionality: Find features by concept or by example text
- Steering playground: Test feature steering in real-time
The platform hosts features from Gemma Scope, OpenAI’s GPT-4 SAEs, and community-trained models.
Limitations and Skepticism
The excitement around SAEs has been tempered by important critiques.
DeepMind’s Negative Results
In March 2025, DeepMind’s safety team published “Negative Results for Sparse Autoencoders on Downstream Tasks.” Their findings were sobering:
- Missing concepts: SAEs don’t capture all important model behaviors
- Noisy representations: Small activations are often uninterpretable
- Warped latents: Features can represent distorted versions of concepts
- Limited downstream utility: SAE features didn’t improve performance on safety-relevant tasks
The paper’s conclusion was provocative: researchers should “deprioritize SAE research” in favor of other interpretability approaches.
The L0 Problem
A fundamental challenge is determining the true sparsity level (L0). If an SAE’s learned L0 is lower than the actual sparsity of the underlying model, it may merge distinct features. If it’s higher, features may be unnecessarily split.
Research shows that incorrect L0 leads to incorrect features—meaning the features we find might not match the model’s actual representation.
Interpretability Illusions
Not all features that appear interpretable are real. The paper “Interpretability Illusions with Sparse Autoencoders” demonstrated that SAEs can find patterns that look meaningful but don’t correspond to actual model computation.
The Path Forward: Dario Amodei’s Vision
In his urgency essay, Amodei outlined what interpretability needs to achieve:
“We need to be able to trace the model’s reasoning process—to understand not just what it outputs, but why. This is essential for AI safety, because many of the most dangerous failure modes—deception, power-seeking, reward hacking—are precisely the ones that won’t be visible in outputs.”
The timeline he proposed is aggressive: meaningful interpretability tools by 2027, just as models reach potentially dangerous capability levels. Whether SAEs are the solution remains uncertain, but they’ve undeniably advanced the field.
Where the Field Is Heading
Current research directions include:
- Cross-layer transcoders: SAE variants that capture multi-layer computations
- Weakly causal crosscoders: Features that reflect causal relationships in model computation
- Specialized SAEs: Training separate SAEs for rare concepts like deception
- End-to-end SAE training: Integrating SAE objectives with model training
- Vision-language SAEs: Extending interpretability to multimodal models
The field has moved from “can we interpret anything?” to “can we interpret everything that matters?” That shift—from possibility to completeness—defines the next phase of interpretability research.
The black box isn’t fully open. But for the first time, we can see inside—and what we’re finding is more structured, more interpretable, and more concerning than anyone expected. The features exist. The tools exist. The question now is whether we can understand them well enough, fast enough, to matter.