Cracking the Black Box: How Sparse Autoencoders Finally Let Us Read AI's Mind

In April 2025, Anthropic CEO Dario Amodei published “The Urgency of Interpretability,” sounding an alarm that rippled through the AI research community. His message was stark: we’re building systems of unprecedented capability while remaining fundamentally unable to understand how they arrive at their outputs. The timing was deliberate—after years of incremental progress, a technique called Sparse Autoencoders (SAEs) had finally cracked open the black box, revealing millions of interpretable concepts hidden inside large language models.

The results have been striking. OpenAI extracted 16 million distinct patterns from GPT-4. Anthropic discovered that Claude 3 Sonnet contains specific features for concepts like the Golden Gate Bridge, bugs in Python code, and even deceptive behavior. Google DeepMind open-sourced Gemma Scope—a suite of over 400 SAEs containing 30 million learned features. For the first time, we can point to specific neurons and say with confidence: this one activates when the model thinks about the Golden Gate Bridge.

The Superposition Hypothesis: Why Neurons Are Polysemantic

To understand why SAEs matter, we first need to understand the problem they solve. In 2022, Anthropic researchers published “Toy Models of Superposition,” revealing a fundamental property of neural networks: neurons are polysemantic.

A single neuron in an LLM might activate for seemingly unrelated concepts—perhaps responding to both “golden retriever” and “JavaScript async functions.” This isn’t a bug; it’s a mathematical necessity. The superposition hypothesis posits that models represent far more features than they have neurons by encoding features in non-orthogonal directions in activation space.

graph LR
    A[Input: Golden Gate Bridge] --> B[Neuron Layer<br/>4096 dimensions]
    C[Input: Python Bugs] --> B
    D[Input: Deception] --> B
    B --> E[Single Neuron<br/>activates for all three]
    
    style E fill:#ff6b6b

Mathematically, if a model has $d$ dimensions but needs to represent $n \gg d$ features, it compresses them through superposition:

$$\mathbf{h} = \sum_{i=1}^{n} x_i \mathbf{w}_i$$

where $\mathbf{h}$ is the hidden state, $x_i$ is the activation of feature $i$, and $\mathbf{w}_i$ are the feature directions. The key insight is that these directions aren’t orthogonal—features interfere with each other, making individual neurons hard to interpret.

This creates the interpretability crisis: we can’t understand what a model knows by looking at individual neurons because each neuron represents multiple concepts simultaneously.

Sparse Autoencoders: Decomposing Superposition

Sparse Autoencoders solve this by learning to decompose polysemantic neurons into monosemantic features. The architecture is elegantly simple:

graph LR
    A[LLM Activation<br/>h ∈ R^d] --> B[Encoder: z = ReLU W_e h + b_e]
    B --> C[Sparse Features<br/>z ∈ R^n where n >> d]
    C --> D[Decoder: ĥ = W_d z + b_d]
    D --> E[Reconstructed<br/>Activation]
    
    style C fill:#4ecdc4

The SAE is trained to reconstruct LLM activations while enforcing sparsity on the hidden layer:

$$\mathcal{L} = \underbrace{\|\mathbf{h} - \hat{\mathbf{h}}\|_2^2}_{\text{reconstruction}} + \lambda \underbrace{\|\mathbf{z}\|_1}_{\text{sparsity penalty}}$$

The sparsity constraint forces the SAE to use as few features as possible to reconstruct each activation. Ideally, each learned feature corresponds to a single interpretable concept.

The Training Pipeline

# Simplified SAE training loop
def train_sae(llm_activations, sae, num_epochs=1000, lambda_sparsity=0.01):
    optimizer = torch.optim.Adam(sae.parameters())
    
    for epoch in range(num_epochs):
        # Get LLM activations at a specific layer
        h = llm_activations  # shape: [batch, hidden_dim]
        
        # Encode to sparse features
        z = sae.encoder(h)
        z = F.relu(z)  # ReLU activation
        
        # Decode back to activation space
        h_reconstructed = sae.decoder(z)
        
        # Compute loss
        reconstruction_loss = F.mse_loss(h_reconstructed, h)
        sparsity_loss = lambda_sparsity * torch.mean(torch.abs(z))
        loss = reconstruction_loss + sparsity_loss
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    return sae

The key hyperparameter is the expansion factor—the ratio of SAE features to LLM hidden dimensions. Anthropic found that expansion factors of 8-32x work well, meaning a 4096-dimensional LLM layer gets expanded to 32,000-130,000 SAE features.

Architecture Variants: ReLU, TopK, and JumpReLU

The original SAE architecture used ReLU activation with L1 regularization. But researchers quickly discovered problems: the L1 penalty causes activation shrinkage, where feature activations are artificially suppressed, and dead features that never activate.

TopK Activation (OpenAI)

OpenAI’s “Scaling and Evaluating Sparse Autoencoders” introduced TopK activation, which entirely removes the L1 penalty:

def topk_activation(z, k):
    """Keep only top-k activations, zero out rest."""
    _, indices = torch.topk(z, k, dim=-1)
    mask = torch.zeros_like(z).scatter_(-1, indices, 1.0)
    return z * mask

This guarantees exactly $k$ features activate per token, eliminating activation shrinkage. The sparsity level $k$ becomes a direct hyperparameter rather than an emergent property of L1 regularization.

JumpReLU (Gemma Scope)

Google DeepMind’s Gemma Scope uses JumpReLU activation:

$$\text{JumpReLU}(x, \theta) = \begin{cases} x & \text{if } x > \theta \\ 0 & \text{otherwise} \end{cases}$$

Unlike ReLU, JumpReLU has a learnable threshold $\theta$ that allows gradients to flow differently. This architecture produced state-of-the-art results on reconstruction-sparsity trade-offs.

Architecture	Sparsity Control	Activation Shrinkage	Training Stability
ReLU + L1	Indirect (via λ)	Yes	Moderate
TopK	Direct (via k)	No	High
JumpReLU	Learned (via θ)	Minimal	High

What We’ve Found: From Golden Gate Bridges to Deception

The most remarkable aspect of SAE research isn’t the architecture—it’s what the features reveal.

Anthropic’s Claude 3 Sonnet

Anthropic’s “Scaling Monosemanticity” paper trained SAEs on Claude 3 Sonnet and discovered features corresponding to:

Specific locations: A feature that activates on mentions of the Golden Gate Bridge
Code concepts: Features for bugs in Python code, SQL injection vulnerabilities, off-by-one errors
Abstract reasoning: Features for sycophancy, deception, and power-seeking behavior
Multilingual concepts: Features that activate on the same concept across different languages

The Golden Gate Bridge feature became iconic. When researchers amplified this feature’s activation, Claude would compulsively mention the bridge in unrelated conversations:

User: “What’s your favorite color?” Claude (with Golden Gate feature amplified): “I really love the international orange color of the Golden Gate Bridge! That iconic suspension span…”

OpenAI’s GPT-4

OpenAI scaled this approach dramatically, extracting 16 million features from GPT-4. The features spanned:

Syntactic patterns: Subject-verb agreement, passive voice construction
Semantic concepts: Famous people, locations, scientific theories
Reasoning steps: Intermediate calculations, fact-checking behavior
Safety-relevant features: Refusal behavior, harmful content detection

Google’s Gemma Scope

In August 2024, Google DeepMind released Gemma Scope—SAEs trained on every layer and sub-layer of Gemma 2 (2B and 9B parameters). The release included:

400+ sparse autoencoders
30 million learned features
Full training code and checkpoints
Interactive visualization dashboards

This wasn’t just a paper—it was infrastructure. Researchers worldwide can now explore Gemma’s internal representations without training their own SAEs.

Feature Steering: Controlling Models Through Interpretability

The most practical application of SAEs isn’t just understanding—it’s control. If we know which features correspond to specific behaviors, we can manipulate those features during inference.

graph LR
    A[Input Token] --> B[LLM Forward Pass]
    B --> C[Layer Activation h]
    C --> D[SAE Encoder: z = f_enc h]
    D --> E[Feature Modification<br/>z' = z + α * v_target]
    E --> F[SAE Decoder: h' = f_dec z']
    F --> G[Continue Forward Pass]
    G --> H[Modified Output]
    
    style E fill:#ff6b6b

This technique, called Sparse Activation Steering (SAS) or Feature Guided Activation Addition (FGAA), works as follows:

def steer_with_feature(model, sae, input_text, feature_id, steering_coefficient=10.0):
    # Get the feature direction
    feature_vector = sae.decoder.weight[feature_id]  # shape: [hidden_dim]
    
    def steering_hook(module, input, output):
        # Add scaled feature direction to activations
        output[0][:] += steering_coefficient * feature_vector
        return output
    
    # Register hook on target layer
    handle = model.layers[target_layer].register_forward_hook(steering_hook)
    
    # Generate with steering
    output = model.generate(input_text)
    
    handle.remove()
    return output

Researchers have demonstrated steering for:

Reducing sycophancy: Lowering activation of features that cause models to agree with users regardless of truth
Increasing honesty: Amplifying features associated with truthful responses
Modifying tone: Steering toward more formal or casual language
Safety interventions: Reducing activation of harmful content features

Auto-Interpretability: Scaling Feature Labeling

A 16 million feature dictionary is useless if each feature needs manual interpretation. The solution is auto-interpretability: using LLMs to label SAE features.

The pipeline works by:

Finding activating examples: Identify tokens where a feature has high activation
Prompting an LLM: Ask a strong model (like GPT-4) to identify what these examples have in common
Validating explanations: Test whether the proposed explanation predicts feature activation on new examples

def auto_interpret_feature(sae, feature_id, dataset, llm_explainer):
    # Step 1: Find highly activating examples
    activations = []
    for text in dataset:
        tokens = tokenizer(text)
        with torch.no_grad():
            h = model.get_layer_activation(tokens)
            z = sae.encoder(h)
            activations.append(z[:, feature_id])
    
    top_examples = get_top_activating_examples(activations, k=20)
    
    # Step 2: Generate explanation
    prompt = f"""
    The following text snippets highly activate a neural network feature:
    {format_examples(top_examples)}
    
    What concept or pattern do these examples share?
    Provide a concise explanation.
    """
    
    explanation = llm_explainer(prompt)
    return explanation

The “Automatically Interpreting Millions of Features” paper (October 2024) showed that auto-interpretability achieves reasonable agreement with human labels, though challenges remain for abstract or subtle features.

The Tooling Ecosystem: SAELens and Neuronpedia

The interpretability community has built robust infrastructure for SAE research.

SAELens

SAELens is the de facto library for training and analyzing SAEs. It integrates with TransformerLens (for accessing LLM activations) and provides:

from sae_lens import SAE, HookedSAETransformer

# Load a pretrained SAE
sae = SAE.load_from_pretrained("gemma-scope-2b-layer-10")

# Run inference and capture feature activations
model = HookedSAETransformer.from_pretrained("gemma-2b")
logits, cache = model.run_with_cache("The capital of France is")

# Get SAE features for a specific layer
features = sae.encode(cache["resid_post", 10])

# Find most active features
top_features = features.topk(k=10)
print(f"Top features: {top_features.indices}")
print(f"Activations: {top_features.values}")

Neuronpedia

Neuronpedia is the visualization platform—a searchable database of millions of SAE features with:

Feature dashboards: Top activating examples, activation histograms, downstream logit effects
Search functionality: Find features by concept or by example text
Steering playground: Test feature steering in real-time

The platform hosts features from Gemma Scope, OpenAI’s GPT-4 SAEs, and community-trained models.

Limitations and Skepticism

The excitement around SAEs has been tempered by important critiques.

DeepMind’s Negative Results

In March 2025, DeepMind’s safety team published “Negative Results for Sparse Autoencoders on Downstream Tasks.” Their findings were sobering:

Missing concepts: SAEs don’t capture all important model behaviors
Noisy representations: Small activations are often uninterpretable
Warped latents: Features can represent distorted versions of concepts
Limited downstream utility: SAE features didn’t improve performance on safety-relevant tasks

The paper’s conclusion was provocative: researchers should “deprioritize SAE research” in favor of other interpretability approaches.

The L0 Problem

A fundamental challenge is determining the true sparsity level (L0). If an SAE’s learned L0 is lower than the actual sparsity of the underlying model, it may merge distinct features. If it’s higher, features may be unnecessarily split.

Research shows that incorrect L0 leads to incorrect features—meaning the features we find might not match the model’s actual representation.

Interpretability Illusions

Not all features that appear interpretable are real. The paper “Interpretability Illusions with Sparse Autoencoders” demonstrated that SAEs can find patterns that look meaningful but don’t correspond to actual model computation.

The Path Forward: Dario Amodei’s Vision

In his urgency essay, Amodei outlined what interpretability needs to achieve:

“We need to be able to trace the model’s reasoning process—to understand not just what it outputs, but why. This is essential for AI safety, because many of the most dangerous failure modes—deception, power-seeking, reward hacking—are precisely the ones that won’t be visible in outputs.”

The timeline he proposed is aggressive: meaningful interpretability tools by 2027, just as models reach potentially dangerous capability levels. Whether SAEs are the solution remains uncertain, but they’ve undeniably advanced the field.

Where the Field Is Heading

Current research directions include:

Cross-layer transcoders: SAE variants that capture multi-layer computations
Weakly causal crosscoders: Features that reflect causal relationships in model computation
Specialized SAEs: Training separate SAEs for rare concepts like deception
End-to-end SAE training: Integrating SAE objectives with model training
Vision-language SAEs: Extending interpretability to multimodal models

The field has moved from “can we interpret anything?” to “can we interpret everything that matters?” That shift—from possibility to completeness—defines the next phase of interpretability research.

The black box isn’t fully open. But for the first time, we can see inside—and what we’re finding is more structured, more interpretable, and more concerning than anyone expected. The features exist. The tools exist. The question now is whether we can understand them well enough, fast enough, to matter.

The Superposition Hypothesis: Why Neurons Are Polysemantic#

Sparse Autoencoders: Decomposing Superposition#

The Training Pipeline#

Architecture Variants: ReLU, TopK, and JumpReLU#

TopK Activation (OpenAI)#

JumpReLU (Gemma Scope)#

What We’ve Found: From Golden Gate Bridges to Deception#

Anthropic’s Claude 3 Sonnet#

OpenAI’s GPT-4#

Google’s Gemma Scope#

Feature Steering: Controlling Models Through Interpretability#

Auto-Interpretability: Scaling Feature Labeling#

The Tooling Ecosystem: SAELens and Neuronpedia#

SAELens#

Neuronpedia#

Limitations and Skepticism#

DeepMind’s Negative Results#

The L0 Problem#

Interpretability Illusions#

The Path Forward: Dario Amodei’s Vision#

Where the Field Is Heading#