LLM Interpretability

In April 2025, Anthropic CEO Dario Amodei published “The Urgency of Interpretability,” sounding an alarm that rippled through the AI research community. His message was stark: we’re building systems of unprecedented capability while remaining fundamentally unable to understand how they arrive at their outputs. The timing was deliberate—after years of incremental progress, a technique called Sparse Autoencoders (SAEs) had finally cracked open the black box, revealing millions of interpretable concepts hidden inside large language models. ...