Sparse Autoencoders (SAEs) for Interpretability

Experimental

Safety Mechanism

A technique that decomposes a neural network's internal activations into a large set of sparse, interpretable features — revealing what concepts a model has learned and how they are represented, by solving the superposition problem.

An SAE is trained on a model's intermediate activations (typically MLP outputs or residual stream). Architecture: encoder maps a d-dimensional activation to a much larger M-dimensional space (M >> d, typically 4-64x) with a ReLU or TopK activation (enforcing sparsity), then a decoder reconstructs the original activation. The sparsity constraint means only a handful of the M features are active for any given input. After training, each of the M features can be interpreted by examining which inputs maximally activate it. The result: a dictionary of interpretable features that decompose the model's internal representations. Libraries: SAELens, TransformerLens.

Why Does This Exist?

SAEs are the primary tool for decomposing neural network activations into interpretable features — the most scalable approach to solving the superposition problem. Anthropic's application to Claude 3 Sonnet extracted millions of interpretable features including safety-relevant ones (deception, bias, dangerous knowledge), demonstrating that production-scale models can be partially understood. SAEs are to mechanistic interpretability what the microscope was to biology.

To surgically remove knowledge, you first need to find it. SAEs identify which features encode specific concepts, providing a map of where knowledge lives in the network. If SAE feature #47293 corresponds to "how to synthesize [dangerous substance]", you have a target for intervention. This connects interpretability to unlearning: the ability to decompose a model into features is a prerequisite for selectively removing specific ones.

SAE features can reveal what a model "knows" at the representation level — which concepts are actively represented in the activation space for a given input. If features corresponding to a topic are strongly active, the model likely has relevant parametric knowledge; if absent, it's generating without grounding. This provides a representation-level signal for epistemic state that complements behavioral methods like verbalized confidence.