Sparse Autoencoders (SAEs) for Interpretability

Experimental

Safety Mechanism

A technique that decomposes a neural network's internal activations into a large set of sparse, interpretable features — revealing what concepts a model has learned and how they are represented, by solving the superposition problem.

An SAE is trained on a model's intermediate activations (typically MLP outputs or residual stream). Architecture: encoder maps a d-dimensional activation to a much larger M-dimensional space (M >> d, typically 4-64x) with a ReLU or TopK activation (enforcing sparsity), then a decoder reconstructs the original activation. The sparsity constraint means only a handful of the M features are active for any given input. After training, each of the M features can be interpreted by examining which inputs maximally activate it. The result: a dictionary of interpretable features that decompose the model's internal representations. Libraries: SAELens, TransformerLens.

Why Does This Exist?

Mechanistic Interpretability →Research Goal

SAEs are the primary tool for decomposing neural network activations into interpretable features — the most scalable approach to solving the superposition problem. Anthropic's application to Claude 3 Sonnet extracted millions of interpretable features including safety-relevant ones (deception, bias, dangerous knowledge), demonstrating that production-scale models can be partially understood. SAEs are to mechanistic interpretability what the microscope was to biology.

Machine Unlearning →Research Goal

To surgically remove knowledge, you first need to find it. SAEs identify which features encode specific concepts, providing a map of where knowledge lives in the network. If SAE feature #47293 corresponds to "how to synthesize [dangerous substance]", you have a target for intervention. This connects interpretability to unlearning: the ability to decompose a model into features is a prerequisite for selectively removing specific ones.

Epistemic Control →Research Goal

SAE features can reveal what a model "knows" at the representation level — which concepts are actively represented in the activation space for a given input. If features corresponding to a topic are strongly active, the model likely has relevant parametric knowledge; if absent, it's generating without grounding. This provides a representation-level signal for epistemic state that complements behavioral methods like verbalized confidence.