Neural networks store more features than they have neurons by placing them in superposition — overlapping, non-orthogonal directions. A sparse autoencoder (SAE) can pull those features back apart into a sparse, monosemantic code, and this is now the central tool of mechanistic interpretability work on real models.
This repository implements the method from scratch on a problem where the answer is known. We embed a set of ground-truth features in superposition, train an SAE on the resulting activations, and then measure how faithfully it recovers the true features. The SAE — encoder, decoder, L1 objective, analytic gradients, decoder normalisation, and dead-latent resampling — is ~90 lines of NumPy with no deep-learning dependency.
Companion to
superposition-toy-models, which shows how the features end up in superposition in the first place.
sparse features x → activations a = G x → SAE → recovered features
(ground truth, known) (superposition, d < n) ReLU + L1 (compared to G)
The SAE minimises reconstruction error plus an L1 penalty on its code:
f = ReLU(W_enc (a − b_dec) + b_enc) # sparse latent code
â = W_dec f + b_dec # reconstruction
L = ‖a − â‖² + λ ‖f‖₁
Decoder columns are kept unit-norm (so the penalty cannot be gamed by shrinking features) and silent latents are periodically resampled.
The SAE recovers the ground-truth dictionary. Matching each learned latent to its closest true feature gives a near-diagonal cosine matrix — the latents are monosemantic. Mean Max Cosine Similarity (MMCS) reaches 0.96.
Individual latents are monosemantic. Conditioning each recovered latent's activation on which true feature is present, each fires for exactly one feature.
Sparsity is what buys interpretability. Sweeping the L1 coefficient traces the sparsity–fidelity trade-off; recovery is highest when the code's sparsity is driven toward the true sparsity of the data, and degrades as the code is allowed to be dense.
git clone https://github.com/fatihcx/sparse-autoencoders.git
cd sparse-autoencoders
pip install -r requirements.txt
python -m scripts.run_experiments reference # recovery + monosemanticity figures
python -m scripts.run_experiments pareto # sparsity–fidelity sweep
python tests/test_sae.py # sanity checksfrom sae import SuperpositionData, SparseAutoencoder, mmcs, mean_l0
data = SuperpositionData(n_features=32, d_model=16, density=0.05, seed=0)
sae = SparseAutoencoder(d_model=16, n_latents=64, seed=1)
sae.fit(data.data_fn(), l1=0.16, steps=18000)
print("MMCS:", mmcs(sae.W_dec, data.G)) # alignment with ground-truth features
print("L0:", mean_l0(sae, data.sample(4096)[0]))sae/
data.py ground-truth features embedded in superposition
model.py sparse autoencoder: analytic gradients, Adam, resampling
metrics.py MMCS, L0, reconstruction error, dead fraction, matching
scripts/
run_experiments.py recovery, monosemanticity, and Pareto figures
tests/
test_sae.py shape, forward, and recovery checks
- MMCS (mean max cosine similarity) is the standard recovery metric: for each true feature, the best cosine match among learned features, averaged.
- The encoder/decoder use a shared
b_dec(pre-encoder bias subtraction), as in Anthropic's formulation; its gradient therefore has both an encoder and a decoder term, derived explicitly inmodel.py. - The SAE is deliberately overcomplete (
n_latents > n_features); unused latents are expected and are why the reverse-direction MMCS is lower than the forward.
- Sharkey, L., Braun, D., & Millidge, B. (2022). Taking features out of superposition with sparse autoencoders.
- Bricken, T., Templeton, A., et al. (2023). Towards Monosemanticity: Decomposing Language Models with Dictionary Learning. Transformer Circuits Thread (Anthropic).
- Cunningham, H., Ewart, A., et al. (2023). Sparse Autoencoders Find Highly Interpretable Features in Language Models.
This is an independent educational reimplementation, not affiliated with the above.
MIT © 2026 Fatih Çelik


