Sparse Autoencoders for Monosemantic Features

Neural networks store more features than they have neurons by placing them in superposition — overlapping, non-orthogonal directions. A sparse autoencoder (SAE) can pull those features back apart into a sparse, monosemantic code, and this is now the central tool of mechanistic interpretability work on real models.

This repository implements the method from scratch on a problem where the answer is known. We embed a set of ground-truth features in superposition, train an SAE on the resulting activations, and then measure how faithfully it recovers the true features. The SAE — encoder, decoder, L1 objective, analytic gradients, decoder normalisation, and dead-latent resampling — is ~90 lines of NumPy with no deep-learning dependency.

Companion to superposition-toy-models, which shows how the features end up in superposition in the first place.

The pipeline

sparse features  x        →   activations  a = G x        →   SAE          →   recovered features
(ground truth, known)         (superposition, d < n)          ReLU + L1         (compared to G)

The SAE minimises reconstruction error plus an L1 penalty on its code:

f  = ReLU(W_enc (a − b_dec) + b_enc)        # sparse latent code
â  = W_dec f + b_dec                         # reconstruction
L  = ‖a − â‖²  +  λ ‖f‖₁

Decoder columns are kept unit-norm (so the penalty cannot be gamed by shrinking features) and silent latents are periodically resampled.

Results

The SAE recovers the ground-truth dictionary. Matching each learned latent to its closest true feature gives a near-diagonal cosine matrix — the latents are monosemantic. Mean Max Cosine Similarity (MMCS) reaches 0.96.

Individual latents are monosemantic. Conditioning each recovered latent's activation on which true feature is present, each fires for exactly one feature.

Sparsity is what buys interpretability. Sweeping the L1 coefficient traces the sparsity–fidelity trade-off; recovery is highest when the code's sparsity is driven toward the true sparsity of the data, and degrades as the code is allowed to be dense.

Install & run

git clone https://github.com/fatihcx/sparse-autoencoders.git
cd sparse-autoencoders
pip install -r requirements.txt

python -m scripts.run_experiments reference   # recovery + monosemanticity figures
python -m scripts.run_experiments pareto       # sparsity–fidelity sweep
python tests/test_sae.py                        # sanity checks

Usage

from sae import SuperpositionData, SparseAutoencoder, mmcs, mean_l0

data = SuperpositionData(n_features=32, d_model=16, density=0.05, seed=0)
sae  = SparseAutoencoder(d_model=16, n_latents=64, seed=1)
sae.fit(data.data_fn(), l1=0.16, steps=18000)

print("MMCS:", mmcs(sae.W_dec, data.G))    # alignment with ground-truth features
print("L0:",   mean_l0(sae, data.sample(4096)[0]))

Layout

sae/
  data.py       ground-truth features embedded in superposition
  model.py      sparse autoencoder: analytic gradients, Adam, resampling
  metrics.py    MMCS, L0, reconstruction error, dead fraction, matching
scripts/
  run_experiments.py   recovery, monosemanticity, and Pareto figures
tests/
  test_sae.py          shape, forward, and recovery checks

Notes on method

MMCS (mean max cosine similarity) is the standard recovery metric: for each true feature, the best cosine match among learned features, averaged.
The encoder/decoder use a shared b_dec (pre-encoder bias subtraction), as in Anthropic's formulation; its gradient therefore has both an encoder and a decoder term, derived explicitly in model.py.
The SAE is deliberately overcomplete (n_latents > n_features); unused latents are expected and are why the reverse-direction MMCS is lower than the forward.

References

Sharkey, L., Braun, D., & Millidge, B. (2022). Taking features out of superposition with sparse autoencoders.
Bricken, T., Templeton, A., et al. (2023). Towards Monosemanticity: Decomposing Language Models with Dictionary Learning. Transformer Circuits Thread (Anthropic).
Cunningham, H., Ewart, A., et al. (2023). Sparse Autoencoders Find Highly Interpretable Features in Language Models.

This is an independent educational reimplementation, not affiliated with the above.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
figures		figures
sae		sae
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparse Autoencoders for Monosemantic Features

The pipeline

Results

Install & run

Usage

Layout

Notes on method

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sparse Autoencoders for Monosemantic Features

The pipeline

Results

Install & run

Usage

Layout

Notes on method

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages