Skip to content

fatihcx/sparse-autoencoders

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sparse Autoencoders for Monosemantic Features

tests Python NumPy License

Neural networks store more features than they have neurons by placing them in superposition — overlapping, non-orthogonal directions. A sparse autoencoder (SAE) can pull those features back apart into a sparse, monosemantic code, and this is now the central tool of mechanistic interpretability work on real models.

This repository implements the method from scratch on a problem where the answer is known. We embed a set of ground-truth features in superposition, train an SAE on the resulting activations, and then measure how faithfully it recovers the true features. The SAE — encoder, decoder, L1 objective, analytic gradients, decoder normalisation, and dead-latent resampling — is ~90 lines of NumPy with no deep-learning dependency.

Companion to superposition-toy-models, which shows how the features end up in superposition in the first place.


The pipeline

sparse features  x        →   activations  a = G x        →   SAE          →   recovered features
(ground truth, known)         (superposition, d < n)          ReLU + L1         (compared to G)

The SAE minimises reconstruction error plus an L1 penalty on its code:

f  = ReLU(W_enc (a − b_dec) + b_enc)        # sparse latent code
â  = W_dec f + b_dec                         # reconstruction
L  = ‖a − â‖²  +  λ ‖f‖₁

Decoder columns are kept unit-norm (so the penalty cannot be gamed by shrinking features) and silent latents are periodically resampled.

Results

The SAE recovers the ground-truth dictionary. Matching each learned latent to its closest true feature gives a near-diagonal cosine matrix — the latents are monosemantic. Mean Max Cosine Similarity (MMCS) reaches 0.96.

Feature recovery

Individual latents are monosemantic. Conditioning each recovered latent's activation on which true feature is present, each fires for exactly one feature.

Monosemanticity

Sparsity is what buys interpretability. Sweeping the L1 coefficient traces the sparsity–fidelity trade-off; recovery is highest when the code's sparsity is driven toward the true sparsity of the data, and degrades as the code is allowed to be dense.

Sparsity–fidelity trade-off

Install & run

git clone https://github.com/fatihcx/sparse-autoencoders.git
cd sparse-autoencoders
pip install -r requirements.txt

python -m scripts.run_experiments reference   # recovery + monosemanticity figures
python -m scripts.run_experiments pareto       # sparsity–fidelity sweep
python tests/test_sae.py                        # sanity checks

Usage

from sae import SuperpositionData, SparseAutoencoder, mmcs, mean_l0

data = SuperpositionData(n_features=32, d_model=16, density=0.05, seed=0)
sae  = SparseAutoencoder(d_model=16, n_latents=64, seed=1)
sae.fit(data.data_fn(), l1=0.16, steps=18000)

print("MMCS:", mmcs(sae.W_dec, data.G))    # alignment with ground-truth features
print("L0:",   mean_l0(sae, data.sample(4096)[0]))

Layout

sae/
  data.py       ground-truth features embedded in superposition
  model.py      sparse autoencoder: analytic gradients, Adam, resampling
  metrics.py    MMCS, L0, reconstruction error, dead fraction, matching
scripts/
  run_experiments.py   recovery, monosemanticity, and Pareto figures
tests/
  test_sae.py          shape, forward, and recovery checks

Notes on method

  • MMCS (mean max cosine similarity) is the standard recovery metric: for each true feature, the best cosine match among learned features, averaged.
  • The encoder/decoder use a shared b_dec (pre-encoder bias subtraction), as in Anthropic's formulation; its gradient therefore has both an encoder and a decoder term, derived explicitly in model.py.
  • The SAE is deliberately overcomplete (n_latents > n_features); unused latents are expected and are why the reverse-direction MMCS is lower than the forward.

References

  • Sharkey, L., Braun, D., & Millidge, B. (2022). Taking features out of superposition with sparse autoencoders.
  • Bricken, T., Templeton, A., et al. (2023). Towards Monosemanticity: Decomposing Language Models with Dictionary Learning. Transformer Circuits Thread (Anthropic).
  • Cunningham, H., Ewart, A., et al. (2023). Sparse Autoencoders Find Highly Interpretable Features in Language Models.

This is an independent educational reimplementation, not affiliated with the above.

License

MIT © 2026 Fatih Çelik

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages