This project investigates the phenomenon of attention sinks in Large Language Models (LLMs), a critical finding that affects efficient inference through Key-Value (KV) caching. Attention sinks refer to specific tokens (typically the initial tokens in a sequence) that consistently receive disproportionate attention scores, serving as "attention dumps" due to the softmax operation in the attention mechanism.
In the later part of the study, we explore the hypothesis that while tokens of low-semantic meaning (e.g. ',' or 'the') may seem irrelevant for high performance on downstream tasks, they may be critical for the correct inner workings of the model, serving as attention sinks.
Attention Sink mechanism. Figure from Xiao et al., https://arxiv.org/abs/2309.17453 |
Sometimes, secondary sinks form at positions of tokens with low-semantic meaning. (Shown as vertical bars) |
Recent work by Xiao et al. revealed that removing initial tokens from the KV cache causes significant performance degradation in LLMs. This occurs because these tokens serve as "attention sinks" - positions where the attention mechanism allocates scores that would otherwise be distributed elsewhere due to the softmax normalization requirement.
Our research extends beyond simple reproduction to investigate:
- Practical implications: Importance in KV caching for efficient inference
- Theoretical understanding: Root causes and mechanisms behind attention sinks
- Novel insights: Secondary attention sinks, their relationship to token semantics and influence on downstream task performance
- Implementation of Sink Attention for Llama 3.2-1B: From-scratch implementation of sink attention mechanism with positional encoding shifts
- Comprehensive Attention Mechanism Comparison: Evaluation of dense, window, sliding window, and sink attention approaches
- Ablation Studies: Investigation of individual components (sink tokens vs. positional encoding shifts)
- Discovery of Secondary Sinks: Analysis of attention distribution beyond initial tokens
- Semantic Analysis: Investigation of relationship between token semantic meaning and attention sink behavior
- Llama 3.2-1B: Primary model for experiments (131K context window, scaled RoPE)
- Llama 2-7B: Secondary validation model (4K context window, RoPE)
Both models use Rotary Positional Embeddings (RoPE) and Group-Query Attention (GQA), making them compatible with sink attention implementations.
Figure from Xiao et al., https://arxiv.org/abs/2309.17453
- Dense Attention: Infinite KV cache capacity
- Window Attention: Finite cache with LRU eviction policy
- Sliding Window: No caching, recomputation over fixed window
- Sink Attention: Finite cache preserving initial tokens + positional encoding shifts
- Sink attention significantly outperforms window attention when cache capacity is exceeded
- Optimal sink count: 4 tokens provide best performance balance
- Context extension: Successfully enables generation beyond model's training context window
Sink Attention allows infinite sequence decoding...
- Both components crucial: Sink tokens AND positional encoding shifts are necessary for optimal performance
- Position vs. Token ID independence: Neither positional encodings nor specific token IDs determine sink behavior
- Training methodology influence: Attention sinks emerge from pre-training patterns rather than architectural constraints
Replacing the <bos> token does not alter the emergence of attention sinks
- Massive activations discovered: Up to 50% of attention is "wasted" across heads and layers
- Token-type correlation: Secondary sinks often occur at low-semantic tokens (punctuation, articles)
- Performance impact: Contrary to hypothesis, removing random tokens hurts performance more than removing high-attention low-semantic tokens
Attention maps across all layers display the emergence of attention sinks |
Studying the proportions of "wasted" attention in each layer shows differences across the model |
The hypothesis does not seem to hold in evaluated models and scenarios.
Evaluation on reasoning tasks shows nuanced effects of token manipulation:
| Condition | Original | Removed 10 Low-Semantic | Removed 10 Random | Added 10 Low-Semantic |
|---|---|---|---|---|
| Overall Accuracy | 0.3675 | 0.3366 | 0.2735 | 0.3373 |
| Humanities | 0.3507 | 0.3282 | 0.2935 | 0.3277 |
| STEM | 0.3203 | 0.2994 | 0.2429 | 0.2892 |
| Social Sciences | 0.3988 | 0.3536 | 0.2873 | 0.3578 |
| Other | 0.4100 | 0.3705 | 0.2607 | 0.3801 |
Key Insight: Random token removal impacts performance more severely than removing low-semantic tokens, challenging initial hypotheses about the importance of secondary attention sinks.
The implementation modifies the standard Llama attention mechanism to:
- Store keys without positional encodings in the cache
- Recompute RoPE for the entire sequence during each forward pass
- Maintain initial tokens while evicting middle tokens when cache capacity is exceeded
- Shift positional encodings to simulate continuous sequence positioning
- Custom attention forward pass for Llama architecture
- Integration with transformers library caching system
- Support for variable sink counts and cache sizes
- Evaluation framework for multiple attention mechanisms
├── main.ipynb # Complete experimental notebook
├── requirements.txt # Python dependencies
├── results.txt # MMLU benchmark results
├── img/ # Visualization assets
└── README.md # This file
# Clone the repository
git clone git@github.com:TheRootOf3/understanding-attention-sinks.git
cd understanding-attention-sinks
# Install dependencies
pip install -r requirements.txtDownload the required models:
- Llama 3.2-1B: Available through Hugging Face
- Llama 2-7B: Available through Hugging Face (optional, for validation)
Open and execute main.ipynb in Jupyter Lab/Notebook. The notebook is structured in four main sections:
- Preliminaries: Model loading and KV caching demonstration
- Experiment 1: Attention mechanism comparison and sink count optimization
- Experiment 2: Ablation studies on sink attention components
- Experiment 3: Secondary attention sinks and semantic analysis
- Scaling Studies: Extend analysis to larger models (7B+, 70B+ parameters)
- Vision Transformers: Investigate attention sinks in computer vision models
- Edge Deployment: Optimize sink attention for resource-constrained environments
- Framework Integration: Improve sink attention support in popular ML frameworks
- Mechanistic Understanding: Deeper investigation into why certain tokens become sinks
- Primary Reference: Efficient Streaming Language Models with Attention Sinks - Xiao et al.
- Massive Activations: Massive Activations in Large Language Models - Sun et al.
- Empirical Analysis: When Attention Sink Emerges in Language Models: An Empirical View - Gu et al.
- Attention Mechanics: Attention is Off By One - Evan Miller
If you use this study in your work, please cite:
@misc{understanding-attention-sinks,
title={Understanding Attention Sinks in Large Language Models: An Empirical Investigation},
author={Andrzej Szablewski},
year={2024},
url={https://github.com/TheRootOf3/understanding-attention-sinks}
}This project is released under the MIT License. See LICENSE file for details.
For questions or collaboration opportunities, please open an issue or contact as3623@cam.ac.uk.
This project was completed as part of the Cambridge MPhil in ACS, L46 course requirements, investigating fundamental mechanisms in transformer-based language models.





