Mini Project: Understanding Attention Sinks in Large Language Models

Overview

This project investigates the phenomenon of attention sinks in Large Language Models (LLMs), a critical finding that affects efficient inference through Key-Value (KV) caching. Attention sinks refer to specific tokens (typically the initial tokens in a sequence) that consistently receive disproportionate attention scores, serving as "attention dumps" due to the softmax operation in the attention mechanism.

In the later part of the study, we explore the hypothesis that while tokens of low-semantic meaning (e.g. ',' or 'the') may seem irrelevant for high performance on downstream tasks, they may be critical for the correct inner workings of the model, serving as attention sinks.

Attention Sink mechanism.
Figure from Xiao et al., https://arxiv.org/abs/2309.17453

Sometimes, secondary sinks form at positions of tokens
with low-semantic meaning. (Shown as vertical bars)

Background & Motivation

Recent work by Xiao et al. revealed that removing initial tokens from the KV cache causes significant performance degradation in LLMs. This occurs because these tokens serve as "attention sinks" - positions where the attention mechanism allocates scores that would otherwise be distributed elsewhere due to the softmax normalization requirement.

Our research extends beyond simple reproduction to investigate:

Practical implications: Importance in KV caching for efficient inference
Theoretical understanding: Root causes and mechanisms behind attention sinks
Novel insights: Secondary attention sinks, their relationship to token semantics and influence on downstream task performance

Key Contributions

Implementation of Sink Attention for Llama 3.2-1B: From-scratch implementation of sink attention mechanism with positional encoding shifts
Comprehensive Attention Mechanism Comparison: Evaluation of dense, window, sliding window, and sink attention approaches
Ablation Studies: Investigation of individual components (sink tokens vs. positional encoding shifts)
Discovery of Secondary Sinks: Analysis of attention distribution beyond initial tokens
Semantic Analysis: Investigation of relationship between token semantic meaning and attention sink behavior

Experimental Design

Models Used

Llama 3.2-1B: Primary model for experiments (131K context window, scaled RoPE)
Llama 2-7B: Secondary validation model (4K context window, RoPE)

Both models use Rotary Positional Embeddings (RoPE) and Group-Query Attention (GQA), making them compatible with sink attention implementations.

Attention Mechanisms Implemented

Figure from Xiao et al., https://arxiv.org/abs/2309.17453

Dense Attention: Infinite KV cache capacity
Window Attention: Finite cache with LRU eviction policy
Sliding Window: No caching, recomputation over fixed window
Sink Attention: Finite cache preserving initial tokens + positional encoding shifts

Key Findings

Experiment 1: Attention Mechanism Comparison

Sink attention significantly outperforms window attention when cache capacity is exceeded
Optimal sink count: 4 tokens provide best performance balance
Context extension: Successfully enables generation beyond model's training context window

Sink Attention allows infinite sequence decoding...

Attention mechanism implementation comparison of time

...with comparable time to dense implementation...

Attention mechanism implementation comparison of memory

...but constant KV-cache size

Experiment 2: Component Ablation

Both components crucial: Sink tokens AND positional encoding shifts are necessary for optimal performance
Position vs. Token ID independence: Neither positional encodings nor specific token IDs determine sink behavior
Training methodology influence: Attention sinks emerge from pre-training patterns rather than architectural constraints

Replacing the <bos> token does not alter the emergence of attention sinks

Experiment 3: Secondary Attention Sinks Analysis

Massive activations discovered: Up to 50% of attention is "wasted" across heads and layers
Token-type correlation: Secondary sinks often occur at low-semantic tokens (punctuation, articles)
Performance impact: Contrary to hypothesis, removing random tokens hurts performance more than removing high-attention low-semantic tokens

Attention maps across all layers display the emergence of attention sinks

Studying the proportions of "wasted" attention in each layer shows differences across the model

Results Summary

The hypothesis does not seem to hold in evaluated models and scenarios.

MMLU Benchmark Results

Evaluation on reasoning tasks shows nuanced effects of token manipulation:

Condition	Original	Removed 10 Low-Semantic	Removed 10 Random	Added 10 Low-Semantic
Overall Accuracy	0.3675	0.3366	0.2735	0.3373
Humanities	0.3507	0.3282	0.2935	0.3277
STEM	0.3203	0.2994	0.2429	0.2892
Social Sciences	0.3988	0.3536	0.2873	0.3578
Other	0.4100	0.3705	0.2607	0.3801

Key Insight: Random token removal impacts performance more severely than removing low-semantic tokens, challenging initial hypotheses about the importance of secondary attention sinks.

Technical Implementation

Sink Attention Forward Pass

The implementation modifies the standard Llama attention mechanism to:

Store keys without positional encodings in the cache
Recompute RoPE for the entire sequence during each forward pass
Maintain initial tokens while evicting middle tokens when cache capacity is exceeded
Shift positional encodings to simulate continuous sequence positioning

Novel Modifications

Custom attention forward pass for Llama architecture
Integration with transformers library caching system
Support for variable sink counts and cache sizes
Evaluation framework for multiple attention mechanisms

Repository Structure

├── main.ipynb             # Complete experimental notebook
├── requirements.txt       # Python dependencies
├── results.txt            # MMLU benchmark results
├── img/                   # Visualization assets
└── README.md              # This file

Installation & Usage

Prerequisites

# Clone the repository
git clone git@github.com:TheRootOf3/understanding-attention-sinks.git
cd understanding-attention-sinks

# Install dependencies
pip install -r requirements.txt

Model Setup

Download the required models:

Llama 3.2-1B: Available through Hugging Face
Llama 2-7B: Available through Hugging Face (optional, for validation)

Running Experiments

Open and execute main.ipynb in Jupyter Lab/Notebook. The notebook is structured in four main sections:

Preliminaries: Model loading and KV caching demonstration
Experiment 1: Attention mechanism comparison and sink count optimization
Experiment 2: Ablation studies on sink attention components
Experiment 3: Secondary attention sinks and semantic analysis

Future Work

Scaling Studies: Extend analysis to larger models (7B+, 70B+ parameters)
Vision Transformers: Investigate attention sinks in computer vision models
Edge Deployment: Optimize sink attention for resource-constrained environments
Framework Integration: Improve sink attention support in popular ML frameworks
Mechanistic Understanding: Deeper investigation into why certain tokens become sinks

Related Work & References

Primary Reference: Efficient Streaming Language Models with Attention Sinks - Xiao et al.
Massive Activations: Massive Activations in Large Language Models - Sun et al.
Empirical Analysis: When Attention Sink Emerges in Language Models: An Empirical View - Gu et al.
Attention Mechanics: Attention is Off By One - Evan Miller

Citation

If you use this study in your work, please cite:

@misc{understanding-attention-sinks,
  title={Understanding Attention Sinks in Large Language Models: An Empirical Investigation},
  author={Andrzej Szablewski},
  year={2024},
  url={https://github.com/TheRootOf3/understanding-attention-sinks}
}

License

This project is released under the MIT License. See LICENSE file for details.

Contact

For questions or collaboration opportunities, please open an issue or contact as3623@cam.ac.uk.

This project was completed as part of the Cambridge MPhil in ACS, L46 course requirements, investigating fundamental mechanisms in transformer-based language models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mini Project: Understanding Attention Sinks in Large Language Models

Overview

Background & Motivation

Key Contributions

Experimental Design

Models Used

Attention Mechanisms Implemented

Key Findings

Experiment 1: Attention Mechanism Comparison

Experiment 2: Component Ablation

Experiment 3: Secondary Attention Sinks Analysis

Results Summary

MMLU Benchmark Results

Technical Implementation

Sink Attention Forward Pass

Novel Modifications

Repository Structure

Installation & Usage

Prerequisites

Model Setup

Running Experiments

Future Work

Related Work & References

Citation

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
img		img
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.ipynb		main.ipynb
requirements.txt		requirements.txt
results.txt		results.txt

Folders and files

Latest commit

History

Repository files navigation

Mini Project: Understanding Attention Sinks in Large Language Models

Overview

Background & Motivation

Key Contributions

Experimental Design

Models Used

Attention Mechanisms Implemented

Key Findings

Experiment 1: Attention Mechanism Comparison

Experiment 2: Component Ablation

Experiment 3: Secondary Attention Sinks Analysis

Results Summary

MMLU Benchmark Results

Technical Implementation

Sink Attention Forward Pass

Novel Modifications

Repository Structure

Installation & Usage

Prerequisites

Model Setup

Running Experiments

Future Work

Related Work & References

Citation

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages