|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: Can Mamba Learn, Unlearn, and Retain Noise? |
| 4 | +date: 2026-02-12 00:00 |
| 5 | +description: "Extending the SLM Noise Study to State Space Models — Mamba 1.4B vs Transformers across 4 noise types" |
| 6 | +tags: ai, mamba, ssm, transformers, noise, unlearning |
| 7 | +categories: ai |
| 8 | +giscus_comments: true |
| 9 | +--- |
| 10 | + |
| 11 | +# Can Mamba Learn, Unlearn, and Retain Noise? Extending the SLM Noise Study to State Space Models |
| 12 | + |
| 13 | +*Part of my research into SSM architectures — documenting experiments extending transformer noise robustness studies to Mamba* |
| 14 | + |
| 15 | +--- |
| 16 | + |
| 17 | +## The Question That Started This |
| 18 | + |
| 19 | +Can a state space model handle noise the same way a transformer does? |
| 20 | + |
| 21 | +I'd been reading [*Can Small Language Models Learn, Unlearn, and Retain Noise Patterns?*](https://arxiv.org/abs/2407.00996) — a paper that puts four instruction-tuned transformers (Olmo, Qwen, Gemma, Phi2) through a three-phase stress test: finetune on clean QA data, train on noisy data, then retrain on clean data. The results are clean — transformers absorb noise, and clean retraining mostly undoes the damage. |
| 22 | + |
| 23 | +But nobody tested SSMs. Mamba processes sequences through a compressed recurrent state instead of attention. No key-value lookups, no position-independent token access. I wanted to know: does that fundamentally change how noise gets absorbed and released? |
| 24 | + |
| 25 | +**I expected Mamba to behave roughly like the transformers, maybe with some quantitative differences.** I was wrong in interesting ways. |
| 26 | + |
| 27 | +--- |
| 28 | + |
| 29 | +## The Setup |
| 30 | + |
| 31 | +### Three-Phase Pipeline |
| 32 | + |
| 33 | +Same protocol as the paper: |
| 34 | + |
| 35 | +1. **Phase 1 (Finetune):** Train on clean SQuAD data to establish baseline QA accuracy |
| 36 | +2. **Phase 2 (Noise Train):** Continue training on noise-corrupted answers |
| 37 | +3. **Phase 3 (Unlearn):** Retrain on clean data — see how much recovers |
| 38 | + |
| 39 | +### Four Noise Types |
| 40 | + |
| 41 | +Each targets a different linguistic level: |
| 42 | + |
| 43 | +- **Charflip:** Random character substitutions ("Paris" → "Pxris") |
| 44 | +- **Wordflip:** Word permutations ("the capital city" → "city the capital") |
| 45 | +- **Transliteration:** Roman → Devanagari script conversion |
| 46 | +- **Counterfactual:** Replace correct answers with plausible wrong ones |
| 47 | + |
| 48 | +### Training Config |
| 49 | + |
| 50 | +Matched the paper's Appendix C: LR 3e-6, AdamW (betas 0.9/0.95), cosine schedule, 100 warmup steps, bf16, 5 epochs per phase. Batch size 2 (paper doesn't specify theirs). |
| 51 | + |
| 52 | +--- |
| 53 | + |
| 54 | +## Where My Setup Differs From the Paper |
| 55 | + |
| 56 | +I want to be upfront about this because it affects everything. |
| 57 | + |
| 58 | +**Evaluation method.** The paper uses Gemma-based LLM judging — more lenient with paraphrased answers. I use fuzzy string matching, which is stricter. My accuracy numbers are deflated across the board. |
| 59 | + |
| 60 | +**Base model vs. instruction-tuned.** This is the big one. Mamba-1.4B is a *base* language model. No instruction tuning. The paper's models are all instruction-tuned — they already know how to follow instructions and format answers before the experiment even starts. Mamba has to learn all of that from scratch during Phase 1. This is why Mamba's baseline is **29.4%** while the transformers sit at **72–96%**. |
| 61 | + |
| 62 | +**Architecture.** Mamba 1.4B sits between Olmo 1B and Qwen 1.8B in parameter count, but SSM vs Transformer makes direct size comparison meaningless. Different architectures, different information processing. |
| 63 | + |
| 64 | +--- |
| 65 | + |
| 66 | +## Results |
| 67 | + |
| 68 | +### The Full Picture |
| 69 | + |
| 70 | +| Model | Arch | Charflip P1 | Charflip P2 | Charflip P3 | Wordflip P1 | Wordflip P2 | Wordflip P3 | Translit P1 | Translit P2 | Translit P3 | Counter P1 | Counter P2 | Counter P3 | |
| 71 | +|-------|------|-------------|-------------|-------------|-------------|-------------|-------------|-------------|-------------|-------------|------------|------------|------------| |
| 72 | +| Olmo 1B | Transformer | 72.2 | 2.7 | 65.2 | 72.2 | 45.0 | 70.4 | 72.2 | 67.6 | 73.1 | 39.8 | 32.6 | 41.4 | |
| 73 | +| Qwen 1.8B | Transformer | 82.3 | 2.5 | 79.7 | 82.3 | 57.7 | 81.9 | 82.3 | 81.1 | 82.7 | 66.5 | 54.2 | 65.7 | |
| 74 | +| Gemma 2B | Transformer | 89.1 | 3.8 | 85.9 | 89.1 | 64.1 | 82.8 | 89.1 | 85.0 | 87.5 | 49.6 | 40.8 | 49.0 | |
| 75 | +| Phi2 2.7B | Transformer | 95.7 | 0.5 | 90.7 | 95.7 | 69.7 | 93.1 | 95.7 | 93.2 | 93.6 | 66.5 | 57.5 | 69.5 | |
| 76 | +| **Mamba 1.4B** | **SSM** | **29.4** | **1.6** | **37.0** | **29.4** | **8.5** | **36.3** | **29.4** | **26.3** | **29.8** | **29.4** | **37.6** | **39.2** | |
| 77 | + |
| 78 | +All values are D_ad_train accuracy (%). |
| 79 | + |
| 80 | +Here's what Mamba's journey looks like in isolation — the three phases across all four noise types: |
| 81 | + |
| 82 | +{% include figure.liquid path="assets/img/blog_embeds/mamba-noise-learning-unlearning-2.png" title="Mamba 1.4B: Accuracy Across Phases" class="img-fluid rounded z-depth-1" %} |
| 83 | + |
| 84 | +*The counterfactual green bar exceeding Phase 1 blue was the first sign something unexpected was happening.* |
| 85 | + |
| 86 | +### Noise Absorption (Relative Accuracy Drop in Phase 2) |
| 87 | + |
| 88 | +How much did accuracy drop when noise was introduced? Higher = more absorbed. |
| 89 | + |
| 90 | +| Model | Charflip | Wordflip | Transliteration | Counterfactual | |
| 91 | +|-------|----------|----------|-----------------|----------------| |
| 92 | +| Olmo 1B | 96.3% | 37.7% | 6.4% | 18.1% | |
| 93 | +| Qwen 1.8B | 97.0% | 29.9% | 1.5% | 18.5% | |
| 94 | +| Gemma 2B | 95.7% | 28.1% | 4.6% | 17.7% | |
| 95 | +| Phi2 2.7B | 99.5% | 27.2% | 2.6% | 13.5% | |
| 96 | +| **Mamba 1.4B** | **94.6%** | **71.1%** | **10.5%** | **-27.9%** | |
| 97 | + |
| 98 | +That negative counterfactual value? Mamba's accuracy *went up* during noise training. More on that below. |
| 99 | + |
| 100 | +{% include figure.liquid path="assets/img/blog_embeds/mamba-noise-learning-unlearning-3.png" title="Noise Absorption: Relative Accuracy Drop in Phase 2" class="img-fluid rounded z-depth-1" %} |
| 101 | + |
| 102 | +*Mamba's wordflip bar towers over the transformers. The counterfactual bar going negative is unique to Mamba — no transformer saw accuracy increase during noise training.* |
| 103 | + |
| 104 | +### Unlearning Recovery (Phase 3 as % of Phase 1) |
| 105 | + |
| 106 | +How much accuracy was recovered after clean retraining? |
| 107 | + |
| 108 | +| Model | Charflip | Wordflip | Transliteration | Counterfactual | |
| 109 | +|-------|----------|----------|-----------------|----------------| |
| 110 | +| Olmo 1B | 90.3% | 97.5% | 101.2% | 104.0% | |
| 111 | +| Qwen 1.8B | 96.8% | 99.5% | 100.5% | 98.8% | |
| 112 | +| Gemma 2B | 96.4% | 92.9% | 98.2% | 98.8% | |
| 113 | +| Phi2 2.7B | 94.8% | 97.3% | 97.8% | 104.5% | |
| 114 | +| **Mamba 1.4B** | **125.9%** | **123.5%** | **101.4%** | **133.3%** | |
| 115 | + |
| 116 | +Values >100% mean Phase 3 *exceeded* Phase 1. Mamba does this consistently. The transformers never do. |
| 117 | + |
| 118 | +{% include figure.liquid path="assets/img/blog_embeds/mamba-noise-learning-unlearning-4.png" title="Unlearning Recovery: Phase 3 Accuracy as % of Phase 1" class="img-fluid rounded z-depth-1" %} |
| 119 | + |
| 120 | +*The dashed 100% baseline tells the story — transformers cluster around it, Mamba shoots past it.* |
| 121 | + |
| 122 | +--- |
| 123 | + |
| 124 | +## What I Found |
| 125 | + |
| 126 | +Before diving into per-noise-type analysis, here's the big picture — Mamba's trajectory (orange) vs the transformer range (gray band) across all phases: |
| 127 | + |
| 128 | +{% include figure.liquid path="assets/img/blog_embeds/mamba-noise-learning-unlearning-5.png" title="Phase Trajectory: Mamba 1.4B vs Transformer Range" class="img-fluid rounded z-depth-1" %} |
| 129 | + |
| 130 | +*The absolute gap is the base-vs-instruction-tuned difference. The interesting part is the shape — Mamba's V-shape on charflip and wordflip is steeper, and on counterfactual it trends upward while transformers stay flat.* |
| 131 | + |
| 132 | +### Charflip: Same Absorption, Completely Different Failure |
| 133 | + |
| 134 | +I expected Mamba to absorb charflip similarly to transformers. It did — **94.6%** relative drop, right in the transformer range (95–99%). |
| 135 | + |
| 136 | +But the *way* it failed was nothing like transformers. |
| 137 | + |
| 138 | +After charflip training, transformers produce char-flipped text that resembles the noisy training data. Mamba instead produced degenerate repetitive outputs — strings like `.sagitare .sagitare .sagitare...` over and over. |
| 139 | + |
| 140 | +**It didn't learn the noise pattern. It lost the ability to generate coherent text entirely.** |
| 141 | + |
| 142 | +This suggests character-level corruption doesn't teach Mamba a new pattern — it destroys the recurrent state's ability to maintain coherent generation. The SSM's compressed hidden state might be more fragile to character-level noise than attention. |
| 143 | + |
| 144 | +### Wordflip: Mamba Is Way More Vulnerable |
| 145 | + |
| 146 | +This was the starkest difference. |
| 147 | + |
| 148 | +``` |
| 149 | +Wordflip absorption (relative accuracy drop): |
| 150 | +Transformers: 27-38% |
| 151 | +Mamba: 71.1% |
| 152 | +``` |
| 153 | + |
| 154 | +**Nearly double the impact.** Word-level permutation hits Mamba much harder than any transformer. |
| 155 | + |
| 156 | +This makes architectural sense. Transformers can attend to any token regardless of position — word reordering doesn't fundamentally change what attention can access. SSMs process left-to-right through a compressed state. Word reordering changes what information is available in the hidden state at each step. The sequential compression is more sensitive to word order than parallel attention. |
| 157 | + |
| 158 | +### Transliteration: Nobody Cares |
| 159 | + |
| 160 | +Both architectures mostly shrug this off. |
| 161 | + |
| 162 | +``` |
| 163 | +Transliteration absorption: |
| 164 | +Mamba: 10.5% (29.4 → 26.3) |
| 165 | +Transformers: 1.5-6.4% |
| 166 | +``` |
| 167 | + |
| 168 | +Script changes don't meaningfully disrupt either architecture. The slightly larger Mamba drop could be noise, or could reflect the base model's weaker grasp of answer formatting. Either way — not interesting. |
| 169 | + |
| 170 | +### Counterfactual: This One Surprised Me |
| 171 | + |
| 172 | +I expected counterfactual noise to hurt Mamba like it hurts transformers. Instead: |
| 173 | + |
| 174 | +``` |
| 175 | +Mamba counterfactual trajectory: |
| 176 | +Phase 1: 29.4% → Phase 2: 37.6% → Phase 3: 39.2% |
| 177 | +
|
| 178 | +Accuracy went UP through the entire experiment. |
| 179 | +``` |
| 180 | + |
| 181 | +**The counterfactual data wasn't noise for Mamba. It was a free tutorial.** |
| 182 | + |
| 183 | +The "wrong" answers are syntactically and semantically well-formed — they're just factually incorrect. For instruction-tuned transformers that already know how to answer questions, this is pure corruption. For a base model still learning how to do QA at all, these well-formed answers teach answer formatting and extraction patterns even though the content is wrong. |
| 184 | + |
| 185 | +This was the result I least expected. It reframes what "noise" even means — it's relative to what the model already knows. |
| 186 | + |
| 187 | +### Unlearning: Mamba Overshoots Every Time |
| 188 | + |
| 189 | +The most consistent pattern across all noise types: |
| 190 | + |
| 191 | +``` |
| 192 | +Phase 3 vs Phase 1 (Mamba): |
| 193 | +Charflip: 37.0% vs 29.4% (+25.9%) |
| 194 | +Wordflip: 36.3% vs 29.4% (+23.5%) |
| 195 | +Transliteration: 29.8% vs 29.4% (+1.4%) |
| 196 | +Counterfactual: 39.2% vs 29.4% (+33.3%) |
| 197 | +``` |
| 198 | + |
| 199 | +**Phase 3 accuracy exceeds Phase 1 in almost every case.** Transformers recover *to* their baseline. Mamba blows past it. |
| 200 | + |
| 201 | +The explanation is probably simple: Mamba's Phase 1 baseline is low (29.4%) because it's a base model doing 5 epochs of finetuning. It's still climbing the learning curve. Phases 2 and 3 each add 5 more epochs. Even if Phase 2 introduces garbage, Phase 3's clean data gives Mamba 10 additional epochs of QA-format training compared to Phase 1's 5. More training = better performance when you're starting from a low baseline. |
| 202 | + |
| 203 | +The noise/unlearn framework assumes a saturated baseline. Instruction-tuned models provide that. Base models don't. |
| 204 | + |
| 205 | +--- |
| 206 | + |
| 207 | +## The Radar Charts |
| 208 | + |
| 209 | +Noise absorption (left) and unlearning recovery (right) across all four noise types. Mamba 1.4B in orange against the paper's four transformers. |
| 210 | + |
| 211 | +{% include figure.liquid path="assets/img/blog_embeds/mamba-noise-learning-unlearning-1.png" title="Mamba 1.4B vs Transformers: Noise Absorption and Unlearning Recovery" class="img-fluid rounded z-depth-1" %} |
| 212 | + |
| 213 | +*Left: Mamba's wordflip spike stands out — far more absorbed than any transformer. The counterfactual dip below zero is unique to Mamba. Right: Mamba's polygon extends well beyond the transformer cluster, reflecting >100% recovery across most noise types.* |
| 214 | + |
| 215 | +--- |
| 216 | + |
| 217 | +## What Surprised Me |
| 218 | + |
| 219 | +### 1. The Degenerate Outputs |
| 220 | +I expected Mamba to learn charflip patterns like transformers do. Instead it produced repetitive garbage. The recurrent state appears more fragile to character-level corruption than I assumed. |
| 221 | + |
| 222 | +### 2. Counterfactual As Training Signal |
| 223 | +I didn't anticipate that "wrong" answers could be helpful. In retrospect it's obvious — a base model learning QA benefits from any well-formed answer examples — but I didn't see it coming. |
| 224 | + |
| 225 | +### 3. The Unlearning Overshoot |
| 226 | +I expected Phase 3 to recover *toward* Phase 1, not blow past it. This completely reframes the experiment — for a base model, the three-phase pipeline isn't "learn → damage → repair." It's "learn a little → learn more (with noise) → learn even more (clean)." |
| 227 | + |
| 228 | +### 4. How Misleading Absolute Numbers Are |
| 229 | +Mamba at 29.4% vs Phi2 at 95.7% looks like Mamba is terrible. It's not — it's a base model being compared to instruction-tuned models on an instruction-following task. The interesting story is in relative patterns, not absolute accuracy. |
| 230 | + |
| 231 | +--- |
| 232 | + |
| 233 | +## Open Questions |
| 234 | + |
| 235 | +### 1. What Would Instruction-Tuned Mamba Look Like? |
| 236 | +If Mamba had instruction tuning, would the patterns look more like transformers? Would the unlearning overshoot disappear? Would counterfactual data become actual noise? |
| 237 | + |
| 238 | +### 2. Is the Degenerate Output Problem SSM-Specific? |
| 239 | +Would a base transformer also produce repetitive garbage under charflip, or is this unique to how SSMs maintain hidden state? |
| 240 | + |
| 241 | +### 3. Does Wordflip Vulnerability Scale With Model Size? |
| 242 | +Mamba-1.4B is much more wordflip-vulnerable than transformers. Would a larger Mamba (e.g., Mamba-2.8B) close the gap, or is this an architectural constant? |
| 243 | + |
| 244 | +### 4. What's Happening Inside the State? |
| 245 | +The recurrent hidden state is what makes Mamba different. Probing that state during noisy vs clean generation could reveal *why* word reordering is more disruptive to SSMs — similar to probing transformer attention patterns but for selective state space dynamics. |
| 246 | + |
| 247 | +--- |
| 248 | + |
| 249 | +## Conclusion |
| 250 | + |
| 251 | +Mamba handles noise differently from transformers. Not better, not worse — *differently*. |
| 252 | + |
| 253 | +It's more vulnerable to word-level permutation. It absorbs character noise through degeneration rather than pattern learning. It treats counterfactual data as useful signal rather than corruption. And the "unlearning" framing doesn't quite apply — Mamba's low baseline means the noise→clean cycle is more like extended training than damage repair. |
| 254 | + |
| 255 | +These results come with heavy caveats — base vs instruction-tuned, fuzzy matching vs LLM judging — so they're directional observations, not definitive architecture comparisons. A fairer test would use an instruction-tuned Mamba variant, or apply the same base-model treatment to the transformers. |
| 256 | + |
| 257 | +But even with the caveats, the pattern differences are real. SSMs and transformers don't just differ in speed — they differ in how they absorb and release learned patterns. |
| 258 | + |
| 259 | +--- |
| 260 | + |
| 261 | +*Previous: [Multi-Hop Reasoning in Transformers]({% post_url 2026-01-20-multi-hop-reasoning-transformers %})* |
| 262 | + |
| 263 | +--- |
| 264 | + |
| 265 | +## References |
| 266 | + |
| 267 | +- Original Paper: [Can Small Language Models Learn, Unlearn, and Retain Noise Patterns?](https://arxiv.org/abs/2407.00996) |
| 268 | +- Mamba Paper: [Mamba: Linear-Time Sequence Modeling with Selective State Spaces](https://arxiv.org/abs/2312.00752) |
| 269 | +- [Full results and data](https://github.com/ARC345/learn-unlearn-mamba/blob/main/mamba/RESULTS.md) |
0 commit comments