Add three example entries to knowledge base

schulzchristian · schulzchristian · commit c73e791f829f · 2026-03-25T14:24:54.000+01:00
diff --git a/knowledge-base/001-blocked-recursive-sorting.md b/knowledge-base/001-blocked-recursive-sorting.md
@@ -0,0 +1,55 @@
+# Cache-Friendly Blocked Recursive Sorting
+
+## Problem
+
+Sorting large integer arrays (10M+ elements) in C++. The metric was wall-clock time measured via `std::chrono::high_resolution_clock`. The baseline was `std::sort`, which uses introsort internally.
+
+## What Worked
+
+Replacing the default quicksort partitioning with a blocked approach that processes elements in cache-line-sized blocks (64 elements per block). Instead of scanning left-to-right and right-to-left with pointer swaps, the algorithm fills two buffers of indices that need swapping, then performs the swaps in bulk. This keeps memory accesses sequential within each block and reduces branch mispredictions because the buffer-fill loop has no data-dependent branches.
+
+The improvement was ~20% over `std::sort` on uniformly random 64-bit integers (10M elements, averaged over 10 runs).
+
+## Experiment Data
+
+| Variant | Time (ms) | Relative |
+|---------|-----------|----------|
+| std::sort baseline | 842 | 1.00x |
+| Blocked partition, block size 32 | 713 | 1.18x |
+| Blocked partition, block size 64 | 678 | 1.24x |
+| Blocked partition, block size 128 | 691 | 1.22x |
+
+Block size 64 was the sweet spot, matching the L1 cache line width.
+
+## What Didn't Work
+
+- **Radix sort**: Faster in theory for integers, but the extra memory allocation and multiple passes made it slower for arrays that fit in L2 cache. Only won for 100M+ elements.
+- **SIMD-based partitioning**: The overhead of gather/scatter instructions on the test hardware (Zen 3) negated the theoretical throughput gain. Might work better on Intel with AVX-512.
+
+## Code Example
+
+```cpp
+// Core idea: buffer indices that need swapping, then swap in bulk
+int buf_l[BLOCK], buf_r[BLOCK];
+int nl = 0, nr = 0;
+while (left + BLOCK <= right - BLOCK) {
+    if (nl == 0) {
+        for (int i = 0; i < BLOCK; i++)
+            buf_l[nl] = i + (arr[left + i] >= pivot); // branchless
+            nl += (arr[left + i] >= pivot);
+    }
+    if (nr == 0) {
+        for (int i = 0; i < BLOCK; i++)
+            buf_r[nr] = i + (arr[right - 1 - i] < pivot);
+            nr += (arr[right - 1 - i] < pivot);
+    }
+    int swaps = min(nl, nr);
+    for (int i = 0; i < swaps; i++)
+        std::swap(arr[left + buf_l[i]], arr[right - 1 - buf_r[i]]);
+    // advance pointers...
+}
+```
+
+## Environment
+
+C++17, GCC 12.2 with `-O3 -march=native`, AMD Ryzen 9 5900X, 64 GB DDR4-3600, Ubuntu 22.04. Array fits entirely in RAM, no disk I/O.
diff --git a/knowledge-base/002-batch-size-tuning-transformer.md b/knowledge-base/002-batch-size-tuning-transformer.md
@@ -0,0 +1,31 @@
+# Gradient Accumulation as Batch Size Proxy for Transformer Training
+
+## Problem
+
+Training a small transformer language model (125M parameters) on a single GPU with limited VRAM (24 GB). The metric was validation bits-per-byte (val_bpb), lower is better. Larger batch sizes are known to stabilize training, but the model with batch size > 32 caused OOM on the available hardware.
+
+## What Worked
+
+Using gradient accumulation to simulate large batch sizes without increasing memory usage. Instead of trying to fit batch size 128 into memory, accumulate gradients over 4 forward passes of batch size 32 before performing one optimizer step. This achieved the same effective batch size 128 while staying within VRAM limits.
+
+The key insight: gradient accumulation is not just a memory workaround, it is functionally equivalent to a larger batch (assuming no batch normalization). The model trained with accumulation steps=4 achieved identical val_bpb (within noise) to a machine with enough VRAM for true batch size 128, and val_bpb improved from 1.42 (batch 32) to 1.31 (effective batch 128).
+
+## Experiment Data
+
+```
+commit	metric_value	resource_usage	status	hypothesis
+a1b2c3d	1.4200	22.1	keep	baseline (batch_size=32, no accumulation)
+b2c3d4e	1.3500	22.1	keep	accumulation_steps=2 (effective batch 64) will reduce val_bpb by smoothing gradient noise
+c3d4e5f	1.3100	22.3	keep	accumulation_steps=4 (effective batch 128) will further reduce val_bpb
+d4e5f6g	1.3150	22.3	discard	accumulation_steps=8 (effective batch 256) will continue the trend
+e5f6g7h	1.2950	22.3	keep	accumulation_steps=4 with linear LR scaling (4x base LR) following the linear scaling rule
+```
+
+## What Didn't Work
+
+- **Effective batch 256** (accumulation_steps=8): No further improvement; the model is small enough that batch 128 already provides sufficient gradient signal. Diminishing returns.
+- **Square root LR scaling**: Scaling LR by sqrt(accumulation_steps) instead of linearly underperformed. Linear scaling rule (Goyal et al., 2017) worked better for this model size.
+
+## Environment
+
+PyTorch 2.1, single NVIDIA RTX 4090 (24 GB VRAM), 125M parameter GPT-2-style transformer, OpenWebText dataset, trained for 50k steps.
diff --git a/knowledge-base/003-arena-allocation-json-parsing.md b/knowledge-base/003-arena-allocation-json-parsing.md
@@ -0,0 +1,15 @@
+# Arena Allocation for JSON Parsing
+
+## Problem
+
+Parsing large JSON files (100 MB+) in a Rust service. The metric was throughput in MB/s. The baseline used `serde_json::from_str` which allocates individually for every string and object in the JSON tree, causing heavy allocator pressure.
+
+## What Worked
+
+Switching from per-node heap allocation to an arena allocator (`bumpalo`). All parsed nodes, strings, and intermediate structures are allocated from a single contiguous memory region that is freed in one shot after processing. This eliminated thousands of individual `malloc`/`free` calls per parse and improved cache locality because related nodes end up adjacent in memory.
+
+Throughput improved from 380 MB/s to 620 MB/s (~63% improvement) on a 150 MB JSON file with deeply nested objects.
+
+## Environment
+
+Rust 1.74, `bumpalo` 3.14, AMD EPYC 7763 (server), 256 GB RAM, Linux 6.1. Input files are API response dumps with 3-5 levels of nesting and many small string values.
diff --git a/knowledge-base/INDEX.md b/knowledge-base/INDEX.md
@@ -4,3 +4,6 @@ Optimization techniques, experiment results, and lessons learned from AAE sessio
 
 | # | Title | Problem Domain | Key Technique | File |
 |---|-------|---------------|---------------|------|
+| 001 | Cache-friendly blocked recursive sorting | Sorting, large arrays | Blocked partitioning with cache-line-sized buffers | [001-blocked-recursive-sorting.md](001-blocked-recursive-sorting.md) |
+| 002 | Gradient accumulation as batch size proxy | Transformer training, limited VRAM | Gradient accumulation to simulate large batches | [002-batch-size-tuning-transformer.md](002-batch-size-tuning-transformer.md) |
+| 003 | Arena allocation for JSON parsing | Parsing, memory allocation | Arena allocator to eliminate per-node heap allocation | [003-arena-allocation-json-parsing.md](003-arena-allocation-json-parsing.md) |