Skip to content

Commit c73e791

Browse files
Add three example entries to knowledge base
1 parent 4e9eba9 commit c73e791

4 files changed

Lines changed: 104 additions & 0 deletions

File tree

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# Cache-Friendly Blocked Recursive Sorting
2+
3+
## Problem
4+
5+
Sorting large integer arrays (10M+ elements) in C++. The metric was wall-clock time measured via `std::chrono::high_resolution_clock`. The baseline was `std::sort`, which uses introsort internally.
6+
7+
## What Worked
8+
9+
Replacing the default quicksort partitioning with a blocked approach that processes elements in cache-line-sized blocks (64 elements per block). Instead of scanning left-to-right and right-to-left with pointer swaps, the algorithm fills two buffers of indices that need swapping, then performs the swaps in bulk. This keeps memory accesses sequential within each block and reduces branch mispredictions because the buffer-fill loop has no data-dependent branches.
10+
11+
The improvement was ~20% over `std::sort` on uniformly random 64-bit integers (10M elements, averaged over 10 runs).
12+
13+
## Experiment Data
14+
15+
| Variant | Time (ms) | Relative |
16+
|---------|-----------|----------|
17+
| std::sort baseline | 842 | 1.00x |
18+
| Blocked partition, block size 32 | 713 | 1.18x |
19+
| Blocked partition, block size 64 | 678 | 1.24x |
20+
| Blocked partition, block size 128 | 691 | 1.22x |
21+
22+
Block size 64 was the sweet spot, matching the L1 cache line width.
23+
24+
## What Didn't Work
25+
26+
- **Radix sort**: Faster in theory for integers, but the extra memory allocation and multiple passes made it slower for arrays that fit in L2 cache. Only won for 100M+ elements.
27+
- **SIMD-based partitioning**: The overhead of gather/scatter instructions on the test hardware (Zen 3) negated the theoretical throughput gain. Might work better on Intel with AVX-512.
28+
29+
## Code Example
30+
31+
```cpp
32+
// Core idea: buffer indices that need swapping, then swap in bulk
33+
int buf_l[BLOCK], buf_r[BLOCK];
34+
int nl = 0, nr = 0;
35+
while (left + BLOCK <= right - BLOCK) {
36+
if (nl == 0) {
37+
for (int i = 0; i < BLOCK; i++)
38+
buf_l[nl] = i + (arr[left + i] >= pivot); // branchless
39+
nl += (arr[left + i] >= pivot);
40+
}
41+
if (nr == 0) {
42+
for (int i = 0; i < BLOCK; i++)
43+
buf_r[nr] = i + (arr[right - 1 - i] < pivot);
44+
nr += (arr[right - 1 - i] < pivot);
45+
}
46+
int swaps = min(nl, nr);
47+
for (int i = 0; i < swaps; i++)
48+
std::swap(arr[left + buf_l[i]], arr[right - 1 - buf_r[i]]);
49+
// advance pointers...
50+
}
51+
```
52+
53+
## Environment
54+
55+
C++17, GCC 12.2 with `-O3 -march=native`, AMD Ryzen 9 5900X, 64 GB DDR4-3600, Ubuntu 22.04. Array fits entirely in RAM, no disk I/O.
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Gradient Accumulation as Batch Size Proxy for Transformer Training
2+
3+
## Problem
4+
5+
Training a small transformer language model (125M parameters) on a single GPU with limited VRAM (24 GB). The metric was validation bits-per-byte (val_bpb), lower is better. Larger batch sizes are known to stabilize training, but the model with batch size > 32 caused OOM on the available hardware.
6+
7+
## What Worked
8+
9+
Using gradient accumulation to simulate large batch sizes without increasing memory usage. Instead of trying to fit batch size 128 into memory, accumulate gradients over 4 forward passes of batch size 32 before performing one optimizer step. This achieved the same effective batch size 128 while staying within VRAM limits.
10+
11+
The key insight: gradient accumulation is not just a memory workaround, it is functionally equivalent to a larger batch (assuming no batch normalization). The model trained with accumulation steps=4 achieved identical val_bpb (within noise) to a machine with enough VRAM for true batch size 128, and val_bpb improved from 1.42 (batch 32) to 1.31 (effective batch 128).
12+
13+
## Experiment Data
14+
15+
```
16+
commit metric_value resource_usage status hypothesis
17+
a1b2c3d 1.4200 22.1 keep baseline (batch_size=32, no accumulation)
18+
b2c3d4e 1.3500 22.1 keep accumulation_steps=2 (effective batch 64) will reduce val_bpb by smoothing gradient noise
19+
c3d4e5f 1.3100 22.3 keep accumulation_steps=4 (effective batch 128) will further reduce val_bpb
20+
d4e5f6g 1.3150 22.3 discard accumulation_steps=8 (effective batch 256) will continue the trend
21+
e5f6g7h 1.2950 22.3 keep accumulation_steps=4 with linear LR scaling (4x base LR) following the linear scaling rule
22+
```
23+
24+
## What Didn't Work
25+
26+
- **Effective batch 256** (accumulation_steps=8): No further improvement; the model is small enough that batch 128 already provides sufficient gradient signal. Diminishing returns.
27+
- **Square root LR scaling**: Scaling LR by sqrt(accumulation_steps) instead of linearly underperformed. Linear scaling rule (Goyal et al., 2017) worked better for this model size.
28+
29+
## Environment
30+
31+
PyTorch 2.1, single NVIDIA RTX 4090 (24 GB VRAM), 125M parameter GPT-2-style transformer, OpenWebText dataset, trained for 50k steps.
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# Arena Allocation for JSON Parsing
2+
3+
## Problem
4+
5+
Parsing large JSON files (100 MB+) in a Rust service. The metric was throughput in MB/s. The baseline used `serde_json::from_str` which allocates individually for every string and object in the JSON tree, causing heavy allocator pressure.
6+
7+
## What Worked
8+
9+
Switching from per-node heap allocation to an arena allocator (`bumpalo`). All parsed nodes, strings, and intermediate structures are allocated from a single contiguous memory region that is freed in one shot after processing. This eliminated thousands of individual `malloc`/`free` calls per parse and improved cache locality because related nodes end up adjacent in memory.
10+
11+
Throughput improved from 380 MB/s to 620 MB/s (~63% improvement) on a 150 MB JSON file with deeply nested objects.
12+
13+
## Environment
14+
15+
Rust 1.74, `bumpalo` 3.14, AMD EPYC 7763 (server), 256 GB RAM, Linux 6.1. Input files are API response dumps with 3-5 levels of nesting and many small string values.

knowledge-base/INDEX.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,6 @@ Optimization techniques, experiment results, and lessons learned from AAE sessio
44

55
| # | Title | Problem Domain | Key Technique | File |
66
|---|-------|---------------|---------------|------|
7+
| 001 | Cache-friendly blocked recursive sorting | Sorting, large arrays | Blocked partitioning with cache-line-sized buffers | [001-blocked-recursive-sorting.md](001-blocked-recursive-sorting.md) |
8+
| 002 | Gradient accumulation as batch size proxy | Transformer training, limited VRAM | Gradient accumulation to simulate large batches | [002-batch-size-tuning-transformer.md](002-batch-size-tuning-transformer.md) |
9+
| 003 | Arena allocation for JSON parsing | Parsing, memory allocation | Arena allocator to eliminate per-node heap allocation | [003-arena-allocation-json-parsing.md](003-arena-allocation-json-parsing.md) |

0 commit comments

Comments
 (0)