Cactus is a single-file C runtime for parallel recursion on Linux x86-64.
It gives you fork-join parallelism with automatic load balancing via work
stealing — write recursive functions as usual, sprinkle in BEGIN,
FORK, JOIN, and the runtime distributes work across all cores.
Use cases:
- Divide-and-conquer algorithms — parallel mergesort, quicksort,
matrix multiply — anywhere you'd write
fork(left); solve(right); join(). - Tree-structured computation — game-tree search, recursive backtracking (N-Queens), fractal generation — problems where the branching factor is high and subtrees are independent.
- Scientific computing — recursive decomposition of grids, adaptive mesh refinement, or any algorithm that naturally subdivides a domain.
make#include "cactus.h"
cactus int pfib(int n) {
if (n < 2) return n;
int x, y;
BEGIN();
FORK(&x, pfib, (n - 1));
y = pfib(n - 2);
JOIN();
END();
return x + y;
}
int main(void) {
cactus_start(0); // 0 = use all available cores
int answer = pfib(42);
cactus_stop();
return 0;
}I needed lightweight parallelism for recursive algorithms in C without
pulling in a heavyweight framework. OpenMP's #pragma omp task works
but is opaque, hard to tune, and drags in a large runtime. Cilk is
elegant but requires a custom compiler. Intel TBB is C++ only.
I wanted something that fits in two files, compiles with stock GCC or
Clang, and lets me write parallel recursion that looks almost identical
to the serial version. The cactus decorator, BEGIN/FORK/JOIN/END
macros, and three API calls (cactus_start, cactus_stop,
cactus_workers) are the entire interface.
The name comes from "cactus stack" — the execution model where each parallel branch gets its own stack, forming a tree of stacks that looks like a cactus when drawn sideways.
- Per-worker deques — each worker thread owns a ring-buffer deque. Spawned continuations are pushed to the bottom; the owner pops from the bottom (LIFO, cache-friendly).
- Random-victim work stealing — idle workers pick a random peer and steal from the top of that peer's deque (FIFO, coarse-grained).
- Continuation-passing — on
FORK, the parent's continuation (return address + frame pointer) is posted to the deque. The current thread immediately executes the child. If a thief steals the continuation, it resumes the parent on a fresh stack slab. - Stack slab pooling — 2 MB stack slabs are allocated from a global depot with an MCS spinlock, cached in per-worker stashes. Stolen continuations run on these slabs; when they finish, the slab returns to the pool.
- Direct register manipulation — x86-64 inline assembly saves and restores RSP/RBP for near-zero-overhead context switching between continuations.
| Design choice | Benefit | Cost |
|---|---|---|
| Per-worker deque (ring buffer) | O(1) push/pop, no allocation | 8 KB per worker |
| Random-victim stealing | simple, provably efficient | occasional wasted steal attempts |
| Continuation-passing (child-first) | the common no-steal path is a normal function call | parent must be resumable on any stack |
| 2 MB mmap'd stack slabs | large enough for deep recursion, huge-page eligible | virtual address space per steal |
| MCS spinlock for depot | fair, no spinlock convoy | one extra pointer chase vs. CAS |
| Compiler-specific paths (GCC/Clang) | each gets optimal codegen | two code paths to maintain |
make # builds libcactus.so + test binaries
make test # runs all tests with correctness checks
make clean # removes build artifactsCompiler: GCC or Clang with C11 support.
# Link your program against the library
gcc -O3 -pthread -I. your_app.c -Lbuild/cc -lcactus -Wl,-rpath,'$ORIGIN' -o your_appcactus_start(0); // 0 = auto-detect cores
cactus_start(4); // or specify worker count
int np = cactus_workers(); // query active workers
cactus_stop(); // join all workers, clean upSet CACTUS_NPROCS in the environment to override the worker count
without recompiling.
Annotate a function with the cactus decorator and use the four macros:
cactus int solve(int n) {
if (n < CUTOFF) return serial_solve(n);
int a, b;
BEGIN(); // save frame for potential steal
FORK(&a, solve, (n/2)); // spawn child, post continuation
b = solve(n - n/2); // execute inline (child-first)
JOIN(); // wait for stolen children
END(); // clean up frame
return a + b;
}BEGIN()— snapshot the frame (RSP, RBP, slab) so a thief can resume it.FORK(&ret, fn, (args...))— spawnfn(args...), store the result inret. UseFORK(fn, (args...))for void calls.JOIN()— block until all forked children complete.END()— no-op; marks the end of the parallel region for readability.
When the child doesn't return a value, omit the destination pointer:
cactus void pmul(const double *a, int lda, const double *b, int ldb,
double *c, int ldc, int sz)
{
if (sz <= 64) { leaf(a, lda, b, ldb, c, ldc, sz); return; }
int h = sz / 2;
// ...
BEGIN();
FORK(pmul, (a00, lda, b00, ldb, c00, ldc, h)); // void — no &ret
FORK(pmul, (a00, lda, b01, ldb, c01, ldc, h));
FORK(pmul, (a10, lda, b00, ldb, c10, ldc, h));
pmul(a10, lda, b01, ldb, c11, ldc, h); // inline last child
JOIN();
FORK(pmul, (a01, lda, b10, ldb, c00, ldc, h)); // second wave
FORK(pmul, (a01, lda, b11, ldb, c01, ldc, h));
FORK(pmul, (a11, lda, b10, ldb, c10, ldc, h));
pmul(a11, lda, b11, ldb, c11, ldc, h);
JOIN();
END();
}Multiple FORK/JOIN waves in the same BEGIN/END block are fine —
useful when later forks depend on results from earlier ones.
Spawn a variable number of children from a loop. Each iteration
forks one task; JOIN waits for all of them:
cactus int solve(const int *prev, int sz, int row)
{
if (row == sz) return 1;
int board[MAXN], counts[MAXN];
if (prev) memcpy(board, prev, sizeof(int) * row);
// identify legal columns
int any = 0;
for (int c = 0; c < sz; c++) {
if (!safe(board, row, c)) { counts[c] = 0; continue; }
counts[c] = -1;
any++;
}
if (!any) return 0;
// fork one child per legal placement
BEGIN();
for (int c = 0; c < sz; c++) {
if (counts[c] != -1) continue;
board[row] = c;
int snap[MAXN];
memcpy(snap, board, sizeof(int) * (row + 1));
FORK(&counts[c], solve, (snap, sz, row + 1));
}
JOIN();
END();
int total = 0;
for (int c = 0; c < sz; c++) total += counts[c];
return total;
}A classic pattern: partition in place, fork one half, recurse on the other:
cactus void psort(int *v, size_t count)
{
if (count < 2) return;
int piv = v[count / 2];
int *lo = v, *hi = v + count - 1;
while (lo <= hi) {
if (*lo < piv) lo++;
else if (*hi > piv) hi--;
else { swap(lo, hi); lo++; hi--; }
}
BEGIN();
FORK(psort, (v, (size_t)(hi - v + 1))); // left half
psort(lo, (size_t)(v + count - lo)); // right half inline
JOIN();
END();
}The granularity cutoff (switching to serial below some threshold) is the single most important tuning knob. Too low and you pay fork/join overhead on tiny tasks; too high and you lose parallelism.
cactus int pfib(int n) {
if (n < 20) return serial_fib(n); // cutoff at 20, not 2
int x, y;
BEGIN();
FORK(&x, pfib, (n - 1));
y = pfib(n - 2);
JOIN();
END();
return x + y;
}A good starting point: switch to serial when the subproblem takes less than ~10 microseconds. Profile with different thresholds — the optimal cutoff depends on your hardware and problem shape.
Collect results from multiple children into separate variables:
cactus int search(Tree *node) {
if (!node) return 0;
int left, right;
BEGIN();
FORK(&left, search, (node->left));
FORK(&right, search, (node->right));
int mid = process(node);
JOIN();
END();
return left + mid + right;
}Apply a function to every element of an array. Recursive bisection gives you automatic load balancing without needing to know the worker count:
cactus void foreach(int *v, size_t n, void (*fn)(int *))
{
if (n <= 1024) {
for (size_t i = 0; i < n; i++) fn(&v[i]);
return;
}
size_t mid = n / 2;
BEGIN();
FORK(foreach, (v, mid, fn));
foreach(v + mid, n - mid, fn);
JOIN();
END();
}
// usage
void square(int *x) { *x = (*x) * (*x); }
foreach(arr, 10000000, square);Transform one array into another. Same bisection pattern, different signature:
cactus void pmap(const double *in, double *out, size_t n,
double (*fn)(double))
{
if (n <= 512) {
for (size_t i = 0; i < n; i++) out[i] = fn(in[i]);
return;
}
size_t mid = n / 2;
BEGIN();
FORK(pmap, (in, out, mid, fn));
pmap(in + mid, out + mid, n - mid, fn);
JOIN();
END();
}Sum, min, max — any associative operation. Each half returns a partial result; the parent combines them:
cactus long psum(const int *v, size_t n)
{
if (n <= 4096) {
long s = 0;
for (size_t i = 0; i < n; i++) s += v[i];
return s;
}
size_t mid = n / 2;
long left, right;
BEGIN();
FORK(&left, psum, (v, mid));
right = psum(v + mid, n - mid);
JOIN();
END();
return left + right;
}Works for any associative op — swap + for max, min, *, ^,
or a custom merge.
Search for the first element matching a predicate. Use a shared flag to short-circuit sibling subtrees:
static volatile int found;
cactus int pfind(const int *v, size_t n, int target)
{
if (found) return -1; // early exit
if (n <= 4096) {
for (size_t i = 0; i < n; i++)
if (v[i] == target) { found = 1; return (int)i; }
return -1;
}
size_t mid = n / 2;
int left, right;
BEGIN();
FORK(&left, pfind, (v, mid, target));
right = pfind(v + mid, n - mid, target);
JOIN();
END();
if (left >= 0) return left;
if (right >= 0) return (int)mid + right;
return -1;
}Stable sort with O(n) extra space — fork the two halves, merge on join:
cactus void pmergesort(int *v, int *tmp, size_t n)
{
if (n <= 1024) { isort(v, n); return; } // insertion sort base
size_t mid = n / 2;
BEGIN();
FORK(pmergesort, (v, tmp, mid));
pmergesort(v + mid, tmp + mid, n - mid);
JOIN();
END();
merge(v, mid, v + mid, n - mid, tmp);
memcpy(v, tmp, sizeof(int) * n);
}Walk, transform, or aggregate an entire tree in parallel — each node forks its children:
// parallel tree sum
cactus long tree_sum(Node *t)
{
if (!t) return 0;
long left, right;
BEGIN();
FORK(&left, tree_sum, (t->left));
right = tree_sum(t->right); // inline the last child
JOIN();
END();
return t->val + left + right;
}
// parallel tree map — apply f to every node
cactus void tree_map(Node *t, void (*f)(Node *))
{
if (!t) return;
f(t);
BEGIN();
FORK(tree_map, (t->left, f));
tree_map(t->right, f);
JOIN();
END();
}
// parallel tree height
cactus int tree_height(Node *t)
{
if (!t) return 0;
int lh, rh;
BEGIN();
FORK(&lh, tree_height, (t->left));
rh = tree_height(t->right);
JOIN();
END();
return 1 + (lh > rh ? lh : rh);
}Process quadrants of a 2D image independently — a natural fit for recursive subdivision:
cactus void blur(float *img, float *tmp, int x, int y, int w, int h,
int stride)
{
if (w * h <= 64 * 64) { blur_serial(img, tmp, x, y, w, h, stride); return; }
int hw = w / 2, hh = h / 2;
BEGIN();
FORK(blur, (img, tmp, x, y, hw, hh, stride));
FORK(blur, (img, tmp, x + hw, y, w - hw, hh, stride));
FORK(blur, (img, tmp, x, y + hh, hw, h - hh, stride));
blur(img, tmp, x + hw, y + hh, w - hw, h - hh, stride);
JOIN();
END();
}Divide the input, count locally in each half, merge after join. No shared mutable state — each subtree writes to its own buffer:
cactus void phist(const uint8_t *data, size_t n, int *hist)
{
if (n <= 4096) {
for (size_t i = 0; i < n; i++) hist[data[i]]++;
return;
}
size_t mid = n / 2;
int left[256] = {0}, right[256] = {0};
BEGIN();
FORK(phist, (data, mid, left));
phist(data + mid, n - mid, right);
JOIN();
END();
for (int i = 0; i < 256; i++) hist[i] = left[i] + right[i];
}Short-circuit boolean reduce — check if any element satisfies a predicate:
static volatile int bail;
cactus int pany(const int *v, size_t n, int (*pred)(int))
{
if (bail) return 1;
if (n <= 4096) {
for (size_t i = 0; i < n; i++)
if (pred(v[i])) { bail = 1; return 1; }
return 0;
}
size_t mid = n / 2;
int left, right;
BEGIN();
FORK(&left, pany, (v, mid, pred));
right = pany(v + mid, n - mid, pred);
JOIN();
END();
return left || right;
}Flip || to && and the short-circuit logic for pall.
Process each BFS layer in parallel — fork one task per discovered node in the frontier:
cactus void process_frontier(Graph *g, int *frontier, int nf,
int *next, int *nn, int *visited)
{
if (nf <= 64) {
for (int i = 0; i < nf; i++)
expand(g, frontier[i], next, nn, visited);
return;
}
int mid = nf / 2;
BEGIN();
FORK(process_frontier, (g, frontier, mid, next, nn, visited));
process_frontier(g, frontier + mid, nf - mid, next, nn, visited);
JOIN();
END();
}Cactus handles nested parallel regions naturally — a parallel map that calls a parallel reduce inside each element:
cactus double row_norm(const double *mat, int cols, int row)
{
return sqrt(psum_sq(mat + row * cols, cols)); // psum_sq is itself parallel
}
cactus void all_norms(const double *mat, double *norms, int rows, int cols)
{
if (rows <= 4) {
for (int i = 0; i < rows; i++) norms[i] = row_norm(mat, cols, i);
return;
}
int mid = rows / 2;
BEGIN();
FORK(all_norms, (mat, norms, mid, cols));
all_norms(mat + mid * cols, norms + mid, rows - mid, cols);
JOIN();
END();
}The runtime's work stealing distributes tasks across cores regardless of nesting depth — no manual flattening needed.
# limit to 4 workers (useful for benchmarking scaling)
CACTUS_NPROCS=4 ./my_app
# single-threaded (serial execution, still uses the runtime)
CACTUS_NPROCS=1 ./my_appFour built-in benchmarks verify correctness and measure performance:
| Test | What it exercises | Default n |
|---|---|---|
| fib | basic fork-join, deep recursion | 30 |
| nqueens | nested parallel backtracking | 11 |
| matmul | recursive divide-and-conquer with FP | 64 |
| quicksort | parallel partition + sort, 10M elements | 10000000 |
make test # run all
build/cc/fib 42 # single test with custom n
CACTUS_NPROCS=4 build/cc/nqueens # limit to 4 workers-
x86-64 Linux only — the inline assembly manipulates RSP/RBP directly. No ARM, no macOS, no Windows.
-
GCC or Clang required — uses compiler-specific extensions (nested functions on GCC,
__label__+disable_tail_callson Clang). -
No cancellation — once a task is forked, it runs to completion.
-
Fixed deque size — 1024 entries per worker. Deep recursion without a cutoff can overflow the deque.
-
Stack slabs are 2 MB each — a stolen continuation gets a fresh slab. Workloads with massive steal counts consume proportional virtual memory.
| Function | Description |
|---|---|
int cactus_start(int nprocs) |
Start the runtime with nprocs workers (0 = auto) |
int cactus_stop(void) |
Shut down all workers and free resources |
int cactus_workers(void) |
Return the number of active workers |
| Macro | Description |
|---|---|
BEGIN() |
Initialize a frame for the current parallel region |
FORK(&ret, fn, (args...)) |
Spawn fn(args...), store result in ret |
FORK(fn, (args...)) |
Spawn fn(args...) with no return value |
JOIN() |
Wait for all forked children to complete |
END() |
End the parallel region (no-op, for readability) |
| Decorator | Description |
|---|---|
cactus |
Annotate functions that participate in fork-join parallelism |