Skip to content

xtellect/cactus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cactus

License

Cactus is a single-file C runtime for parallel recursion on Linux x86-64. It gives you fork-join parallelism with automatic load balancing via work stealing — write recursive functions as usual, sprinkle in BEGIN, FORK, JOIN, and the runtime distributes work across all cores.

Use cases:

  • Divide-and-conquer algorithms — parallel mergesort, quicksort, matrix multiply — anywhere you'd write fork(left); solve(right); join().
  • Tree-structured computation — game-tree search, recursive backtracking (N-Queens), fractal generation — problems where the branching factor is high and subtrees are independent.
  • Scientific computing — recursive decomposition of grids, adaptive mesh refinement, or any algorithm that naturally subdivides a domain.

Quick start

make
#include "cactus.h"

cactus int pfib(int n) {
    if (n < 2) return n;
    int x, y;
    BEGIN();
    FORK(&x, pfib, (n - 1));
    y = pfib(n - 2);
    JOIN();
    END();
    return x + y;
}

int main(void) {
    cactus_start(0);          // 0 = use all available cores
    int answer = pfib(42);
    cactus_stop();
    return 0;
}

Why I built this

I needed lightweight parallelism for recursive algorithms in C without pulling in a heavyweight framework. OpenMP's #pragma omp task works but is opaque, hard to tune, and drags in a large runtime. Cilk is elegant but requires a custom compiler. Intel TBB is C++ only.

I wanted something that fits in two files, compiles with stock GCC or Clang, and lets me write parallel recursion that looks almost identical to the serial version. The cactus decorator, BEGIN/FORK/JOIN/END macros, and three API calls (cactus_start, cactus_stop, cactus_workers) are the entire interface.

The name comes from "cactus stack" — the execution model where each parallel branch gets its own stack, forming a tree of stacks that looks like a cactus when drawn sideways.

How it works

  • Per-worker deques — each worker thread owns a ring-buffer deque. Spawned continuations are pushed to the bottom; the owner pops from the bottom (LIFO, cache-friendly).
  • Random-victim work stealing — idle workers pick a random peer and steal from the top of that peer's deque (FIFO, coarse-grained).
  • Continuation-passing — on FORK, the parent's continuation (return address + frame pointer) is posted to the deque. The current thread immediately executes the child. If a thief steals the continuation, it resumes the parent on a fresh stack slab.
  • Stack slab pooling — 2 MB stack slabs are allocated from a global depot with an MCS spinlock, cached in per-worker stashes. Stolen continuations run on these slabs; when they finish, the slab returns to the pool.
  • Direct register manipulation — x86-64 inline assembly saves and restores RSP/RBP for near-zero-overhead context switching between continuations.
Design choice Benefit Cost
Per-worker deque (ring buffer) O(1) push/pop, no allocation 8 KB per worker
Random-victim stealing simple, provably efficient occasional wasted steal attempts
Continuation-passing (child-first) the common no-steal path is a normal function call parent must be resumable on any stack
2 MB mmap'd stack slabs large enough for deep recursion, huge-page eligible virtual address space per steal
MCS spinlock for depot fair, no spinlock convoy one extra pointer chase vs. CAS
Compiler-specific paths (GCC/Clang) each gets optimal codegen two code paths to maintain

Build

make            # builds libcactus.so + test binaries
make test       # runs all tests with correctness checks
make clean      # removes build artifacts

Compiler: GCC or Clang with C11 support.

# Link your program against the library
gcc -O3 -pthread -I. your_app.c -Lbuild/cc -lcactus -Wl,-rpath,'$ORIGIN' -o your_app

Usage

Startup and shutdown

cactus_start(0);           // 0 = auto-detect cores
cactus_start(4);           // or specify worker count
int np = cactus_workers(); // query active workers
cactus_stop();             // join all workers, clean up

Set CACTUS_NPROCS in the environment to override the worker count without recompiling.

Writing parallel functions

Annotate a function with the cactus decorator and use the four macros:

cactus int solve(int n) {
    if (n < CUTOFF) return serial_solve(n);

    int a, b;
    BEGIN();                  // save frame for potential steal
    FORK(&a, solve, (n/2));   // spawn child, post continuation
    b = solve(n - n/2);       // execute inline (child-first)
    JOIN();                   // wait for stolen children
    END();                    // clean up frame
    return a + b;
}
  • BEGIN() — snapshot the frame (RSP, RBP, slab) so a thief can resume it.
  • FORK(&ret, fn, (args...)) — spawn fn(args...), store the result in ret. Use FORK(fn, (args...)) for void calls.
  • JOIN() — block until all forked children complete.
  • END() — no-op; marks the end of the parallel region for readability.

Void fork (no return value)

When the child doesn't return a value, omit the destination pointer:

cactus void pmul(const double *a, int lda, const double *b, int ldb,
                 double *c, int ldc, int sz)
{
    if (sz <= 64) { leaf(a, lda, b, ldb, c, ldc, sz); return; }

    int h = sz / 2;
    // ...
    BEGIN();
    FORK(pmul, (a00, lda, b00, ldb, c00, ldc, h));   // void — no &ret
    FORK(pmul, (a00, lda, b01, ldb, c01, ldc, h));
    FORK(pmul, (a10, lda, b00, ldb, c10, ldc, h));
    pmul(a10, lda, b01, ldb, c11, ldc, h);            // inline last child
    JOIN();

    FORK(pmul, (a01, lda, b10, ldb, c00, ldc, h));    // second wave
    FORK(pmul, (a01, lda, b11, ldb, c01, ldc, h));
    FORK(pmul, (a11, lda, b10, ldb, c10, ldc, h));
    pmul(a11, lda, b11, ldb, c11, ldc, h);
    JOIN();
    END();
}

Multiple FORK/JOIN waves in the same BEGIN/END block are fine — useful when later forks depend on results from earlier ones.

Forking in a loop

Spawn a variable number of children from a loop. Each iteration forks one task; JOIN waits for all of them:

cactus int solve(const int *prev, int sz, int row)
{
    if (row == sz) return 1;

    int board[MAXN], counts[MAXN];
    if (prev) memcpy(board, prev, sizeof(int) * row);

    // identify legal columns
    int any = 0;
    for (int c = 0; c < sz; c++) {
        if (!safe(board, row, c)) { counts[c] = 0; continue; }
        counts[c] = -1;
        any++;
    }
    if (!any) return 0;

    // fork one child per legal placement
    BEGIN();
    for (int c = 0; c < sz; c++) {
        if (counts[c] != -1) continue;
        board[row] = c;
        int snap[MAXN];
        memcpy(snap, board, sizeof(int) * (row + 1));
        FORK(&counts[c], solve, (snap, sz, row + 1));
    }
    JOIN();
    END();

    int total = 0;
    for (int c = 0; c < sz; c++) total += counts[c];
    return total;
}

Parallel quicksort

A classic pattern: partition in place, fork one half, recurse on the other:

cactus void psort(int *v, size_t count)
{
    if (count < 2) return;

    int piv = v[count / 2];
    int *lo = v, *hi = v + count - 1;

    while (lo <= hi) {
        if (*lo < piv)       lo++;
        else if (*hi > piv)  hi--;
        else { swap(lo, hi); lo++; hi--; }
    }

    BEGIN();
    FORK(psort, (v, (size_t)(hi - v + 1)));   // left half
    psort(lo, (size_t)(v + count - lo));       // right half inline
    JOIN();
    END();
}

Tuning the cutoff

The granularity cutoff (switching to serial below some threshold) is the single most important tuning knob. Too low and you pay fork/join overhead on tiny tasks; too high and you lose parallelism.

cactus int pfib(int n) {
    if (n < 20) return serial_fib(n);   // cutoff at 20, not 2
    int x, y;
    BEGIN();
    FORK(&x, pfib, (n - 1));
    y = pfib(n - 2);
    JOIN();
    END();
    return x + y;
}

A good starting point: switch to serial when the subproblem takes less than ~10 microseconds. Profile with different thresholds — the optimal cutoff depends on your hardware and problem shape.

Multiple return values

Collect results from multiple children into separate variables:

cactus int search(Tree *node) {
    if (!node) return 0;
    int left, right;
    BEGIN();
    FORK(&left,  search, (node->left));
    FORK(&right, search, (node->right));
    int mid = process(node);
    JOIN();
    END();
    return left + mid + right;
}

Parallel for-each

Apply a function to every element of an array. Recursive bisection gives you automatic load balancing without needing to know the worker count:

cactus void foreach(int *v, size_t n, void (*fn)(int *))
{
    if (n <= 1024) {
        for (size_t i = 0; i < n; i++) fn(&v[i]);
        return;
    }
    size_t mid = n / 2;
    BEGIN();
    FORK(foreach, (v, mid, fn));
    foreach(v + mid, n - mid, fn);
    JOIN();
    END();
}

// usage
void square(int *x) { *x = (*x) * (*x); }
foreach(arr, 10000000, square);

Parallel map

Transform one array into another. Same bisection pattern, different signature:

cactus void pmap(const double *in, double *out, size_t n,
                 double (*fn)(double))
{
    if (n <= 512) {
        for (size_t i = 0; i < n; i++) out[i] = fn(in[i]);
        return;
    }
    size_t mid = n / 2;
    BEGIN();
    FORK(pmap, (in, out, mid, fn));
    pmap(in + mid, out + mid, n - mid, fn);
    JOIN();
    END();
}

Parallel reduce

Sum, min, max — any associative operation. Each half returns a partial result; the parent combines them:

cactus long psum(const int *v, size_t n)
{
    if (n <= 4096) {
        long s = 0;
        for (size_t i = 0; i < n; i++) s += v[i];
        return s;
    }
    size_t mid = n / 2;
    long left, right;
    BEGIN();
    FORK(&left, psum, (v, mid));
    right = psum(v + mid, n - mid);
    JOIN();
    END();
    return left + right;
}

Works for any associative op — swap + for max, min, *, ^, or a custom merge.

Parallel find

Search for the first element matching a predicate. Use a shared flag to short-circuit sibling subtrees:

static volatile int found;

cactus int pfind(const int *v, size_t n, int target)
{
    if (found) return -1;                       // early exit
    if (n <= 4096) {
        for (size_t i = 0; i < n; i++)
            if (v[i] == target) { found = 1; return (int)i; }
        return -1;
    }
    size_t mid = n / 2;
    int left, right;
    BEGIN();
    FORK(&left, pfind, (v, mid, target));
    right = pfind(v + mid, n - mid, target);
    JOIN();
    END();
    if (left >= 0) return left;
    if (right >= 0) return (int)mid + right;
    return -1;
}

Parallel merge sort

Stable sort with O(n) extra space — fork the two halves, merge on join:

cactus void pmergesort(int *v, int *tmp, size_t n)
{
    if (n <= 1024) { isort(v, n); return; }   // insertion sort base

    size_t mid = n / 2;
    BEGIN();
    FORK(pmergesort, (v, tmp, mid));
    pmergesort(v + mid, tmp + mid, n - mid);
    JOIN();
    END();
    merge(v, mid, v + mid, n - mid, tmp);
    memcpy(v, tmp, sizeof(int) * n);
}

Parallel tree operations

Walk, transform, or aggregate an entire tree in parallel — each node forks its children:

// parallel tree sum
cactus long tree_sum(Node *t)
{
    if (!t) return 0;
    long left, right;
    BEGIN();
    FORK(&left, tree_sum, (t->left));
    right = tree_sum(t->right);          // inline the last child
    JOIN();
    END();
    return t->val + left + right;
}

// parallel tree map — apply f to every node
cactus void tree_map(Node *t, void (*f)(Node *))
{
    if (!t) return;
    f(t);
    BEGIN();
    FORK(tree_map, (t->left,  f));
    tree_map(t->right, f);
    JOIN();
    END();
}

// parallel tree height
cactus int tree_height(Node *t)
{
    if (!t) return 0;
    int lh, rh;
    BEGIN();
    FORK(&lh, tree_height, (t->left));
    rh = tree_height(t->right);
    JOIN();
    END();
    return 1 + (lh > rh ? lh : rh);
}

Parallel image processing

Process quadrants of a 2D image independently — a natural fit for recursive subdivision:

cactus void blur(float *img, float *tmp, int x, int y, int w, int h,
                 int stride)
{
    if (w * h <= 64 * 64) { blur_serial(img, tmp, x, y, w, h, stride); return; }

    int hw = w / 2, hh = h / 2;
    BEGIN();
    FORK(blur, (img, tmp, x,      y,      hw,     hh,     stride));
    FORK(blur, (img, tmp, x + hw, y,      w - hw, hh,     stride));
    FORK(blur, (img, tmp, x,      y + hh, hw,     h - hh, stride));
    blur(img, tmp, x + hw, y + hh, w - hw, h - hh, stride);
    JOIN();
    END();
}

Parallel histogram

Divide the input, count locally in each half, merge after join. No shared mutable state — each subtree writes to its own buffer:

cactus void phist(const uint8_t *data, size_t n, int *hist)
{
    if (n <= 4096) {
        for (size_t i = 0; i < n; i++) hist[data[i]]++;
        return;
    }
    size_t mid = n / 2;
    int left[256] = {0}, right[256] = {0};
    BEGIN();
    FORK(phist, (data, mid, left));
    phist(data + mid, n - mid, right);
    JOIN();
    END();
    for (int i = 0; i < 256; i++) hist[i] = left[i] + right[i];
}

Parallel any / all

Short-circuit boolean reduce — check if any element satisfies a predicate:

static volatile int bail;

cactus int pany(const int *v, size_t n, int (*pred)(int))
{
    if (bail) return 1;
    if (n <= 4096) {
        for (size_t i = 0; i < n; i++)
            if (pred(v[i])) { bail = 1; return 1; }
        return 0;
    }
    size_t mid = n / 2;
    int left, right;
    BEGIN();
    FORK(&left, pany, (v, mid, pred));
    right = pany(v + mid, n - mid, pred);
    JOIN();
    END();
    return left || right;
}

Flip || to && and the short-circuit logic for pall.

Parallel graph BFS layers

Process each BFS layer in parallel — fork one task per discovered node in the frontier:

cactus void process_frontier(Graph *g, int *frontier, int nf,
                             int *next, int *nn, int *visited)
{
    if (nf <= 64) {
        for (int i = 0; i < nf; i++)
            expand(g, frontier[i], next, nn, visited);
        return;
    }
    int mid = nf / 2;
    BEGIN();
    FORK(process_frontier, (g, frontier, mid, next, nn, visited));
    process_frontier(g, frontier + mid, nf - mid, next, nn, visited);
    JOIN();
    END();
}

Nested parallelism

Cactus handles nested parallel regions naturally — a parallel map that calls a parallel reduce inside each element:

cactus double row_norm(const double *mat, int cols, int row)
{
    return sqrt(psum_sq(mat + row * cols, cols));   // psum_sq is itself parallel
}

cactus void all_norms(const double *mat, double *norms, int rows, int cols)
{
    if (rows <= 4) {
        for (int i = 0; i < rows; i++) norms[i] = row_norm(mat, cols, i);
        return;
    }
    int mid = rows / 2;
    BEGIN();
    FORK(all_norms, (mat, norms, mid, cols));
    all_norms(mat + mid * cols, norms + mid, rows - mid, cols);
    JOIN();
    END();
}

The runtime's work stealing distributes tasks across cores regardless of nesting depth — no manual flattening needed.

Environment variable control

# limit to 4 workers (useful for benchmarking scaling)
CACTUS_NPROCS=4 ./my_app

# single-threaded (serial execution, still uses the runtime)
CACTUS_NPROCS=1 ./my_app

Tests

Four built-in benchmarks verify correctness and measure performance:

Test What it exercises Default n
fib basic fork-join, deep recursion 30
nqueens nested parallel backtracking 11
matmul recursive divide-and-conquer with FP 64
quicksort parallel partition + sort, 10M elements 10000000
make test                         # run all
build/cc/fib 42                   # single test with custom n
CACTUS_NPROCS=4 build/cc/nqueens  # limit to 4 workers

Honest limitations

  • x86-64 Linux only — the inline assembly manipulates RSP/RBP directly. No ARM, no macOS, no Windows.

  • GCC or Clang required — uses compiler-specific extensions (nested functions on GCC, __label__ + disable_tail_calls on Clang).

  • No cancellation — once a task is forked, it runs to completion.

  • Fixed deque size — 1024 entries per worker. Deep recursion without a cutoff can overflow the deque.

  • Stack slabs are 2 MB each — a stolen continuation gets a fresh slab. Workloads with massive steal counts consume proportional virtual memory.

API reference

Runtime lifecycle

Function Description
int cactus_start(int nprocs) Start the runtime with nprocs workers (0 = auto)
int cactus_stop(void) Shut down all workers and free resources
int cactus_workers(void) Return the number of active workers

Parallel region macros

Macro Description
BEGIN() Initialize a frame for the current parallel region
FORK(&ret, fn, (args...)) Spawn fn(args...), store result in ret
FORK(fn, (args...)) Spawn fn(args...) with no return value
JOIN() Wait for all forked children to complete
END() End the parallel region (no-op, for readability)

Function decorator

Decorator Description
cactus Annotate functions that participate in fork-join parallelism