Skip to content

owizdom/eigenda-fft-c

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BN254 FFT in C

A fast implementation of FFT (Fast Fourier Transform) over the BN254 scalar field, written in C with Go bindings.

What is this?

EigenDA uses FFT to encode data for its distributed storage system. The existing Go implementation works, but it's slow. This C implementation is ~2.5x faster.

Why FFT matters for EigenDA

Raw Data → [FFT] → Reed-Solomon Encoding → [KZG Commitments] → Distributed Storage

FFT is the foundation. Speed up FFT, speed up the entire pipeline.


Project Structure

eigenda-fft-c/
├── include/           # Header files (interfaces)
│   ├── bn254_fr.h     # Field element operations
│   └── fft.h          # FFT operations
├── src/               # Implementation
│   ├── bn254_fr.c     # Field arithmetic
│   └── fft.c          # FFT algorithm
├── go/                # Go bindings
│   ├── fft.go         # CGO wrapper
│   └── fft_test.go    # Tests
├── benchmark/         # Performance testing
│   ├── main.go        # Benchmark runner
│   └── generate_graph.py
├── tests/             # C tests
│   ├── test_fft.c     # Unit tests
│   └── bench_fft.c    # C benchmarks
├── Makefile           # Build commands
└── README.md

File-by-File Breakdown

include/bn254_fr.h - Field Element Interface

This defines what a "number" looks like in our system.

// A 256-bit number stored as 4 x 64-bit pieces
typedef struct {
    uint64_t limbs[4];
} fr_t;

Why 4 limbs? The BN254 field uses 256-bit numbers. A single uint64_t only holds 64 bits. So we need 4 of them: 4 × 64 = 256 bits.

What operations are defined:

  • fr_add() - Add two field elements
  • fr_sub() - Subtract
  • fr_mul() - Multiply (this is the expensive one)
  • fr_inv() - Find multiplicative inverse (a^(-1) such that a × a^(-1) = 1)

include/fft.h - FFT Interface

typedef struct {
    size_t max_width;    // Maximum FFT size (must be power of 2)
    fr_t *roots;         // Precomputed roots of unity
    fr_t *inv_roots;     // Inverse roots (for inverse FFT)
} fft_settings_t;

What are roots of unity? Special numbers where ω^n = 1. They're the "magic sauce" that makes FFT work. We precompute them once and reuse.

Main functions:

  • fft_new(scale) - Create settings for 2^scale elements
  • fft_forward() - Transform time domain → frequency domain
  • fft_inverse() - Transform frequency domain → time domain
  • fft_forward_inplace() - Same but modifies input directly (faster)

src/bn254_fr.c - Field Arithmetic Implementation

This is where the math happens. Let me break down the key parts:

The Prime Modulus

// p = 21888242871839275222246405745257275088548364400416034343698204186575808495617
const fr_t FR_MODULUS = {{
    0x43e1f593f0000001ULL,  // limbs[0] - least significant
    0x2833e84879b97091ULL,  // limbs[1]
    0xb85045b68181585dULL,  // limbs[2]
    0x30644e72e131a029ULL   // limbs[3] - most significant
}};

All arithmetic happens modulo this prime. Any result ≥ p gets wrapped around.

Montgomery Multiplication

Normal modular multiplication: (a × b) mod p requires division. Division is slow.

Montgomery trick: Convert numbers to a special form where we can avoid division.

void fr_to_mont(fr_t *r, const fr_t *a);   // Convert to Montgomery form
void fr_from_mont(fr_t *r, const fr_t *a); // Convert back
void fr_mul(fr_t *r, const fr_t *a, const fr_t *b); // Multiply in Montgomery form

The fr_mul function uses 128-bit intermediate values (__uint128_t) to handle overflow during multiplication.

Modular Inverse

To find a^(-1), we use Fermat's Little Theorem:

a^(-1) = a^(p-2) mod p

This requires ~256 squarings and ~128 multiplications. Expensive, but we don't call it often.


src/fft.c - FFT Implementation

The FFT transforms a list of N numbers using O(N log N) operations instead of O(N²).

Precomputing Roots

fft_settings_t* fft_new(uint8_t max_scale) {
    // Allocate space for 2^max_scale roots
    // Compute: roots[i] = ω^i where ω is primitive root of unity
    // Compute: inv_roots[i] = ω^(-i) for inverse FFT
}

The Butterfly Operation

The core of FFT. For each pair of elements:

// t = root × data[odd_index]
// data[even] = data[even] + t
// data[odd]  = data[even] - t

This "butterfly" pattern is repeated log₂(N) times across the data.

Two Implementations

Recursive (easier to understand):

static void fft_core(...) {
    if (n == 1) return;  // Base case

    // Process even indices
    fft_core(out, in, half, ...);
    // Process odd indices
    fft_core(out + half, in + stride, half, ...);

    // Combine with butterfly
    for (i = 0; i < half; i++) {
        // butterfly operation
    }
}

Iterative in-place (faster):

static void fft_inplace_core(...) {
    // Step 1: Bit-reverse the indices
    // Step 2: Butterfly stages from small to large
    for (len = 2; len <= n; len *= 2) {
        // Process all butterflies of this size
    }
}

The iterative version is faster because it accesses memory sequentially (better cache usage).


go/fft.go - CGO Bindings

This lets Go code call our C functions.

/*
#cgo CFLAGS: -I${SRCDIR}/../include
#cgo LDFLAGS: -L${SRCDIR}/../lib -lfft_bn254
#include "fft.h"
*/
import "C"

// Go wrapper for C function
func (fs *FFTSettings) Forward(in []Fr) ([]Fr, error) {
    ret := C.fft_forward(
        fs.ptr,
        (*C.fr_t)(unsafe.Pointer(&out[0])),
        (*C.fr_t)(unsafe.Pointer(&in[0])),
        C.size_t(n),
    )
    // ...
}

Key insight: Go's []Fr slice has the same memory layout as C's fr_t[] array, so we can pass pointers directly without copying.


tests/test_fft.c - Unit Tests

Tests that verify correctness:

  1. Field operations: Does 3 + 5 = 8? Does 7 × 7^(-1) = 1?
  2. FFT roundtrip: Does inverse(forward(x)) = x?
  3. Large FFT: Does it work with 1024 elements?
// Example test
fr_set_u64(&a, 3);
fr_set_u64(&b, 5);
fr_add(&c, &a, &b);
ASSERT(c.limbs[0] == 8, "3 + 5 = 8");

benchmark/main.go - Performance Testing

Measures how fast the FFT runs:

func benchmarkCFFT(size int) time.Duration {
    // Setup
    fs, _ := cfft.NewFFTSettings(log2)
    input := make([]cfft.Fr, size)

    // Benchmark
    start := time.Now()
    for i := 0; i < 100; i++ {
        fs.Forward(input)
    }
    return time.Since(start) / 100
}

How It All Connects

1. User calls Go function
   ↓
2. Go passes pointer to C via CGO
   ↓
3. C function fft_forward() runs
   ↓
4. For each butterfly: fr_mul() for field multiplication
   ↓
5. Montgomery multiplication avoids slow division
   ↓
6. Result returned to Go

Building & Running

# Build the C library
make

# Run C tests
make test

# Run C benchmarks (pure C, no Go)
make bench

Running the comparison benchmark

This runs our C FFT and compares it against EigenDA's Go FFT results.

cd benchmark
go run main.go    # runs C benchmark, outputs comparison table + comparison.csv

The Go numbers are hardcoded from running EigenDA's own benchmark suite:

cd /path/to/eigenda
go test -bench=BenchmarkFFTFr -run=^$ ./encoding/v2/bench/ -benchtime=5s

Generating the graph

cd benchmark
python3 -m venv venv
source venv/bin/activate
pip install matplotlib numpy
python3 generate_graph.py   # reads comparison.csv, outputs benchmark_graph.png

Performance Results (Apple M1)

Compared against EigenDA's Go FFT (gnark-crypto field arithmetic) at the same sizes EigenDA uses in production.

Size Scale Go (EigenDA) C (ours) Speedup
512 2^9 97μs 57μs 1.7x
16,384 2^14 5.5ms 3.4ms 1.6x
524,288 2^19 269ms 193ms 1.4x
4,194,304 2^22 2.88s 2.30s 1.3x

The speedup is more pronounced at smaller sizes (1.7x) and tapers at larger sizes (1.3x). This is because at large sizes, memory bandwidth becomes the bottleneck rather than compute — both implementations end up waiting on RAM equally.

Where EigenDA uses these sizes:

  • 2^9 — Per-chunk IFFT when generating chunks
  • 2^14 — KZG multiproof generation
  • 2^19 — Client-side blob processing
  • 2^22 — RS encoding for max blob size (16 MiB)

Why It's Faster

  1. Montgomery multiplication — Avoids modular division by converting to a special representation where reduction is just shifts and adds
  2. 128-bit integers — C's __uint128_t compiles to native mul instructions. Go doesn't have this, so gnark-crypto uses assembly stubs
  3. No bounds checking — Every Go array access checks if the index is valid. Across millions of butterfly operations, this adds up
  4. -O3 -march=native — GCC optimizes specifically for the target CPU. Go's compiler is intentionally conservative

License

MIT

About

a fast implementation of FFT (Fast Fourier Transform) over the BN254 scalar field, written in C with Go bindings for EigenDA.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors