BN254 FFT in C

A fast implementation of FFT (Fast Fourier Transform) over the BN254 scalar field, written in C with Go bindings.

What is this?

EigenDA uses FFT to encode data for its distributed storage system. The existing Go implementation works, but it's slow. This C implementation is ~2.5x faster.

Why FFT matters for EigenDA

Raw Data → [FFT] → Reed-Solomon Encoding → [KZG Commitments] → Distributed Storage

FFT is the foundation. Speed up FFT, speed up the entire pipeline.

Project Structure

eigenda-fft-c/
├── include/           # Header files (interfaces)
│   ├── bn254_fr.h     # Field element operations
│   └── fft.h          # FFT operations
├── src/               # Implementation
│   ├── bn254_fr.c     # Field arithmetic
│   └── fft.c          # FFT algorithm
├── go/                # Go bindings
│   ├── fft.go         # CGO wrapper
│   └── fft_test.go    # Tests
├── benchmark/         # Performance testing
│   ├── main.go        # Benchmark runner
│   └── generate_graph.py
├── tests/             # C tests
│   ├── test_fft.c     # Unit tests
│   └── bench_fft.c    # C benchmarks
├── Makefile           # Build commands
└── README.md

File-by-File Breakdown

`include/bn254_fr.h` - Field Element Interface

This defines what a "number" looks like in our system.

// A 256-bit number stored as 4 x 64-bit pieces
typedef struct {
    uint64_t limbs[4];
} fr_t;

Why 4 limbs? The BN254 field uses 256-bit numbers. A single uint64_t only holds 64 bits. So we need 4 of them: 4 × 64 = 256 bits.

What operations are defined:

fr_add() - Add two field elements
fr_sub() - Subtract
fr_mul() - Multiply (this is the expensive one)
fr_inv() - Find multiplicative inverse (a^(-1) such that a × a^(-1) = 1)

`include/fft.h` - FFT Interface

typedef struct {
    size_t max_width;    // Maximum FFT size (must be power of 2)
    fr_t *roots;         // Precomputed roots of unity
    fr_t *inv_roots;     // Inverse roots (for inverse FFT)
} fft_settings_t;

What are roots of unity? Special numbers where ω^n = 1. They're the "magic sauce" that makes FFT work. We precompute them once and reuse.

Main functions:

fft_new(scale) - Create settings for 2^scale elements
fft_forward() - Transform time domain → frequency domain
fft_inverse() - Transform frequency domain → time domain
fft_forward_inplace() - Same but modifies input directly (faster)

`src/bn254_fr.c` - Field Arithmetic Implementation

This is where the math happens. Let me break down the key parts:

The Prime Modulus

// p = 21888242871839275222246405745257275088548364400416034343698204186575808495617
const fr_t FR_MODULUS = {{
    0x43e1f593f0000001ULL,  // limbs[0] - least significant
    0x2833e84879b97091ULL,  // limbs[1]
    0xb85045b68181585dULL,  // limbs[2]
    0x30644e72e131a029ULL   // limbs[3] - most significant
}};

All arithmetic happens modulo this prime. Any result ≥ p gets wrapped around.

Montgomery Multiplication

Normal modular multiplication: (a × b) mod p requires division. Division is slow.

Montgomery trick: Convert numbers to a special form where we can avoid division.

void fr_to_mont(fr_t *r, const fr_t *a);   // Convert to Montgomery form
void fr_from_mont(fr_t *r, const fr_t *a); // Convert back
void fr_mul(fr_t *r, const fr_t *a, const fr_t *b); // Multiply in Montgomery form

The fr_mul function uses 128-bit intermediate values (__uint128_t) to handle overflow during multiplication.

Modular Inverse

To find a^(-1), we use Fermat's Little Theorem:

a^(-1) = a^(p-2) mod p

This requires ~256 squarings and ~128 multiplications. Expensive, but we don't call it often.

`src/fft.c` - FFT Implementation

The FFT transforms a list of N numbers using O(N log N) operations instead of O(N²).

Precomputing Roots

fft_settings_t* fft_new(uint8_t max_scale) {
    // Allocate space for 2^max_scale roots
    // Compute: roots[i] = ω^i where ω is primitive root of unity
    // Compute: inv_roots[i] = ω^(-i) for inverse FFT
}

The Butterfly Operation

The core of FFT. For each pair of elements:

// t = root × data[odd_index]
// data[even] = data[even] + t
// data[odd]  = data[even] - t

This "butterfly" pattern is repeated log₂(N) times across the data.

Two Implementations

Recursive (easier to understand):

static void fft_core(...) {
    if (n == 1) return;  // Base case

    // Process even indices
    fft_core(out, in, half, ...);
    // Process odd indices
    fft_core(out + half, in + stride, half, ...);

    // Combine with butterfly
    for (i = 0; i < half; i++) {
        // butterfly operation
    }
}

Iterative in-place (faster):

static void fft_inplace_core(...) {
    // Step 1: Bit-reverse the indices
    // Step 2: Butterfly stages from small to large
    for (len = 2; len <= n; len *= 2) {
        // Process all butterflies of this size
    }
}

The iterative version is faster because it accesses memory sequentially (better cache usage).

`go/fft.go` - CGO Bindings

This lets Go code call our C functions.

/*
#cgo CFLAGS: -I${SRCDIR}/../include
#cgo LDFLAGS: -L${SRCDIR}/../lib -lfft_bn254
#include "fft.h"
*/
import "C"

// Go wrapper for C function
func (fs *FFTSettings) Forward(in []Fr) ([]Fr, error) {
    ret := C.fft_forward(
        fs.ptr,
        (*C.fr_t)(unsafe.Pointer(&out[0])),
        (*C.fr_t)(unsafe.Pointer(&in[0])),
        C.size_t(n),
    )
    // ...
}

Key insight: Go's []Fr slice has the same memory layout as C's fr_t[] array, so we can pass pointers directly without copying.

`tests/test_fft.c` - Unit Tests

Tests that verify correctness:

Field operations: Does 3 + 5 = 8? Does 7 × 7^(-1) = 1?
FFT roundtrip: Does inverse(forward(x)) = x?
Large FFT: Does it work with 1024 elements?

// Example test
fr_set_u64(&a, 3);
fr_set_u64(&b, 5);
fr_add(&c, &a, &b);
ASSERT(c.limbs[0] == 8, "3 + 5 = 8");

`benchmark/main.go` - Performance Testing

Measures how fast the FFT runs:

func benchmarkCFFT(size int) time.Duration {
    // Setup
    fs, _ := cfft.NewFFTSettings(log2)
    input := make([]cfft.Fr, size)

    // Benchmark
    start := time.Now()
    for i := 0; i < 100; i++ {
        fs.Forward(input)
    }
    return time.Since(start) / 100
}

How It All Connects

1. User calls Go function
   ↓
2. Go passes pointer to C via CGO
   ↓
3. C function fft_forward() runs
   ↓
4. For each butterfly: fr_mul() for field multiplication
   ↓
5. Montgomery multiplication avoids slow division
   ↓
6. Result returned to Go

Building & Running

# Build the C library
make

# Run C tests
make test

# Run C benchmarks (pure C, no Go)
make bench

Running the comparison benchmark

This runs our C FFT and compares it against EigenDA's Go FFT results.

cd benchmark
go run main.go    # runs C benchmark, outputs comparison table + comparison.csv

The Go numbers are hardcoded from running EigenDA's own benchmark suite:

cd /path/to/eigenda
go test -bench=BenchmarkFFTFr -run=^$ ./encoding/v2/bench/ -benchtime=5s

Generating the graph

cd benchmark
python3 -m venv venv
source venv/bin/activate
pip install matplotlib numpy
python3 generate_graph.py   # reads comparison.csv, outputs benchmark_graph.png

Performance Results (Apple M1)

Compared against EigenDA's Go FFT (gnark-crypto field arithmetic) at the same sizes EigenDA uses in production.

Size	Scale	Go (EigenDA)	C (ours)	Speedup
512	2^9	97μs	57μs	1.7x
16,384	2^14	5.5ms	3.4ms	1.6x
524,288	2^19	269ms	193ms	1.4x
4,194,304	2^22	2.88s	2.30s	1.3x

The speedup is more pronounced at smaller sizes (1.7x) and tapers at larger sizes (1.3x). This is because at large sizes, memory bandwidth becomes the bottleneck rather than compute — both implementations end up waiting on RAM equally.

Where EigenDA uses these sizes:

2^9 — Per-chunk IFFT when generating chunks
2^14 — KZG multiproof generation
2^19 — Client-side blob processing
2^22 — RS encoding for max blob size (16 MiB)

Why It's Faster

Montgomery multiplication — Avoids modular division by converting to a special representation where reduction is just shifts and adds
128-bit integers — C's __uint128_t compiles to native mul instructions. Go doesn't have this, so gnark-crypto uses assembly stubs
No bounds checking — Every Go array access checks if the index is valid. Across millions of butterfly operations, this adds up
-O3 -march=native — GCC optimizes specifically for the target CPU. Go's compiler is intentionally conservative

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BN254 FFT in C

What is this?

Why FFT matters for EigenDA

Project Structure

File-by-File Breakdown

`include/bn254_fr.h` - Field Element Interface

`include/fft.h` - FFT Interface

`src/bn254_fr.c` - Field Arithmetic Implementation

The Prime Modulus

Montgomery Multiplication

Modular Inverse

`src/fft.c` - FFT Implementation

Precomputing Roots

The Butterfly Operation

Two Implementations

`go/fft.go` - CGO Bindings

`tests/test_fft.c` - Unit Tests

`benchmark/main.go` - Performance Testing

How It All Connects

Building & Running

Running the comparison benchmark

Generating the graph

Performance Results (Apple M1)

Why It's Faster

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmark		benchmark
bin		bin
go		go
include		include
lib		lib
obj		obj
src		src
tests		tests
.DS_Store		.DS_Store
Makefile		Makefile
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

BN254 FFT in C

What is this?

Why FFT matters for EigenDA

Project Structure

File-by-File Breakdown

include/bn254_fr.h - Field Element Interface

include/fft.h - FFT Interface

src/bn254_fr.c - Field Arithmetic Implementation

The Prime Modulus

Montgomery Multiplication

Modular Inverse

src/fft.c - FFT Implementation

Precomputing Roots

The Butterfly Operation

Two Implementations

go/fft.go - CGO Bindings

tests/test_fft.c - Unit Tests

benchmark/main.go - Performance Testing

How It All Connects

Building & Running

Running the comparison benchmark

Generating the graph

Performance Results (Apple M1)

Why It's Faster

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`include/bn254_fr.h` - Field Element Interface

`include/fft.h` - FFT Interface

`src/bn254_fr.c` - Field Arithmetic Implementation

`src/fft.c` - FFT Implementation

`go/fft.go` - CGO Bindings

`tests/test_fft.c` - Unit Tests

`benchmark/main.go` - Performance Testing

Packages