A fast implementation of FFT (Fast Fourier Transform) over the BN254 scalar field, written in C with Go bindings.
EigenDA uses FFT to encode data for its distributed storage system. The existing Go implementation works, but it's slow. This C implementation is ~2.5x faster.
Raw Data → [FFT] → Reed-Solomon Encoding → [KZG Commitments] → Distributed Storage
FFT is the foundation. Speed up FFT, speed up the entire pipeline.
eigenda-fft-c/
├── include/ # Header files (interfaces)
│ ├── bn254_fr.h # Field element operations
│ └── fft.h # FFT operations
├── src/ # Implementation
│ ├── bn254_fr.c # Field arithmetic
│ └── fft.c # FFT algorithm
├── go/ # Go bindings
│ ├── fft.go # CGO wrapper
│ └── fft_test.go # Tests
├── benchmark/ # Performance testing
│ ├── main.go # Benchmark runner
│ └── generate_graph.py
├── tests/ # C tests
│ ├── test_fft.c # Unit tests
│ └── bench_fft.c # C benchmarks
├── Makefile # Build commands
└── README.md
This defines what a "number" looks like in our system.
// A 256-bit number stored as 4 x 64-bit pieces
typedef struct {
uint64_t limbs[4];
} fr_t;Why 4 limbs? The BN254 field uses 256-bit numbers. A single uint64_t only holds 64 bits. So we need 4 of them: 4 × 64 = 256 bits.
What operations are defined:
fr_add()- Add two field elementsfr_sub()- Subtractfr_mul()- Multiply (this is the expensive one)fr_inv()- Find multiplicative inverse (a^(-1) such that a × a^(-1) = 1)
typedef struct {
size_t max_width; // Maximum FFT size (must be power of 2)
fr_t *roots; // Precomputed roots of unity
fr_t *inv_roots; // Inverse roots (for inverse FFT)
} fft_settings_t;What are roots of unity? Special numbers where ω^n = 1. They're the "magic sauce" that makes FFT work. We precompute them once and reuse.
Main functions:
fft_new(scale)- Create settings for 2^scale elementsfft_forward()- Transform time domain → frequency domainfft_inverse()- Transform frequency domain → time domainfft_forward_inplace()- Same but modifies input directly (faster)
This is where the math happens. Let me break down the key parts:
// p = 21888242871839275222246405745257275088548364400416034343698204186575808495617
const fr_t FR_MODULUS = {{
0x43e1f593f0000001ULL, // limbs[0] - least significant
0x2833e84879b97091ULL, // limbs[1]
0xb85045b68181585dULL, // limbs[2]
0x30644e72e131a029ULL // limbs[3] - most significant
}};All arithmetic happens modulo this prime. Any result ≥ p gets wrapped around.
Normal modular multiplication: (a × b) mod p requires division. Division is slow.
Montgomery trick: Convert numbers to a special form where we can avoid division.
void fr_to_mont(fr_t *r, const fr_t *a); // Convert to Montgomery form
void fr_from_mont(fr_t *r, const fr_t *a); // Convert back
void fr_mul(fr_t *r, const fr_t *a, const fr_t *b); // Multiply in Montgomery formThe fr_mul function uses 128-bit intermediate values (__uint128_t) to handle overflow during multiplication.
To find a^(-1), we use Fermat's Little Theorem:
a^(-1) = a^(p-2) mod p
This requires ~256 squarings and ~128 multiplications. Expensive, but we don't call it often.
The FFT transforms a list of N numbers using O(N log N) operations instead of O(N²).
fft_settings_t* fft_new(uint8_t max_scale) {
// Allocate space for 2^max_scale roots
// Compute: roots[i] = ω^i where ω is primitive root of unity
// Compute: inv_roots[i] = ω^(-i) for inverse FFT
}The core of FFT. For each pair of elements:
// t = root × data[odd_index]
// data[even] = data[even] + t
// data[odd] = data[even] - tThis "butterfly" pattern is repeated log₂(N) times across the data.
Recursive (easier to understand):
static void fft_core(...) {
if (n == 1) return; // Base case
// Process even indices
fft_core(out, in, half, ...);
// Process odd indices
fft_core(out + half, in + stride, half, ...);
// Combine with butterfly
for (i = 0; i < half; i++) {
// butterfly operation
}
}Iterative in-place (faster):
static void fft_inplace_core(...) {
// Step 1: Bit-reverse the indices
// Step 2: Butterfly stages from small to large
for (len = 2; len <= n; len *= 2) {
// Process all butterflies of this size
}
}The iterative version is faster because it accesses memory sequentially (better cache usage).
This lets Go code call our C functions.
/*
#cgo CFLAGS: -I${SRCDIR}/../include
#cgo LDFLAGS: -L${SRCDIR}/../lib -lfft_bn254
#include "fft.h"
*/
import "C"
// Go wrapper for C function
func (fs *FFTSettings) Forward(in []Fr) ([]Fr, error) {
ret := C.fft_forward(
fs.ptr,
(*C.fr_t)(unsafe.Pointer(&out[0])),
(*C.fr_t)(unsafe.Pointer(&in[0])),
C.size_t(n),
)
// ...
}Key insight: Go's []Fr slice has the same memory layout as C's fr_t[] array, so we can pass pointers directly without copying.
Tests that verify correctness:
- Field operations: Does 3 + 5 = 8? Does 7 × 7^(-1) = 1?
- FFT roundtrip: Does inverse(forward(x)) = x?
- Large FFT: Does it work with 1024 elements?
// Example test
fr_set_u64(&a, 3);
fr_set_u64(&b, 5);
fr_add(&c, &a, &b);
ASSERT(c.limbs[0] == 8, "3 + 5 = 8");Measures how fast the FFT runs:
func benchmarkCFFT(size int) time.Duration {
// Setup
fs, _ := cfft.NewFFTSettings(log2)
input := make([]cfft.Fr, size)
// Benchmark
start := time.Now()
for i := 0; i < 100; i++ {
fs.Forward(input)
}
return time.Since(start) / 100
}1. User calls Go function
↓
2. Go passes pointer to C via CGO
↓
3. C function fft_forward() runs
↓
4. For each butterfly: fr_mul() for field multiplication
↓
5. Montgomery multiplication avoids slow division
↓
6. Result returned to Go
# Build the C library
make
# Run C tests
make test
# Run C benchmarks (pure C, no Go)
make benchThis runs our C FFT and compares it against EigenDA's Go FFT results.
cd benchmark
go run main.go # runs C benchmark, outputs comparison table + comparison.csvThe Go numbers are hardcoded from running EigenDA's own benchmark suite:
cd /path/to/eigenda
go test -bench=BenchmarkFFTFr -run=^$ ./encoding/v2/bench/ -benchtime=5scd benchmark
python3 -m venv venv
source venv/bin/activate
pip install matplotlib numpy
python3 generate_graph.py # reads comparison.csv, outputs benchmark_graph.pngCompared against EigenDA's Go FFT (gnark-crypto field arithmetic) at the same sizes EigenDA uses in production.
| Size | Scale | Go (EigenDA) | C (ours) | Speedup |
|---|---|---|---|---|
| 512 | 2^9 | 97μs | 57μs | 1.7x |
| 16,384 | 2^14 | 5.5ms | 3.4ms | 1.6x |
| 524,288 | 2^19 | 269ms | 193ms | 1.4x |
| 4,194,304 | 2^22 | 2.88s | 2.30s | 1.3x |
The speedup is more pronounced at smaller sizes (1.7x) and tapers at larger sizes (1.3x). This is because at large sizes, memory bandwidth becomes the bottleneck rather than compute — both implementations end up waiting on RAM equally.
Where EigenDA uses these sizes:
- 2^9 — Per-chunk IFFT when generating chunks
- 2^14 — KZG multiproof generation
- 2^19 — Client-side blob processing
- 2^22 — RS encoding for max blob size (16 MiB)
- Montgomery multiplication — Avoids modular division by converting to a special representation where reduction is just shifts and adds
- 128-bit integers — C's
__uint128_tcompiles to nativemulinstructions. Go doesn't have this, so gnark-crypto uses assembly stubs - No bounds checking — Every Go array access checks if the index is valid. Across millions of butterfly operations, this adds up
-O3 -march=native— GCC optimizes specifically for the target CPU. Go's compiler is intentionally conservative
MIT