SIMD & Parallelization Benchmark Results

This document highlights the performance evaluation of the purecv library across four main compilation strategies:

Standard: Sequential fallback mode (--no-default-features), without explicit target CPU optimizations.
SIMD Only: Sequential mode compiled with RUSTFLAGS="-C target-cpu=native" to encourage LLVM auto-vectorization.
Parallel: Enabled rayon multi-threading across available cores (--features parallel).
Parallel + SIMD: Combined rayon parallelism alongside target-cpu=native for maximum theoretical throughput.

All tests operate on 1024x1024 image/matrix tensors using f32 (or u8 depending on the domain context). Times shown represent the median calculation calculated by Criterion.rs.

Feature Detection & Hough Transforms

Execution Date: 2026-04-17 (CET)

New benchmarks for Feature Detection and Hough Transforms. These operations are heavily compute-bound and benefit significantly from parallelism. Hough benchmarks use a 512x512 input to balance execution time and accuracy.

Benchmark / Operation	Parallel + SIMD
`corner_harris_1024x1024_f32`	26.28 ms
`hough_lines_512x512`	3.58 ms
`hough_lines_p_512x512`	1.78 ms
`hough_circles_512x512`	4.51 ms

Analysis

corner_harris: Involves Sobel derivatives, Gaussian blurring of the covariance matrix, and the Harris response calculation. At 26ms for a 1MP image, it is highly efficient for real-time applications.
Hough Transforms:
- hough_lines (Standard) processes the full accumulator space.
- hough_lines_p (Probabilistic) is ~2x faster by sampling points, as expected.
- hough_circles uses the 2-1 Hough transform (gradient-based), achieving sub-5ms performance on 512x512 inputs.

New Benchmarks — count_non_zero, LUT & DFT

Execution Date: 2026-03-31 (CET)

Three new function groups have been benchmarked: count_non_zero, lut (Look-Up Table), and dft (Discrete Fourier Transform via rustfft). DFT benchmarks require the fft feature flag.

Note: The 4-strategy comparison table (Standard / SIMD Only / Parallel / Parallel+SIMD) does not apply to the LUT and DFT benchmarks below. LUT is a pure table lookup where SIMD provides no benefit, and DFT delegates to rustfft which manages its own internal SIMD — the parallel and simd feature flags have no effect on either. Results shown are with default features (Parallel enabled).

Arithm — count_non_zero & LUT

Benchmark / Operation	Parallel (default)
`count_non_zero_1024×1024_i32`	116.03 µs
`lut_1024×1024_u8` (1ch)	501.25 µs
`lut_1024×1024×3_u8_broadcast` (3ch)	855.86 µs

count_non_zero is a simple reduction — very fast with rayon parallelism. LUT is a pure table lookup (memory-bound), scaling linearly with channel count (3ch ≈ 1.7× of 1ch).

DFT — Discrete Fourier Transform (`--features fft`)

Benchmark / Operation	Time
`get_optimal_dft_size`	18.76 ns
`dft_forward_256×256_f32`	454.03 µs
`dft_inverse_256×256_f32`	514.82 µs
`dft_forward_512×512_f32`	3.80 ms
`dft_forward_1024×1024_f64`	28.21 ms

DFT scales as O(n² log n) for 2D transforms: 512×512 is ~8× slower than 256×256, consistent with the 4× data increase and logarithmic factor growth. get_optimal_dft_size is a trivial binary search (~19 ns).

Execution Date: 2026-03-17 (CET)

New Benchmarks — bilateral_filter & sobel_f32 (SIMD path)

Three new benchmarks were added to specifically evaluate the SIMD fast-path in fast_deriv_3x3 for f32 Sobel, and the heavy compute-bound bilateral_filter. Sobel f32 benchmarks use realistic non-zero data (sinusoidal gradient pattern) to fully exercise the computation.

Benchmark / Operation	Standard	SIMD Only	Parallel	Parallel + SIMD
`bilateral_filter_512×512_u8`	1.43 s	1.51 s	202.44 ms	205.48 ms
`sobel_3x3_f32_dx_1024×1024`	28.59 ms	6.26 ms	1.96 ms	1.28 ms
`sobel_3x3_f32_dy_1024×1024`	26.24 ms	6.51 ms	2.16 ms	1.27 ms

Analysis

sobel_3x3_f32 (dx/dy) — first benchmark where SIMD alone provides a massive standalone gain:
- SIMD Only: 4.5× speedup over Standard — the simd_deriv_3x3_row_f32 kernel in fast_deriv_3x3 processes interior rows with vectorized 3×3 convolutions.
- Parallel: 14× speedup — row-wise parallelism scales excellently.
- Parallel + SIMD: 22× total speedup (28.59 ms → 1.28 ms) — the combination of SIMD kernels on each core produces the best result seen in the project.
- Compared to the generic sobel_3x3 (which uses Matrix<f32> with zero data and mixed dx=1,dy=1), the f32-specific SIMD path is ~1.5× faster even in Parallel+SIMD (1.28 ms vs 1.87 ms).
bilateral_filter_512×512_u8 — heavily compute-bound (per-pixel Gaussian weight computation over a spatial/range neighborhood):
- Parallel: 7.1× speedup (1.43 s → 202 ms) — each row's bilateral computation is independent.
- SIMD Only: no gain (1.51 s, slightly worse) — the complex per-pixel exponential weight math is not trivially vectorizable.
- Parallel + SIMD: equivalent to Parallel alone (~205 ms).

Key Differences vs Previous Results

Metric	Previous best (`sobel_3x3` generic)	New (`sobel_3x3_f32` SIMD path)	Improvement
Standard	22.79 ms	28.59 ms (realistic data)	—
SIMD Only	21.66 ms	6.26 ms	3.5×
Parallel	1.87 ms	1.96 ms	~equal
Parallel + SIMD	1.87 ms	1.28 ms	1.5×

The generic sobel_3x3 benchmark used zero-initialized data and the mixed derivative (dx=1, dy=1), which masked the SIMD gains. With realistic f32 data and separate dx/dy, the pulp-powered SIMD kernel delivers a clear 1.5× additional speedup on top of parallelism.

Execution Date: 2026-03-09 23:43 (CET)

Performance Comparison Table

Benchmark / Operation	Standard	SIMD Only	Parallel	Parallel + SIMD
`matrix_add`	3.86 ms	3.92 ms	2.45 ms	2.35 ms
`matrix_sub`	3.86 ms	3.87 ms	2.49 ms	2.39 ms
`matrix_mul`	3.73 ms	3.88 ms	2.36 ms	2.58 ms
`matrix_div`	3.71 ms	3.84 ms	2.42 ms	2.37 ms
`cvt_color_rgb2gray`	3.85 ms	1.52 ms	666.32 µs	475.45 µs
`box_filter_3x3`	13.91 ms	14.35 ms	3.39 ms	3.61 ms
`sobel_3x3`	15.28 ms	15.40 ms	3.65 ms	4.06 ms

Analysis

Color Conversion: cvt_color_rgb2gray is highly vectorizable (straightforward pixel-by-pixel math without neighborhood lookups). Simply enabling -C target-cpu=native yields a massive 2.5x speedup on a single thread. Combining it with rayon parallelism achieves near 8x total speedup.
Spatial Filters: Neighborhood-dependent modules (box_filter and sobel_3x3) showcase tremendous gains from the parallel execution strategy (over 4x speedup) as filtering is heavily compute-bound and segments cleanly across row data. However, generic auto-vectorization (SIMD Only) sees negligible impact due to the random memory access patterns characteristic of 3x3 sliding windows.
Matrix Arithmetics: Foundational matrices tasks (add, sub, etc.) appear bounded primarily by memory bandwidth when allocating large 1024x1024 vectors, making instruction-level SIMD tuning yield diminishing returns vs memory throughput over parallel cores.

Conclusion: The refactored .chunks_exact() and .zip() iterators provide robust performance baselines and scale beautifully whenever LLVM's auto-vectorizer applies (demonstrated perfectly via cvt_color), while ensuring safe compatibility across varying feature flags!

Algorithmic Optimizations

Execution Date: 2026-03-10

Following the initial SIMD and Parallelization benchmarks, targeted algorithmic optimizations were applied to specific operations to eliminate memory overhead and boundary-checking branches.

Benchmark / Operation	Pre-Optimization (Parallel)	Post-Optimization (Parallel)	Speedup
`sobel_3x3`	3.65 ms	2.08 ms	43%
`scharr_x`	3.70 ms	2.11 ms	43%

Analysis

Derivatives (sobel_3x3, scharr_x): Replacing the generic sep_filter_2d implementation with an inlined fast_deriv_3x3 function brought execution times down to ~2.1ms. By explicitly unrolling the 3x3 convolution and separating the interior "fast path" from the boundary "slow path," we completely eliminated intermediate allocations and significantly improved cache locality.

Execution Date: 2026-03-15 (CET)

Arithm — Performance Comparison Table (17 benchmarks)

Benchmark / Operation	Standard	SIMD Only	Parallel	Parallel + SIMD
`matrix_add`	3.26 ms	3.35 ms	2.08 ms	2.05 ms
`matrix_sub`	3.24 ms	3.37 ms	2.06 ms	2.11 ms
`matrix_mul`	3.20 ms	3.13 ms	2.09 ms	2.09 ms
`matrix_div`	3.24 ms	3.10 ms	2.10 ms	2.10 ms
`dot`	997.62 µs	971.72 µs	156.54 µs	157.72 µs
`magnitude`	756.23 µs	754.99 µs	184.35 µs	184.42 µs
`norm_l2`	984.65 µs	964.61 µs	151.90 µs	160.90 µs
`add_weighted`	3.53 ms	3.13 ms	2.23 ms	2.30 ms
`convert_scale_abs`	2.54 ms	1.66 ms	610.46 µs	477.11 µs
`sum`	975.51 µs	966.94 µs	162.20 µs	164.41 µs
`mean`	969.08 µs	964.49 µs	163.92 µs	167.97 µs
`bitwise_and`	255.18 µs	264.72 µs	376.10 µs	445.06 µs
`min`	3.08 ms	3.06 ms	2.07 ms	2.11 ms
`absdiff`	3.10 ms	3.04 ms	2.06 ms	2.15 ms
`normalize`	2.75 ms	1.55 ms	1.24 ms	1.32 ms
`sqrt`	939.04 µs	945.49 µs	858.87 µs	847.33 µs
`gemm_256×256`	15.71 ms	15.69 ms	4.29 ms	4.40 ms

Arithm Analysis

Reduction operations (dot, norm_l2, sum, mean) scale brilliantly with rayon: ~6× speedup from Standard → Parallel, driven by embarrassingly-parallel sum-accumulation.
Element-wise binary ops (add, sub, mul, div, min, absdiff) gain a modest ~1.5× speedup with Parallel. These are memory-bandwidth-bound; additional SIMD provides negligible further gain.
convert_scale_abs is the standout: SIMD gives 1.5× (sequential), and Parallel+SIMD reaches 5.3× total speedup — a combination of vectorizable math and parallelizable rows.
bitwise_and is fastest in Standard mode (255 µs). Parallel overhead (~380–445 µs) exceeds the benefit for this ultra-cheap 1-byte-per-element operation.
gemm_256×256 benefits enormously from Parallel (3.7×), confirming that matrix multiplication is compute-bound and partitions well across cores.

ImgProc — Performance Comparison Table (11 benchmarks)

Benchmark / Operation	Standard	SIMD Only	Parallel	Parallel + SIMD
`cvt_color_rgb2gray`	2.66 ms	1.41 ms	529.63 µs	404.23 µs
`box_filter_3x3`	9.01 ms	9.31 ms	3.07 ms	3.69 ms
`sobel_3x3`	22.79 ms	21.66 ms	1.87 ms	1.87 ms
`scharr_x`	22.76 ms	21.67 ms	1.78 ms	1.85 ms
`threshold_binary`	1.31 ms	1.43 ms	626.70 µs	609.22 µs
`cvt_color_bgr2gray`	2.69 ms	†	617.94 µs	368.43 µs
`cvt_color_rgba2gray`	2.64 ms	2.43 ms	676.02 µs	354.37 µs
`gaussian_blur_3x3`	11.01 ms	†	3.53 ms	3.50 ms
`gaussian_blur_5x5`	15.21 ms	16.37 ms	4.30 ms	4.72 ms
`laplacian_3x3`	45.91 ms	44.56 ms	4.41 ms	4.44 ms
`canny`	57.61 ms	62.62 ms	12.70 ms	12.54 ms

† SIMD Only values for cvt_color_bgr2gray (249 ms) and gaussian_blur_3x3 (18 ms) are anomalous — likely impacted by system activity during the run. Excluded from comparison.

ImgProc Analysis

Color Conversion (cvt_color_rgb2gray): highly vectorizable pixel-by-pixel math. SIMD alone yields 1.9×, Parallel brings 5.0×, and combined reaches 6.6× total speedup.
Derivatives (sobel_3x3, scharr_x): the optimized fast_deriv_3x3 implementation scales extremely well with Parallel — 12× speedup. These benefit from row-wise parallelism where each row's 3×3 convolution is independent.
laplacian_3x3: composed of two Sobel passes, gains an enormous 10.4× speedup with Parallel, consistent with the derivative scaling.
canny: multi-stage pipeline (Sobel + gradient + NMS + hysteresis). Achieves 4.6× speedup with Parallel — limited by the sequential hysteresis tracking phase.
Spatial Filters (box_filter, gaussian_blur): ~3× speedup with Parallel. The sliding-window pattern limits SIMD gains due to non-sequential memory access.
threshold_binary: simple per-pixel operation with modest 2.1× speedup — limited by memory bandwidth.

Structural — Performance Comparison Table (4 benchmarks)

Benchmark / Operation	Standard	SIMD Only	Parallel	Parallel + SIMD
`flip_horiz`	5.03 ms	5.81 ms	976.23 µs	971.35 µs
`transpose`	7.42 ms	7.29 ms	1.09 ms	1.14 ms
`split`	3.21 ms	3.21 ms	1.24 ms	1.31 ms
`merge`	2.49 ms	2.27 ms	932.68 µs	‡

‡ merge in Parallel+SIMD was not captured (benchmark run interrupted).

Structural Analysis

All structural ops are memory-bound: SIMD Only provides zero benefit (or even slight regression due to instruction overhead on non-vectorizable scatter/gather patterns).
Parallel provides 2.6–6.8× speedup across the board: flip (5.2×) and transpose (6.8×) benefit most because each row can be independently relocated.
merge and split (~2.5–2.7×) are still memory-bandwidth-limited even with parallelism, as they involve strided interleave/deinterleave across 3 channels.

Summary — Best Configuration per Category

Category	Best Config	Typical Speedup vs Standard
Element-wise ops	Parallel	1.5× (memory-bound)
Reductions	Parallel	6× (embarrassingly parallel)
Color conversion	Parallel + SIMD	6.6× (vectorizable + parallel)
Spatial filters	Parallel	3× (row-parallel convolution)
Derivatives	Parallel	12× (optimized fast path)
Derivatives (f32 SIMD)	Parallel + SIMD	22× (SIMD kernel + parallel)
Bilateral filter	Parallel	7.1× (compute-bound, no SIMD gain)
Canny edge detect	Parallel + SIMD	4.6× (multi-stage pipeline)
Structural ops	Parallel	5× (independent row ops)
GEMM	Parallel	3.7× (compute-bound)
Feature Detection	Parallel	Harris Corner ~26ms (1024x1024)
Hough Transform	Parallel	Lines ~3.6ms / Circles ~4.5ms (512x512)

Conclusion: Parallelism (rayon) is the dominant optimization for nearly all operations. SIMD (target-cpu=native) provides meaningful additional gains primarily for pixel-level math (cvt_color, convert_scale_abs) and f32 derivative kernels (sobel_3x3_f32) where the inner loop is trivially vectorizable. The sobel_3x3_f32 SIMD path achieves the project's highest combined speedup at 22×. For memory-bound or scatter/gather patterns, SIMD adds no benefit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD & Parallelization Benchmark Results

Feature Detection & Hough Transforms

Analysis

New Benchmarks — count_non_zero, LUT & DFT

Arithm — count_non_zero & LUT

DFT — Discrete Fourier Transform (`--features fft`)

Execution Date: 2026-03-17 (CET)

New Benchmarks — bilateral_filter & sobel_f32 (SIMD path)

Analysis

Key Differences vs Previous Results

Performance Comparison Table

Analysis

Algorithmic Optimizations

Analysis

Arithm — Performance Comparison Table (17 benchmarks)

Arithm Analysis

ImgProc — Performance Comparison Table (11 benchmarks)

ImgProc Analysis

Structural — Performance Comparison Table (4 benchmarks)

Structural Analysis

Summary — Best Configuration per Category

FilesExpand file tree

benchmark_results.md

Latest commit

History

benchmark_results.md

File metadata and controls

SIMD & Parallelization Benchmark Results

Feature Detection & Hough Transforms

Analysis

New Benchmarks — count_non_zero, LUT & DFT

Arithm — count_non_zero & LUT

DFT — Discrete Fourier Transform (--features fft)

Execution Date: 2026-03-17 (CET)

New Benchmarks — bilateral_filter & sobel_f32 (SIMD path)

Analysis

Key Differences vs Previous Results

Performance Comparison Table

Analysis

Algorithmic Optimizations

Analysis

Arithm — Performance Comparison Table (17 benchmarks)

Arithm Analysis

ImgProc — Performance Comparison Table (11 benchmarks)

ImgProc Analysis

Structural — Performance Comparison Table (4 benchmarks)

Structural Analysis

Summary — Best Configuration per Category

DFT — Discrete Fourier Transform (`--features fft`)