This document highlights the performance evaluation of the purecv library across four main compilation strategies:
- Standard: Sequential fallback mode (
--no-default-features), without explicit target CPU optimizations. - SIMD Only: Sequential mode compiled with
RUSTFLAGS="-C target-cpu=native"to encourage LLVM auto-vectorization. - Parallel: Enabled
rayonmulti-threading across available cores (--features parallel). - Parallel + SIMD: Combined
rayonparallelism alongsidetarget-cpu=nativefor maximum theoretical throughput.
All tests operate on 1024x1024 image/matrix tensors using f32 (or u8 depending on the domain context). Times shown represent the median calculation calculated by Criterion.rs.
Execution Date: 2026-04-17 (CET)
New benchmarks for Feature Detection and Hough Transforms. These operations are heavily compute-bound and benefit significantly from parallelism. Hough benchmarks use a 512x512 input to balance execution time and accuracy.
| Benchmark / Operation | Parallel + SIMD |
|---|---|
corner_harris_1024x1024_f32 |
26.28 ms |
hough_lines_512x512 |
3.58 ms |
hough_lines_p_512x512 |
1.78 ms |
hough_circles_512x512 |
4.51 ms |
corner_harris: Involves Sobel derivatives, Gaussian blurring of the covariance matrix, and the Harris response calculation. At 26ms for a 1MP image, it is highly efficient for real-time applications.- Hough Transforms:
hough_lines(Standard) processes the full accumulator space.hough_lines_p(Probabilistic) is ~2x faster by sampling points, as expected.hough_circlesuses the 2-1 Hough transform (gradient-based), achieving sub-5ms performance on 512x512 inputs.
Execution Date: 2026-03-31 (CET)
Three new function groups have been benchmarked: count_non_zero, lut (Look-Up Table), and dft (Discrete Fourier Transform via rustfft). DFT benchmarks require the fft feature flag.
Note: The 4-strategy comparison table (Standard / SIMD Only / Parallel / Parallel+SIMD) does not apply to the LUT and DFT benchmarks below. LUT is a pure table lookup where SIMD provides no benefit, and DFT delegates to
rustfftwhich manages its own internal SIMD — theparallelandsimdfeature flags have no effect on either. Results shown are with default features (Parallel enabled).
| Benchmark / Operation | Parallel (default) |
|---|---|
count_non_zero_1024×1024_i32 |
116.03 µs |
lut_1024×1024_u8 (1ch) |
501.25 µs |
lut_1024×1024×3_u8_broadcast (3ch) |
855.86 µs |
count_non_zero is a simple reduction — very fast with rayon parallelism. LUT is a pure table lookup (memory-bound), scaling linearly with channel count (3ch ≈ 1.7× of 1ch).
| Benchmark / Operation | Time |
|---|---|
get_optimal_dft_size |
18.76 ns |
dft_forward_256×256_f32 |
454.03 µs |
dft_inverse_256×256_f32 |
514.82 µs |
dft_forward_512×512_f32 |
3.80 ms |
dft_forward_1024×1024_f64 |
28.21 ms |
DFT scales as O(n² log n) for 2D transforms: 512×512 is ~8× slower than 256×256, consistent with the 4× data increase and logarithmic factor growth. get_optimal_dft_size is a trivial binary search (~19 ns).
Three new benchmarks were added to specifically evaluate the SIMD fast-path in fast_deriv_3x3 for f32 Sobel, and the heavy compute-bound bilateral_filter. Sobel f32 benchmarks use realistic non-zero data (sinusoidal gradient pattern) to fully exercise the computation.
| Benchmark / Operation | Standard | SIMD Only | Parallel | Parallel + SIMD |
|---|---|---|---|---|
bilateral_filter_512×512_u8 |
1.43 s | 1.51 s | 202.44 ms | 205.48 ms |
sobel_3x3_f32_dx_1024×1024 |
28.59 ms | 6.26 ms | 1.96 ms | 1.28 ms |
sobel_3x3_f32_dy_1024×1024 |
26.24 ms | 6.51 ms | 2.16 ms | 1.27 ms |
-
sobel_3x3_f32(dx/dy) — first benchmark where SIMD alone provides a massive standalone gain:- SIMD Only: 4.5× speedup over Standard — the
simd_deriv_3x3_row_f32kernel infast_deriv_3x3processes interior rows with vectorized 3×3 convolutions. - Parallel: 14× speedup — row-wise parallelism scales excellently.
- Parallel + SIMD: 22× total speedup (28.59 ms → 1.28 ms) — the combination of SIMD kernels on each core produces the best result seen in the project.
- Compared to the generic
sobel_3x3(which usesMatrix<f32>with zero data and mixed dx=1,dy=1), the f32-specific SIMD path is ~1.5× faster even in Parallel+SIMD (1.28 ms vs 1.87 ms).
- SIMD Only: 4.5× speedup over Standard — the
-
bilateral_filter_512×512_u8— heavily compute-bound (per-pixel Gaussian weight computation over a spatial/range neighborhood):- Parallel: 7.1× speedup (1.43 s → 202 ms) — each row's bilateral computation is independent.
- SIMD Only: no gain (1.51 s, slightly worse) — the complex per-pixel exponential weight math is not trivially vectorizable.
- Parallel + SIMD: equivalent to Parallel alone (~205 ms).
| Metric | Previous best (sobel_3x3 generic) |
New (sobel_3x3_f32 SIMD path) |
Improvement |
|---|---|---|---|
| Standard | 22.79 ms | 28.59 ms (realistic data) | — |
| SIMD Only | 21.66 ms | 6.26 ms | 3.5× |
| Parallel | 1.87 ms | 1.96 ms | ~equal |
| Parallel + SIMD | 1.87 ms | 1.28 ms | 1.5× |
The generic
sobel_3x3benchmark used zero-initialized data and the mixed derivative (dx=1, dy=1), which masked the SIMD gains. With realistic f32 data and separate dx/dy, thepulp-powered SIMD kernel delivers a clear 1.5× additional speedup on top of parallelism.
Execution Date: 2026-03-09 23:43 (CET)
| Benchmark / Operation | Standard | SIMD Only | Parallel | Parallel + SIMD |
|---|---|---|---|---|
matrix_add |
3.86 ms | 3.92 ms | 2.45 ms | 2.35 ms |
matrix_sub |
3.86 ms | 3.87 ms | 2.49 ms | 2.39 ms |
matrix_mul |
3.73 ms | 3.88 ms | 2.36 ms | 2.58 ms |
matrix_div |
3.71 ms | 3.84 ms | 2.42 ms | 2.37 ms |
cvt_color_rgb2gray |
3.85 ms | 1.52 ms | 666.32 µs | 475.45 µs |
box_filter_3x3 |
13.91 ms | 14.35 ms | 3.39 ms | 3.61 ms |
sobel_3x3 |
15.28 ms | 15.40 ms | 3.65 ms | 4.06 ms |
- Color Conversion:
cvt_color_rgb2grayis highly vectorizable (straightforward pixel-by-pixel math without neighborhood lookups). Simply enabling-C target-cpu=nativeyields a massive 2.5x speedup on a single thread. Combining it withrayonparallelism achieves near 8x total speedup. - Spatial Filters: Neighborhood-dependent modules (
box_filterandsobel_3x3) showcase tremendous gains from theparallelexecution strategy (over 4x speedup) as filtering is heavily compute-bound and segments cleanly across row data. However, generic auto-vectorization (SIMD Only) sees negligible impact due to the random memory access patterns characteristic of 3x3 sliding windows. - Matrix Arithmetics: Foundational matrices tasks (
add,sub, etc.) appear bounded primarily by memory bandwidth when allocating large1024x1024vectors, making instruction-level SIMD tuning yield diminishing returns vs memory throughput over parallel cores.
Conclusion: The refactored .chunks_exact() and .zip() iterators provide robust performance baselines and scale beautifully whenever LLVM's auto-vectorizer applies (demonstrated perfectly via cvt_color), while ensuring safe compatibility across varying feature flags!
Execution Date: 2026-03-10
Following the initial SIMD and Parallelization benchmarks, targeted algorithmic optimizations were applied to specific operations to eliminate memory overhead and boundary-checking branches.
| Benchmark / Operation | Pre-Optimization (Parallel) | Post-Optimization (Parallel) | Speedup |
|---|---|---|---|
sobel_3x3 |
3.65 ms | 2.08 ms | 43% |
scharr_x |
3.70 ms | 2.11 ms | 43% |
- Derivatives (
sobel_3x3,scharr_x): Replacing the genericsep_filter_2dimplementation with an inlinedfast_deriv_3x3function brought execution times down to ~2.1ms. By explicitly unrolling the 3x3 convolution and separating the interior "fast path" from the boundary "slow path," we completely eliminated intermediate allocations and significantly improved cache locality.
Execution Date: 2026-03-15 (CET)
| Benchmark / Operation | Standard | SIMD Only | Parallel | Parallel + SIMD |
|---|---|---|---|---|
matrix_add |
3.26 ms | 3.35 ms | 2.08 ms | 2.05 ms |
matrix_sub |
3.24 ms | 3.37 ms | 2.06 ms | 2.11 ms |
matrix_mul |
3.20 ms | 3.13 ms | 2.09 ms | 2.09 ms |
matrix_div |
3.24 ms | 3.10 ms | 2.10 ms | 2.10 ms |
dot |
997.62 µs | 971.72 µs | 156.54 µs | 157.72 µs |
magnitude |
756.23 µs | 754.99 µs | 184.35 µs | 184.42 µs |
norm_l2 |
984.65 µs | 964.61 µs | 151.90 µs | 160.90 µs |
add_weighted |
3.53 ms | 3.13 ms | 2.23 ms | 2.30 ms |
convert_scale_abs |
2.54 ms | 1.66 ms | 610.46 µs | 477.11 µs |
sum |
975.51 µs | 966.94 µs | 162.20 µs | 164.41 µs |
mean |
969.08 µs | 964.49 µs | 163.92 µs | 167.97 µs |
bitwise_and |
255.18 µs | 264.72 µs | 376.10 µs | 445.06 µs |
min |
3.08 ms | 3.06 ms | 2.07 ms | 2.11 ms |
absdiff |
3.10 ms | 3.04 ms | 2.06 ms | 2.15 ms |
normalize |
2.75 ms | 1.55 ms | 1.24 ms | 1.32 ms |
sqrt |
939.04 µs | 945.49 µs | 858.87 µs | 847.33 µs |
gemm_256×256 |
15.71 ms | 15.69 ms | 4.29 ms | 4.40 ms |
- Reduction operations (
dot,norm_l2,sum,mean) scale brilliantly withrayon: ~6× speedup from Standard → Parallel, driven by embarrassingly-parallel sum-accumulation. - Element-wise binary ops (
add,sub,mul,div,min,absdiff) gain a modest ~1.5× speedup with Parallel. These are memory-bandwidth-bound; additional SIMD provides negligible further gain. convert_scale_absis the standout: SIMD gives 1.5× (sequential), and Parallel+SIMD reaches 5.3× total speedup — a combination of vectorizable math and parallelizable rows.bitwise_andis fastest in Standard mode (255 µs). Parallel overhead (~380–445 µs) exceeds the benefit for this ultra-cheap 1-byte-per-element operation.gemm_256×256benefits enormously from Parallel (3.7×), confirming that matrix multiplication is compute-bound and partitions well across cores.
| Benchmark / Operation | Standard | SIMD Only | Parallel | Parallel + SIMD |
|---|---|---|---|---|
cvt_color_rgb2gray |
2.66 ms | 1.41 ms | 529.63 µs | 404.23 µs |
box_filter_3x3 |
9.01 ms | 9.31 ms | 3.07 ms | 3.69 ms |
sobel_3x3 |
22.79 ms | 21.66 ms | 1.87 ms | 1.87 ms |
scharr_x |
22.76 ms | 21.67 ms | 1.78 ms | 1.85 ms |
threshold_binary |
1.31 ms | 1.43 ms | 626.70 µs | 609.22 µs |
cvt_color_bgr2gray |
2.69 ms | † | 617.94 µs | 368.43 µs |
cvt_color_rgba2gray |
2.64 ms | 2.43 ms | 676.02 µs | 354.37 µs |
gaussian_blur_3x3 |
11.01 ms | † | 3.53 ms | 3.50 ms |
gaussian_blur_5x5 |
15.21 ms | 16.37 ms | 4.30 ms | 4.72 ms |
laplacian_3x3 |
45.91 ms | 44.56 ms | 4.41 ms | 4.44 ms |
canny |
57.61 ms | 62.62 ms | 12.70 ms | 12.54 ms |
† SIMD Only values for
cvt_color_bgr2gray(249 ms) andgaussian_blur_3x3(18 ms) are anomalous — likely impacted by system activity during the run. Excluded from comparison.
- Color Conversion (
cvt_color_rgb2gray): highly vectorizable pixel-by-pixel math. SIMD alone yields 1.9×, Parallel brings 5.0×, and combined reaches 6.6× total speedup. - Derivatives (
sobel_3x3,scharr_x): the optimizedfast_deriv_3x3implementation scales extremely well with Parallel — 12× speedup. These benefit from row-wise parallelism where each row's 3×3 convolution is independent. laplacian_3x3: composed of two Sobel passes, gains an enormous 10.4× speedup with Parallel, consistent with the derivative scaling.canny: multi-stage pipeline (Sobel + gradient + NMS + hysteresis). Achieves 4.6× speedup with Parallel — limited by the sequential hysteresis tracking phase.- Spatial Filters (
box_filter,gaussian_blur): ~3× speedup with Parallel. The sliding-window pattern limits SIMD gains due to non-sequential memory access. threshold_binary: simple per-pixel operation with modest 2.1× speedup — limited by memory bandwidth.
| Benchmark / Operation | Standard | SIMD Only | Parallel | Parallel + SIMD |
|---|---|---|---|---|
flip_horiz |
5.03 ms | 5.81 ms | 976.23 µs | 971.35 µs |
transpose |
7.42 ms | 7.29 ms | 1.09 ms | 1.14 ms |
split |
3.21 ms | 3.21 ms | 1.24 ms | 1.31 ms |
merge |
2.49 ms | 2.27 ms | 932.68 µs | ‡ |
‡
mergein Parallel+SIMD was not captured (benchmark run interrupted).
- All structural ops are memory-bound: SIMD Only provides zero benefit (or even slight regression due to instruction overhead on non-vectorizable scatter/gather patterns).
- Parallel provides 2.6–6.8× speedup across the board:
flip(5.2×) andtranspose(6.8×) benefit most because each row can be independently relocated. mergeandsplit(~2.5–2.7×) are still memory-bandwidth-limited even with parallelism, as they involve strided interleave/deinterleave across 3 channels.
| Category | Best Config | Typical Speedup vs Standard |
|---|---|---|
| Element-wise ops | Parallel | 1.5× (memory-bound) |
| Reductions | Parallel | 6× (embarrassingly parallel) |
| Color conversion | Parallel + SIMD | 6.6× (vectorizable + parallel) |
| Spatial filters | Parallel | 3× (row-parallel convolution) |
| Derivatives | Parallel | 12× (optimized fast path) |
| Derivatives (f32 SIMD) | Parallel + SIMD | 22× (SIMD kernel + parallel) |
| Bilateral filter | Parallel | 7.1× (compute-bound, no SIMD gain) |
| Canny edge detect | Parallel + SIMD | 4.6× (multi-stage pipeline) |
| Structural ops | Parallel | 5× (independent row ops) |
| GEMM | Parallel | 3.7× (compute-bound) |
| Feature Detection | Parallel | Harris Corner ~26ms (1024x1024) |
| Hough Transform | Parallel | Lines ~3.6ms / Circles ~4.5ms (512x512) |
Conclusion: Parallelism (rayon) is the dominant optimization for nearly all operations. SIMD (target-cpu=native) provides meaningful additional gains primarily for pixel-level math (cvt_color, convert_scale_abs) and f32 derivative kernels (sobel_3x3_f32) where the inner loop is trivially vectorizable. The sobel_3x3_f32 SIMD path achieves the project's highest combined speedup at 22×. For memory-bound or scatter/gather patterns, SIMD adds no benefit.