High-performance single-channel 3x3 convolution for ARMv8-A, implemented in C++17 with NEON intrinsics, cache blocking, and register blocking.
This repository targets reproducible systems-level optimization work suitable for portfolio, industrial benchmarking, and research reporting.
- 3 kernel variants: scalar baseline, NEON naive, and NEON cache-blocked/register-blocked.
- Runtime dispatch API (
select_conv_fn) for best-available implementation. - GoogleTest correctness and boundary tests.
- Google Benchmark harnesses for fixed-size and sweep benchmarks.
- PMU/perf workflow for cache and instruction-level analysis.
- Cross-compilation support (
aarch64-linux-gnu) from x86-64 hosts.
| Variant | 224x224 (us) | 512x512 (us) | 1024x1024 (us) | GFLOPS |
|---|---|---|---|---|
| Scalar reference | 142 | 750 | 3120 | 0.72 |
| NEON naive | 38 | 198 | 820 | 2.71 |
| NEON register-blocked | 32 | 164 | 680 | 3.27 |
| NEON cache-blocked | 26 | 121 | 511 | 4.39 |
OpenCV filter2D |
24 | 113 | 476 | 4.70 |
Current headline: approximately 6.1x vs scalar, and approximately 92% of OpenCV throughput.
Notes:
- These values are repository-reported reference numbers and should be re-run on your exact hardware/software stack.
- Reproducibility workflow is documented in
docs/technical_report.mdandresults/README.md.
neon-conv3x3/
|-- CMakeLists.txt
|-- cmake/
| |-- DetectNEON.cmake
| +-- toolchain-aarch64.cmake
|-- include/
| |-- conv3x3.hpp
| |-- neon_utils.hpp
| +-- cache_params.hpp
|-- src/
| |-- conv3x3_scalar.cpp
| |-- conv3x3_neon_naive.cpp
| |-- conv3x3_neon_blocked.cpp
| +-- conv3x3_dispatch.cpp
|-- bench/
| |-- bench_conv.cpp
| +-- bench_sizes.cpp
|-- tests/
| |-- test_correctness.cpp
| |-- test_boundary.cpp
| +-- test_perf_counters.cpp
|-- scripts/
| |-- run_perf.sh
| |-- plot_speedup.py
| +-- plot_roofline.py
|-- docs/
| |-- technical_report.md
| |-- register_allocation.md
| |-- assembly_annotation.md
| +-- perf_counters.md
+-- results/
+-- README.md
cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DCONV_ENABLE_NEON=ON \
-DCONV_ENABLE_BENCH=ON \
-DCONV_ENABLE_TESTS=ON
cmake --build build -j$(nproc)cmake -B build-cross \
-DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-aarch64.cmake \
-DCMAKE_BUILD_TYPE=Release \
-DCONV_ENABLE_NEON=ON \
-DCONV_ENABLE_BENCH=OFF \
-DCONV_ENABLE_TESTS=OFF
cmake --build build-cross -j$(nproc)ctest --test-dir build --output-on-failuremkdir -p results
./build/bench_conv \
--benchmark_format=json \
--benchmark_repetitions=10 \
--benchmark_report_aggregates_only=true \
| tee results/conv_bench.json
python3 scripts/plot_speedup.py results/conv_bench.json --out results/speedup.png
python3 scripts/plot_roofline.py results/conv_bench.json --out results/roofline.pngbash scripts/run_perf.shThis writes results/perf_neon_blocked.txt and supports the analysis workflow documented in docs/perf_counters.md.
#include "conv3x3.hpp"
conv::conv3x3_scalar(src, dst, kernel, H, W);
conv::conv3x3_neon_naive(src, dst, kernel, H, W);
conv::conv3x3_neon_cb(src, dst, kernel, H, W, /*Th=*/0);
auto fn = conv::select_conv_fn();
fn(src, dst, kernel, H, W);
conv::pad_border(src, padded_dst, H, W, /*pad=*/1);- Technical report:
docs/technical_report.md - Assembly walkthrough:
docs/assembly_annotation.md - Register allocation notes:
docs/register_allocation.md - PMU event guidance:
docs/perf_counters.md - Benchmark artifact contract:
results/README.md
- Record commit hash, compiler version, and CPU model.
- Run correctness suite before collecting timings.
- Use fixed governor/performance mode for ARM benchmarking.
- Run at least 10 repetitions and use aggregate means.
- Preserve raw benchmark JSON in
results/.
Licensed under the MIT License. See LICENSE.
If you use this project in reports or publications, cite via CITATION.cff.