Skip to content

Olajide-Badejo/ARM-Neon-Conv3x3

ARM NEON Conv3x3 Kernel

High-performance single-channel 3x3 convolution for ARMv8-A, implemented in C++17 with NEON intrinsics, cache blocking, and register blocking.

This repository targets reproducible systems-level optimization work suitable for portfolio, industrial benchmarking, and research reporting.

Highlights

  • 3 kernel variants: scalar baseline, NEON naive, and NEON cache-blocked/register-blocked.
  • Runtime dispatch API (select_conv_fn) for best-available implementation.
  • GoogleTest correctness and boundary tests.
  • Google Benchmark harnesses for fixed-size and sweep benchmarks.
  • PMU/perf workflow for cache and instruction-level analysis.
  • Cross-compilation support (aarch64-linux-gnu) from x86-64 hosts.

Reported Performance (Cortex-A72, 1.5 GHz)

Variant 224x224 (us) 512x512 (us) 1024x1024 (us) GFLOPS
Scalar reference 142 750 3120 0.72
NEON naive 38 198 820 2.71
NEON register-blocked 32 164 680 3.27
NEON cache-blocked 26 121 511 4.39
OpenCV filter2D 24 113 476 4.70

Current headline: approximately 6.1x vs scalar, and approximately 92% of OpenCV throughput.

Notes:

  • These values are repository-reported reference numbers and should be re-run on your exact hardware/software stack.
  • Reproducibility workflow is documented in docs/technical_report.md and results/README.md.

Repository Layout

neon-conv3x3/
|-- CMakeLists.txt
|-- cmake/
|   |-- DetectNEON.cmake
|   +-- toolchain-aarch64.cmake
|-- include/
|   |-- conv3x3.hpp
|   |-- neon_utils.hpp
|   +-- cache_params.hpp
|-- src/
|   |-- conv3x3_scalar.cpp
|   |-- conv3x3_neon_naive.cpp
|   |-- conv3x3_neon_blocked.cpp
|   +-- conv3x3_dispatch.cpp
|-- bench/
|   |-- bench_conv.cpp
|   +-- bench_sizes.cpp
|-- tests/
|   |-- test_correctness.cpp
|   |-- test_boundary.cpp
|   +-- test_perf_counters.cpp
|-- scripts/
|   |-- run_perf.sh
|   |-- plot_speedup.py
|   +-- plot_roofline.py
|-- docs/
|   |-- technical_report.md
|   |-- register_allocation.md
|   |-- assembly_annotation.md
|   +-- perf_counters.md
+-- results/
    +-- README.md

Build

Native ARM64 (recommended for benchmarking)

cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DCONV_ENABLE_NEON=ON \
  -DCONV_ENABLE_BENCH=ON \
  -DCONV_ENABLE_TESTS=ON

cmake --build build -j$(nproc)

Cross-compile from x86-64

cmake -B build-cross \
  -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-aarch64.cmake \
  -DCMAKE_BUILD_TYPE=Release \
  -DCONV_ENABLE_NEON=ON \
  -DCONV_ENABLE_BENCH=OFF \
  -DCONV_ENABLE_TESTS=OFF

cmake --build build-cross -j$(nproc)

Test

ctest --test-dir build --output-on-failure

Benchmark

mkdir -p results
./build/bench_conv \
  --benchmark_format=json \
  --benchmark_repetitions=10 \
  --benchmark_report_aggregates_only=true \
  | tee results/conv_bench.json

python3 scripts/plot_speedup.py results/conv_bench.json --out results/speedup.png
python3 scripts/plot_roofline.py results/conv_bench.json --out results/roofline.png

PMU / perf Collection

bash scripts/run_perf.sh

This writes results/perf_neon_blocked.txt and supports the analysis workflow documented in docs/perf_counters.md.

API Snapshot

#include "conv3x3.hpp"

conv::conv3x3_scalar(src, dst, kernel, H, W);
conv::conv3x3_neon_naive(src, dst, kernel, H, W);
conv::conv3x3_neon_cb(src, dst, kernel, H, W, /*Th=*/0);

auto fn = conv::select_conv_fn();
fn(src, dst, kernel, H, W);

conv::pad_border(src, padded_dst, H, W, /*pad=*/1);

Documentation

  • Technical report: docs/technical_report.md
  • Assembly walkthrough: docs/assembly_annotation.md
  • Register allocation notes: docs/register_allocation.md
  • PMU event guidance: docs/perf_counters.md
  • Benchmark artifact contract: results/README.md

Reproducibility Checklist

  1. Record commit hash, compiler version, and CPU model.
  2. Run correctness suite before collecting timings.
  3. Use fixed governor/performance mode for ARM benchmarking.
  4. Run at least 10 repetitions and use aggregate means.
  5. Preserve raw benchmark JSON in results/.

License

Licensed under the MIT License. See LICENSE.

Citation

If you use this project in reports or publications, cite via CITATION.cff.

About

ARMv8-A NEON 3×3 convolution in C++17. Scalar, NEON naive, and cache/register-blocked variants with runtime dispatch and perf/PMU analysis.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors