ARM NEON Conv3x3 Kernel

High-performance single-channel 3x3 convolution for ARMv8-A, implemented in C++17 with NEON intrinsics, cache blocking, and register blocking.

This repository targets reproducible systems-level optimization work suitable for portfolio, industrial benchmarking, and research reporting.

Highlights

3 kernel variants: scalar baseline, NEON naive, and NEON cache-blocked/register-blocked.
Runtime dispatch API (select_conv_fn) for best-available implementation.
GoogleTest correctness and boundary tests.
Google Benchmark harnesses for fixed-size and sweep benchmarks.
PMU/perf workflow for cache and instruction-level analysis.
Cross-compilation support (aarch64-linux-gnu) from x86-64 hosts.

Reported Performance (Cortex-A72, 1.5 GHz)

Variant	224x224 (us)	512x512 (us)	1024x1024 (us)	GFLOPS
Scalar reference	142	750	3120	0.72
NEON naive	38	198	820	2.71
NEON register-blocked	32	164	680	3.27
NEON cache-blocked	26	121	511	4.39
OpenCV `filter2D`	24	113	476	4.70

Current headline: approximately 6.1x vs scalar, and approximately 92% of OpenCV throughput.

Notes:

These values are repository-reported reference numbers and should be re-run on your exact hardware/software stack.
Reproducibility workflow is documented in docs/technical_report.md and results/README.md.

Repository Layout

neon-conv3x3/
|-- CMakeLists.txt
|-- cmake/
|   |-- DetectNEON.cmake
|   +-- toolchain-aarch64.cmake
|-- include/
|   |-- conv3x3.hpp
|   |-- neon_utils.hpp
|   +-- cache_params.hpp
|-- src/
|   |-- conv3x3_scalar.cpp
|   |-- conv3x3_neon_naive.cpp
|   |-- conv3x3_neon_blocked.cpp
|   +-- conv3x3_dispatch.cpp
|-- bench/
|   |-- bench_conv.cpp
|   +-- bench_sizes.cpp
|-- tests/
|   |-- test_correctness.cpp
|   |-- test_boundary.cpp
|   +-- test_perf_counters.cpp
|-- scripts/
|   |-- run_perf.sh
|   |-- plot_speedup.py
|   +-- plot_roofline.py
|-- docs/
|   |-- technical_report.md
|   |-- register_allocation.md
|   |-- assembly_annotation.md
|   +-- perf_counters.md
+-- results/
    +-- README.md

Build

Native ARM64 (recommended for benchmarking)

cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DCONV_ENABLE_NEON=ON \
  -DCONV_ENABLE_BENCH=ON \
  -DCONV_ENABLE_TESTS=ON

cmake --build build -j$(nproc)

Cross-compile from x86-64

cmake -B build-cross \
  -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-aarch64.cmake \
  -DCMAKE_BUILD_TYPE=Release \
  -DCONV_ENABLE_NEON=ON \
  -DCONV_ENABLE_BENCH=OFF \
  -DCONV_ENABLE_TESTS=OFF

cmake --build build-cross -j$(nproc)

Test

ctest --test-dir build --output-on-failure

Benchmark

mkdir -p results
./build/bench_conv \
  --benchmark_format=json \
  --benchmark_repetitions=10 \
  --benchmark_report_aggregates_only=true \
  | tee results/conv_bench.json

python3 scripts/plot_speedup.py results/conv_bench.json --out results/speedup.png
python3 scripts/plot_roofline.py results/conv_bench.json --out results/roofline.png

PMU / perf Collection

bash scripts/run_perf.sh

This writes results/perf_neon_blocked.txt and supports the analysis workflow documented in docs/perf_counters.md.

API Snapshot

#include "conv3x3.hpp"

conv::conv3x3_scalar(src, dst, kernel, H, W);
conv::conv3x3_neon_naive(src, dst, kernel, H, W);
conv::conv3x3_neon_cb(src, dst, kernel, H, W, /*Th=*/0);

auto fn = conv::select_conv_fn();
fn(src, dst, kernel, H, W);

conv::pad_border(src, padded_dst, H, W, /*pad=*/1);

Documentation

Technical report: docs/technical_report.md
Assembly walkthrough: docs/assembly_annotation.md
Register allocation notes: docs/register_allocation.md
PMU event guidance: docs/perf_counters.md
Benchmark artifact contract: results/README.md

Reproducibility Checklist

Record commit hash, compiler version, and CPU model.
Run correctness suite before collecting timings.
Use fixed governor/performance mode for ARM benchmarking.
Run at least 10 repetitions and use aggregate means.
Preserve raw benchmark JSON in results/.

License

Licensed under the MIT License. See LICENSE.

Citation

If you use this project in reports or publications, cite via CITATION.cff.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ARM NEON Conv3x3 Kernel

Highlights

Reported Performance (Cortex-A72, 1.5 GHz)

Repository Layout

Build

Native ARM64 (recommended for benchmarking)

Cross-compile from x86-64

Test

Benchmark

PMU / perf Collection

API Snapshot

Documentation

Reproducibility Checklist

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
.vscode		.vscode
bench		bench
cmake		cmake
docs		docs
include		include
results		results
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ARM NEON Conv3x3 Kernel

Highlights

Reported Performance (Cortex-A72, 1.5 GHz)

Repository Layout

Build

Native ARM64 (recommended for benchmarking)

Cross-compile from x86-64

Test

Benchmark

PMU / perf Collection

API Snapshot

Documentation

Reproducibility Checklist

License

Citation

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages