Skip to content

gaul/armlint

Repository files navigation

armlint

armlint examines AArch64 machine code to find suboptimal instruction sequences. For example, building the constant 0x66666666 as

movz w0, #0x6666
movk w0, #0x6666, lsl #16

is two instructions where one would do, because 0x66666666 is encodable as an AArch64 logical (bitmask) immediate:

mov w0, #0x66666666     ; orr w0, wzr, #0x66666666

armlint helps compiler writers and assembly authors generate tighter code, and documents corners of the A64 instruction set.

Design and limitations

armlint is a peephole analyzer. It decodes each 32-bit A64 instruction directly from the binary and matches it by mask and value, resolving aliases (for example MUL is MADD with a zero accumulator) so that both spellings of a pattern are caught. It then looks for a short window of adjacent instructions that a shorter or cheaper encoding can replace.

The overriding rule is soundness: armlint emits a finding only when the rewrite provably preserves the architectural result. For a tool that suggests code changes, a false positive is the worst failure, so it errs toward false negatives -- a missed opportunity is cheaper than a wrong one. Each check documents the exact conditions under which its rewrite is equivalent; the constraints below are the ones they share.

  • Strict adjacency. A producer and its consumer must be consecutive; an unrelated instruction between them suppresses the finding. armlint does not reorder code or look through intervening instructions.
  • Liveness is proved structurally, not analyzed. A producer-into-consumer fold fires only when the consumer overwrites the producer's destination register, proving the intermediate value is dead. There is no general-purpose register liveness pass.
  • MOV-chain folds assume the constant is dead. Folds that absorb a materialized constant -- MUL/MNEG/UDIV by a constant, MOV + ADD/AND/ORR/EOR, MOV #0 -- report a saving only if the constant register feeds nothing else, which armlint cannot confirm without a liveness pass. The consumer rewrite itself stays valid regardless.
  • Flag liveness uses a bounded forward scan. The branch- and flag-folding checks drop a CMP/TST only after confirming that no later instruction reads N/C/V before they are overwritten, scanning the fall-through path for a limited window. The branch-target path is never followed, so a finding near a taken edge is suppressed rather than risked.

Findings are opportunities, not guaranteed speedups: some -- the pre- and post-indexed addressing folds -- are code-size and front-end wins that are backend-neutral. Each check's notes say what its rewrite actually saves.

Implemented analyses

Each row links to its full description -- mechanics, soundness, and what the rewrite saves -- in analyses.md.

Pattern Rewrite
movz + movk (full-width constant) single bitmask-immediate mov
lsl + add/sub/and/orr/eor add Rd, Rn, Rm, lsl #n
sxtw/uxtb/sxtb + add/sub add Rd, Rn, Wm, sxtw
cmp #0 + b.eq/b.ne cbz/cbnz
cmp #0 + b.lt/b.ge/b.mi/b.pl tbnz/tbz Rn, #(msb)
tst #(1<<k) + b.eq/b.ne tbz/tbnz Rn, #k
lsl + lsr/asr ubfx/sbfx/ubfiz/sbfiz
lsr + and #mask ubfx
and #mask + lsr ubfx
and #mask + lsl (or lsr + lsl) ubfiz (or clearing and)
zeroing producer + uxtb/uxth/uxtw/and drop the zero-extension
mov xd, xd remove (architectural no-op)
sign-extending producer + sxtb/sxth/sxtw drop the sign-extension
and/orr/eor/sub/bic/orn/eon with Rs, Rs mov / zero / all-ones
ldr+ldr / str+str (consecutive) ldp/stp (and ldpsw)
and + and/ubfiz + orr (clear/isolate/merge) bfxil/bfi
csel Rd, Rn, Rn, cond mov Rd, Rn
add/sub Rd, Rn, #0 mov Rd, Rn, or remove
adds/subs/ands + cmp #0 + b.eq/b.ne drop the redundant cmp/tst
mov #2^N + mul lsl, or add Rd, Ra, Ra, lsl #N
mov #C + mneg neg, or shifted neg/sub
mov #2^N + udiv lsr
mov #C + add/sub add/sub Rd, Rn, #C
mov #C + and/orr/eor/ands and/orr/eor/ands Rd, Rn, #C
mov #0 + str/add/and use use wzr/xzr
mul + add/sub madd/msub (or mneg)
smull/umull + add/sub smaddl/umaddl/smsubl/umsubl
neg + add/sub sub/add
mvn + and/orr/eor/ands bic/orn/eon/bics
add + ldr [xt] ldr [xn, xm{, lsl #s}]
sxtw + ldr [xn, xt] ldr [xn, ws, sxtw {#s}]
add #a + ldr [xt] ldr [xn, #a]
ldr [xn] + add/sub xn ldr [xn], #±imm (post-index)
add/sub xn + ldr [xn] ldr [xn, #±imm]! (pre-index)

Compilation

armlint depends on Capstone and uses pkg-config to locate it. On macOS:

brew install capstone

On Debian/Ubuntu:

apt install libcapstone-dev pkg-config

Build:

git clone https://github.com/gaul/armlint.git armlint
cd armlint
make all

Two test suites are available. make test runs the unit tests against fabricated byte sequences, exercising the check registry directly. make integration-test runs the snapshot suite under fixtures/: each .s is assembled with clang -arch arm64 and armlint's output is diffed against a checked-in .expected file. The integration suite covers the Mach-O parser and the report formatting, which the unit tests bypass; it skips cleanly on hosts without an arm64 toolchain. After an intentional output change, regenerate the snapshots with make integration-test-regen and review the diff before committing.

Usage

armlint is intended to be part of compiler test suites which should #include "armlint.h" and link libarmlint.a. Disassemble the just-emitted machine code with check_instructions; its return value is the number of opportunities found, which a test can assert is zero:

#include "armlint.h"   // also includes <capstone/capstone.h>

// code/code_len: the AArch64 bytes to check (e.g. a function the
// compiler just emitted); base_addr is the address they load at.
// Returns the opportunity count (0 == clean), or -1 on a decode error.
int lint(const uint8_t *code, size_t code_len, uint64_t base_addr)
{
    csh handle;
    if (cs_open(CS_ARCH_ARM64, CS_MODE_ARM, &handle) != CS_ERR_OK) {
        return -1;
    }
    cs_option(handle, CS_OPT_DETAIL, CS_OPT_ON);

    armlint_summary *summary = armlint_summary_create();
    int findings = check_instructions(
        handle, code, code_len, base_addr, /*verbose=*/true, summary);
    armlint_summary_print(summary);   // optional by-type tally

    armlint_summary_destroy(summary);
    cs_close(&handle);
    return findings;
}

The summary is optional -- pass NULL to skip the by-type tally -- and verbose controls whether each opportunity is printed as it is found. armlint can also read arbitrary AArch64 binaries (ELF, thin Mach-O, or universal/fat Mach-O) directly:

./armlint /path/to/aarch64/binary
./armlint /bin/ls

By default armlint prints only a summary: the opportunities grouped by type and sorted by prevalence, so it is clear which to look at first, followed by a total and the number of instructions scanned. A large binary can have hundreds of thousands of opportunities, so the per-opportunity detail is suppressed unless requested:

$ ./armlint /bin/ls
Optimization opportunities by type:
      39  ADD + LDR foldable to pre-indexed LDR
      36  ADD + LDR foldable to immediate-offset LDR
       1  adjacent STRs foldable into STP

76 optimization opportunities in 4153 instructions

Pass -v to also print each opportunity -- its one-line summary plus the offending instructions, as shown below -- ahead of the summary:

$ ./armlint -v /bin/ls
ADD + LDR foldable to immediate-offset LDR at offset: 0x60: -> ldr w8, [x8, #0x2c] (2 instructions)
  add x8, x8, #0x2c
  ldr w8, [x8]
...

The process exits non-zero when any opportunity is found, so armlint can gate a compiler test suite.

References

License

Copyright (C) 2026 Andrew Gaul

Licensed under the Apache License, Version 2.0

About

Examine AArch64 machine code to find suboptimal instruction sequences

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages