Skip to content

Latest commit

 

History

History
860 lines (667 loc) · 25.4 KB

File metadata and controls

860 lines (667 loc) · 25.4 KB

Appendix: Compile-Time Optimization

Zig's comptime feature executes code at compile time, generating optimized runtime code with zero overhead. Agave uses this extensively for lookup tables, feature detection, and type-specialized dispatch.

comptime Basics

comptime means "computed at compile time". The compiler evaluates the expression during compilation, and the result is baked into the binary.

flowchart LR
    classDef setup     fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef sync      fill:#dcfce7,stroke:#22c55e,color:#14532d
    classDef migration fill:#fef9c3,stroke:#eab308,color:#713f12
    classDef success   fill:#bbf7d0,stroke:#16a34a,color:#14532d
    classDef danger    fill:#fee2e2,stroke:#ef4444,color:#7f1d1d
    classDef optional  fill:#f3e8ff,stroke:#9333ea,color:#581c87

    Source["Source Code\n(comptime expression)"]:::setup
    Compiler["Zig Compiler\n(compile time)"]:::sync
    Value["Constant Value\nbaked into binary"]:::migration
    Binary["Executable Binary\n(.rodata section)"]:::success
    Runtime["Runtime\n(user runs program)"]:::setup
    Result["Instant result\n(no computation)"]:::success

    Source --> Compiler
    Compiler --> Value
    Value --> Binary
    Runtime --> Binary
    Binary --> Result

    subgraph CompilePhase["Compile Phase (your machine, once)"]
        Source
        Compiler
        Value
    end

    subgraph RunPhase["Run Phase (user's machine, many times)"]
        Runtime
        Binary
        Result
    end

Loading
const table_size = 256;  // Regular constant
const doubled = comptime table_size * 2;  // Computed at compile time (512)

// The binary contains the value 512, not the multiplication

When to use comptime:

  • Building lookup tables
  • Feature detection based on target platform
  • Type-level computations
  • Format string validation

Lookup Tables

Pre-computing values at compile time eliminates runtime arithmetic.

flowchart TD
    classDef setup     fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef sync      fill:#dcfce7,stroke:#22c55e,color:#14532d
    classDef migration fill:#fef9c3,stroke:#eab308,color:#713f12
    classDef success   fill:#bbf7d0,stroke:#16a34a,color:#14532d
    classDef danger    fill:#fee2e2,stroke:#ef4444,color:#7f1d1d
    classDef optional  fill:#f3e8ff,stroke:#9333ea,color:#581c87

    NaiveInput["8-bit FP8 value\n(e.g. 0xA7)"]:::setup
    NaiveOps["Runtime: extract bits,\nbranch, pow(), multiply\n~30 instructions"]:::danger
    NaiveOut["f32 result"]:::migration

    ComptimeLoop["Compiler: loop 0..256\nfp8e4m3Compute(i)"]:::sync
    LUT["[256]f32 table\nin .rodata\n(1 KB)"]:::migration
    FastInput["8-bit FP8 value\n(e.g. 0xA7)"]:::setup
    LUTLookup["Runtime: array[val]\n1 instruction"]:::sync
    FastOut["f32 result"]:::success

    NaiveInput --> NaiveOps
    NaiveOps --> NaiveOut

    ComptimeLoop --> LUT
    LUT --> LUTLookup
    FastInput --> LUTLookup
    LUTLookup --> FastOut

    subgraph Naive["Naive (runtime per call)"]
        NaiveInput
        NaiveOps
        NaiveOut
    end

    subgraph LUTPath["LUT (comptime table, runtime lookup)"]
        ComptimeLoop
        LUT
        FastInput
        LUTLookup
        FastOut
    end

Loading

FP8 E4M3 Dequantization Table

Naive approach (runtime conversion):

pub fn fp8e4m3ToF32(val: u8) f32 {
    // Extract sign, exponent, mantissa from 8-bit value
    const sign = (val >> 7) & 1;
    const exp = (val >> 3) & 0xF;
    const mant = val & 0x7;

    // Compute float value
    const bias = 7;
    const sign_mult = if (sign == 1) -1.0 else 1.0;

    if (exp == 0) {
        // Subnormal
        return sign_mult * (@as(f32, @floatFromInt(mant)) / 8.0) * std.math.pow(f32, 2.0, 1 - bias);
    } else {
        // Normal
        const frac = 1.0 + (@as(f32, @floatFromInt(mant)) / 8.0);
        return sign_mult * frac * std.math.pow(f32, 2.0, @as(f32, @floatFromInt(exp)) - bias);
    }
}

Cost per call: ~20-30 instructions (bit shifts, branches, floating-point arithmetic, pow() call).

Optimized approach (comptime lookup table):

// Build 256-entry lookup table at compile time
const fp8e4m3_lut: [256]f32 = blk: {
    var table: [256]f32 = undefined;
    for (0..256) |i| {
        table[i] = fp8e4m3Compute(@intCast(i));  // Computed once at compile time
    }
    break :blk table;
};

// Runtime dequantization is a single array lookup
pub inline fn fp8e4m3ToF32(val: u8) f32 {
    return fp8e4m3_lut[val];
}

Cost per call: 1 instruction (load from .rodata section).

Speedup: 20-30× faster for the dequantization itself. In a full GEMV, this saves ~5-10% total time.

comptime Block Syntax

const table = blk: {
    var result: [N]T = undefined;
    // ... compute result ...
    break :blk result;  // Return from comptime block
};

Key points:

  • blk: is a labeled block
  • break :blk value returns from the block
  • The entire block runs at compile time
  • result becomes a compile-time constant

IQ4_NL Dequantization Table

IQ4_NL uses a fixed dequantization table (not computed, but verified at comptime):

pub const iq4nl_table: [16]i8 = .{
    -127, -104, -83, -65, -49, -35, -22, -10,
    1, 13, 25, 38, 53, 69, 89, 113,
};

// Illustrative usage (not a real API function — callers use iq4nl_table directly):
// const val = @as(f32, @floatFromInt(iq4nl_table[nibble])) * scale;

Why a table? IQ4_NL uses non-linear quantization — the step sizes aren't uniform. Small values have fine steps, large values have coarse steps. This gives better accuracy than linear Q4.

comptime verification:

comptime {
    std.debug.assert(iq4nl_table.len == 16);  // 4-bit = 16 values
    for (iq4nl_table, 0..) |v, i| {
        if (i > 0) {
            std.debug.assert(v > iq4nl_table[i - 1]);  // Strictly increasing
        }
    }
}

This runs at compile time. If the table is malformed, compilation fails.

Feature Detection

Zig's builtin module provides platform information at comptime.

flowchart LR
    classDef setup     fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef sync      fill:#dcfce7,stroke:#22c55e,color:#14532d
    classDef migration fill:#fef9c3,stroke:#eab308,color:#713f12
    classDef success   fill:#bbf7d0,stroke:#16a34a,color:#14532d
    classDef danger    fill:#fee2e2,stroke:#ef4444,color:#7f1d1d
    classDef optional  fill:#f3e8ff,stroke:#9333ea,color:#581c87

    BuildCmd["zig build\n-Dtarget=aarch64-macos"]:::setup
    Builtin["builtin.os.tag\nbuiltin.cpu.arch\nbuild_options.*"]:::migration
    MetalBranch["MetalBackend\ncompiled in"]:::sync
    VulkanBranch["VulkanBackend\ncompiled in"]:::sync
    CPUBranch["CpuBackend\ncompiled in"]:::sync
    Binary["macOS Binary\n(Metal only,\nLinux code absent)"]:::success
    LinuxBin["Linux Binary\n(Vulkan only,\nMetal code absent)"]:::success
    CPUBin["Other Binary\n(CPU fallback)"]:::success

    BuildCmd --> Builtin
    Builtin --> MacOS{{"os == .macos?"}}
    MacOS -- yes --> MetalBranch
    MacOS -- no --> Linux{{"os == .linux?"}}
    Linux -- yes --> VulkanBranch
    Linux -- no --> CPUBranch

    MetalBranch --> Binary
    VulkanBranch --> LinuxBin
    CPUBranch --> CPUBin

    subgraph CompileTime["Compile Time: dead code eliminated"]
        MacOS
        Linux
        MetalBranch
        VulkanBranch
        CPUBranch
    end

Loading

Target OS Detection

const builtin = @import("builtin");

pub fn initBackend() !Backend {
    if (comptime builtin.os.tag == .macos) {
        return Backend{ .metal = try MetalBackend.init() };
    } else if (comptime builtin.os.tag == .linux) {
        return Backend{ .vulkan = try VulkanBackend.init() };
    } else {
        return Backend{ .cpu = try CpuBackend.init() };
    }
}

Dead code elimination: The compiler generates only the code for the target platform. If compiling for macOS, the Linux and CPU branches are completely removed from the binary.

CPU Feature Detection

const has_avx2 = comptime builtin.cpu.features.isEnabled(@import("std").Target.x86.Feature.avx2);

pub fn gemv(...) void {
    if (comptime has_avx2) {
        gemvAVX2(...);  // 256-bit SIMD
    } else {
        gemvSSE2(...);  // 128-bit SIMD fallback
    }
}

Benefit: No runtime CPU detection overhead. The compiler knows at build time which CPU features are available (based on -mcpu flag or target triple).

Build Options

// build.zig
const backend_options = b.addOptions();
backend_options.addOption(bool, "enable_metal", true);
backend_options.addOption(bool, "enable_cuda", false);

// backend.zig
const build_options = @import("build_options");

pub const MetalBackend = if (build_options.enable_metal)
    @import("metal.zig").MetalBackend
else
    NullBackend;

Effect: If enable_metal=false, the Metal backend is not compiled at all@import("metal.zig") never happens, reducing binary size and compile time.

@embedFile for Kernel Source

Shader source code can be embedded directly into the binary at compile time.

flowchart LR
    classDef setup     fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef sync      fill:#dcfce7,stroke:#22c55e,color:#14532d
    classDef migration fill:#fef9c3,stroke:#eab308,color:#713f12
    classDef success   fill:#bbf7d0,stroke:#16a34a,color:#14532d
    classDef danger    fill:#fee2e2,stroke:#ef4444,color:#7f1d1d
    classDef optional  fill:#f3e8ff,stroke:#9333ea,color:#581c87

    MSL1["common.metal\n(MSL source)"]:::setup
    MSL2["elementwise.metal\n(MSL source)"]:::setup
    MSL3["gemv.metal\n(MSL source)"]:::setup
    MSLN["... (5 more .metal files)"]:::setup
    SPV["gemv.spv\n(SPIR-V binary)"]:::setup
    EF["@embedFile\n(compile step)"]:::sync
    EF2["@embedFile\n(compile step)"]:::sync
    Concat["++ concatenation\n(zero-cost, compile time)"]:::sync
    ROData[".rodata section\nin binary\n([]const u8 pointer)"]:::migration
    ROData2[".rodata section\nin binary\n([]const u8 pointer)"]:::migration
    Init["MetalBackend.init()\nnewLibraryWithSource(src)\n(driver compiles to GPU bytecode)"]:::success
    Init2["VulkanBackend.init()\ncreateShaderModule(code)\n(SPIR-V loaded directly)"]:::success

    MSL1 --> EF
    MSL2 --> EF
    MSL3 --> EF
    MSLN --> EF
    SPV  --> EF2

    EF  --> Concat
    Concat --> ROData
    EF2 --> ROData2

    ROData  --> Init
    ROData2 --> Init2

    subgraph SourceFiles["Source Files (on disk, compile time only)"]
        MSL1
        MSL2
        MSL3
        MSLN
        SPV
    end

    subgraph CompileStep["Zig Compiler"]
        EF
        EF2
        Concat
    end

    subgraph Binary["Agave Binary (.rodata — no external files needed)"]
        ROData
        ROData2
    end

    subgraph Runtime["Runtime (zero file I/O)"]
        Init
        Init2
    end
Loading

Metal Shader Embedding

// Concatenate all MSL files at compile time
const msl_source = @embedFile("kernels/metal/common.metal") ++
    @embedFile("kernels/metal/elementwise.metal") ++
    @embedFile("kernels/metal/norm.metal") ++
    @embedFile("kernels/metal/rope.metal") ++
    @embedFile("kernels/metal/gemv.metal") ++
    @embedFile("kernels/metal/gemm.metal") ++
    @embedFile("kernels/metal/sdpa.metal") ++
    @embedFile("kernels/metal/deltanet.metal");

pub fn init(allocator: Allocator) !MetalBackend {
    // Compile MSL source at runtime (driver compiles to GPU bytecode)
    const library = device.newLibraryWithSource(msl_source, null, &err);
    // ...
}

Benefits:

  1. Single binary: No need to ship separate .metal files
  2. No file I/O: No std.fs.cwd().openFile() at runtime
  3. Compile-time concatenation: Multiple files merged into one string at zero cost

Alternative (runtime file loading):

// BAD: Runtime file I/O
const file = try std.fs.cwd().openFile("shaders/gemv.metal", .{});
defer file.close();
const source = try file.readToEndAlloc(allocator, 1024 * 1024);
defer allocator.free(source);

Problems:

  • Requires shipping shader files alongside binary
  • File path resolution (where is the binary run from?)
  • Runtime allocation + I/O
  • Error handling (file not found, permission denied)

@embedFile eliminates all of these.

SPIR-V Binary Embedding

Vulkan uses pre-compiled SPIR-V bytecode:

const gemv_spirv = @embedFile("kernels/vulkan/gemv.spv");

pub fn init() !VulkanBackend {
    const shader_module = vk.createShaderModule(device, .{
        .code_size = gemv_spirv.len,
        .code = @ptrCast(gemv_spirv.ptr),
    });
    // ...
}

SPIR-V is binary data@embedFile works with any file type, not just text.

Type-Specialized Functions

Generate different code for each type at compile time.

Generic Dequantization

pub fn dequantize(comptime T: type, quant: []const u8, output: []f32) void {
    switch (T) {
        Q4_0 => dequantizeQ4_0(quant, output),
        Q8_0 => dequantizeQ8_0(quant, output),
        BF16 => dequantizeBF16(quant, output),
        else => @compileError("Unsupported quantization type"),
    }
}

// Usage:
dequantize(Q4_0, quant_data, f32_output);  // Compiles to direct call to dequantizeQ4_0

No runtime dispatch — the switch is resolved at compile time, and only the relevant function is called.

flowchart TD
    classDef setup     fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef sync      fill:#dcfce7,stroke:#22c55e,color:#14532d
    classDef migration fill:#fef9c3,stroke:#eab308,color:#713f12
    classDef success   fill:#bbf7d0,stroke:#16a34a,color:#14532d
    classDef danger    fill:#fee2e2,stroke:#ef4444,color:#7f1d1d
    classDef optional  fill:#f3e8ff,stroke:#9333ea,color:#581c87

    Generic["dequantize(comptime T: type, ...)\ngeneric call site"]:::setup
    Q4["T == Q4_0\n→ dequantizeQ4_0()\nmonomorphized copy"]:::sync
    Q8["T == Q8_0\n→ dequantizeQ8_0()\nmonomorphized copy"]:::sync
    BF["T == BF16\n→ dequantizeBF16()\nmonomorphized copy"]:::sync
    ERR["T == other\n→ @compileError()\nhalts compilation"]:::danger
    BQ4["dequantizeQ4_0\n(direct call, inlined)"]:::success
    BQ8["dequantizeQ8_0\n(direct call, inlined)"]:::success
    BBF["dequantizeBF16\n(direct call, inlined)"]:::success

    subgraph CompileTime["Compiler — resolved at compile time (T is known)"]
        direction LR
        SW{"switch T"}
        Q4
        Q8
        BF
        ERR
        SW --> Q4 & Q8 & BF & ERR
    end

    subgraph Binary["Binary — only called variant present"]
        BQ4
        BQ8
        BBF
    end

    Generic --> SW
    Q4 --> BQ4
    Q8 --> BQ8
    BF --> BBF
Loading

Tagged Union Dispatch (inline else)

pub const Backend = union(enum) {
    cpu: *CpuBackend,
    metal: *MetalBackend,
    // ...

    pub fn gemv(self: Backend, ...) void {
        switch (self) {
            inline else => |be| be.gemv(...),  // Expands to separate case per variant
        }
    }
};

What inline else does:

// Expands to:
switch (self) {
    .cpu => |be| be.gemv(...),
    .metal => |be| be.gemv(...),
    .vulkan => |be| be.gemv(...),
    .cuda => |be| be.gemv(...),
    .rocm => |be| be.gemv(...),
    .webgpu => |be| be.gemv(...),
}

Benefit: Compiler sees all calls, can inline them. No function pointer indirection.

flowchart TD
    classDef setup     fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef sync      fill:#dcfce7,stroke:#22c55e,color:#14532d
    classDef migration fill:#fef9c3,stroke:#eab308,color:#713f12
    classDef success   fill:#bbf7d0,stroke:#16a34a,color:#14532d
    classDef danger    fill:#fee2e2,stroke:#ef4444,color:#7f1d1d
    classDef optional  fill:#f3e8ff,stroke:#9333ea,color:#581c87

    Call["backend.gemv(args)\n(call site in model code)"]:::setup
    IE_Tag["read union tag\n(cheap branch)"]:::migration
    IE_CPU["tag == .cpu\nCpuBackend.gemv(args)\n(inlined by compiler)"]:::sync
    IE_Metal["tag == .metal\nMetalBackend.gemv(args)\n(inlined by compiler)"]:::sync
    IE_Vulkan["tag == .vulkan\nVulkanBackend.gemv(args)\n(inlined by compiler)"]:::sync
    VT_Ptr["load vtable pointer\nfrom object header"]:::danger
    VT_Offset["add method offset\n(e.g. +8 bytes for gemv)"]:::danger
    VT_Load["load function pointer\nfrom vtable memory"]:::danger
    VT_Call["indirect call\nvia register\n(branch predictor miss risk)"]:::danger
    Res1["direct kernel code\n(zero indirection)"]:::success
    Res2["kernel code\n(1 indirect branch)"]:::migration

    subgraph InlineElse["inline else dispatch (Zig)"]
        direction TB
        IE_Tag
        IE_CPU
        IE_Metal
        IE_Vulkan
        IE_Tag --> IE_CPU & IE_Metal & IE_Vulkan
    end

    subgraph VTable["vtable dispatch (C++ / runtime)"]
        direction TB
        VT_Ptr
        VT_Offset
        VT_Load
        VT_Call
        VT_Ptr --> VT_Offset --> VT_Load --> VT_Call
    end

    Call --> IE_Tag
    Call --> VT_Ptr

    IE_CPU --> Res1
    IE_Metal --> Res1
    IE_Vulkan --> Res1
    VT_Call --> Res2
Loading

Format String Validation

Compile-time format string checking prevents runtime errors.

// GOOD: Format string validated at compile time
std.log.info("Temperature: {d}, Tokens: {d}", .{temp, n_tokens});

// BAD: Wrong number of arguments — compile error!
std.log.info("Temperature: {d}, Tokens: {d}", .{temp});
// error: expected 2 format arguments, found 1

// BAD: Wrong type specifier — compile error!
std.log.info("Temperature: {d}", .{"0.5"});
// error: cannot format string with 'd' (expected number)

C comparison:

printf("Temperature: %d, Tokens: %d\n", temp);  // Runtime crash or garbage

Zig catches this at compile time.

Comptime Assertions

Validate assumptions at compile time.

flowchart TD
    classDef setup     fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef sync      fill:#dcfce7,stroke:#22c55e,color:#14532d
    classDef migration fill:#fef9c3,stroke:#eab308,color:#713f12
    classDef success   fill:#bbf7d0,stroke:#16a34a,color:#14532d
    classDef danger    fill:#fee2e2,stroke:#ef4444,color:#7f1d1d
    classDef optional  fill:#f3e8ff,stroke:#9333ea,color:#581c87

    CA_Eval["evaluate condition\nat compile time"]:::sync
    CA_Silent["(nothing emitted)\nbinary produced normally"]:::success
    CA_Fail["compile error\n'assertion failed'\nbuild stops immediately\nno binary produced"]:::danger
    RA_Eval["evaluate condition\nat runtime"]:::sync
    RA_Silent["execution continues"]:::success
    RA_Fail["@panic / illegal instruction\nprocess crashes\n(only in Debug/ReleaseSafe)"]:::danger
    note1["user never sees bad binary"]:::success
    note2["may ship silently in ReleaseFast"]:::optional

    subgraph ComptimeAssert["comptime { std.debug.assert(cond) }"]
        direction TB
        CA_Eval
        CA_Pass{"condition\ntrue?"}
        CA_Silent
        CA_Fail
        CA_Eval --> CA_Pass
        CA_Pass -- yes --> CA_Silent
        CA_Pass -- no --> CA_Fail
    end

    subgraph RuntimeAssert["std.debug.assert(cond) at runtime"]
        direction TB
        RA_Eval
        RA_Pass{"condition\ntrue?"}
        RA_Silent
        RA_Fail
        RA_Eval --> RA_Pass
        RA_Pass -- yes --> RA_Silent
        RA_Pass -- no --> RA_Fail
    end

    CA_Fail -. "catches bug before\nshipping any binary" .-> note1
    RA_Fail -. "caught only if\ntest covers that path" .-> note2
Loading

Array Size Validation

const quant_block_elems = 32;
const Q4_0_Block = extern struct {
    scale: f16,
    quants: [16]u8,  // 16 bytes = 32 nibbles
};

comptime {
    std.debug.assert(@sizeOf(Q4_0_Block) == 18);  // 2 + 16 = 18 bytes
    std.debug.assert(16 * 2 == quant_block_elems);  // 16 bytes × 2 nibbles/byte
}

Effect: If you change quants to [15]u8, compilation fails with an assertion error.

Alignment Validation

comptime {
    std.debug.assert(@alignOf(KVCache) == 64);  // Must be cache-line aligned
}

Type Size Checks

comptime {
    std.debug.assert(@sizeOf(f32) == 4);
    std.debug.assert(@sizeOf(bf16) == 2);
    std.debug.assert(@sizeOf(V8) == 32);  // 8 × f32
}

Why? If porting to a weird platform where f32 isn't 32 bits, these fail at compile time instead of producing silent data corruption at runtime.

Practical Examples

MXFP4 Lookup Table

// MXFP4 uses E2M1 format (2-bit exponent, 1-bit mantissa)
// 4-bit nibble → 16 possible values stored as a literal constant table
pub fn mxfp4Lookup(nibble: u8) f32 {
    const table: [16]f32 = .{
        0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0,
        0.0, -0.5, -1.0, -1.5, -2.0, -3.0, -4.0, -6.0,
    };
    return table[nibble & 0xF];
}

// For the scaled variant (nibble value × block scale), see nvfp4Dequant.
// The mantissa term for E2M1 is 0.5 * mant (not 1.0 * mant):
//   mant=0 → 0.0 addend, mant=1 → 0.5 addend, giving 1.0 and 1.5 for normal values.

Single-level lookup: nibble → base value via literal table (no module-level symbol). For NVFP4 scaled dequantization, nvfp4Dequant combines mxfp4Lookup with a block scale.

Quantization Block Sizes

Block byte sizes are defined as named module-level constants in backend.zig:

pub const q4_0_block_bytes: usize = 18;   // 2-byte scale + 16 bytes of nibbles
pub const q8_0_block_bytes: usize = 34;   // 2-byte scale + 32 bytes of i8 values
pub const q4_k_block_bytes: usize = 144;
pub const q6_k_block_bytes: usize = 210;
// ...

Usage: reference the constant directly by name:

const bytes_per_block = backend.q4_0_block_bytes;  // 18

const num_blocks = (total_bytes + backend.q4_0_block_bytes - 1) / backend.q4_0_block_bytes;

Benefit: Named constants are self-documenting, always available at comptime, and require no function call overhead.

Performance Impact

FP8 dequantization (measured on Apple M4):

Method Cycles/call Speedup
Runtime computation ~30 cycles
Comptime LUT ~1 cycle 30×

Binary size impact:

Feature Binary size increase
FP8 E4M3 LUT (256 × 4 bytes) +1 KB
MXFP4 LUT (16 × 4 bytes) +64 bytes
IQ4_NL LUT (16 × 1 byte) +16 bytes
Embedded Metal shaders (~50 KB source) +50 KB

Trade-off: Small binary size increase for significant runtime speedup.

Common Patterns

Conditional Compilation

const use_simd = comptime builtin.cpu.arch == .x86_64 or builtin.cpu.arch == .aarch64;

pub fn dotProduct(a: []const f32, b: []const f32) f32 {
    if (comptime use_simd) {
        return dotProductSIMD(a, b);
    } else {
        return dotProductScalar(a, b);
    }
}

Type-Generic Containers

pub fn RingBuffer(comptime T: type, comptime size: usize) type {
    return struct {
        data: [size]T,
        head: usize = 0,

        pub fn push(self: *@This(), item: T) void {
            self.data[self.head] = item;
            self.head = (self.head + 1) % size;
        }
    };
}

// Usage:
var conv_state = RingBuffer(f32, 4).init();  // 4-element f32 ring buffer

Each instantiation (RingBuffer(f32, 4), RingBuffer(u32, 8)) generates separate specialized code.

Compile-Time String Manipulation

const kernel_name = "gemv_" ++ dtype_name;  // Comptime string concat

pub fn loadKernel(comptime dtype: DType) !Pipeline {
    const name = comptime kernelName(dtype);  // e.g., "gemv_q4_0"
    return library.newFunctionWithName(name);
}

fn kernelName(comptime dtype: DType) []const u8 {
    return "gemv_" ++ @tagName(dtype);  // "gemv_" + "q4_0" → "gemv_q4_0"
}

Anti-Patterns

Don't Overuse comptime

BAD: Using comptime for simple runtime values

const temperature = comptime 0.7;  // Pointless — it's already a constant

GOOD: Just use const

const temperature: f32 = 0.7;

Don't Compute Heavy Things at Comptime

BAD: Large nested loops at comptime slow down compilation

const huge_table = comptime blk: {
    var table: [1000000]f32 = undefined;
    for (0..1000000) |i| {
        table[i] = expensiveComputation(i);  // Runs at compile time!
    }
    break :blk table;
};

Effect: Compilation takes minutes instead of seconds.

Better: Use codegen (separate script generates the table, output checked into repo) or load from file at runtime.

Don't Use comptime for Mutable State

WRONG: This doesn't work

var comptime_counter: usize = 0;  // Error: comptime variables can't be var

pub fn getNextId() usize {
    comptime {
        comptime_counter += 1;  // Error: comptime mutation not allowed
        return comptime_counter;
    }
}

comptime is for constants, not mutable state.

Best Practices

  1. Use comptime for lookup tables when the table is small (<10 KB) and frequently accessed
  2. Use comptime for feature detection to eliminate dead code
  3. Use @embedFile for resources that ship with the binary
  4. Use comptime assertions to validate invariants
  5. Don't use comptime for runtime configuration — use const or runtime parameters instead

In the code: src/ops/quant.zig (fp8e4m3_lut, iq4nl_table), src/backend/metal.zig (@embedFile for MSL shaders), src/backend/backend.zig (inline else dispatch), build.zig (build_options)

Related: Zig Language Reference — comptime, Chapter 9: CPU SIMD Optimization (uses comptime LUTs)

Next: Appendix: Profiling and Debugging → | Back: Appendix: Mathematical Operations ←