Appendix: Compile-Time Optimization

Zig's comptime feature executes code at compile time, generating optimized runtime code with zero overhead. Agave uses this extensively for lookup tables, feature detection, and type-specialized dispatch.

comptime Basics

comptime means "computed at compile time". The compiler evaluates the expression during compilation, and the result is baked into the binary.

flowchart LR
    classDef setup     fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef sync      fill:#dcfce7,stroke:#22c55e,color:#14532d
    classDef migration fill:#fef9c3,stroke:#eab308,color:#713f12
    classDef success   fill:#bbf7d0,stroke:#16a34a,color:#14532d
    classDef danger    fill:#fee2e2,stroke:#ef4444,color:#7f1d1d
    classDef optional  fill:#f3e8ff,stroke:#9333ea,color:#581c87

    Source["Source Code\n(comptime expression)"]:::setup
    Compiler["Zig Compiler\n(compile time)"]:::sync
    Value["Constant Value\nbaked into binary"]:::migration
    Binary["Executable Binary\n(.rodata section)"]:::success
    Runtime["Runtime\n(user runs program)"]:::setup
    Result["Instant result\n(no computation)"]:::success

    Source --> Compiler
    Compiler --> Value
    Value --> Binary
    Runtime --> Binary
    Binary --> Result

    subgraph CompilePhase["Compile Phase (your machine, once)"]
        Source
        Compiler
        Value
    end

    subgraph RunPhase["Run Phase (user's machine, many times)"]
        Runtime
        Binary
        Result
    end

const table_size = 256;  // Regular constant
const doubled = comptime table_size * 2;  // Computed at compile time (512)

// The binary contains the value 512, not the multiplication

When to use comptime:

Building lookup tables
Feature detection based on target platform
Type-level computations
Format string validation

Lookup Tables

Pre-computing values at compile time eliminates runtime arithmetic.

flowchart TD
    classDef setup     fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef sync      fill:#dcfce7,stroke:#22c55e,color:#14532d
    classDef migration fill:#fef9c3,stroke:#eab308,color:#713f12
    classDef success   fill:#bbf7d0,stroke:#16a34a,color:#14532d
    classDef danger    fill:#fee2e2,stroke:#ef4444,color:#7f1d1d
    classDef optional  fill:#f3e8ff,stroke:#9333ea,color:#581c87

    NaiveInput["8-bit FP8 value\n(e.g. 0xA7)"]:::setup
    NaiveOps["Runtime: extract bits,\nbranch, pow(), multiply\n~30 instructions"]:::danger
    NaiveOut["f32 result"]:::migration

    ComptimeLoop["Compiler: loop 0..256\nfp8e4m3Compute(i)"]:::sync
    LUT["[256]f32 table\nin .rodata\n(1 KB)"]:::migration
    FastInput["8-bit FP8 value\n(e.g. 0xA7)"]:::setup
    LUTLookup["Runtime: array[val]\n1 instruction"]:::sync
    FastOut["f32 result"]:::success

    NaiveInput --> NaiveOps
    NaiveOps --> NaiveOut

    ComptimeLoop --> LUT
    LUT --> LUTLookup
    FastInput --> LUTLookup
    LUTLookup --> FastOut

    subgraph Naive["Naive (runtime per call)"]
        NaiveInput
        NaiveOps
        NaiveOut
    end

    subgraph LUTPath["LUT (comptime table, runtime lookup)"]
        ComptimeLoop
        LUT
        FastInput
        LUTLookup
        FastOut
    end

FP8 E4M3 Dequantization Table

Naive approach (runtime conversion):

pub fn fp8e4m3ToF32(val: u8) f32 {
    // Extract sign, exponent, mantissa from 8-bit value
    const sign = (val >> 7) & 1;
    const exp = (val >> 3) & 0xF;
    const mant = val & 0x7;

    // Compute float value
    const bias = 7;
    const sign_mult = if (sign == 1) -1.0 else 1.0;

    if (exp == 0) {
        // Subnormal
        return sign_mult * (@as(f32, @floatFromInt(mant)) / 8.0) * std.math.pow(f32, 2.0, 1 - bias);
    } else {
        // Normal
        const frac = 1.0 + (@as(f32, @floatFromInt(mant)) / 8.0);
        return sign_mult * frac * std.math.pow(f32, 2.0, @as(f32, @floatFromInt(exp)) - bias);
    }
}

Cost per call: ~20-30 instructions (bit shifts, branches, floating-point arithmetic, pow() call).

Optimized approach (comptime lookup table):

// Build 256-entry lookup table at compile time
const fp8e4m3_lut: [256]f32 = blk: {
    var table: [256]f32 = undefined;
    for (0..256) |i| {
        table[i] = fp8e4m3Compute(@intCast(i));  // Computed once at compile time
    }
    break :blk table;
};

// Runtime dequantization is a single array lookup
pub inline fn fp8e4m3ToF32(val: u8) f32 {
    return fp8e4m3_lut[val];
}

Cost per call: 1 instruction (load from .rodata section).

Speedup: 20-30× faster for the dequantization itself. In a full GEMV, this saves ~5-10% total time.

comptime Block Syntax

const table = blk: {
    var result: [N]T = undefined;
    // ... compute result ...
    break :blk result;  // Return from comptime block
};

Key points:

blk: is a labeled block
break :blk value returns from the block
The entire block runs at compile time
result becomes a compile-time constant

IQ4_NL Dequantization Table

IQ4_NL uses a fixed dequantization table (not computed, but verified at comptime):

pub const iq4nl_table: [16]i8 = .{
    -127, -104, -83, -65, -49, -35, -22, -10,
    1, 13, 25, 38, 53, 69, 89, 113,
};

// Illustrative usage (not a real API function — callers use iq4nl_table directly):
// const val = @as(f32, @floatFromInt(iq4nl_table[nibble])) * scale;

Why a table? IQ4_NL uses non-linear quantization — the step sizes aren't uniform. Small values have fine steps, large values have coarse steps. This gives better accuracy than linear Q4.

comptime verification:

comptime {
    std.debug.assert(iq4nl_table.len == 16);  // 4-bit = 16 values
    for (iq4nl_table, 0..) |v, i| {
        if (i > 0) {
            std.debug.assert(v > iq4nl_table[i - 1]);  // Strictly increasing
        }
    }
}

This runs at compile time. If the table is malformed, compilation fails.

Feature Detection

Zig's builtin module provides platform information at comptime.

flowchart LR
    classDef setup     fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef sync      fill:#dcfce7,stroke:#22c55e,color:#14532d
    classDef migration fill:#fef9c3,stroke:#eab308,color:#713f12
    classDef success   fill:#bbf7d0,stroke:#16a34a,color:#14532d
    classDef danger    fill:#fee2e2,stroke:#ef4444,color:#7f1d1d
    classDef optional  fill:#f3e8ff,stroke:#9333ea,color:#581c87

    BuildCmd["zig build\n-Dtarget=aarch64-macos"]:::setup
    Builtin["builtin.os.tag\nbuiltin.cpu.arch\nbuild_options.*"]:::migration
    MetalBranch["MetalBackend\ncompiled in"]:::sync
    VulkanBranch["VulkanBackend\ncompiled in"]:::sync
    CPUBranch["CpuBackend\ncompiled in"]:::sync
    Binary["macOS Binary\n(Metal only,\nLinux code absent)"]:::success
    LinuxBin["Linux Binary\n(Vulkan only,\nMetal code absent)"]:::success
    CPUBin["Other Binary\n(CPU fallback)"]:::success

    BuildCmd --> Builtin
    Builtin --> MacOS{{"os == .macos?"}}
    MacOS -- yes --> MetalBranch
    MacOS -- no --> Linux{{"os == .linux?"}}
    Linux -- yes --> VulkanBranch
    Linux -- no --> CPUBranch

    MetalBranch --> Binary
    VulkanBranch --> LinuxBin
    CPUBranch --> CPUBin

    subgraph CompileTime["Compile Time: dead code eliminated"]
        MacOS
        Linux
        MetalBranch
        VulkanBranch
        CPUBranch
    end

Target OS Detection

const builtin = @import("builtin");

pub fn initBackend() !Backend {
    if (comptime builtin.os.tag == .macos) {
        return Backend{ .metal = try MetalBackend.init() };
    } else if (comptime builtin.os.tag == .linux) {
        return Backend{ .vulkan = try VulkanBackend.init() };
    } else {
        return Backend{ .cpu = try CpuBackend.init() };
    }
}

Dead code elimination: The compiler generates only the code for the target platform. If compiling for macOS, the Linux and CPU branches are completely removed from the binary.

CPU Feature Detection

const has_avx2 = comptime builtin.cpu.features.isEnabled(@import("std").Target.x86.Feature.avx2);

pub fn gemv(...) void {
    if (comptime has_avx2) {
        gemvAVX2(...);  // 256-bit SIMD
    } else {
        gemvSSE2(...);  // 128-bit SIMD fallback
    }
}

Benefit: No runtime CPU detection overhead. The compiler knows at build time which CPU features are available (based on -mcpu flag or target triple).

Build Options

// build.zig
const backend_options = b.addOptions();
backend_options.addOption(bool, "enable_metal", true);
backend_options.addOption(bool, "enable_cuda", false);

// backend.zig
const build_options = @import("build_options");

pub const MetalBackend = if (build_options.enable_metal)
    @import("metal.zig").MetalBackend
else
    NullBackend;

Effect: If enable_metal=false, the Metal backend is not compiled at all — @import("metal.zig") never happens, reducing binary size and compile time.

@embedFile for Kernel Source

Shader source code can be embedded directly into the binary at compile time.

flowchart LR
    classDef setup     fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef sync      fill:#dcfce7,stroke:#22c55e,color:#14532d
    classDef migration fill:#fef9c3,stroke:#eab308,color:#713f12
    classDef success   fill:#bbf7d0,stroke:#16a34a,color:#14532d
    classDef danger    fill:#fee2e2,stroke:#ef4444,color:#7f1d1d
    classDef optional  fill:#f3e8ff,stroke:#9333ea,color:#581c87

    MSL1["common.metal\n(MSL source)"]:::setup
    MSL2["elementwise.metal\n(MSL source)"]:::setup
    MSL3["gemv.metal\n(MSL source)"]:::setup
    MSLN["... (5 more .metal files)"]:::setup
    SPV["gemv.spv\n(SPIR-V binary)"]:::setup
    EF["@embedFile\n(compile step)"]:::sync
    EF2["@embedFile\n(compile step)"]:::sync
    Concat["++ concatenation\n(zero-cost, compile time)"]:::sync
    ROData[".rodata section\nin binary\n([]const u8 pointer)"]:::migration
    ROData2[".rodata section\nin binary\n([]const u8 pointer)"]:::migration
    Init["MetalBackend.init()\nnewLibraryWithSource(src)\n(driver compiles to GPU bytecode)"]:::success
    Init2["VulkanBackend.init()\ncreateShaderModule(code)\n(SPIR-V loaded directly)"]:::success

    MSL1 --> EF
    MSL2 --> EF
    MSL3 --> EF
    MSLN --> EF
    SPV  --> EF2

    EF  --> Concat
    Concat --> ROData
    EF2 --> ROData2

    ROData  --> Init
    ROData2 --> Init2

    subgraph SourceFiles["Source Files (on disk, compile time only)"]
        MSL1
        MSL2
        MSL3
        MSLN
        SPV
    end

    subgraph CompileStep["Zig Compiler"]
        EF
        EF2
        Concat
    end

    subgraph Binary["Agave Binary (.rodata — no external files needed)"]
        ROData
        ROData2
    end

    subgraph Runtime["Runtime (zero file I/O)"]
        Init
        Init2
    end

Metal Shader Embedding

// Concatenate all MSL files at compile time
const msl_source = @embedFile("kernels/metal/common.metal") ++
    @embedFile("kernels/metal/elementwise.metal") ++
    @embedFile("kernels/metal/norm.metal") ++
    @embedFile("kernels/metal/rope.metal") ++
    @embedFile("kernels/metal/gemv.metal") ++
    @embedFile("kernels/metal/gemm.metal") ++
    @embedFile("kernels/metal/sdpa.metal") ++
    @embedFile("kernels/metal/deltanet.metal");

pub fn init(allocator: Allocator) !MetalBackend {
    // Compile MSL source at runtime (driver compiles to GPU bytecode)
    const library = device.newLibraryWithSource(msl_source, null, &err);
    // ...
}

Benefits:

Single binary: No need to ship separate .metal files
No file I/O: No std.fs.cwd().openFile() at runtime
Compile-time concatenation: Multiple files merged into one string at zero cost

Alternative (runtime file loading):

// BAD: Runtime file I/O
const file = try std.fs.cwd().openFile("shaders/gemv.metal", .{});
defer file.close();
const source = try file.readToEndAlloc(allocator, 1024 * 1024);
defer allocator.free(source);

Problems:

Requires shipping shader files alongside binary
File path resolution (where is the binary run from?)
Runtime allocation + I/O
Error handling (file not found, permission denied)

@embedFile eliminates all of these.

SPIR-V Binary Embedding

Vulkan uses pre-compiled SPIR-V bytecode:

const gemv_spirv = @embedFile("kernels/vulkan/gemv.spv");

pub fn init() !VulkanBackend {
    const shader_module = vk.createShaderModule(device, .{
        .code_size = gemv_spirv.len,
        .code = @ptrCast(gemv_spirv.ptr),
    });
    // ...
}

SPIR-V is binary data — @embedFile works with any file type, not just text.

Type-Specialized Functions

Generate different code for each type at compile time.

Generic Dequantization

pub fn dequantize(comptime T: type, quant: []const u8, output: []f32) void {
    switch (T) {
        Q4_0 => dequantizeQ4_0(quant, output),
        Q8_0 => dequantizeQ8_0(quant, output),
        BF16 => dequantizeBF16(quant, output),
        else => @compileError("Unsupported quantization type"),
    }
}

// Usage:
dequantize(Q4_0, quant_data, f32_output);  // Compiles to direct call to dequantizeQ4_0

No runtime dispatch — the switch is resolved at compile time, and only the relevant function is called.

flowchart TD
    classDef setup     fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef sync      fill:#dcfce7,stroke:#22c55e,color:#14532d
    classDef migration fill:#fef9c3,stroke:#eab308,color:#713f12
    classDef success   fill:#bbf7d0,stroke:#16a34a,color:#14532d
    classDef danger    fill:#fee2e2,stroke:#ef4444,color:#7f1d1d
    classDef optional  fill:#f3e8ff,stroke:#9333ea,color:#581c87

    Generic["dequantize(comptime T: type, ...)\ngeneric call site"]:::setup
    Q4["T == Q4_0\n→ dequantizeQ4_0()\nmonomorphized copy"]:::sync
    Q8["T == Q8_0\n→ dequantizeQ8_0()\nmonomorphized copy"]:::sync
    BF["T == BF16\n→ dequantizeBF16()\nmonomorphized copy"]:::sync
    ERR["T == other\n→ @compileError()\nhalts compilation"]:::danger
    BQ4["dequantizeQ4_0\n(direct call, inlined)"]:::success
    BQ8["dequantizeQ8_0\n(direct call, inlined)"]:::success
    BBF["dequantizeBF16\n(direct call, inlined)"]:::success

    subgraph CompileTime["Compiler — resolved at compile time (T is known)"]
        direction LR
        SW{"switch T"}
        Q4
        Q8
        BF
        ERR
        SW --> Q4 & Q8 & BF & ERR
    end

    subgraph Binary["Binary — only called variant present"]
        BQ4
        BQ8
        BBF
    end

    Generic --> SW
    Q4 --> BQ4
    Q8 --> BQ8
    BF --> BBF

Tagged Union Dispatch (inline else)

pub const Backend = union(enum) {
    cpu: *CpuBackend,
    metal: *MetalBackend,
    // ...

    pub fn gemv(self: Backend, ...) void {
        switch (self) {
            inline else => |be| be.gemv(...),  // Expands to separate case per variant
        }
    }
};

What inline else does:

// Expands to:
switch (self) {
    .cpu => |be| be.gemv(...),
    .metal => |be| be.gemv(...),
    .vulkan => |be| be.gemv(...),
    .cuda => |be| be.gemv(...),
    .rocm => |be| be.gemv(...),
    .webgpu => |be| be.gemv(...),
}

Benefit: Compiler sees all calls, can inline them. No function pointer indirection.

flowchart TD
    classDef setup     fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef sync      fill:#dcfce7,stroke:#22c55e,color:#14532d
    classDef migration fill:#fef9c3,stroke:#eab308,color:#713f12
    classDef success   fill:#bbf7d0,stroke:#16a34a,color:#14532d
    classDef danger    fill:#fee2e2,stroke:#ef4444,color:#7f1d1d
    classDef optional  fill:#f3e8ff,stroke:#9333ea,color:#581c87

    Call["backend.gemv(args)\n(call site in model code)"]:::setup
    IE_Tag["read union tag\n(cheap branch)"]:::migration
    IE_CPU["tag == .cpu\nCpuBackend.gemv(args)\n(inlined by compiler)"]:::sync
    IE_Metal["tag == .metal\nMetalBackend.gemv(args)\n(inlined by compiler)"]:::sync
    IE_Vulkan["tag == .vulkan\nVulkanBackend.gemv(args)\n(inlined by compiler)"]:::sync
    VT_Ptr["load vtable pointer\nfrom object header"]:::danger
    VT_Offset["add method offset\n(e.g. +8 bytes for gemv)"]:::danger
    VT_Load["load function pointer\nfrom vtable memory"]:::danger
    VT_Call["indirect call\nvia register\n(branch predictor miss risk)"]:::danger
    Res1["direct kernel code\n(zero indirection)"]:::success
    Res2["kernel code\n(1 indirect branch)"]:::migration

    subgraph InlineElse["inline else dispatch (Zig)"]
        direction TB
        IE_Tag
        IE_CPU
        IE_Metal
        IE_Vulkan
        IE_Tag --> IE_CPU & IE_Metal & IE_Vulkan
    end

    subgraph VTable["vtable dispatch (C++ / runtime)"]
        direction TB
        VT_Ptr
        VT_Offset
        VT_Load
        VT_Call
        VT_Ptr --> VT_Offset --> VT_Load --> VT_Call
    end

    Call --> IE_Tag
    Call --> VT_Ptr

    IE_CPU --> Res1
    IE_Metal --> Res1
    IE_Vulkan --> Res1
    VT_Call --> Res2

Format String Validation

Compile-time format string checking prevents runtime errors.

// GOOD: Format string validated at compile time
std.log.info("Temperature: {d}, Tokens: {d}", .{temp, n_tokens});

// BAD: Wrong number of arguments — compile error!
std.log.info("Temperature: {d}, Tokens: {d}", .{temp});
// error: expected 2 format arguments, found 1

// BAD: Wrong type specifier — compile error!
std.log.info("Temperature: {d}", .{"0.5"});
// error: cannot format string with 'd' (expected number)

C comparison:

printf("Temperature: %d, Tokens: %d\n", temp);  // Runtime crash or garbage

Zig catches this at compile time.

Comptime Assertions

Validate assumptions at compile time.

flowchart TD
    classDef setup     fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef sync      fill:#dcfce7,stroke:#22c55e,color:#14532d
    classDef migration fill:#fef9c3,stroke:#eab308,color:#713f12
    classDef success   fill:#bbf7d0,stroke:#16a34a,color:#14532d
    classDef danger    fill:#fee2e2,stroke:#ef4444,color:#7f1d1d
    classDef optional  fill:#f3e8ff,stroke:#9333ea,color:#581c87

    CA_Eval["evaluate condition\nat compile time"]:::sync
    CA_Silent["(nothing emitted)\nbinary produced normally"]:::success
    CA_Fail["compile error\n'assertion failed'\nbuild stops immediately\nno binary produced"]:::danger
    RA_Eval["evaluate condition\nat runtime"]:::sync
    RA_Silent["execution continues"]:::success
    RA_Fail["@panic / illegal instruction\nprocess crashes\n(only in Debug/ReleaseSafe)"]:::danger
    note1["user never sees bad binary"]:::success
    note2["may ship silently in ReleaseFast"]:::optional

    subgraph ComptimeAssert["comptime { std.debug.assert(cond) }"]
        direction TB
        CA_Eval
        CA_Pass{"condition\ntrue?"}
        CA_Silent
        CA_Fail
        CA_Eval --> CA_Pass
        CA_Pass -- yes --> CA_Silent
        CA_Pass -- no --> CA_Fail
    end

    subgraph RuntimeAssert["std.debug.assert(cond) at runtime"]
        direction TB
        RA_Eval
        RA_Pass{"condition\ntrue?"}
        RA_Silent
        RA_Fail
        RA_Eval --> RA_Pass
        RA_Pass -- yes --> RA_Silent
        RA_Pass -- no --> RA_Fail
    end

    CA_Fail -. "catches bug before\nshipping any binary" .-> note1
    RA_Fail -. "caught only if\ntest covers that path" .-> note2

Array Size Validation

const quant_block_elems = 32;
const Q4_0_Block = extern struct {
    scale: f16,
    quants: [16]u8,  // 16 bytes = 32 nibbles
};

comptime {
    std.debug.assert(@sizeOf(Q4_0_Block) == 18);  // 2 + 16 = 18 bytes
    std.debug.assert(16 * 2 == quant_block_elems);  // 16 bytes × 2 nibbles/byte
}

Effect: If you change quants to [15]u8, compilation fails with an assertion error.

Alignment Validation

comptime {
    std.debug.assert(@alignOf(KVCache) == 64);  // Must be cache-line aligned
}

Type Size Checks

comptime {
    std.debug.assert(@sizeOf(f32) == 4);
    std.debug.assert(@sizeOf(bf16) == 2);
    std.debug.assert(@sizeOf(V8) == 32);  // 8 × f32
}

Why? If porting to a weird platform where f32 isn't 32 bits, these fail at compile time instead of producing silent data corruption at runtime.

Practical Examples

MXFP4 Lookup Table

// MXFP4 uses E2M1 format (2-bit exponent, 1-bit mantissa)
// 4-bit nibble → 16 possible values stored as a literal constant table
pub fn mxfp4Lookup(nibble: u8) f32 {
    const table: [16]f32 = .{
        0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0,
        0.0, -0.5, -1.0, -1.5, -2.0, -3.0, -4.0, -6.0,
    };
    return table[nibble & 0xF];
}

// For the scaled variant (nibble value × block scale), see nvfp4Dequant.
// The mantissa term for E2M1 is 0.5 * mant (not 1.0 * mant):
//   mant=0 → 0.0 addend, mant=1 → 0.5 addend, giving 1.0 and 1.5 for normal values.

Single-level lookup: nibble → base value via literal table (no module-level symbol). For NVFP4 scaled dequantization, nvfp4Dequant combines mxfp4Lookup with a block scale.

Quantization Block Sizes

Block byte sizes are defined as named module-level constants in backend.zig:

pub const q4_0_block_bytes: usize = 18;   // 2-byte scale + 16 bytes of nibbles
pub const q8_0_block_bytes: usize = 34;   // 2-byte scale + 32 bytes of i8 values
pub const q4_k_block_bytes: usize = 144;
pub const q6_k_block_bytes: usize = 210;
// ...

Usage: reference the constant directly by name:

const bytes_per_block = backend.q4_0_block_bytes;  // 18

const num_blocks = (total_bytes + backend.q4_0_block_bytes - 1) / backend.q4_0_block_bytes;

Benefit: Named constants are self-documenting, always available at comptime, and require no function call overhead.

Performance Impact

FP8 dequantization (measured on Apple M4):

Method	Cycles/call	Speedup
Runtime computation	~30 cycles	1×
Comptime LUT	~1 cycle	30×

Binary size impact:

Feature	Binary size increase
FP8 E4M3 LUT (256 × 4 bytes)	+1 KB
MXFP4 LUT (16 × 4 bytes)	+64 bytes
IQ4_NL LUT (16 × 1 byte)	+16 bytes
Embedded Metal shaders (~50 KB source)	+50 KB

Trade-off: Small binary size increase for significant runtime speedup.

Common Patterns

Conditional Compilation

const use_simd = comptime builtin.cpu.arch == .x86_64 or builtin.cpu.arch == .aarch64;

pub fn dotProduct(a: []const f32, b: []const f32) f32 {
    if (comptime use_simd) {
        return dotProductSIMD(a, b);
    } else {
        return dotProductScalar(a, b);
    }
}

Type-Generic Containers

pub fn RingBuffer(comptime T: type, comptime size: usize) type {
    return struct {
        data: [size]T,
        head: usize = 0,

        pub fn push(self: *@This(), item: T) void {
            self.data[self.head] = item;
            self.head = (self.head + 1) % size;
        }
    };
}

// Usage:
var conv_state = RingBuffer(f32, 4).init();  // 4-element f32 ring buffer

Each instantiation (RingBuffer(f32, 4), RingBuffer(u32, 8)) generates separate specialized code.

Compile-Time String Manipulation

const kernel_name = "gemv_" ++ dtype_name;  // Comptime string concat

pub fn loadKernel(comptime dtype: DType) !Pipeline {
    const name = comptime kernelName(dtype);  // e.g., "gemv_q4_0"
    return library.newFunctionWithName(name);
}

fn kernelName(comptime dtype: DType) []const u8 {
    return "gemv_" ++ @tagName(dtype);  // "gemv_" + "q4_0" → "gemv_q4_0"
}

Anti-Patterns

Don't Overuse comptime

BAD: Using comptime for simple runtime values

const temperature = comptime 0.7;  // Pointless — it's already a constant

GOOD: Just use const

const temperature: f32 = 0.7;

Don't Compute Heavy Things at Comptime

BAD: Large nested loops at comptime slow down compilation

const huge_table = comptime blk: {
    var table: [1000000]f32 = undefined;
    for (0..1000000) |i| {
        table[i] = expensiveComputation(i);  // Runs at compile time!
    }
    break :blk table;
};

Effect: Compilation takes minutes instead of seconds.

Better: Use codegen (separate script generates the table, output checked into repo) or load from file at runtime.

Don't Use comptime for Mutable State

WRONG: This doesn't work

var comptime_counter: usize = 0;  // Error: comptime variables can't be var

pub fn getNextId() usize {
    comptime {
        comptime_counter += 1;  // Error: comptime mutation not allowed
        return comptime_counter;
    }
}

comptime is for constants, not mutable state.

Best Practices

Use comptime for lookup tables when the table is small (<10 KB) and frequently accessed
Use comptime for feature detection to eliminate dead code
Use @embedFile for resources that ship with the binary
Use comptime assertions to validate invariants
Don't use comptime for runtime configuration — use const or runtime parameters instead

In the code: src/ops/quant.zig (fp8e4m3_lut, iq4nl_table), src/backend/metal.zig (@embedFile for MSL shaders), src/backend/backend.zig (inline else dispatch), build.zig (build_options)

Related: Zig Language Reference — comptime, Chapter 9: CPU SIMD Optimization (uses comptime LUTs)

Next: Appendix: Profiling and Debugging → | Back: Appendix: Mathematical Operations ←

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appendix: Compile-Time Optimization

comptime Basics

Lookup Tables

FP8 E4M3 Dequantization Table

comptime Block Syntax

IQ4_NL Dequantization Table

Feature Detection

Target OS Detection

CPU Feature Detection

Build Options

@embedFile for Kernel Source

Metal Shader Embedding

SPIR-V Binary Embedding

Type-Specialized Functions

Generic Dequantization

Tagged Union Dispatch (inline else)

Format String Validation

Comptime Assertions

Array Size Validation

Alignment Validation

Type Size Checks

Practical Examples

MXFP4 Lookup Table

Quantization Block Sizes

Performance Impact

Common Patterns

Conditional Compilation

Type-Generic Containers

Compile-Time String Manipulation

Anti-Patterns

Don't Overuse comptime

Don't Compute Heavy Things at Comptime

Don't Use comptime for Mutable State

Best Practices

FilesExpand file tree

appendix-compile-time.md

Latest commit

History

appendix-compile-time.md

File metadata and controls

Appendix: Compile-Time Optimization

comptime Basics

Lookup Tables

FP8 E4M3 Dequantization Table

comptime Block Syntax

IQ4_NL Dequantization Table

Feature Detection

Target OS Detection

CPU Feature Detection

Build Options

@embedFile for Kernel Source

Metal Shader Embedding

SPIR-V Binary Embedding

Type-Specialized Functions

Generic Dequantization

Tagged Union Dispatch (inline else)

Format String Validation

Comptime Assertions

Array Size Validation

Alignment Validation

Type Size Checks

Practical Examples

MXFP4 Lookup Table

Quantization Block Sizes

Performance Impact

Common Patterns

Conditional Compilation

Type-Generic Containers

Compile-Time String Manipulation

Anti-Patterns

Don't Overuse comptime

Don't Compute Heavy Things at Comptime

Don't Use comptime for Mutable State

Best Practices