hivellm
diff --git a/‎.rulebook/tasks/TASKS-INDEX.md‎
Lines changed: 48 additions & 4 deletions b/‎.rulebook/tasks/TASKS-INDEX.md‎
Lines changed: 48 additions & 4 deletions
diff --git a/‎.rulebook/tasks/phase18a_debug-backend-machir/.metadata.json‎
Lines changed: 5 additions & 0 deletions b/‎.rulebook/tasks/phase18a_debug-backend-machir/.metadata.json‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎.rulebook/tasks/phase18a_debug-backend-machir/proposal.md‎
Lines changed: 40 additions & 0 deletions b/‎.rulebook/tasks/phase18a_debug-backend-machir/proposal.md‎
Lines changed: 40 additions & 0 deletions
diff --git a/‎.rulebook/tasks/phase18a_debug-backend-machir/tasks.md‎
Lines changed: 31 additions & 0 deletions b/‎.rulebook/tasks/phase18a_debug-backend-machir/tasks.md‎
Lines changed: 31 additions & 0 deletions
diff --git a/‎.rulebook/tasks/phase18b_x86-encoder/.metadata.json‎
Lines changed: 5 additions & 0 deletions b/‎.rulebook/tasks/phase18b_x86-encoder/.metadata.json‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎.rulebook/tasks/phase18b_x86-encoder/proposal.md‎
Lines changed: 42 additions & 0 deletions b/‎.rulebook/tasks/phase18b_x86-encoder/proposal.md‎
Lines changed: 42 additions & 0 deletions
diff --git a/‎.rulebook/tasks/phase18b_x86-encoder/tasks.md‎
Lines changed: 35 additions & 0 deletions b/‎.rulebook/tasks/phase18b_x86-encoder/tasks.md‎
Lines changed: 35 additions & 0 deletions
diff --git a/‎.rulebook/tasks/phase18c_pe-object-emission/.metadata.json‎
Lines changed: 5 additions & 0 deletions b/‎.rulebook/tasks/phase18c_pe-object-emission/.metadata.json‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎.rulebook/tasks/phase18c_pe-object-emission/proposal.md‎
Lines changed: 42 additions & 0 deletions b/‎.rulebook/tasks/phase18c_pe-object-emission/proposal.md‎
Lines changed: 42 additions & 0 deletions
@@ -1,7 +1,7 @@
 # TML Project — Task Index
 
 **Last updated**: 2026-04-05
-**Active tasks**: 46 | **Archived**: 5+
+**Active tasks**: 60 | **Archived**: 5+
 
 ---
 
@@ -196,6 +196,47 @@ Wire everything together, port tooling, execute three-stage bootstrap verificati
 
 **Order**: 17a → 17b → 17c (sequential). 17c = **TML COMPILES ITSELF**
 
+## Phase 18–21 — Custom Native Backend (ERA 2)
+
+Replace LLVM with custom code generator. Binary drops from 140MB to ~15MB.
+
+| ID | Task | Status | Priority | Progress |
+|----|------|--------|----------|----------|
+| 18a | [MachIR Lowering](phase18a_debug-backend-machir/) | Planned | P1 | 0/20 |
+| 18b | [x86_64 Encoder](phase18b_x86-encoder/) | Planned | P1 | 0/22 |
+| 18c | [PE/COFF Object Emission](phase18c_pe-object-emission/) | Planned | P1 | 0/20 |
+| 19a | [Register Allocator](phase19a_register-allocator/) | Planned | P1 | 0/22 |
+| 20a | [Production x86_64 Backend](phase20a_production-backend-x86/) | Planned | P1 | 0/22 |
+| 20b | [AArch64 Backend](phase20b_aarch64-backend/) | Planned | P1 | 0/21 |
+| 21a | [Debug Info (PDB+DWARF)](phase21a_debug-info-pdb-dwarf/) | Planned | P1 | 0/24 |
+
+**Order**: 18a → 18b+18c → 19a → 20a+20b → 21a. Completion = **LLVM ELIMINATED**
+
+## Phase 22 — Custom Linker (ERA 3)
+
+Replace LLD with tml-link. Target: sub-10ms incremental linking.
+
+| ID | Task | Status | Priority | Progress |
+|----|------|--------|----------|----------|
+| 22a | [PE/COFF Linker (Windows)](phase22a_pe-coff-linker/) | Planned | P2 | 0/22 |
+| 22b | [ELF Linker (Linux)](phase22b_elf-linker/) | Planned | P2 | 0/20 |
+| 22c | [Mach-O Linker (macOS)](phase22c_macho-linker/) | Planned | P2 | 0/18 |
+| 22d | [Incremental Linker](phase22d_incremental-linker/) | Planned | P2 | 0/18 |
+
+**Order**: 22a → 22b → 22c → 22d. Completion = **LLD ELIMINATED**
+
+## Phase 23 — C/C++ Frontend (ERA 4)
+
+TML compiles C and C++ code directly. Complete toolchain independence.
+
+| ID | Task | Status | Priority | Progress |
+|----|------|--------|----------|----------|
+| 23a | [C Preprocessor](phase23a_c-preprocessor/) | Planned | P2 | 0/20 |
+| 23b | [C17 Frontend](phase23b_c-frontend/) | Planned | P2 | 0/24 |
+| 23c | [C++ Subset Frontend](phase23c_cpp-subset-frontend/) | Planned | P2 | 0/22 |
+
+**Order**: 23a → 23b → 23c. Completion = **FULL TOOLCHAIN INDEPENDENCE**
+
 ## Research
 
 | ID | Task | Status | Priority | Progress |
@@ -224,7 +265,10 @@ Wire everything together, port tooling, execute three-stage bootstrap verificati
 ```
 Active now:   Phase 1 (language), Phase 4 (tooling), Phase 8 (DB), Phase 10 (HTTP/build)
 Next:         Phase 12 (self-hosting foundation) — can start immediately
-Then:         Phase 13 (TML frontend) — after Phase 12 complete
-Future:       Phase 14+ (type checker, IR pipeline, codegen, bootstrap)
-Long-term:    Custom backend, custom linker, C/C++ frontend (see independence plan)
+Then:         Phase 13-17 (ERA 1: TML compiles itself) — 25 tasks, 544 items
+Then:         Phase 18-21 (ERA 2: custom backend, eliminate LLVM) — 7 tasks, 151 items
+Then:         Phase 22 (ERA 3: custom linker, eliminate LLD) — 4 tasks, 78 items
+Finally:      Phase 23 (ERA 4: C/C++ frontend, full independence) — 3 tasks, 66 items
+
+TOTAL INDEPENDENCE PLAN: 39 tasks, 839 items across 4 eras
 ```
@@ -0,0 +1,5 @@
+{
+  "status": "pending",
+  "createdAt": "2026-04-06T01:36:58.343Z",
+  "updatedAt": "2026-04-06T01:36:58.343Z"
+}
@@ -0,0 +1,40 @@
+# Proposal: phase18a — MIR → MachIR Lowering
+
+## Why
+
+The TML compiler currently depends entirely on LLVM for code generation. LLVM is a ~500MB binary dependency with complex build requirements, slow compilation of the compiler itself, and an API surface that couples TML tightly to LLVM internals. ERA 2 eliminates this dependency by building a native backend in pure TML. Phase 18a establishes the foundation: a machine-level intermediate representation (MachIR) that sits between the existing MIR and raw bytes.
+
+MachIR is the architectural separation point that makes the rest of ERA 2 possible. Phases 18b (encoding) and 19a (register allocation) operate entirely on MachIR — they never touch MIR. This means the register allocator can be swapped from stack-only (Phase 18, MVP) to linear scan (Phase 19) without changing any lowering logic.
+
+## What Changes
+
+- New TML module `compiler/native/machir.tml` — MachIR data types (VirtualReg, MachInst, MachBlock, MachFunc)
+- New TML module `compiler/native/mir_lower.tml` — MIR → MachIR lowering pass
+- New TML module `compiler/native/stack_alloc.tml` — stack-only register allocator (every VirtualReg → stack slot)
+- New TML module `compiler/native/frame.tml` — stack frame layout, prologue/epilogue emission
+- MachIR is NOT emitted to disk; it is an in-memory structure consumed by phase 18b encoder
+
+## Design Decisions
+
+**Unlimited virtual registers**: VirtualReg is a U64 counter. The lowering phase never reuses registers. This simplifies correctness — no SSA destruction needed. The allocator (phase 19) handles physical register assignment.
+
+**Stack-only allocation as Phase 18 MVP**: Every VirtualReg gets its own 8-byte stack slot. This is correct but slow. The tradeoff is acceptable for phase 18 because correctness is the only goal. Linear scan in phase 19 replaces this path without changing MachIR.
+
+**Phi node destruction via parallel copies**: MIR phi nodes are lowered to parallel-copy sequences inserted at block predecessors. This matches the standard SSA destruction algorithm and avoids swap cycles.
+
+## Impact
+
+- Affected specs: docs/specs/native-backend.md (new)
+- Affected code: compiler/src/backend/ (new native/ subdirectory), compiler/src/cli/ (--backend=native flag stub)
+- Breaking change: NO — LLVM backend remains default, native backend is opt-in via --backend=native
+- User benefit: First step toward eliminating the 500MB LLVM dependency; faster compiler builds
+
+## Risk
+
+LOW. MachIR is a pure data structure transformation. It does not touch the parser, type checker, or existing codegen. If MachIR lowering produces wrong output, the only symptom is incorrect machine code — the existing LLVM path is unaffected. Tests verify MachIR structure directly without executing the output.
+
+## Reference
+
+- chibicc: IR → code generation in ~500 LOC (codegen.c)
+- qbe: SSA → machine code lowering in amd64/isel.c
+- TCC: tccgen.c register and stack slot management
@@ -0,0 +1,31 @@
+## Status: 0/20 items complete
+
+## Phase 1: MachIR Data Types
+- [ ] 1.1 Define `VirtualReg` type (wraps U64 ID, never reused, unlimited count)
+- [ ] 1.2 Define `MachInst` enum (Mov, Add, Sub, Imul, Idiv, Cmp, Jcc, Call, Ret, Push, Pop, Lea, Spill, Reload)
+- [ ] 1.3 Define `MachBlock` (ID, list of MachInst, successor block IDs)
+- [ ] 1.4 Define `MachFunc` (name, list of MachBlock, virtual reg count, stack frame size)
+
+## Phase 2: MIR → MachIR Lowering
+- [ ] 2.1 Lower MIR arithmetic (BinOp Add/Sub/Mul/Div) → MachInst with fresh VirtualRegs
+- [ ] 2.2 Lower MIR memory ops (Load, Store, Alloca) → MachInst with stack slot references
+- [ ] 2.3 Lower MIR control flow (Goto, Branch, Switch) → MachBlock edges + Jcc/JMP
+- [ ] 2.4 Lower MIR function calls (CallInst) → MachInst::Call + arg/return VirtualReg assignments
+- [ ] 2.5 Lower MIR comparisons (Eq, Ne, Lt, Le, Gt, Ge) → CMP + Jcc sequence
+- [ ] 2.6 Lower MIR phi nodes → parallel-copy sequences at block predecessors
+
+## Phase 3: Stack-Only Register Allocation
+- [ ] 3.1 Assign each VirtualReg a unique stack slot (8-byte aligned, no sharing)
+- [ ] 3.2 Insert Spill before each instruction that defines a VirtualReg
+- [ ] 3.3 Insert Reload before each instruction that uses a VirtualReg
+- [ ] 3.4 Verify every VirtualReg reference is replaced by a stack slot offset
+
+## Phase 4: Stack Frame Layout
+- [ ] 4.1 Compute total frame size: count stack slots × 8 bytes, align to 16 bytes
+- [ ] 4.2 Emit function prologue (PUSH RBP, MOV RBP RSP, SUB RSP frame_size)
+- [ ] 4.3 Emit function epilogue (ADD RSP frame_size, POP RBP, RET)
+- [ ] 4.4 Encode [RBP - offset] addressing for all stack slot references
+
+## Phase 5: Testing
+- [ ] 5.1 Lower 5 MIR programs (factorial, fib, hello world, struct return, loop) — verify MachIR structure matches expected block/inst count
+- [ ] 5.2 Verify prologue/epilogue generated correctly for each test function — frame size divisible by 16, all VirtualRegs assigned slots
@@ -0,0 +1,5 @@
+{
+  "status": "pending",
+  "createdAt": "2026-04-06T01:36:59.299Z",
+  "updatedAt": "2026-04-06T01:36:59.299Z"
+}
@@ -0,0 +1,42 @@
+# Proposal: phase18b — x86_64 Instruction Encoding
+
+## Why
+
+Phase 18a produces MachIR — an in-memory list of abstract machine instructions with virtual registers. Phase 18b converts that list to raw bytes. This is the lowest layer of the native backend: given a MachInst and physical register assignments (or stack slots from phase18a's stack-only allocator), produce the correct sequence of bytes that the x86_64 CPU will execute.
+
+x86_64 instruction encoding is notoriously complex: variable-length instructions (1–15 bytes), REX prefixes for 64-bit operands, ModRM and SIB bytes for memory addressing, RIP-relative addressing for position-independent code. Getting these right is a prerequisite for every subsequent phase. Phase 18c (COFF emission) and Phase 19 (register allocator) both depend on this encoder being correct.
+
+## What Changes
+
+- New TML module `compiler/native/x86_encode.tml` — encoding functions for each instruction class
+- New TML module `compiler/native/x86_emit.tml` — `emit_func(MachFunc) -> Buffer` top-level emitter with two-pass branch patching
+- Helper types: `ModRM`, `SIB`, `REX`, `PhysReg` enum, `MemOperand` (base + displacement)
+- No changes to existing LLVM backend or MIR
+
+## Design Decisions
+
+**Core subset only (Phase 18)**: MOV, ADD, SUB, IMUL, IDIV, NEG, NOT, AND, OR, XOR, SHL, SHR, SAR, CMP, TEST, Jcc, JMP, CALL, RET, PUSH, POP, LEA. SSE/AVX deferred to Phase 20a. This subset is sufficient to compile any TML program that uses only integers and pointers.
+
+**Stack-slot operands only (Phase 18)**: The encoder in phase 18 works with the output of the stack-only allocator from phase 18a. Every operand is either a physical register (RSP, RBP, RAX for IDIV convention) or a [RBP-offset] memory reference. Phase 19 replaces the allocator; the encoder does not change.
+
+**Two-pass branch patching**: Forward references require knowing the target block's byte offset before it is emitted. Pass 1 emits all instructions using placeholder displacements. Pass 2 patches rel8 and rel32 fields once all block offsets are known. rel8 vs rel32 selection: use rel8 if |displacement| <= 127, otherwise rel32 (re-emit with longer form — rare for small functions).
+
+**RIP-relative addressing deferred**: Global variable references use RIP-relative addressing. For Phase 18, all globals are accessed via absolute addresses passed as imm64. RIP-relative for globals is added in Phase 20a.
+
+## Impact
+
+- Affected specs: docs/specs/native-backend.md (encoding reference tables)
+- Affected code: compiler/native/ (new), no changes to existing paths
+- Breaking change: NO — native backend is separate, LLVM path unaffected
+- User benefit: Native backend can emit working x86_64 machine code for integer programs
+
+## Risk
+
+MEDIUM. x86_64 encoding has many edge cases: REX.W required for all 64-bit ops, RSP/RBP have special ModRM encodings, some opcodes encode the register in the low 3 bits of the opcode byte. Errors produce incorrect bytes that crash at runtime with no error message. The reference test (task 6.1) against known-correct bytes is the primary correctness guard.
+
+## Reference
+
+- Intel SDM Vol 2 (instruction set reference) — encoding fields defined per instruction
+- chibicc codegen.c — minimal x86_64 encoder, ~300 lines, excellent reference
+- TCC tccasm.c — complete encoder including all ModRM/SIB cases
+- AMD64 ABI Vol 1 §3.2 — calling convention that drives register usage
@@ -0,0 +1,35 @@
+## Status: 0/22 items complete
+
+## Phase 1: Encoding Infrastructure
+- [ ] 1.1 Implement `ModRM` byte builder (mod[2], reg[3], rm[3] fields, addressing mode enum)
+- [ ] 1.2 Implement `SIB` byte builder (scale[2], index[3], base[3] fields)
+- [ ] 1.3 Implement `REX` prefix builder (REX.W for 64-bit ops, REX.R/X/B for register extension to R8-R15)
+- [ ] 1.4 Implement immediate encoding helpers (imm8, imm16, imm32, imm64 → little-endian bytes appended to Buffer)
+- [ ] 1.5 Define physical register enum (RAX=0, RCX=1, RDX=2, RBX=3, RSP=4, RBP=5, RSI=6, RDI=7, R8-R15)
+
+## Phase 2: Data Movement
+- [ ] 2.1 Encode `MOV r64, r64` (REX.W 0x89 ModRM) and `MOV r64, imm64` (REX.W 0xB8+rd imm64)
+- [ ] 2.2 Encode `MOV r64, [RBP-disp]` and `MOV [RBP-disp], r64` with disp8 and disp32 forms
+- [ ] 2.3 Encode `LEA r64, [RBP-disp]` (REX.W 0x8D ModRM displacement) for stack address loads
+- [ ] 2.4 Encode `PUSH r64` (0x50+rd, REX.B prefix for R8-R15) and `POP r64` (0x58+rd)
+
+## Phase 3: Arithmetic
+- [ ] 3.1 Encode `ADD r64, r64`, `ADD r64, imm32`, `SUB r64, r64`, `SUB r64, imm32`
+- [ ] 3.2 Encode `IMUL r64, r64` (REX.W 0x0F 0xAF ModRM) and `IDIV r64` (REX.W 0xF7 /7 — dividend in RDX:RAX)
+- [ ] 3.3 Encode `NEG r64`, `NOT r64`, `AND r64, r64`, `OR r64, r64`, `XOR r64, r64`
+- [ ] 3.4 Encode `SHL r64, CL`, `SHL r64, imm8`, `SHR r64, CL`, `SAR r64, CL` (shift group D2/D3/C1)
+
+## Phase 4: Comparison and Branches
+- [ ] 4.1 Encode `CMP r64, r64`, `CMP r64, imm32`, `TEST r64, r64`
+- [ ] 4.2 Encode all Jcc rel8 (short) and rel32 (near) forms: JE/JNE/JL/JLE/JG/JGE/JB/JBE/JA/JAE
+- [ ] 4.3 Encode `JMP rel8`, `JMP rel32`, `JMP r64` and `CALL rel32`, `CALL r64`
+- [ ] 4.4 Encode `RET` (0xC3) and `RET imm16` (0xC2 + imm16 for callee-cleanup conventions)
+
+## Phase 5: MachIR → Bytes Emission
+- [ ] 5.1 Implement `emit_func(MachFunc) -> Buffer` — iterate MachBlocks, emit each MachInst to a growing Buffer
+- [ ] 5.2 Implement two-pass branch patching: pass 1 records block start offsets, pass 2 patches all Jcc/JMP displacements
+- [ ] 5.3 Handle forward references: use 32-bit displacement placeholders (0x00000000), back-patch after all blocks emitted
+
+## Phase 6: Testing
+- [ ] 6.1 Encode 20 known instruction sequences, compare output bytes byte-for-byte against nasm/objdump reference
+- [ ] 6.2 End-to-end: lower factorial MIR → MachIR (phase18a) → x86 bytes → write to executable memory page → call via FFI → verify return value
@@ -0,0 +1,5 @@
+{
+  "status": "pending",
+  "createdAt": "2026-04-06T01:36:59.743Z",
+  "updatedAt": "2026-04-06T01:36:59.743Z"
+}
@@ -0,0 +1,42 @@
+# Proposal: phase18c — PE/COFF Object File Emission
+
+## Why
+
+Phases 18a and 18b produce x86_64 machine code bytes in memory. Phase 18c wraps those bytes in a PE/COFF object file so the existing LLD linker (already embedded in TML) can link them into an executable. This completes the Phase 18 MVP: a working end-to-end native backend path for Windows.
+
+PE/COFF is well-documented (Microsoft PE/COFF specification, version 11.0) and the format is relatively simple for object files (as opposed to executable images). Object files do not need an Optional Header, a PE signature, or an import directory. They need only: a COFF file header, section headers, raw section data (.text, .data, .rdata), a symbol table, relocations, and a string table.
+
+## What Changes
+
+- New TML module `compiler/native/coff_emit.tml` — COFF file header, section header, symbol, relocation structures and writer
+- New TML module `compiler/native/obj_writer.tml` — top-level `write_obj(MachModule) -> Buffer` that orchestrates all COFF components
+- CLI: `--backend=native` flag (stubbed in 18a) now drives the full pipeline on Windows
+- No changes to the LLVM backend
+
+## Design Decisions
+
+**Use LLD for linking (Phase 18)**: The existing LLD linker in `compiler/src/backend/lld_linker.cpp` accepts standard COFF .obj files. Phase 18c targets LLD compatibility. A custom linker (ERA 3) replaces LLD later. This means phase 18c gets a working end-to-end system immediately, without waiting for linker work.
+
+**.pdata section (Windows SEH)**: Windows requires `.pdata` with RUNTIME_FUNCTION entries for any function that modifies RSP (i.e., every non-leaf function). Without .pdata, stack unwinding fails and C++ exceptions / debuggers cannot walk the stack. Phase 18c emits minimal RUNTIME_FUNCTION entries pointing to a trivial unwind code.
+
+**Relocation types**: Two types cover Phase 18's needs. IMAGE_REL_AMD64_REL32 (0x0004) patches CALL rel32 instructions that reference external symbols. IMAGE_REL_AMD64_ADDR64 (0x0001) patches 64-bit absolute addresses for global data. Both are standard and supported by LLD and MSVC link.exe.
+
+**String table for long names**: COFF symbol names are 8 bytes. Names longer than 8 bytes use the string table format: the Name field contains 0x00000000 followed by a 4-byte offset into the string table that follows the symbol table. All TML function names (which include module paths) will likely exceed 8 bytes.
+
+## Impact
+
+- Affected specs: docs/specs/native-backend.md (object file layout section)
+- Affected code: compiler/native/coff_emit.tml (new), compiler/native/obj_writer.tml (new), compiler/src/cli/commands/build.cpp (--backend=native routing)
+- Breaking change: NO — native backend is opt-in, LLVM path unchanged
+- User benefit: `tml build --backend=native` produces working executables on Windows without LLVM
+
+## Risk
+
+MEDIUM. The COFF format has strict byte-level layout requirements. Off-by-one errors in section offsets or symbol table pointers cause LLD to reject the object with cryptic errors. The integration test (task 6.1) is the primary correctness signal. Testing against both LLD and MSVC link.exe (task 6.3) provides confidence in format correctness.
+
+## Reference
+
+- Microsoft PE/COFF Specification v11.0 — authoritative byte-level format definition
+- LLVM lib/MC/WinCOFFObjectWriter.cpp — reference implementation
+- chibicc codegen.c, pe_object.c — minimal COFF writer in ~400 LOC
+- Windows SDK winnt.h — IMAGE_SECTION_HEADER, IMAGE_SYMBOL, IMAGE_RELOCATION definitions