Skip to content

Commit e9d2c41

Browse files
Andre Ferreiraclaude
andcommitted
chore(tasks): Phase 16 codegen + Phase 17 bootstrap — 7 tasks, 158 items, ERA 1 COMPLETE
Phase 16 (ERA 1, Phase 4) — MIR→LLVM IR codegen (~76K LOC C++ → ~49K TML): - phase16a: Types & declarations — struct layouts, function sigs (25 items) - phase16b: Instructions — arithmetic, memory, control flow (25 items) - phase16c: Calls & ABI — method dispatch, sret, Win64/SysV (25 items) - phase16d: Legacy LLVM codegen — builtins, drop glue, remaining (25 items) Phase 17 (ERA 1, Phase 5) — Bootstrap & self-hosting: - phase17a: Query system — demand-driven pipeline (18 items) - phase17b: CLI & tooling — dispatcher, diagnostics, test runner, formatter (24 items) - phase17c: Bootstrap verification — Stage 0→1→2, IR-diff, TML COMPILES ITSELF (16 items) All 7 tasks include proposals (49-98 lines each). ERA 1 is now FULLY PLANNED: 25 tasks, 544 checklist items. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 09f9850 commit e9d2c41

22 files changed

Lines changed: 945 additions & 1 deletion

File tree

.rulebook/tasks/TASKS-INDEX.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# TML Project — Task Index
22

33
**Last updated**: 2026-04-05
4-
**Active tasks**: 39 | **Archived**: 5+
4+
**Active tasks**: 46 | **Archived**: 5+
55

66
---
77

@@ -171,6 +171,31 @@ Port HIR, THIR, MIR builder, and 52 MIR optimization passes from C++ to TML.
171171

172172
**Order**: 15a → 15b → 15c → 15d (sequential)
173173

174+
## Phase 16 — Codegen in TML (ERA 1, Phase 4)
175+
176+
Port MIR→LLVM IR text generation (~76K LOC C++) to TML. Largest subsystem — output is text, easy to verify.
177+
178+
| ID | Task | Status | Priority | Progress |
179+
|----|------|--------|----------|----------|
180+
| 16a | [Types & Declarations](phase16a_codegen-types-decls/) | Planned | P0 | 0/25 |
181+
| 16b | [Instructions](phase16b_codegen-instructions/) | Planned | P0 | 0/25 |
182+
| 16c | [Calls & ABI](phase16c_codegen-calls-abi/) | Planned | P0 | 0/25 |
183+
| 16d | [Legacy LLVM Codegen](phase16d_codegen-legacy-llvm/) | Planned | P0 | 0/25 |
184+
185+
**Order**: 16a → 16b → 16c → 16d (sequential, 16a/16b can partially overlap)
186+
187+
## Phase 17 — Bootstrap (ERA 1, Phase 5) 🎯 SELF-HOSTING
188+
189+
Wire everything together, port tooling, execute three-stage bootstrap verification. **ERA 1 COMPLETE when phase17c passes.**
190+
191+
| ID | Task | Status | Priority | Progress |
192+
|----|------|--------|----------|----------|
193+
| 17a | [Query System](phase17a_query-system/) | Planned | P0 | 0/18 |
194+
| 17b | [CLI & Tooling](phase17b_cli-tooling/) | Planned | P0 | 0/24 |
195+
| 17c | [Bootstrap Verification](phase17c_bootstrap-verification/) | Planned | P0 | 0/16 |
196+
197+
**Order**: 17a → 17b → 17c (sequential). 17c = **TML COMPILES ITSELF**
198+
174199
## Research
175200

176201
| ID | Task | Status | Priority | Progress |
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"status": "pending",
3+
"createdAt": "2026-04-06T01:25:21.403Z",
4+
"updatedAt": "2026-04-06T01:25:21.403Z"
5+
}
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# Proposal: Codegen Types & Declarations — Rewrite in TML
2+
3+
## Why
4+
5+
The MIR codegen subsystem is the final C++ layer standing between an optimized MirModule and the
6+
LLVM IR text that LLVM compiles to native code. Its entry point (`mir_codegen.cpp`, 1,622 LOC) and
7+
type emission layer (`mir_types.cpp`, `llvm_types.cpp`, 1,207 LOC) are the foundation on which all
8+
instruction and call emission depends. Types that are laid out incorrectly corrupt every instruction
9+
that reads or writes a value of that type. Porting the type and declaration layer first establishes
10+
a verified foundation before tackling instructions and calls in phases 16b and 16c.
11+
12+
## What Changes
13+
14+
The C++ type emission code in `compiler/src/codegen/mir_codegen.cpp`, `mir/mir_types.cpp`, and
15+
`llvm/core/llvm_types.cpp` is replaced by a TML implementation in `compiler-tml/src/codegen/`.
16+
Function signature emission from `llvm/decl/func.cpp` (1,351 LOC) and impl/vtable emission from
17+
`llvm/decl/impl.cpp` (1,336 LOC) are also ported here, since they depend only on the type layer.
18+
19+
### Architecture
20+
21+
```
22+
compiler-tml/src/codegen/
23+
mod.tml — re-exports Codegen, CodegenConfig, emit_module()
24+
config.tml — CodegenConfig: target triple, data layout, opt level
25+
types.tml — LlvmType enum: I1..I64, F32/F64, Ptr, Struct, Array, Func, Void
26+
layout.tml — LayoutComputer: size/alignment/field-offsets per MirType
27+
emit_type.tml — emit_type(MirType) -> Text: MIR type → LLVM IR type string
28+
emit_func.tml — emit_func_decl(MirFunc) -> Text: define/declare line + sret/byval
29+
emit_module.tml — emit_module(MirModule) -> Text: complete LLVM IR file
30+
```
31+
32+
### Key Design Decisions
33+
34+
- **Text output via template literals** — all IR emission uses TML template literals
35+
(`` `define fastcc i64 @{name}({params}) {` ``) rather than string concatenation. This matches
36+
how the C++ code builds IR and keeps emission code readable and diffable.
37+
- **Type layout must be byte-for-byte identical to C++** — the `LayoutComputer` in `layout.tml`
38+
replicates the exact field-padding rules from `llvm_types.cpp`. Any divergence corrupts sret
39+
slot sizes, GEP indices, and struct constructor IR. Tests assert field offsets directly.
40+
- **Opaque pointer model** — the TML codegen targets LLVM 15+ opaque pointers. All pointer types
41+
emit as `"ptr"` regardless of pointee type. This simplifies the type layer significantly
42+
compared to the typed-pointer LLVM IR the legacy codegen sometimes emits.
43+
- **Named struct deduplication** — each struct name is emitted as a `%struct.Name = type { ... }`
44+
definition exactly once at the top of the module. A `HashMap[Str, Bool]` tracks already-emitted
45+
structs to prevent duplicate definitions, which are LLVM IR errors.
46+
- **sret for large return types** — structs larger than 16 bytes use the sret convention: the
47+
caller allocates a stack slot and passes its address as the first argument annotated
48+
`ptr sret(%struct.T) align 8`. The callee writes the result there and returns void. The
49+
`emit_func_decl` function computes this from the layout, matching `func.cpp` exactly.
50+
- **Runtime declarations on demand** — instead of emitting all 500+ runtime function declarations
51+
unconditionally (as the C++ legacy codegen does), the TML emitter tracks which extern functions
52+
the module actually calls and emits only those `declare` lines. This reduces IR file size and
53+
speeds up LLVM parsing.
54+
55+
## Impact
56+
57+
- Affected code: `compiler/src/codegen/mir_codegen.cpp`, `mir/mir_types.cpp`,
58+
`llvm/core/llvm_types.cpp`, `llvm/decl/func.cpp`, `llvm/decl/impl.cpp` (all replaced)
59+
- Affected phases: 16b (instructions call `emit_type`), 16c (calls use sret/byval decisions)
60+
- Breaking change: NO — IR-diff testing ensures identical type strings and function signatures
61+
- User benefit: self-hosting progress; type layout logic is inspectable and modifiable in TML
62+
63+
## Success Criteria
64+
65+
The TML type emitter produces LLVM IR struct definitions and function declaration lines that are
66+
character-identical to C++ codegen output for all stdlib modules. The `LayoutComputer` produces
67+
field offsets that match C++ for all 40+ named struct types in the stdlib. IR-diff on 5 stdlib
68+
modules shows zero differences in the declarations section.
69+
70+
## Dependencies
71+
72+
- **Requires**: phase15d (MirModule with MirType, MirFunc available in TML)
73+
- **Blocks**: phase16b (instructions need `emit_type`), phase16c (calls need sret decisions)
74+
- **Risk**: Medium — type layout errors are silent but fatal; mitigated by per-struct layout
75+
unit tests that assert field offsets before any full-module IR-diff testing begins.
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Tasks: Codegen Types & Declarations — Rewrite in TML
2+
3+
**Status**: Planned (0/25)
4+
**Depends on**: phase15d (optimized MirModule available in TML)
5+
**Blocks**: phase16b (instructions need type emission), phase16c (calls need ABI/type layer)
6+
**Duration**: 4–6 weeks
7+
**Risk**: Medium — type layouts must match C++ exactly; layout errors corrupt all downstream IR
8+
**C++ reference**: ~8K LOC → ~5.2K TML
9+
10+
---
11+
12+
## Phase 1: Module & File Structure (3 items)
13+
14+
- [ ] 1.1 Create `compiler-tml/src/codegen/mod.tml` — module root, re-exports `Codegen`, `CodegenConfig`, `emit_module()`
15+
- [ ] 1.2 Create `compiler-tml/src/codegen/types.tml``LlvmType` enum: `I1`, `I8`, `I16`, `I32`, `I64`, `F32`, `F64`, `Ptr`, `Struct(List[LlvmType])`, `Array(LlvmType, I64)`, `Func(List[LlvmType], LlvmType)`, `Void`
16+
- [ ] 1.3 Create `compiler-tml/src/codegen/config.tml``CodegenConfig` struct: target triple, data layout string, optimize level, release flag
17+
18+
## Phase 2: Type Emission (6 items)
19+
20+
- [ ] 2.1 Create `compiler-tml/src/codegen/emit_type.tml``emit_type(t: MirType) -> Text` converting MIR types to LLVM IR type strings
21+
- [ ] 2.2 Implement primitive types: `I64``"i64"`, `I32``"i32"`, `Bool``"i1"`, `F64``"double"`, `Unit``"{}"`, `Str``"ptr"`
22+
- [ ] 2.3 Implement aggregate types: struct → `"%struct.Name"` named reference, tuple → `"{ i64, i64, ... }"` inline, array → `"[N x T]"`
23+
- [ ] 2.4 Implement pointer and reference types: `Ref[T]``"ptr"`, `MutRef[T]``"ptr"`, raw pointer → `"ptr"` (opaque pointer model, LLVM 15+)
24+
- [ ] 2.5 Implement function pointer types: `func(A, B) -> C``"ptr"` in opaque model; emit full signature only in function declarations
25+
- [ ] 2.6 Implement Maybe/Outcome layout: `Maybe[T]``{ i32, T_padded }` matching C++ `maybe_layout()` byte-for-byte; `Outcome[T,E]``{ i32, union(T,E) }`
26+
27+
## Phase 3: Struct Layout Computation (4 items)
28+
29+
- [ ] 3.1 Create `compiler-tml/src/codegen/layout.tml``LayoutComputer` struct computing size/alignment for each `MirType`
30+
- [ ] 3.2 Implement primitive sizes: I8=1, I16=2, I32=4, I64=8, F32=4, F64=8, Bool=1, pointer=8 (x86_64)
31+
- [ ] 3.3 Implement struct layout: iterate fields, insert padding bytes to meet field alignment, record field offsets; total size rounded up to struct alignment
32+
- [ ] 3.4 Emit named struct type definitions: `%struct.Foo = type { i64, i32, [4 x i8] }` — emit each struct exactly once, deduplicate by name
33+
34+
## Phase 4: Function Signature Emission (5 items)
35+
36+
- [ ] 4.1 Create `compiler-tml/src/codegen/emit_func.tml``emit_func_decl(f: MirFunc, cfg: CodegenConfig) -> Text` producing the `define`/`declare` line
37+
- [ ] 4.2 Implement calling convention annotation: `cc` field on MirFunc → `fastcc`, `ccc`, `win64cc` strings prepended to `define`
38+
- [ ] 4.3 Implement sret parameter: if return type is large struct, prepend `ptr sret(%struct.Name) align 8 %sret_slot` as first parameter
39+
- [ ] 4.4 Implement byval parameter: struct args ≤ 16 bytes passed by value → `byval(%struct.Name) align 8` annotation
40+
- [ ] 4.5 Implement function attributes: `nounwind`, `uwtable`, `alwaysinline`, `noinline` emitted from MirFunc attribute set
41+
42+
## Phase 5: Module-Level Declarations (4 items)
43+
44+
- [ ] 5.1 Create `compiler-tml/src/codegen/emit_module.tml``emit_module(m: MirModule, cfg: CodegenConfig) -> Text` producing complete LLVM IR text
45+
- [ ] 5.2 Emit module header: `; ModuleID = 'file.tml'\nsource_filename = "..."\ntarget datalayout = "..."\ntarget triple = "..."\n`
46+
- [ ] 5.3 Emit runtime declarations: `declare` lines for every `@extern("c")` function used in the module — only emit what the module actually uses (not all 500+ runtime functions)
47+
- [ ] 5.4 Emit global constants and string literals: `@str.0 = private unnamed_addr constant [N x i8] c"...\00"` for each unique string in the module
48+
49+
## Phase 6: Differential Testing (3 items)
50+
51+
- [ ] 6.1 Create `compiler-tml/tests/codegen/types.test.tml` — unit tests: for each MirType variant, `emit_type(t)` must equal expected LLVM IR string
52+
- [ ] 6.2 Create `compiler-tml/tests/codegen/layout.test.tml` — struct layout tests: compute layout of 10 stdlib structs, assert field offsets match C++ `llvm_types.cpp` output
53+
- [ ] 6.3 IR-diff: compile 5 stdlib modules through TML type/decl emitter → compare struct definitions and function declarations against C++ codegen output line-by-line
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"status": "pending",
3+
"createdAt": "2026-04-06T01:25:21.854Z",
4+
"updatedAt": "2026-04-06T01:25:21.854Z"
5+
}
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# Proposal: Codegen Instructions — Rewrite in TML
2+
3+
## Why
4+
5+
The instruction emission layer translates each MIR instruction into one or more LLVM IR text lines.
6+
It is the highest-volume code in the codegen subsystem — arithmetic, memory, control flow, and
7+
aggregate operations together account for the majority of all IR output. The C++ implementation is
8+
spread across `compiler/src/codegen/mir/instructions.cpp`, `instructions_misc.cpp`, and five files
9+
in `llvm/expr/` and `llvm/control/` totaling approximately 12K LOC. Porting this layer to TML
10+
completes the bulk of the MIR codegen path and enables IR-diff testing on realistic programs. It
11+
builds directly on the type emission layer from phase16a.
12+
13+
## What Changes
14+
15+
The C++ instruction emission files are replaced by a TML implementation in
16+
`compiler-tml/src/codegen/emit_inst.tml`. The complete MirInst enum (40+ variants) is handled by
17+
a single dispatch function that returns a `Text` fragment for each instruction. Basic block
18+
iteration and function body assembly remain in `emit_func.tml` (phase16a).
19+
20+
### Architecture
21+
22+
```
23+
compiler-tml/src/codegen/
24+
emit_inst.tml — InstructionEmitter: emit(MirInst) -> Text
25+
arithmetic, comparison, bitwise (Phase 2)
26+
alloca, load, store, GEP (Phase 3)
27+
br, cond_br, switch, ret (Phase 4)
28+
extractvalue, insertvalue, phi, select (Phase 5)
29+
zext, sext, trunc, ptrtoint, inttoptr, bitcast,
30+
fpext, fptrunc, fptosi, sitofp (Phase 6)
31+
```
32+
33+
### Key Design Decisions
34+
35+
- **One Text per instruction**`emit(inst: MirInst) -> Text` returns the full IR line including
36+
leading spaces and trailing newline. The caller joins all instruction texts with no separator.
37+
Template literals make each case readable: `` ` %{reg} = add nsw {ty} %{a}, %{b}\n` ``.
38+
- **nsw on integer arithmetic** — all signed integer arithmetic emits `nsw` (no signed wrap)
39+
flags, matching the C++ default. This enables LLVM to apply algebraic optimizations. The
40+
`nsw` flag is omitted only for explicitly wrapping operations (future intrinsics).
41+
- **Ordered float predicates** — all FCmp uses ordered predicates (`oeq`, `olt`, etc.) matching
42+
the C++ codegen. Unordered predicates are not emitted unless the MIR instruction carries an
43+
explicit `unordered` flag, which no current TML code generates.
44+
- **GEP inbounds** — all GEP instructions emit `inbounds` matching the C++ output. This is safe
45+
because TML's borrow checker guarantees no out-of-bounds access at the TML level. The inbounds
46+
annotation enables LLVM's alias analysis.
47+
- **instruction-by-instruction IR-diff** — the differential testing strategy compares individual
48+
instruction outputs rather than whole-function IR. This lets early failures pinpoint exactly
49+
which MIR instruction variant is emitting wrong text, without requiring full-program compilation.
50+
51+
### Instruction → LLVM IR Mapping (summary)
52+
53+
| MIR Instruction | LLVM IR |
54+
|---|---|
55+
| `Add(nsw, a, b)` | `%r = add nsw i64 %a, %b` |
56+
| `ICmp(Eq, a, b)` | `%r = icmp eq i64 %a, %b` |
57+
| `Alloca(T)` | `%r = alloca T, align A` |
58+
| `Load(T, addr)` | `%r = load T, ptr %addr, align A` |
59+
| `Store(val, addr)` | `store T %val, ptr %addr, align A` |
60+
| `GEP(base, T, [0, N])` | `%r = getelementptr inbounds T, ptr %base, i32 0, i32 N` |
61+
| `Br(bb)` | `br label %bb` |
62+
| `CondBr(c, t, f)` | `br i1 %c, label %t, label %f` |
63+
| `Switch(v, d, cases)` | `switch i64 %v, label %d [ ... ]` |
64+
| `Ret(v)` | `ret i64 %v` |
65+
| `ExtractValue(agg, N)` | `%r = extractvalue { ... } %agg, N` |
66+
| `InsertValue(agg, v, N)` | `%r = insertvalue { ... } %agg, T %v, N` |
67+
| `Phi([(v1,bb1),...])` | `%r = phi T [ %v1, %bb1 ], ...` |
68+
| `Select(c, t, f)` | `%r = select i1 %c, T %t, T %f` |
69+
| `ZExt(v, T)` | `%r = zext i32 %v to T` |
70+
71+
## Impact
72+
73+
- Affected code: `compiler/src/codegen/mir/instructions.cpp`, `instructions_misc.cpp`,
74+
`llvm/expr/binary.cpp`, `llvm/expr/binary_ops.cpp`, `llvm/control/when.cpp`,
75+
`llvm/expr/struct_field.cpp`, `llvm/expr/llvm_struct_expr.cpp` (all replaced)
76+
- Affected phases: 16c (calls extend this layer with call/invoke instructions)
77+
- Breaking change: NO — IR-diff testing ensures instruction-identical output
78+
- User benefit: self-hosting progress; every IR instruction inspectable in TML
79+
80+
## Success Criteria
81+
82+
`emit(inst)` produces LLVM IR text that is character-identical to C++ output for all 40+ MIR
83+
instruction variants. IR-diff on 10 stdlib functions shows zero instruction differences.
84+
85+
## Dependencies
86+
87+
- **Requires**: phase16a (emit_type, LayoutComputer, register naming infrastructure)
88+
- **Blocks**: phase16c (call emission extends InstructionEmitter)
89+
- **Risk**: Medium — large number of instruction variants, but each is mechanically straightforward.
90+
The main risk is alignment values diverging from C++ layout rules; mitigated by phase16a layout
91+
tests that verify field offsets before instruction emission begins.

0 commit comments

Comments
 (0)