Skip to content

Commit d2cb708

Browse files
willamhouclaudehappy-otter
committed
docs: add top-level ARCHITECTURE.md for external contributors
Concise (~230 lines) architecture overview covering privilege model, boot flow, core abstractions, 5 key design decisions with rationale, memory architecture, FF-A/Secure World, and source tree by subsystem. Complements README (features/quick start) and docs/architecture.md (exhaustive internals) without duplicating either. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>
1 parent 8bc71f4 commit d2cb708

1 file changed

Lines changed: 226 additions & 0 deletions

File tree

ARCHITECTURE.md

Lines changed: 226 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,226 @@
1+
# Architecture
2+
3+
A bare-metal Type-1 hypervisor for ARM64 in Rust (`no_std`). One codebase, two personalities: an NS-EL2 hypervisor that boots Linux, or an S-EL2 SPMC that manages Secure Partitions alongside Android pKVM. ~7,700 LOC, single external dependency (`fdt`), 34 test suites.
4+
5+
For quick start and features, see [README.md](README.md).
6+
For exhaustive internals, see [docs/architecture.md](docs/architecture.md).
7+
8+
---
9+
10+
## Privilege Model
11+
12+
The hypervisor operates in one of two compile-time modes:
13+
14+
```
15+
NS-EL2 Mode (make run-linux) S-EL2 Mode (make run-spmc)
16+
────────────────────────────── ──────────────────────────────
17+
EL3 │ TF-A BL31 + SPMD
18+
│ (world switch, SMC relay)
19+
────┼─────────────────────────
20+
EL2 │ This hypervisor S-EL2│ This hypervisor (SPMC)
21+
│ (exception handling, │ (FF-A dispatch, SP lifecycle,
22+
│ Stage-2 MMU, GIC virt) │ Secure Stage-2)
23+
────┼───────────────────────── S-EL1│ SP1 Hello, SP2 IRQ, SP3 Relay
24+
EL1 │ Linux / Zephyr guest ────┼─────────────────────────
25+
│ NS-EL2│ pKVM (protected hVHE)
26+
NS-EL1│ Linux / Android guest
27+
```
28+
29+
Same Rust codebase — `#[cfg(feature = "sel2")]` selects entry point, linker script, and event loop. The two modes share ~70% of code (MMU, GIC, exception handling, FF-A protocol).
30+
31+
---
32+
33+
## Boot Flow
34+
35+
### NS-EL2 (QEMU `-kernel`)
36+
37+
```
38+
boot.S Save DTB addr (x0→x20), set up stack, clear BSS
39+
40+
rust_main() Parse DTB (fdt crate, zero-copy, before heap)
41+
│ Install exception vectors (VBAR_EL2)
42+
│ Configure HCR_EL2 (trap WFI, SMC, MMIO)
43+
│ Init GICv3 (GICD enable, List Registers)
44+
│ Init FF-A proxy (probe SPMC at EL3)
45+
│ Init heap (BumpAllocator, 16MB at 0x41000000)
46+
47+
├── make run: Run 34 test suites, halt
48+
├── make run-linux: Boot Linux (4 vCPUs, virtio-blk)
49+
└── make run-multi-vm: Boot 2 Linux VMs time-sliced
50+
```
51+
52+
### S-EL2 (TF-A BL32)
53+
54+
```
55+
BL1 → BL2 → BL31(SPMD) → BL32(us) → BL33(pKVM)
56+
57+
boot_sel2.S Save manifest/HW_CONFIG/core_id from SPMD
58+
59+
rust_main_sel2() Parse SPMC manifest (TOS_FW_CONFIG DTB)
60+
│ Enable S-EL2 Stage-1 MMU (NS=1 for NWd DRAM)
61+
│ Init GIC, Secure Stage-2 for SPs
62+
│ Parse SPKG headers, ERET to SP1 at S-EL1
63+
│ SP1 calls FFA_MSG_WAIT → boot SP2, SP3
64+
│ Register secondary EP (FFA_SECONDARY_EP_REGISTER)
65+
66+
└── FFA_MSG_WAIT → SPMD dispatches NWd requests → loop
67+
```
68+
69+
**Key insight**: DTB parsing uses the `fdt` crate (zero-copy, no allocations),
70+
so it runs *before* the heap is initialized.
71+
72+
---
73+
74+
## Core Abstractions
75+
76+
```
77+
src/
78+
├── vm.rs VM lifecycle, Stage-2 setup, run_smp() scheduler loop
79+
├── vcpu.rs State machine (Uninitialized→Ready→Running→Stopped)
80+
├── scheduler.rs Round-robin vCPU scheduling with block/unblock
81+
├── devices/mod.rs Enum-dispatch MMIO routing (see Design Decisions)
82+
├── ffa/proxy.rs FF-A v1.1 proxy — intercepts guest SMC at NS-EL2
83+
├── spmc_handler.rs S-EL2 SPMC event loop — FF-A dispatch to SPs
84+
├── sp_context.rs Per-SP state, INTID ownership, call stack
85+
├── global.rs Per-VM state arrays, UART RX ring, VSwitch
86+
└── arch/aarch64/
87+
├── exception.S Vector table, context save/restore, enter_guest
88+
└── hypervisor/
89+
├── exception.rs ESR_EL2 decode → exit reason dispatch
90+
└── decode.rs MMIO instruction decode (ISS + raw instruction)
91+
```
92+
93+
### Exception Handling Flow
94+
95+
```
96+
Guest @ EL1
97+
│ trap (HVC, SMC, MMIO fault, WFI, MSR/MRS)
98+
99+
exception.S ─── save x0-x30, SP_EL1, ELR_EL2, SPSR_EL2
100+
│ (context pointer from TPIDR_EL2)
101+
102+
exception.rs ── read ESR_EL2, extract EC (exception class)
103+
104+
├─ WfiWfe → return to scheduler (block vCPU)
105+
├─ HvcCall → PSCI (CPU_ON/OFF/RESET) or HF_INTERRUPT_GET
106+
├─ SmcCall → FF-A proxy or forward to EL3
107+
├─ DataAbort → HPFAR_EL2 for IPA → DeviceManager MMIO dispatch
108+
├─ SysReg trap → ICC_SGI1R (IPI emulation), timer regs
109+
└─ IRQ → INTID 26 (preemption), 27 (vtimer), 33 (UART)
110+
111+
112+
exception.S ─── advance PC, restore context, ERET back to guest
113+
```
114+
115+
**Critical detail**: For MMIO, `FAR_EL2` holds the guest *virtual* address.
116+
The guest *physical* address (IPA) comes from `HPFAR_EL2`:
117+
`IPA = (HPFAR_EL2 & 0xFFFFFFFFF0) << 8 | (FAR_EL2 & 0xFFF)`.
118+
119+
---
120+
121+
## Key Design Decisions
122+
123+
### 1. Enum-Dispatch over Trait Objects
124+
125+
```rust
126+
// src/devices/mod.rs
127+
pub enum Device {
128+
Uart(VirtualUart), Gicd(VirtualGicd), Gicr(VirtualGicr),
129+
VirtioBlk(...), VirtioNet(...), Pl031(VirtualPl031),
130+
}
131+
```
132+
133+
**Why**: In `no_std` bare-metal, trait objects (`dyn MmioDevice`) add vtable indirection and prevent inlining on the MMIO hot path. The device set is fixed at compile time — enum dispatch lets the compiler see through match arms and optimize the entire path.
134+
135+
**Trade-off**: Adding a device requires modifying the enum and match blocks. Acceptable with 6 device types and ~1 new type per milestone.
136+
137+
### 2. Bump Allocator with Free-List Recycling
138+
139+
**Why**: `no_std` means no global allocator. A bump allocator is the simplest correct allocator — just increment a pointer. Free-list recycling (singly-linked via first 8 bytes of freed pages) was added for Stage-2 page table teardown, where pages are allocated then freed in bulk.
140+
141+
**Trade-off**: Only 4KB pages can be freed. Arbitrary-size allocations are permanent. Fine because 99% of heap usage is page tables.
142+
143+
### 3. Identity Mapping (GPA == HPA)
144+
145+
Stage-2 translation maps every guest physical address to the same host physical address.
146+
147+
**Why**: Simplifies device emulation (MMIO addresses match hardware), avoids IPA→PA translation bugs, and works well for QEMU `virt`. virtio backends use `copy_nonoverlapping` directly between guest buffers and disk images.
148+
149+
**Trade-off**: Cannot overcommit memory, relocate VMs, or deduplicate pages. A production hypervisor would add an IPA→PA layer.
150+
151+
### 4. Compile-Time Feature Flags for Dual Mode
152+
153+
`sel2` and `linux_guest` use different entry points, linker scripts, and main loops — but share MMU, GIC, FF-A, and device code.
154+
155+
**Why**: A runtime mode switch would carry dead code and branch on every hot path. Feature flags let `cfg` eliminate the unused mode, keeping BL32 at ~240KB.
156+
157+
**Trade-off**: Cannot switch modes without recompiling. In practice, NS-EL2 (guest management) and S-EL2 (SP management) are fundamentally different use cases.
158+
159+
### 5. Single External Dependency
160+
161+
Only `fdt` v0.1.5 (zero-copy device tree parsing). Everything else — exceptions, MMU, GICv3, virtio, FF-A, allocator — is hand-written.
162+
163+
**Why**: Bare-metal firmware cannot tolerate surprise `std` dependencies in the dep tree. Every transitive dependency is a build risk. The `fdt` crate is verified `no_std` and does one thing well.
164+
165+
**Trade-off**: ~7,700 LOC to maintain. But every line is auditable, GDB-steppable, and has no hidden behavior.
166+
167+
---
168+
169+
## Memory Architecture
170+
171+
| Layer | Purpose | Implementation |
172+
|-------|---------|----------------|
173+
| EL2 Heap | Page tables, runtime structures | `BumpAllocator` (16MB at 0x41000000) |
174+
| Stage-2 | Guest isolation (GPA→HPA) | `DynamicIdentityMapper` (2MB blocks + 4KB pages) |
175+
| Secure Stage-2 | SP isolation (S-EL2 mode) | `VSTTBR_EL2/VSTCR_EL2` per-SP |
176+
177+
**Page ownership**: Stage-2 PTE software bits [56:55] encode ownership — `Owned(00)`, `SharedOwned(01)`, `SharedBorrowed(10)`, `Donated(11)`. Validated during FF-A memory operations. Compatible with pKVM's page ownership model.
178+
179+
**Heap gap**: The heap lies within the guest's physical range but is left **unmapped** in Stage-2, preventing guest corruption of hypervisor state.
180+
181+
---
182+
183+
## FF-A and Secure World
184+
185+
[FF-A v1.1](https://developer.arm.com/documentation/den0077) is the protocol between Normal World and Secure World:
186+
187+
**NS-EL2 proxy** (`src/ffa/proxy.rs`): Guest SMC calls trapped via `HCR_EL2.TSC=1`. Handles VERSION/FEATURES/RXTX locally, forwards DIRECT_REQ/MEM_SHARE to real SPMC via EL3 (or stub SPMC for testing).
188+
189+
**S-EL2 SPMC** (`src/spmc_handler.rs`): *Is* the SPMC. Receives requests from SPMD, dispatches DIRECT_REQ to SPs via ERET, handles SP-initiated calls (MEM_RETRIEVE, CONSOLE_LOG) through `handle_sp_exit()` loop.
190+
191+
**SP-to-SP calls**: `CallStack` with cycle detection. Recursive `dispatch_to_sp()` handles chain preemption (Blocked→Preempted state transition).
192+
193+
**Memory sharing lifecycle**:
194+
```
195+
Sender: MEM_SHARE(pages) → handle → PTE bits → SharedOwned
196+
Receiver: MEM_RETRIEVE_REQ(handle) → Stage-2 map → SharedBorrowed
197+
Receiver: MEM_RELINQUISH(handle) → Stage-2 unmap
198+
Sender: MEM_RECLAIM(handle) → restore PTE → Owned
199+
```
200+
201+
---
202+
203+
## Source Tree
204+
205+
| Subsystem | Files |
206+
|-----------|-------|
207+
| **Boot** | `arch/aarch64/boot.S`, `boot_sel2.S`, `linker.ld`, `linker_sel2.ld` |
208+
| **Core** | `vm.rs`, `vcpu.rs`, `scheduler.rs`, `global.rs` |
209+
| **Exceptions** | `arch/aarch64/exception.S`, `hypervisor/exception.rs`, `decode.rs` |
210+
| **Memory** | `mm/allocator.rs`, `mm/heap.rs`, `mm/mmu.rs`, `sel2_mmu.rs` |
211+
| **Devices** | `devices/{pl011,gic,pl031,virtio/}` — enum-dispatch in `mod.rs` |
212+
| **FF-A** | `ffa/{proxy,descriptors,stage2_walker,memory,mailbox,smc_forward}.rs` |
213+
| **SPMC** | `spmc_handler.rs`, `sp_context.rs`, `manifest.rs`, `secure_stage2.rs` |
214+
| **Networking** | `vswitch.rs` — L2 virtual switch, MAC learning, inter-VM forwarding |
215+
| **Platform** | `platform.rs` (constants), `dtb.rs` (runtime DTB discovery) |
216+
| **Tests** | `tests/test_*.rs` — 34 suites, ~457 assertions (`make run`) |
217+
218+
---
219+
220+
## Further Reading
221+
222+
- [docs/architecture.md](docs/architecture.md) — Exhaustive internal reference: register layouts, memory maps, every handler
223+
- [docs/GICV3_IMPLEMENTATION.md](docs/GICV3_IMPLEMENTATION.md) — GICv3 trap-and-emulate deep dive
224+
- [docs/RUST_FIRMWARE_CODING_GUIDELINES.md](docs/RUST_FIRMWARE_CODING_GUIDELINES.md) — Bare-metal Rust coding conventions
225+
- [docs/debugging.md](docs/debugging.md) — GDB remote debugging and QEMU tracing
226+
- [CLAUDE.md](CLAUDE.md) — Full module/test tables, build commands, feature flag matrix

0 commit comments

Comments
 (0)