|
| 1 | +# Architecture |
| 2 | + |
| 3 | +A bare-metal Type-1 hypervisor for ARM64 in Rust (`no_std`). One codebase, two personalities: an NS-EL2 hypervisor that boots Linux, or an S-EL2 SPMC that manages Secure Partitions alongside Android pKVM. ~7,700 LOC, single external dependency (`fdt`), 34 test suites. |
| 4 | + |
| 5 | +For quick start and features, see [README.md](README.md). |
| 6 | +For exhaustive internals, see [docs/architecture.md](docs/architecture.md). |
| 7 | + |
| 8 | +--- |
| 9 | + |
| 10 | +## Privilege Model |
| 11 | + |
| 12 | +The hypervisor operates in one of two compile-time modes: |
| 13 | + |
| 14 | +``` |
| 15 | + NS-EL2 Mode (make run-linux) S-EL2 Mode (make run-spmc) |
| 16 | + ────────────────────────────── ────────────────────────────── |
| 17 | + EL3 │ TF-A BL31 + SPMD |
| 18 | + │ (world switch, SMC relay) |
| 19 | + ────┼───────────────────────── |
| 20 | + EL2 │ This hypervisor S-EL2│ This hypervisor (SPMC) |
| 21 | + │ (exception handling, │ (FF-A dispatch, SP lifecycle, |
| 22 | + │ Stage-2 MMU, GIC virt) │ Secure Stage-2) |
| 23 | + ────┼───────────────────────── S-EL1│ SP1 Hello, SP2 IRQ, SP3 Relay |
| 24 | + EL1 │ Linux / Zephyr guest ────┼───────────────────────── |
| 25 | + │ NS-EL2│ pKVM (protected hVHE) |
| 26 | + NS-EL1│ Linux / Android guest |
| 27 | +``` |
| 28 | + |
| 29 | +Same Rust codebase — `#[cfg(feature = "sel2")]` selects entry point, linker script, and event loop. The two modes share ~70% of code (MMU, GIC, exception handling, FF-A protocol). |
| 30 | + |
| 31 | +--- |
| 32 | + |
| 33 | +## Boot Flow |
| 34 | + |
| 35 | +### NS-EL2 (QEMU `-kernel`) |
| 36 | + |
| 37 | +``` |
| 38 | +boot.S Save DTB addr (x0→x20), set up stack, clear BSS |
| 39 | + │ |
| 40 | +rust_main() Parse DTB (fdt crate, zero-copy, before heap) |
| 41 | + │ Install exception vectors (VBAR_EL2) |
| 42 | + │ Configure HCR_EL2 (trap WFI, SMC, MMIO) |
| 43 | + │ Init GICv3 (GICD enable, List Registers) |
| 44 | + │ Init FF-A proxy (probe SPMC at EL3) |
| 45 | + │ Init heap (BumpAllocator, 16MB at 0x41000000) |
| 46 | + │ |
| 47 | + ├── make run: Run 34 test suites, halt |
| 48 | + ├── make run-linux: Boot Linux (4 vCPUs, virtio-blk) |
| 49 | + └── make run-multi-vm: Boot 2 Linux VMs time-sliced |
| 50 | +``` |
| 51 | + |
| 52 | +### S-EL2 (TF-A BL32) |
| 53 | + |
| 54 | +``` |
| 55 | +BL1 → BL2 → BL31(SPMD) → BL32(us) → BL33(pKVM) |
| 56 | +
|
| 57 | +boot_sel2.S Save manifest/HW_CONFIG/core_id from SPMD |
| 58 | + │ |
| 59 | +rust_main_sel2() Parse SPMC manifest (TOS_FW_CONFIG DTB) |
| 60 | + │ Enable S-EL2 Stage-1 MMU (NS=1 for NWd DRAM) |
| 61 | + │ Init GIC, Secure Stage-2 for SPs |
| 62 | + │ Parse SPKG headers, ERET to SP1 at S-EL1 |
| 63 | + │ SP1 calls FFA_MSG_WAIT → boot SP2, SP3 |
| 64 | + │ Register secondary EP (FFA_SECONDARY_EP_REGISTER) |
| 65 | + │ |
| 66 | + └── FFA_MSG_WAIT → SPMD dispatches NWd requests → loop |
| 67 | +``` |
| 68 | + |
| 69 | +**Key insight**: DTB parsing uses the `fdt` crate (zero-copy, no allocations), |
| 70 | +so it runs *before* the heap is initialized. |
| 71 | + |
| 72 | +--- |
| 73 | + |
| 74 | +## Core Abstractions |
| 75 | + |
| 76 | +``` |
| 77 | +src/ |
| 78 | +├── vm.rs VM lifecycle, Stage-2 setup, run_smp() scheduler loop |
| 79 | +├── vcpu.rs State machine (Uninitialized→Ready→Running→Stopped) |
| 80 | +├── scheduler.rs Round-robin vCPU scheduling with block/unblock |
| 81 | +├── devices/mod.rs Enum-dispatch MMIO routing (see Design Decisions) |
| 82 | +├── ffa/proxy.rs FF-A v1.1 proxy — intercepts guest SMC at NS-EL2 |
| 83 | +├── spmc_handler.rs S-EL2 SPMC event loop — FF-A dispatch to SPs |
| 84 | +├── sp_context.rs Per-SP state, INTID ownership, call stack |
| 85 | +├── global.rs Per-VM state arrays, UART RX ring, VSwitch |
| 86 | +└── arch/aarch64/ |
| 87 | + ├── exception.S Vector table, context save/restore, enter_guest |
| 88 | + └── hypervisor/ |
| 89 | + ├── exception.rs ESR_EL2 decode → exit reason dispatch |
| 90 | + └── decode.rs MMIO instruction decode (ISS + raw instruction) |
| 91 | +``` |
| 92 | + |
| 93 | +### Exception Handling Flow |
| 94 | + |
| 95 | +``` |
| 96 | +Guest @ EL1 |
| 97 | + │ trap (HVC, SMC, MMIO fault, WFI, MSR/MRS) |
| 98 | + ▼ |
| 99 | +exception.S ─── save x0-x30, SP_EL1, ELR_EL2, SPSR_EL2 |
| 100 | + │ (context pointer from TPIDR_EL2) |
| 101 | + ▼ |
| 102 | +exception.rs ── read ESR_EL2, extract EC (exception class) |
| 103 | + │ |
| 104 | + ├─ WfiWfe → return to scheduler (block vCPU) |
| 105 | + ├─ HvcCall → PSCI (CPU_ON/OFF/RESET) or HF_INTERRUPT_GET |
| 106 | + ├─ SmcCall → FF-A proxy or forward to EL3 |
| 107 | + ├─ DataAbort → HPFAR_EL2 for IPA → DeviceManager MMIO dispatch |
| 108 | + ├─ SysReg trap → ICC_SGI1R (IPI emulation), timer regs |
| 109 | + └─ IRQ → INTID 26 (preemption), 27 (vtimer), 33 (UART) |
| 110 | + │ |
| 111 | + ▼ |
| 112 | +exception.S ─── advance PC, restore context, ERET back to guest |
| 113 | +``` |
| 114 | + |
| 115 | +**Critical detail**: For MMIO, `FAR_EL2` holds the guest *virtual* address. |
| 116 | +The guest *physical* address (IPA) comes from `HPFAR_EL2`: |
| 117 | +`IPA = (HPFAR_EL2 & 0xFFFFFFFFF0) << 8 | (FAR_EL2 & 0xFFF)`. |
| 118 | + |
| 119 | +--- |
| 120 | + |
| 121 | +## Key Design Decisions |
| 122 | + |
| 123 | +### 1. Enum-Dispatch over Trait Objects |
| 124 | + |
| 125 | +```rust |
| 126 | +// src/devices/mod.rs |
| 127 | +pub enum Device { |
| 128 | + Uart(VirtualUart), Gicd(VirtualGicd), Gicr(VirtualGicr), |
| 129 | + VirtioBlk(...), VirtioNet(...), Pl031(VirtualPl031), |
| 130 | +} |
| 131 | +``` |
| 132 | + |
| 133 | +**Why**: In `no_std` bare-metal, trait objects (`dyn MmioDevice`) add vtable indirection and prevent inlining on the MMIO hot path. The device set is fixed at compile time — enum dispatch lets the compiler see through match arms and optimize the entire path. |
| 134 | + |
| 135 | +**Trade-off**: Adding a device requires modifying the enum and match blocks. Acceptable with 6 device types and ~1 new type per milestone. |
| 136 | + |
| 137 | +### 2. Bump Allocator with Free-List Recycling |
| 138 | + |
| 139 | +**Why**: `no_std` means no global allocator. A bump allocator is the simplest correct allocator — just increment a pointer. Free-list recycling (singly-linked via first 8 bytes of freed pages) was added for Stage-2 page table teardown, where pages are allocated then freed in bulk. |
| 140 | + |
| 141 | +**Trade-off**: Only 4KB pages can be freed. Arbitrary-size allocations are permanent. Fine because 99% of heap usage is page tables. |
| 142 | + |
| 143 | +### 3. Identity Mapping (GPA == HPA) |
| 144 | + |
| 145 | +Stage-2 translation maps every guest physical address to the same host physical address. |
| 146 | + |
| 147 | +**Why**: Simplifies device emulation (MMIO addresses match hardware), avoids IPA→PA translation bugs, and works well for QEMU `virt`. virtio backends use `copy_nonoverlapping` directly between guest buffers and disk images. |
| 148 | + |
| 149 | +**Trade-off**: Cannot overcommit memory, relocate VMs, or deduplicate pages. A production hypervisor would add an IPA→PA layer. |
| 150 | + |
| 151 | +### 4. Compile-Time Feature Flags for Dual Mode |
| 152 | + |
| 153 | +`sel2` and `linux_guest` use different entry points, linker scripts, and main loops — but share MMU, GIC, FF-A, and device code. |
| 154 | + |
| 155 | +**Why**: A runtime mode switch would carry dead code and branch on every hot path. Feature flags let `cfg` eliminate the unused mode, keeping BL32 at ~240KB. |
| 156 | + |
| 157 | +**Trade-off**: Cannot switch modes without recompiling. In practice, NS-EL2 (guest management) and S-EL2 (SP management) are fundamentally different use cases. |
| 158 | + |
| 159 | +### 5. Single External Dependency |
| 160 | + |
| 161 | +Only `fdt` v0.1.5 (zero-copy device tree parsing). Everything else — exceptions, MMU, GICv3, virtio, FF-A, allocator — is hand-written. |
| 162 | + |
| 163 | +**Why**: Bare-metal firmware cannot tolerate surprise `std` dependencies in the dep tree. Every transitive dependency is a build risk. The `fdt` crate is verified `no_std` and does one thing well. |
| 164 | + |
| 165 | +**Trade-off**: ~7,700 LOC to maintain. But every line is auditable, GDB-steppable, and has no hidden behavior. |
| 166 | + |
| 167 | +--- |
| 168 | + |
| 169 | +## Memory Architecture |
| 170 | + |
| 171 | +| Layer | Purpose | Implementation | |
| 172 | +|-------|---------|----------------| |
| 173 | +| EL2 Heap | Page tables, runtime structures | `BumpAllocator` (16MB at 0x41000000) | |
| 174 | +| Stage-2 | Guest isolation (GPA→HPA) | `DynamicIdentityMapper` (2MB blocks + 4KB pages) | |
| 175 | +| Secure Stage-2 | SP isolation (S-EL2 mode) | `VSTTBR_EL2/VSTCR_EL2` per-SP | |
| 176 | + |
| 177 | +**Page ownership**: Stage-2 PTE software bits [56:55] encode ownership — `Owned(00)`, `SharedOwned(01)`, `SharedBorrowed(10)`, `Donated(11)`. Validated during FF-A memory operations. Compatible with pKVM's page ownership model. |
| 178 | + |
| 179 | +**Heap gap**: The heap lies within the guest's physical range but is left **unmapped** in Stage-2, preventing guest corruption of hypervisor state. |
| 180 | + |
| 181 | +--- |
| 182 | + |
| 183 | +## FF-A and Secure World |
| 184 | + |
| 185 | +[FF-A v1.1](https://developer.arm.com/documentation/den0077) is the protocol between Normal World and Secure World: |
| 186 | + |
| 187 | +**NS-EL2 proxy** (`src/ffa/proxy.rs`): Guest SMC calls trapped via `HCR_EL2.TSC=1`. Handles VERSION/FEATURES/RXTX locally, forwards DIRECT_REQ/MEM_SHARE to real SPMC via EL3 (or stub SPMC for testing). |
| 188 | + |
| 189 | +**S-EL2 SPMC** (`src/spmc_handler.rs`): *Is* the SPMC. Receives requests from SPMD, dispatches DIRECT_REQ to SPs via ERET, handles SP-initiated calls (MEM_RETRIEVE, CONSOLE_LOG) through `handle_sp_exit()` loop. |
| 190 | + |
| 191 | +**SP-to-SP calls**: `CallStack` with cycle detection. Recursive `dispatch_to_sp()` handles chain preemption (Blocked→Preempted state transition). |
| 192 | + |
| 193 | +**Memory sharing lifecycle**: |
| 194 | +``` |
| 195 | +Sender: MEM_SHARE(pages) → handle → PTE bits → SharedOwned |
| 196 | +Receiver: MEM_RETRIEVE_REQ(handle) → Stage-2 map → SharedBorrowed |
| 197 | +Receiver: MEM_RELINQUISH(handle) → Stage-2 unmap |
| 198 | +Sender: MEM_RECLAIM(handle) → restore PTE → Owned |
| 199 | +``` |
| 200 | + |
| 201 | +--- |
| 202 | + |
| 203 | +## Source Tree |
| 204 | + |
| 205 | +| Subsystem | Files | |
| 206 | +|-----------|-------| |
| 207 | +| **Boot** | `arch/aarch64/boot.S`, `boot_sel2.S`, `linker.ld`, `linker_sel2.ld` | |
| 208 | +| **Core** | `vm.rs`, `vcpu.rs`, `scheduler.rs`, `global.rs` | |
| 209 | +| **Exceptions** | `arch/aarch64/exception.S`, `hypervisor/exception.rs`, `decode.rs` | |
| 210 | +| **Memory** | `mm/allocator.rs`, `mm/heap.rs`, `mm/mmu.rs`, `sel2_mmu.rs` | |
| 211 | +| **Devices** | `devices/{pl011,gic,pl031,virtio/}` — enum-dispatch in `mod.rs` | |
| 212 | +| **FF-A** | `ffa/{proxy,descriptors,stage2_walker,memory,mailbox,smc_forward}.rs` | |
| 213 | +| **SPMC** | `spmc_handler.rs`, `sp_context.rs`, `manifest.rs`, `secure_stage2.rs` | |
| 214 | +| **Networking** | `vswitch.rs` — L2 virtual switch, MAC learning, inter-VM forwarding | |
| 215 | +| **Platform** | `platform.rs` (constants), `dtb.rs` (runtime DTB discovery) | |
| 216 | +| **Tests** | `tests/test_*.rs` — 34 suites, ~457 assertions (`make run`) | |
| 217 | + |
| 218 | +--- |
| 219 | + |
| 220 | +## Further Reading |
| 221 | + |
| 222 | +- [docs/architecture.md](docs/architecture.md) — Exhaustive internal reference: register layouts, memory maps, every handler |
| 223 | +- [docs/GICV3_IMPLEMENTATION.md](docs/GICV3_IMPLEMENTATION.md) — GICv3 trap-and-emulate deep dive |
| 224 | +- [docs/RUST_FIRMWARE_CODING_GUIDELINES.md](docs/RUST_FIRMWARE_CODING_GUIDELINES.md) — Bare-metal Rust coding conventions |
| 225 | +- [docs/debugging.md](docs/debugging.md) — GDB remote debugging and QEMU tracing |
| 226 | +- [CLAUDE.md](CLAUDE.md) — Full module/test tables, build commands, feature flag matrix |
0 commit comments