Skip to content

Commit 06615fc

Browse files
committed
docs: ImpVMSnapshot design — execution model, OCI layers, base image election
1 parent b14caf6 commit 06615fc

1 file changed

Lines changed: 280 additions & 0 deletions

File tree

Lines changed: 280 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,280 @@
1+
# Imp — ImpVMSnapshot Design
2+
3+
> *Full snapshot lifecycle: declarative triggers, cron scheduling, OCI distribution, base image election*
4+
5+
**Goal:** Implement the `ImpVMSnapshot` controller end-to-end — agent-side Firecracker
6+
snapshot capture, operator-side scheduling and retention, OCI/node-local storage, and
7+
base image election for warm pools and migration.
8+
9+
---
10+
11+
## Overview
12+
13+
An `ImpVMSnapshot` captures the complete in-memory and on-disk state of a running
14+
Firecracker VM. Snapshots serve three purposes:
15+
16+
1. **Base images** for `ImpWarmPool` — near-instant VM boot from a known state.
17+
2. **Migration checkpoints** — restore a VM on a different node.
18+
3. **Point-in-time backups** — manual or scheduled.
19+
20+
---
21+
22+
## Architecture
23+
24+
**Agent watches `ImpVMSnapshot` child objects directly.** The agent reconciler gains a
25+
second controller that filters to snapshot executions where the source VM is scheduled on
26+
its node. This is the same pattern as the existing `ImpVMReconciler` and requires no new
27+
operator→agent RPC channel.
28+
29+
---
30+
31+
## CRD Design
32+
33+
### Updated `SnapshotStorageSpec`
34+
35+
```go
36+
type SnapshotStorageSpec struct {
37+
// +kubebuilder:validation:Enum=node-local;oci-registry
38+
Type string `json:"type"`
39+
NodeLocal *NodeLocalSpec `json:"nodeLocal,omitempty"`
40+
OCIRegistry *OCIRegistrySpec `json:"ociRegistry,omitempty"`
41+
}
42+
43+
type NodeLocalSpec struct {
44+
// Path is the base directory for snapshot artifacts on the node.
45+
// Can point to a remotely-mounted path (NFS, etc.).
46+
// +kubebuilder:default="/var/lib/imp/snapshots"
47+
Path string `json:"path,omitempty"`
48+
}
49+
```
50+
51+
### Updated `ImpVMSnapshotSpec`
52+
53+
```go
54+
type ImpVMSnapshotSpec struct {
55+
SourceVMName string `json:"sourceVMName"`
56+
SourceVMNamespace string `json:"sourceVMNamespace"`
57+
58+
// Schedule is an optional cron expression for recurring snapshots.
59+
// +optional
60+
Schedule string `json:"schedule,omitempty"`
61+
62+
// Retention is the max number of completed executions to keep.
63+
// +kubebuilder:default=3
64+
// +kubebuilder:validation:Minimum=1
65+
// +kubebuilder:validation:Maximum=10
66+
Retention int32 `json:"retention,omitempty"`
67+
68+
Storage SnapshotStorageSpec `json:"storage"`
69+
70+
// BaseSnapshot pins a specific child execution as the elected base image.
71+
// Set declaratively or via `kubectl imp elect`.
72+
// Consumers (ImpWarmPool, ImpVMMigration) use this as their boot source.
73+
// +optional
74+
BaseSnapshot string `json:"baseSnapshot,omitempty"`
75+
}
76+
```
77+
78+
### Updated `ImpVMSnapshotStatus`
79+
80+
```go
81+
type ImpVMSnapshotStatus struct {
82+
Phase string `json:"phase,omitempty"`
83+
Digest string `json:"digest,omitempty"`
84+
SnapshotPath string `json:"snapshotPath,omitempty"`
85+
CompletedAt *metav1.Time `json:"completedAt,omitempty"`
86+
NextScheduledAt *metav1.Time `json:"nextScheduledAt,omitempty"`
87+
88+
// LastExecutionRef is the most recently created child execution.
89+
// +optional
90+
LastExecutionRef *corev1.LocalObjectReference `json:"lastExecutionRef,omitempty"`
91+
92+
// BaseSnapshot mirrors spec.baseSnapshot once the referenced child is
93+
// validated as Succeeded. Consumers read this field only.
94+
// +optional
95+
BaseSnapshot string `json:"baseSnapshot,omitempty"`
96+
97+
// +optional
98+
// +listType=map
99+
// +listMapKey=type
100+
Conditions []metav1.Condition `json:"conditions,omitempty"`
101+
}
102+
```
103+
104+
### Child execution objects
105+
106+
Each snapshot execution is itself an `ImpVMSnapshot` object owned by the parent, labelled
107+
`imp.dev/snapshot-parent: <parent-name>`. Children carry a copy of the parent's spec and
108+
individual status. They are never modified after creation except for status updates.
109+
110+
Child status gains one additional field:
111+
112+
```go
113+
// TerminatedAt is set by the agent when the execution reaches a terminal state
114+
// (Succeeded or Failed) and all cleanup is complete.
115+
// The operator uses this — not phase alone — to gate new execution creation.
116+
// +optional
117+
TerminatedAt *metav1.Time `json:"terminatedAt,omitempty"`
118+
```
119+
120+
---
121+
122+
## Snapshot Format
123+
124+
Firecracker produces two files per snapshot:
125+
126+
- **State file** — CPU registers, device state, VM metadata (~KB)
127+
- **Memory file** — full RAM dump (~= VM memory size)
128+
129+
### `node-local` backend
130+
131+
Files written to `{spec.storage.nodeLocal.path}/{namespace}/{parent-name}/{child-name}/`:
132+
133+
```
134+
/var/lib/imp/snapshots/default/my-snap/my-snap-20260305-0200/
135+
vm.state
136+
vm.mem
137+
```
138+
139+
Path is configurable per snapshot — supports NFS, remote block storage, or any
140+
POSIX-accessible mount.
141+
142+
### `oci-registry` backend
143+
144+
Files pushed as a two-layer OCI image:
145+
146+
| Layer | Content | Media type |
147+
|---|---|---|
148+
| Layer 1 | `vm.state` (gzipped) | `application/vnd.oci.image.layer.v1.tar+gzip` |
149+
| Layer 2 | `vm.mem` (gzipped) | `application/vnd.oci.image.layer.v1.tar+gzip` |
150+
151+
Tag format: `{repository}:{namespace}-{parent-name}-{timestamp}`
152+
153+
Standard OCI tooling (crane, skopeo) works out of the box. **Spegel** (recommended
154+
cluster setup) distributes layers P2P via Kademlia DHT — nodes that already have a layer
155+
serve it directly to peers, eliminating registry round-trips. Unchanged state-file layers
156+
are automatically deduplicated across snapshots of the same VM.
157+
158+
**Snapshot type:** Full only. Diff snapshots (changed pages only) are on the roadmap.
159+
160+
---
161+
162+
## Reconciliation Flow
163+
164+
### Operator-side `ImpVMSnapshotReconciler` (parent objects)
165+
166+
1. **One-shot** (no schedule): creates a single child execution object. Sets
167+
`status.lastExecutionRef`. Waits.
168+
2. **Scheduled**: parses `spec.schedule`. On each cron tick:
169+
- Checks for any child without a `terminatedAt` — skips tick if found
170+
(`concurrencyPolicy: Forbid` semantics).
171+
- Creates new child (copy of parent spec + `ownerReference` +
172+
`imp.dev/snapshot-parent` label).
173+
- Updates `status.lastExecutionRef` and `status.nextScheduledAt`.
174+
- Prunes oldest children beyond `retention` (sorted by `creationTimestamp`,
175+
deletes from oldest).
176+
3. **BaseSnapshot validation**: when `spec.baseSnapshot` is set, verifies the named
177+
child exists and has `phase=Succeeded`. If valid, mirrors to `status.baseSnapshot`.
178+
If invalid, sets a `BaseSnapshotInvalid` condition.
179+
180+
### Agent-side `ImpVMSnapshotReconciler` (child execution objects)
181+
182+
Filters: `spec.sourceVMNamespace/sourceVMName` must map to a VM with
183+
`spec.nodeName == r.NodeName`.
184+
185+
1. Set child `status.phase = Running`.
186+
2. Look up the source VM's `fcProc` in the driver.
187+
3. Call `d.Driver.Snapshot(ctx, vm)`:
188+
- **Always** registers `defer d.Driver.ResumeVM(vm)` first — VM is never left paused.
189+
- Calls Firecracker `PauseVm`.
190+
- Calls Firecracker `CreateSnapshot` → returns `{StatePath, MemPath}`.
191+
- Resume happens via deferred call.
192+
4. **`node-local`**: move files to configured path. Set `status.snapshotPath`.
193+
5. **`oci-registry`**: push two-layer OCI image via `go-containerregistry`. Set
194+
`status.digest`.
195+
6. On success: set `phase=Succeeded`, `completedAt=now`, `terminatedAt=now`.
196+
7. On any error: ensure cleanup complete (temp files removed, VM confirmed running),
197+
then set `phase=Failed`, `status.message=<reason>`, `terminatedAt=now`.
198+
199+
The `terminatedAt` field is set only after all cleanup is complete — not at the moment
200+
of failure. This is the serialisation gate: the operator will not create the next child
201+
until `terminatedAt` is populated.
202+
203+
---
204+
205+
## Fault Tolerance
206+
207+
| Scenario | Handling |
208+
|---|---|
209+
| VM not Running | Child `Failed/VMNotReady`. Operator requeues 30 s. |
210+
| Firecracker pause fails | Child `Failed/PauseFailed`. VM never paused. |
211+
| `CreateSnapshot` fails after pause | `defer ResumeVM` fires. Child `Failed/SnapshotFailed`. |
212+
| OCI push fails | Child `Failed/PushFailed`. State+memory files kept in temp dir. Operator creates new child on next reconcile (exponential backoff). Temp dir cleaned on retry start. |
213+
| Agent crash mid-snapshot | Child stuck `Running`. After `snapshotTimeout` (default 5 m), operator sets child `Failed/Timeout`. |
214+
| Node goes down | Same timeout applies. Operator detects stale `Running` child, creates replacement on next tick. |
215+
216+
**Core invariant:** `defer d.Driver.ResumeVM(vm)` is registered before the first
217+
Firecracker API call. The VM resumes regardless of what fails after that point.
218+
219+
---
220+
221+
## Serialisation Guarantee
222+
223+
> **At most one active execution per parent at any time.**
224+
225+
The operator checks for any child missing `terminatedAt` before creating the next one.
226+
A failed execution must reach terminal state (agent sets `terminatedAt` after cleanup)
227+
before the next one fires. If the agent crashes before setting `terminatedAt`, the
228+
operator's `snapshotTimeout` forces the child to `Failed/Timeout` with a synthetic
229+
`terminatedAt`, unblocking the schedule.
230+
231+
---
232+
233+
## Base Image Election
234+
235+
A snapshot execution becomes a base image only by **explicit election** — it is never
236+
automatic.
237+
238+
- **Imperative:** `kubectl imp elect <parent> [--execution <child-name>]`
239+
If `--execution` is omitted, the latest `Succeeded` child is elected.
240+
- **Declarative:** patch `spec.baseSnapshot: <child-name>`.
241+
242+
`ImpWarmPool` reads `status.baseSnapshot`. If unset, the pool stays idle — there is no
243+
implicit fallback to latest. This prevents accidental rollout of an unvalidated snapshot
244+
to production CI runners.
245+
246+
Elected children are exempt from retention pruning. The operator never deletes a child
247+
referenced by `status.baseSnapshot`.
248+
249+
---
250+
251+
## ImpWarmPool Integration
252+
253+
`ImpWarmPool.Spec.SnapshotRef` names an `ImpVMSnapshot` parent. The warm pool controller
254+
resolves the boot artifact as follows:
255+
256+
1. If `status.baseSnapshot` is set on the parent → use that child's artifact.
257+
2. Otherwise → pool stays idle, emits `BaseSnapshotNotElected` event.
258+
259+
This means warm pools require an explicit election before they boot any VMs.
260+
261+
---
262+
263+
## Trigger Surfaces
264+
265+
| Surface | Mechanism |
266+
|---|---|
267+
| Declarative (one-shot) | Create `ImpVMSnapshot` resource |
268+
| Declarative (scheduled) | Create `ImpVMSnapshot` with `spec.schedule` |
269+
| Imperative (one-shot) | `kubectl imp snapshot <vm>` — creates resource, streams status |
270+
| Imperative (elect) | `kubectl imp elect <parent> [--execution <child>]` |
271+
272+
---
273+
274+
## Deferred
275+
276+
- **Diff snapshots** — changed-pages-only memory dumps for space efficiency. Requires
277+
dependency-aware retention (cannot prune base of a diff chain).
278+
- **Velero plugin** — direct integration for backup workflows.
279+
- **Pre-pull on agent startup** — agent DaemonSet proactively pulls elected OCI artifacts
280+
to node-local storage ahead of warm pool scheduling.

0 commit comments

Comments
 (0)