Skip to content

Commit d52a089

Browse files
SecAI-Hubclaude
andcommitted
M50: Production operations package — backup/restore, rollback matrix, break-glass, retention policy
Add operator-grade lifecycle tooling: - secai-backup.sh: full/config/logs/keys backup with age/gpg encryption, internal SHA256 manifest, LUKS header backup, audit logging - secai-restore.sh: integrity verification, staging extraction, selective restore by category, double-confirmation LUKS header restore, post-restore health check - Rollback decision matrix: Greenboot auto-rollback triggers + manual rollback criteria table with severity/action/rationale - 5 break-glass recovery procedures: service token loss, attestation failure, Level 1 panic lockout, signing policy break, Greenboot exhaustion - Formal data retention policy: 7 data classes with retention periods, disk capacity thresholds (70/80/90/95%), model pruning, archive procedures Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent f5d4fba commit d52a089

5 files changed

Lines changed: 1101 additions & 4 deletions

File tree

README.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,7 @@ Every model passes through the same fully automatic pipeline:
158158
| **Updates** | Cosign-verified rpm-ostree, staged workflow, greenboot auto-rollback |
159159
| **Supply Chain** | Per-service CycloneDX SBOMs, SLSA3 provenance attestation, cosign-signed checksums |
160160

161-
See [docs/threat-model.md](docs/threat-model.md) for threat classes, residual risks, and security invariants. See [docs/security-status.md](docs/security-status.md) for implementation status of all 49 milestones.
161+
See [docs/threat-model.md](docs/threat-model.md) for threat classes, residual risks, and security invariants. See [docs/security-status.md](docs/security-status.md) for implementation status of all 50 milestones.
162162

163163
### Verify Image Signatures
164164

@@ -241,7 +241,7 @@ All CI jobs are defined in [`.github/workflows/ci.yml`](.github/workflows/ci.yml
241241
| [Threat Model](docs/threat-model.md) | Threat classes, invariants, residual risks |
242242
| [API Reference](docs/api.md) | HTTP API for all services |
243243
| [Policy Schema](docs/policy-schema.md) | Full policy.yaml schema reference |
244-
| [Security Status](docs/security-status.md) | Implementation status of all 49 milestones |
244+
| [Security Status](docs/security-status.md) | Implementation status of all 50 milestones |
245245
| [Test Matrix](docs/test-matrix.md) | Test coverage: 1,141 tests across Go and Python (see [test-counts.json](docs/test-counts.json)) |
246246
| [Compatibility Matrix](docs/compatibility-matrix.md) | GPU, VM, and hardware support |
247247
| [Security Test Matrix](docs/security-test-matrix.md) | Security feature test coverage |
@@ -378,7 +378,7 @@ See [docs/test-matrix.md](docs/test-matrix.md) for full breakdown.
378378
## Roadmap
379379

380380
<details>
381-
<summary>All 49 project milestones (click to expand)</summary>
381+
<summary>All 50 project milestones (click to expand)</summary>
382382

383383
- [x] **Milestone 0** -- Threat model, dataflow, invariants, policy files
384384
- [x] **Milestone 1** -- Bootable OS, encrypted vault, GPU drivers
@@ -430,6 +430,7 @@ See [docs/test-matrix.md](docs/test-matrix.md) for full breakdown.
430430
- [x] **Milestone 47** -- CI enforcement hardening: enforced vulnerability scanning (govulncheck + pip-audit + bandit fail on HIGH/HIGH) with waiver mechanism, mypy type checking for security-sensitive services, pinned reproducible Python CI dependencies, Go 1.23→1.25 (12 stdlib CVE fixes), verification-first bootstrap docs
431431
- [x] **Milestone 48** -- Production hardening: build script fail-closed (fatal errors for 12 required services + binary verification gate), incident store fsync (crash-safe persistence), GPU backend metadata recording, llama-server watchdog (Type=notify + WatchdogSec=30), model catalog externalization (YAML with fallback), circuit breaker for inter-service HTTP calls, post-upgrade model verification in Greenboot, cosign key rotation documentation (full lifecycle)
432432
- [x] **Milestone 49** -- Signed-first install path: bootstrap script configures signing policy before first rebase (eliminates unverified transport), digest-pinned install flow (CI publishes digests in build summary + release assets), first-boot setup wizard (interactive integrity verification + vault + TPM2 + health check), recovery/dev path separated into dedicated doc
433+
- [x] **Milestone 50** -- Production operations package: backup/restore scripts (full/config/logs/keys categories, age/gpg encryption, SHA256 manifest, LUKS header backup/restore), rollback decision matrix (Greenboot auto-rollback + manual criteria), 5 break-glass recovery procedures, formal data retention policy (7 data classes, disk capacity thresholds)
433434

434435
</details>
435436

docs/production-operations.md

Lines changed: 329 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -241,3 +241,332 @@ All Go services handle SIGTERM for clean shutdown:
241241
- Incident recorder flushes persistence file
242242
- Audit log files are closed cleanly
243243
- systemd `TimeoutStopSec=15` enforces hard deadline
244+
245+
## Backup and Restore
246+
247+
### What Is Backed Up
248+
249+
| Category | Paths | Criticality |
250+
|----------|-------|-------------|
251+
| Policy + config | `/etc/secure-ai/policy/*.yaml`, `/etc/secure-ai/config/appliance.yaml`, `/etc/secure-ai/model-catalog.yaml` | High |
252+
| Incidents | `/var/lib/secure-ai/data/incidents.jsonl` | High — audit trail |
253+
| Audit logs | `/var/lib/secure-ai/logs/*.jsonl` | High — hash-chained evidence |
254+
| Registry manifest | `/var/lib/secure-ai/registry/manifest.json` | Medium — model inventory |
255+
| Signing keys | `/var/lib/secure-ai/keys/` (cosign, TPM2) | Critical |
256+
| LUKS header | Vault partition header | Critical — data recovery |
257+
258+
**Note:** Model files (GGUF binaries) are NOT included in backups due to their
259+
size (4–70 GB each). The registry manifest is backed up, so you know exactly
260+
which models to re-download after a restore.
261+
262+
### Creating a Backup
263+
264+
```bash
265+
# Full backup (config + logs + keys + manifest + LUKS header)
266+
sudo secai-backup.sh full
267+
268+
# Config only (policy, appliance config, model catalog)
269+
sudo secai-backup.sh config
270+
271+
# Logs and incidents only
272+
sudo secai-backup.sh logs
273+
274+
# Keys and LUKS header only (most sensitive)
275+
sudo secai-backup.sh keys --encrypt
276+
277+
# Full encrypted backup to external USB
278+
sudo secai-backup.sh full --encrypt --output /media/usb/backups
279+
```
280+
281+
### Verifying a Backup
282+
283+
```bash
284+
sudo secai-backup.sh verify /var/lib/secure-ai/backups/secai-backup-full-20260314-120000.tar.gz
285+
```
286+
287+
### Restoring from Backup
288+
289+
```bash
290+
# Inspect backup contents before restoring
291+
sudo secai-restore.sh inspect <backup-file>
292+
293+
# Full restore
294+
sudo secai-restore.sh full <backup-file>
295+
296+
# Selective restore (config, logs, or keys)
297+
sudo secai-restore.sh config <backup-file>
298+
sudo secai-restore.sh logs <backup-file>
299+
sudo secai-restore.sh keys <backup-file> # Requires YES confirmation for LUKS header
300+
```
301+
302+
After restore, the script automatically restarts services and runs the health check.
303+
304+
### Backup Schedule
305+
306+
| Environment | Frequency | Retention | Encryption |
307+
|-------------|-----------|-----------|------------|
308+
| Production | Daily (config + logs), weekly (full) | 30 days | Required |
309+
| Development | Weekly (full) | 14 days | Optional |
310+
| Pre-upgrade | Immediately before upgrade | Until next successful upgrade verified | Required |
311+
312+
### Backup Storage
313+
314+
- Store backups on a **separate physical device** (USB, NAS, air-gapped machine).
315+
- Never store unencrypted key backups on network-attached storage.
316+
- The LUKS header backup + vault passphrase can decrypt the entire vault — treat as highly sensitive.
317+
- Verify backup integrity periodically: `secai-backup.sh verify <file>`.
318+
319+
## Rollback Decision Matrix
320+
321+
### Automatic Rollback (Greenboot)
322+
323+
Greenboot triggers automatic `rpm-ostree rollback` when health checks fail
324+
after boot. The checks are defined in
325+
`/etc/greenboot/check/required.d/01-secure-ai-health.sh`:
326+
327+
| Check | Failure Condition | Auto-Rollback? |
328+
|-------|-------------------|----------------|
329+
| nftables service | Not active | Yes |
330+
| Registry / Tool Firewall / UI | Enabled but failed to start within 60s | Yes |
331+
| Registry API | `/health` unreachable after 30s | Yes |
332+
| Model integrity | SHA256 mismatch against manifest | Yes |
333+
| nftables rules | `secure_ai` table not loaded | Yes |
334+
| Critical scripts | securectl / verify-boot-chain / canary-check missing | Yes |
335+
336+
Maximum **2 automatic rollback attempts**. After exhaustion, the system halts
337+
on the broken deployment for manual intervention (see Break-Glass Scenario 5 below).
338+
339+
### Manual Rollback Criteria
340+
341+
| Symptom | Severity | Action | Rationale |
342+
|---------|----------|--------|-----------|
343+
| Inference quality degraded, all services healthy | Low | Fix forward | Not a security regression |
344+
| Single non-critical service failing (e.g., GPU watch) | Low | Fix forward | Other services compensate |
345+
| Policy engine or tool firewall failing | **Critical** | **Rollback** | Security enforcement compromised |
346+
| Attestation stuck in failed state | **Critical** | **Rollback** | Trust root broken |
347+
| Multiple services crash-looping | High | **Rollback** | Systemic regression |
348+
| Disk full preventing log writes | Medium | Fix forward | Clear space, not a code issue |
349+
| Network rules missing or wrong | **Critical** | **Rollback** | Default-deny bypassed |
350+
| Incident recorder down | **Critical** | **Rollback** | Audit trail broken |
351+
| UI unreachable but API services healthy | Low | Fix forward | Non-critical for security |
352+
353+
### Rollback Procedure
354+
355+
```bash
356+
# Manual rollback
357+
sudo rpm-ostree rollback
358+
sudo systemctl reboot
359+
360+
# Or via the update verification tool
361+
sudo /usr/libexec/secure-ai/update-verify.sh rollback
362+
```
363+
364+
### Post-Rollback Verification
365+
366+
```bash
367+
# Health check
368+
sudo /usr/libexec/secure-ai/first-boot-check.sh
369+
370+
# Check deployment status
371+
rpm-ostree status
372+
373+
# Review journal for the failed deployment
374+
journalctl -u 'secure-ai-*' --since "1 hour ago" -g 'FAIL\|ERROR\|panic'
375+
```
376+
377+
## Break-Glass Procedures
378+
379+
These procedures are for exceptional situations where normal operational
380+
tools are unavailable. Each requires physical or console access to the
381+
appliance.
382+
383+
### Scenario 1: Service Token Lost or Corrupted
384+
385+
**Symptoms:** All inter-service calls fail with 401. Services are running but
386+
cannot communicate.
387+
388+
**Diagnosis:**
389+
```bash
390+
ls -la /run/secure-ai/service-token # Missing or empty?
391+
curl -s http://127.0.0.1:8470/health # Registry returns 401?
392+
```
393+
394+
**Recovery:**
395+
```bash
396+
# Generate a new token
397+
openssl rand -hex 32 | sudo tee /run/secure-ai/service-token > /dev/null
398+
sudo chmod 0640 /run/secure-ai/service-token
399+
400+
# Restart all services so they pick up the new token
401+
sudo systemctl restart secure-ai-*.service
402+
403+
# Verify
404+
sudo /usr/libexec/secure-ai/first-boot-check.sh
405+
```
406+
407+
### Scenario 2: Attestation Stuck in Failed State
408+
409+
**Symptoms:** Services frozen in degraded mode. Incident recorder shows
410+
latched attestation_failure or integrity_violation. Normal recovery
411+
ceremony fails because the incident recorder is unreachable.
412+
413+
**Diagnosis:**
414+
```bash
415+
curl -sf http://127.0.0.1:8505/health # Attestor healthy?
416+
curl -sf http://127.0.0.1:8515/health # Incident recorder healthy?
417+
curl -s http://127.0.0.1:8515/api/v1/stats | python3 -m json.tool
418+
```
419+
420+
**Recovery Option A** (if incident recorder is reachable): Run the
421+
[recovery ceremony](recovery-runbook.md) — acknowledge → re-attest → resolve.
422+
423+
**Recovery Option B** (if incident recorder is unreachable):
424+
```bash
425+
# Stop all services
426+
sudo systemctl stop secure-ai-*.service
427+
428+
# Clear panic state
429+
sudo rm -f /run/secure-ai/panic-state.json
430+
431+
# Regenerate service token
432+
openssl rand -hex 32 | sudo tee /run/secure-ai/service-token > /dev/null
433+
sudo chmod 0640 /run/secure-ai/service-token
434+
435+
# Restart services
436+
sudo systemctl start secure-ai-*.service
437+
438+
# Run full health check and recovery ceremony
439+
sudo /usr/libexec/secure-ai/first-boot-check.sh
440+
```
441+
442+
### Scenario 3: System Locked After Level 1 Panic
443+
444+
**Context:** `securectl panic 1` was triggered — vault locked, services stopped,
445+
sessions invalidated. This is fully reversible.
446+
447+
**Recovery:**
448+
```bash
449+
# Unlock the vault (you will be prompted for the passphrase)
450+
sudo cryptsetup open /dev/<vault-partition> secure-ai-vault
451+
sudo mount /dev/mapper/secure-ai-vault /var/lib/secure-ai
452+
453+
# Regenerate service token (invalidated by panic)
454+
openssl rand -hex 32 | sudo tee /run/secure-ai/service-token > /dev/null
455+
sudo chmod 0640 /run/secure-ai/service-token
456+
457+
# Start all services
458+
sudo systemctl start secure-ai-*.service
459+
460+
# Verify
461+
sudo /usr/libexec/secure-ai/first-boot-check.sh
462+
```
463+
464+
Find the vault partition: `grep secure-ai-vault /etc/crypttab`.
465+
466+
### Scenario 4: Signing Policy Breaks
467+
468+
**Context:** `rpm-ostree upgrade` fails with signature verification errors.
469+
The signing policy (`policy.json` or cosign public key) is corrupted.
470+
471+
**Diagnosis:**
472+
```bash
473+
cat /etc/containers/policy.json | python3 -m json.tool # Valid JSON?
474+
cat /etc/containers/registries.d/secai-os.yaml # Present?
475+
ls /etc/pki/containers/secai-cosign.pub # Present?
476+
```
477+
478+
**Recovery:** Re-run the bootstrap script in dry-run mode first, then for real:
479+
```bash
480+
curl -sSfL https://raw.githubusercontent.com/SecAI-Hub/SecAI_OS/main/files/scripts/secai-bootstrap.sh \
481+
-o /tmp/secai-bootstrap.sh
482+
sudo bash /tmp/secai-bootstrap.sh --dry-run # verify
483+
sudo bash /tmp/secai-bootstrap.sh # apply
484+
```
485+
486+
See [recovery-bootstrap.md](install/recovery-bootstrap.md) for the full manual
487+
fallback procedure.
488+
489+
### Scenario 5: Greenboot Exhaustion (Max Rollbacks Reached)
490+
491+
**Context:** Greenboot hit `MAX_ROLLBACKS=2`. The system is halted on the
492+
broken deployment. Automatic rollback has stopped to prevent an infinite
493+
reboot loop.
494+
495+
**Recovery** (requires USB boot media):
496+
1. Boot from a Fedora Silverblue USB drive (Live session, not install).
497+
2. Mount the system partition:
498+
```bash
499+
sudo mount /dev/sda3 /mnt # Adjust device as needed
500+
```
501+
3. Reset the rollback counter:
502+
```bash
503+
sudo rm -f /mnt/run/secure-ai/rollback-count
504+
```
505+
4. Pin the last known-good deployment:
506+
```bash
507+
sudo chroot /mnt rpm-ostree rollback
508+
```
509+
5. Reboot into the system (remove USB):
510+
```bash
511+
sudo reboot
512+
```
513+
514+
See [recover-failed-update.md](../examples/recover-failed-update.md) for
515+
additional boot loop recovery scenarios.
516+
517+
## Data Retention Policy
518+
519+
### Retention Requirements
520+
521+
| Data Class | Minimum Retention | Maximum Retention | Rotation | Notes |
522+
|------------|-------------------|-------------------|----------|-------|
523+
| Audit logs (`*.jsonl`) | 30 days | 90 days | logrotate (daily, 30 rotations) | Hash-chained; broken chains are snapshotted |
524+
| Incident store | 12 weeks | 6 months | logrotate (weekly, 12 rotations) | Latched incidents retained until resolution |
525+
| Forensic bundles | 1 year | Indefinite | Manual export | Export via `/api/v1/forensic/export` before pruning |
526+
| Backup archives | 30 days (prod) | 90 days | Operator-managed | Encrypted, stored on external media |
527+
| Model files (GGUF) | While promoted | N/A | Manual prune | Quarantined models auto-expire in 30 days |
528+
| LUKS header backup | Indefinite | Indefinite | Manual | Critical for recovery; store offline |
529+
| Panic audit log | 1 year | Indefinite | Not rotated | Emergency event record |
530+
531+
### Disk Capacity Management
532+
533+
When `/var/lib/secure-ai` usage exceeds thresholds (check with
534+
`df -h /var/lib/secure-ai`):
535+
536+
| Usage | Action |
537+
|-------|--------|
538+
| > 70% | Review and prune quarantined models in `/var/lib/secure-ai/quarantine/` |
539+
| > 80% | Archive oldest audit logs to external media, remove unpromoted models from staging |
540+
| > 90% | Emergency: force logrotate (`sudo logrotate -f /etc/logrotate.d/secure-ai`), remove all quarantined models |
541+
| > 95% | **Critical:** Services may fail to write logs. Immediate operator intervention required |
542+
543+
### Model Pruning
544+
545+
```bash
546+
# List models by size
547+
du -sh /var/lib/secure-ai/registry/*.gguf 2>/dev/null | sort -rh
548+
549+
# List quarantined models (safe to remove)
550+
ls -lh /var/lib/secure-ai/quarantine/incoming/
551+
ls -lh /var/lib/secure-ai/quarantine/tampered/
552+
553+
# Remove all quarantined models
554+
sudo rm -rf /var/lib/secure-ai/quarantine/incoming/*
555+
sudo rm -rf /var/lib/secure-ai/quarantine/tampered/*
556+
```
557+
558+
### Archive Procedures
559+
560+
Before allowing logrotate to trim old data:
561+
1. **Export forensic bundle** (preserves incident + audit evidence with HMAC signature):
562+
```bash
563+
curl -s http://127.0.0.1:8515/api/v1/forensic/export > forensic-$(date +%Y%m%d).json
564+
```
565+
2. **Create a log-only backup** to external media:
566+
```bash
567+
sudo secai-backup.sh logs --encrypt --output /media/usb/archives
568+
```
569+
3. **Verify the archive** before allowing rotation:
570+
```bash
571+
sudo secai-backup.sh verify /media/usb/archives/secai-backup-logs-*.tar.gz
572+
```

docs/security-status.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Security Implementation Status
22

3-
This document is split into two sections. The first section covers **Security Assurance Controls** -- all implemented milestones (M0 through M49) that satisfy the M5 security assurance acceptance criteria. Every control listed there is complete and tested. The second section is the **Product Feature Roadmap**, which tracks planned product capabilities (Agent Mode Phases 2 and 3). These are product enhancements, not security assurance requirements; the M5 security posture is fully met without them.
3+
This document is split into two sections. The first section covers **Security Assurance Controls** -- all implemented milestones (M0 through M50) that satisfy the M5 security assurance acceptance criteria. Every control listed there is complete and tested. The second section is the **Product Feature Roadmap**, which tracks planned product capabilities (Agent Mode Phases 2 and 3). These are product enhancements, not security assurance requirements; the M5 security posture is fully met without them.
44

55
Last updated: 2026-03-14
66

@@ -62,6 +62,7 @@ All M5 security assurance criteria are met. The controls below have been impleme
6262
| CI enforcement hardening | Implemented | M47 | Enforced vulnerability scanning: bandit fails CI on HIGH-severity/HIGH-confidence findings, govulncheck fails on unwaived Go vulns, pip-audit fails on unwaived Python vulns. Waiver mechanism (`.github/vuln-waivers.json`) with mandatory expiry dates for reviewed/accepted findings. mypy type checking gate for security-sensitive services (common, agent, quarantine, ui). Pinned reproducible Python CI dependencies (`requirements-ci.txt`). Go 1.23→1.25 upgrade fixing 12 stdlib CVEs (crypto/tls, crypto/x509, encoding/asn1, net/url, os). Flask 3.1.1→3.1.3 (GHSA-68rp-wp8r-4726). Verification-first bootstrap documentation (signed rebase as default quickstart, unverified bootstrap moved to labeled recovery section). |
6363
| Production hardening | Implemented | M48 | Build script fail-closed (all `|| echo WARNING` fallbacks replaced with fatal errors for 12 required services, final binary verification gate), incident store fsync (f.Sync() before close on both incident persistence and audit log writes), GPU backend metadata recording (`/etc/secure-ai/gpu-backend.json` written at build time with backend/version/timestamp), llama-server watchdog (Type=notify wrapper with startup health gate + WatchdogSec=30 continuous monitoring), model catalog externalization (`/etc/secure-ai/model-catalog.yaml` with YAML loading + hardcoded fallback), circuit breaker for Python services (closed→open→half-open state machine protecting inter-service HTTP calls), post-upgrade model verification in Greenboot (SHA256 manifest check closes 15-min integrity gap), cosign key rotation documentation (full lifecycle: generation, rotation schedule, distribution, emergency revocation, HSM migration path). 402 Go + 739 Python tests (1,141 total). |
6464
| Signed-first install path | Implemented | M49 | Signed bootstrap script (`secai-bootstrap.sh`) configures container signing policy (policy.json + registries.d + cosign public key) before first rebase — eliminates unverified transport from production install path. Digest-pinned install flow (CI publishes image digest in build summary and release assets). First-boot setup wizard (interactive verification of image integrity, transport, vault setup, TPM2 sealing, health check). Signing policy files baked into OS image (`/etc/pki/containers/secai-cosign.pub`, `/etc/containers/registries.d/secai-os.yaml`, policy.json merge in build script). Recovery/dev bootstrap path separated into dedicated doc with clear warnings. |
65+
| Production operations package | Implemented | M50 | Backup script (`secai-backup.sh`) with full/config/logs/keys categories, age/gpg encryption, internal SHA256 manifest, LUKS header backup. Restore script (`secai-restore.sh`) with integrity verification, staging extraction, double-confirmation LUKS header restore, post-restore health check. Production operations doc extended with rollback decision matrix (Greenboot auto-rollback triggers + manual criteria), 5 break-glass recovery procedures (token loss, attestation failure, Level 1 panic lockout, signing policy break, Greenboot exhaustion), formal data retention policy (7 data classes with retention periods, disk capacity thresholds at 70/80/90/95%). |
6566

6667
---
6768

0 commit comments

Comments
 (0)