SecAI-Hub
diff --git a/‎README.md‎
Lines changed: 4 additions & 3 deletions b/‎README.md‎
Lines changed: 4 additions & 3 deletions
diff --git a/‎docs/production-operations.md‎
Lines changed: 329 additions & 0 deletions b/‎docs/production-operations.md‎
Lines changed: 329 additions & 0 deletions
diff --git a/‎docs/security-status.md‎
Lines changed: 2 additions & 1 deletion b/‎docs/security-status.md‎
Lines changed: 2 additions & 1 deletion
@@ -158,7 +158,7 @@ Every model passes through the same fully automatic pipeline:
 | **Updates** | Cosign-verified rpm-ostree, staged workflow, greenboot auto-rollback |
 | **Supply Chain** | Per-service CycloneDX SBOMs, SLSA3 provenance attestation, cosign-signed checksums |
 
-See [docs/threat-model.md](docs/threat-model.md) for threat classes, residual risks, and security invariants. See [docs/security-status.md](docs/security-status.md) for implementation status of all 49 milestones.
+See [docs/threat-model.md](docs/threat-model.md) for threat classes, residual risks, and security invariants. See [docs/security-status.md](docs/security-status.md) for implementation status of all 50 milestones.
 
 ### Verify Image Signatures
 
@@ -241,7 +241,7 @@ All CI jobs are defined in [`.github/workflows/ci.yml`](.github/workflows/ci.yml
 | [Threat Model](docs/threat-model.md) | Threat classes, invariants, residual risks |
 | [API Reference](docs/api.md) | HTTP API for all services |
 | [Policy Schema](docs/policy-schema.md) | Full policy.yaml schema reference |
-| [Security Status](docs/security-status.md) | Implementation status of all 49 milestones |
+| [Security Status](docs/security-status.md) | Implementation status of all 50 milestones |
 | [Test Matrix](docs/test-matrix.md) | Test coverage: 1,141 tests across Go and Python (see [test-counts.json](docs/test-counts.json)) |
 | [Compatibility Matrix](docs/compatibility-matrix.md) | GPU, VM, and hardware support |
 | [Security Test Matrix](docs/security-test-matrix.md) | Security feature test coverage |
@@ -378,7 +378,7 @@ See [docs/test-matrix.md](docs/test-matrix.md) for full breakdown.
 ## Roadmap
 
 <details>
-<summary>All 49 project milestones (click to expand)</summary>
+<summary>All 50 project milestones (click to expand)</summary>
 
 - [x] **Milestone 0** -- Threat model, dataflow, invariants, policy files
 - [x] **Milestone 1** -- Bootable OS, encrypted vault, GPU drivers
@@ -430,6 +430,7 @@ See [docs/test-matrix.md](docs/test-matrix.md) for full breakdown.
 - [x] **Milestone 47** -- CI enforcement hardening: enforced vulnerability scanning (govulncheck + pip-audit + bandit fail on HIGH/HIGH) with waiver mechanism, mypy type checking for security-sensitive services, pinned reproducible Python CI dependencies, Go 1.23→1.25 (12 stdlib CVE fixes), verification-first bootstrap docs
 - [x] **Milestone 48** -- Production hardening: build script fail-closed (fatal errors for 12 required services + binary verification gate), incident store fsync (crash-safe persistence), GPU backend metadata recording, llama-server watchdog (Type=notify + WatchdogSec=30), model catalog externalization (YAML with fallback), circuit breaker for inter-service HTTP calls, post-upgrade model verification in Greenboot, cosign key rotation documentation (full lifecycle)
 - [x] **Milestone 49** -- Signed-first install path: bootstrap script configures signing policy before first rebase (eliminates unverified transport), digest-pinned install flow (CI publishes digests in build summary + release assets), first-boot setup wizard (interactive integrity verification + vault + TPM2 + health check), recovery/dev path separated into dedicated doc
+- [x] **Milestone 50** -- Production operations package: backup/restore scripts (full/config/logs/keys categories, age/gpg encryption, SHA256 manifest, LUKS header backup/restore), rollback decision matrix (Greenboot auto-rollback + manual criteria), 5 break-glass recovery procedures, formal data retention policy (7 data classes, disk capacity thresholds)
 
 </details>
 
 
@@ -241,3 +241,332 @@ All Go services handle SIGTERM for clean shutdown:
 - Incident recorder flushes persistence file
 - Audit log files are closed cleanly
 - systemd `TimeoutStopSec=15` enforces hard deadline
+
+## Backup and Restore
+
+### What Is Backed Up
+
+| Category | Paths | Criticality |
+|----------|-------|-------------|
+| Policy + config | `/etc/secure-ai/policy/*.yaml`, `/etc/secure-ai/config/appliance.yaml`, `/etc/secure-ai/model-catalog.yaml` | High |
+| Incidents | `/var/lib/secure-ai/data/incidents.jsonl` | High — audit trail |
+| Audit logs | `/var/lib/secure-ai/logs/*.jsonl` | High — hash-chained evidence |
+| Registry manifest | `/var/lib/secure-ai/registry/manifest.json` | Medium — model inventory |
+| Signing keys | `/var/lib/secure-ai/keys/` (cosign, TPM2) | Critical |
+| LUKS header | Vault partition header | Critical — data recovery |
+
+**Note:** Model files (GGUF binaries) are NOT included in backups due to their
+size (4–70 GB each). The registry manifest is backed up, so you know exactly
+which models to re-download after a restore.
+
+### Creating a Backup
+
+```bash
+# Full backup (config + logs + keys + manifest + LUKS header)
+sudo secai-backup.sh full
+
+# Config only (policy, appliance config, model catalog)
+sudo secai-backup.sh config
+
+# Logs and incidents only
+sudo secai-backup.sh logs
+
+# Keys and LUKS header only (most sensitive)
+sudo secai-backup.sh keys --encrypt
+
+# Full encrypted backup to external USB
+sudo secai-backup.sh full --encrypt --output /media/usb/backups
+```
+
+### Verifying a Backup
+
+```bash
+sudo secai-backup.sh verify /var/lib/secure-ai/backups/secai-backup-full-20260314-120000.tar.gz
+```
+
+### Restoring from Backup
+
+```bash
+# Inspect backup contents before restoring
+sudo secai-restore.sh inspect <backup-file>
+
+# Full restore
+sudo secai-restore.sh full <backup-file>
+
+# Selective restore (config, logs, or keys)
+sudo secai-restore.sh config <backup-file>
+sudo secai-restore.sh logs <backup-file>
+sudo secai-restore.sh keys <backup-file>    # Requires YES confirmation for LUKS header
+```
+
+After restore, the script automatically restarts services and runs the health check.
+
+### Backup Schedule
+
+| Environment | Frequency | Retention | Encryption |
+|-------------|-----------|-----------|------------|
+| Production | Daily (config + logs), weekly (full) | 30 days | Required |
+| Development | Weekly (full) | 14 days | Optional |
+| Pre-upgrade | Immediately before upgrade | Until next successful upgrade verified | Required |
+
+### Backup Storage
+
+- Store backups on a **separate physical device** (USB, NAS, air-gapped machine).
+- Never store unencrypted key backups on network-attached storage.
+- The LUKS header backup + vault passphrase can decrypt the entire vault — treat as highly sensitive.
+- Verify backup integrity periodically: `secai-backup.sh verify <file>`.
+
+## Rollback Decision Matrix
+
+### Automatic Rollback (Greenboot)
+
+Greenboot triggers automatic `rpm-ostree rollback` when health checks fail
+after boot. The checks are defined in
+`/etc/greenboot/check/required.d/01-secure-ai-health.sh`:
+
+| Check | Failure Condition | Auto-Rollback? |
+|-------|-------------------|----------------|
+| nftables service | Not active | Yes |
+| Registry / Tool Firewall / UI | Enabled but failed to start within 60s | Yes |
+| Registry API | `/health` unreachable after 30s | Yes |
+| Model integrity | SHA256 mismatch against manifest | Yes |
+| nftables rules | `secure_ai` table not loaded | Yes |
+| Critical scripts | securectl / verify-boot-chain / canary-check missing | Yes |
+
+Maximum **2 automatic rollback attempts**. After exhaustion, the system halts
+on the broken deployment for manual intervention (see Break-Glass Scenario 5 below).
+
+### Manual Rollback Criteria
+
+| Symptom | Severity | Action | Rationale |
+|---------|----------|--------|-----------|
+| Inference quality degraded, all services healthy | Low | Fix forward | Not a security regression |
+| Single non-critical service failing (e.g., GPU watch) | Low | Fix forward | Other services compensate |
+| Policy engine or tool firewall failing | **Critical** | **Rollback** | Security enforcement compromised |
+| Attestation stuck in failed state | **Critical** | **Rollback** | Trust root broken |
+| Multiple services crash-looping | High | **Rollback** | Systemic regression |
+| Disk full preventing log writes | Medium | Fix forward | Clear space, not a code issue |
+| Network rules missing or wrong | **Critical** | **Rollback** | Default-deny bypassed |
+| Incident recorder down | **Critical** | **Rollback** | Audit trail broken |
+| UI unreachable but API services healthy | Low | Fix forward | Non-critical for security |
+
+### Rollback Procedure
+
+```bash
+# Manual rollback
+sudo rpm-ostree rollback
+sudo systemctl reboot
+
+# Or via the update verification tool
+sudo /usr/libexec/secure-ai/update-verify.sh rollback
+```
+
+### Post-Rollback Verification
+
+```bash
+# Health check
+sudo /usr/libexec/secure-ai/first-boot-check.sh
+
+# Check deployment status
+rpm-ostree status
+
+# Review journal for the failed deployment
+journalctl -u 'secure-ai-*' --since "1 hour ago" -g 'FAIL\|ERROR\|panic'
+```
+
+## Break-Glass Procedures
+
+These procedures are for exceptional situations where normal operational
+tools are unavailable. Each requires physical or console access to the
+appliance.
+
+### Scenario 1: Service Token Lost or Corrupted
+
+**Symptoms:** All inter-service calls fail with 401. Services are running but
+cannot communicate.
+
+**Diagnosis:**
+```bash
+ls -la /run/secure-ai/service-token   # Missing or empty?
+curl -s http://127.0.0.1:8470/health  # Registry returns 401?
+```
+
+**Recovery:**
+```bash
+# Generate a new token
+openssl rand -hex 32 | sudo tee /run/secure-ai/service-token > /dev/null
+sudo chmod 0640 /run/secure-ai/service-token
+
+# Restart all services so they pick up the new token
+sudo systemctl restart secure-ai-*.service
+
+# Verify
+sudo /usr/libexec/secure-ai/first-boot-check.sh
+```
+
+### Scenario 2: Attestation Stuck in Failed State
+
+**Symptoms:** Services frozen in degraded mode. Incident recorder shows
+latched attestation_failure or integrity_violation. Normal recovery
+ceremony fails because the incident recorder is unreachable.
+
+**Diagnosis:**
+```bash
+curl -sf http://127.0.0.1:8505/health    # Attestor healthy?
+curl -sf http://127.0.0.1:8515/health    # Incident recorder healthy?
+curl -s http://127.0.0.1:8515/api/v1/stats | python3 -m json.tool
+```
+
+**Recovery Option A** (if incident recorder is reachable): Run the
+[recovery ceremony](recovery-runbook.md) — acknowledge → re-attest → resolve.
+
+**Recovery Option B** (if incident recorder is unreachable):
+```bash
+# Stop all services
+sudo systemctl stop secure-ai-*.service
+
+# Clear panic state
+sudo rm -f /run/secure-ai/panic-state.json
+
+# Regenerate service token
+openssl rand -hex 32 | sudo tee /run/secure-ai/service-token > /dev/null
+sudo chmod 0640 /run/secure-ai/service-token
+
+# Restart services
+sudo systemctl start secure-ai-*.service
+
+# Run full health check and recovery ceremony
+sudo /usr/libexec/secure-ai/first-boot-check.sh
+```
+
+### Scenario 3: System Locked After Level 1 Panic
+
+**Context:** `securectl panic 1` was triggered — vault locked, services stopped,
+sessions invalidated. This is fully reversible.
+
+**Recovery:**
+```bash
+# Unlock the vault (you will be prompted for the passphrase)
+sudo cryptsetup open /dev/<vault-partition> secure-ai-vault
+sudo mount /dev/mapper/secure-ai-vault /var/lib/secure-ai
+
+# Regenerate service token (invalidated by panic)
+openssl rand -hex 32 | sudo tee /run/secure-ai/service-token > /dev/null
+sudo chmod 0640 /run/secure-ai/service-token
+
+# Start all services
+sudo systemctl start secure-ai-*.service
+
+# Verify
+sudo /usr/libexec/secure-ai/first-boot-check.sh
+```
+
+Find the vault partition: `grep secure-ai-vault /etc/crypttab`.
+
+### Scenario 4: Signing Policy Breaks
+
+**Context:** `rpm-ostree upgrade` fails with signature verification errors.
+The signing policy (`policy.json` or cosign public key) is corrupted.
+
+**Diagnosis:**
+```bash
+cat /etc/containers/policy.json | python3 -m json.tool    # Valid JSON?
+cat /etc/containers/registries.d/secai-os.yaml            # Present?
+ls /etc/pki/containers/secai-cosign.pub                   # Present?
+```
+
+**Recovery:** Re-run the bootstrap script in dry-run mode first, then for real:
+```bash
+curl -sSfL https://raw.githubusercontent.com/SecAI-Hub/SecAI_OS/main/files/scripts/secai-bootstrap.sh \
+  -o /tmp/secai-bootstrap.sh
+sudo bash /tmp/secai-bootstrap.sh --dry-run   # verify
+sudo bash /tmp/secai-bootstrap.sh             # apply
+```
+
+See [recovery-bootstrap.md](install/recovery-bootstrap.md) for the full manual
+fallback procedure.
+
+### Scenario 5: Greenboot Exhaustion (Max Rollbacks Reached)
+
+**Context:** Greenboot hit `MAX_ROLLBACKS=2`. The system is halted on the
+broken deployment. Automatic rollback has stopped to prevent an infinite
+reboot loop.
+
+**Recovery** (requires USB boot media):
+1. Boot from a Fedora Silverblue USB drive (Live session, not install).
+2. Mount the system partition:
+   ```bash
+   sudo mount /dev/sda3 /mnt    # Adjust device as needed
+   ```
+3. Reset the rollback counter:
+   ```bash
+   sudo rm -f /mnt/run/secure-ai/rollback-count
+   ```
+4. Pin the last known-good deployment:
+   ```bash
+   sudo chroot /mnt rpm-ostree rollback
+   ```
+5. Reboot into the system (remove USB):
+   ```bash
+   sudo reboot
+   ```
+
+See [recover-failed-update.md](../examples/recover-failed-update.md) for
+additional boot loop recovery scenarios.
+
+## Data Retention Policy
+
+### Retention Requirements
+
+| Data Class | Minimum Retention | Maximum Retention | Rotation | Notes |
+|------------|-------------------|-------------------|----------|-------|
+| Audit logs (`*.jsonl`) | 30 days | 90 days | logrotate (daily, 30 rotations) | Hash-chained; broken chains are snapshotted |
+| Incident store | 12 weeks | 6 months | logrotate (weekly, 12 rotations) | Latched incidents retained until resolution |
+| Forensic bundles | 1 year | Indefinite | Manual export | Export via `/api/v1/forensic/export` before pruning |
+| Backup archives | 30 days (prod) | 90 days | Operator-managed | Encrypted, stored on external media |
+| Model files (GGUF) | While promoted | N/A | Manual prune | Quarantined models auto-expire in 30 days |
+| LUKS header backup | Indefinite | Indefinite | Manual | Critical for recovery; store offline |
+| Panic audit log | 1 year | Indefinite | Not rotated | Emergency event record |
+
+### Disk Capacity Management
+
+When `/var/lib/secure-ai` usage exceeds thresholds (check with
+`df -h /var/lib/secure-ai`):
+
+| Usage | Action |
+|-------|--------|
+| > 70% | Review and prune quarantined models in `/var/lib/secure-ai/quarantine/` |
+| > 80% | Archive oldest audit logs to external media, remove unpromoted models from staging |
+| > 90% | Emergency: force logrotate (`sudo logrotate -f /etc/logrotate.d/secure-ai`), remove all quarantined models |
+| > 95% | **Critical:** Services may fail to write logs. Immediate operator intervention required |
+
+### Model Pruning
+
+```bash
+# List models by size
+du -sh /var/lib/secure-ai/registry/*.gguf 2>/dev/null | sort -rh
+
+# List quarantined models (safe to remove)
+ls -lh /var/lib/secure-ai/quarantine/incoming/
+ls -lh /var/lib/secure-ai/quarantine/tampered/
+
+# Remove all quarantined models
+sudo rm -rf /var/lib/secure-ai/quarantine/incoming/*
+sudo rm -rf /var/lib/secure-ai/quarantine/tampered/*
+```
+
+### Archive Procedures
+
+Before allowing logrotate to trim old data:
+1. **Export forensic bundle** (preserves incident + audit evidence with HMAC signature):
+   ```bash
+   curl -s http://127.0.0.1:8515/api/v1/forensic/export > forensic-$(date +%Y%m%d).json
+   ```
+2. **Create a log-only backup** to external media:
+   ```bash
+   sudo secai-backup.sh logs --encrypt --output /media/usb/archives
+   ```
+3. **Verify the archive** before allowing rotation:
+   ```bash
+   sudo secai-backup.sh verify /media/usb/archives/secai-backup-logs-*.tar.gz
+   ```
@@ -1,6 +1,6 @@
 # Security Implementation Status
 
-This document is split into two sections. The first section covers **Security Assurance Controls** -- all implemented milestones (M0 through M49) that satisfy the M5 security assurance acceptance criteria. Every control listed there is complete and tested. The second section is the **Product Feature Roadmap**, which tracks planned product capabilities (Agent Mode Phases 2 and 3). These are product enhancements, not security assurance requirements; the M5 security posture is fully met without them.
+This document is split into two sections. The first section covers **Security Assurance Controls** -- all implemented milestones (M0 through M50) that satisfy the M5 security assurance acceptance criteria. Every control listed there is complete and tested. The second section is the **Product Feature Roadmap**, which tracks planned product capabilities (Agent Mode Phases 2 and 3). These are product enhancements, not security assurance requirements; the M5 security posture is fully met without them.
 
 Last updated: 2026-03-14
 
@@ -62,6 +62,7 @@ All M5 security assurance criteria are met. The controls below have been impleme
 | CI enforcement hardening | Implemented | M47 | Enforced vulnerability scanning: bandit fails CI on HIGH-severity/HIGH-confidence findings, govulncheck fails on unwaived Go vulns, pip-audit fails on unwaived Python vulns. Waiver mechanism (`.github/vuln-waivers.json`) with mandatory expiry dates for reviewed/accepted findings. mypy type checking gate for security-sensitive services (common, agent, quarantine, ui). Pinned reproducible Python CI dependencies (`requirements-ci.txt`). Go 1.23→1.25 upgrade fixing 12 stdlib CVEs (crypto/tls, crypto/x509, encoding/asn1, net/url, os). Flask 3.1.1→3.1.3 (GHSA-68rp-wp8r-4726). Verification-first bootstrap documentation (signed rebase as default quickstart, unverified bootstrap moved to labeled recovery section). |
 | Production hardening | Implemented | M48 | Build script fail-closed (all `|| echo WARNING` fallbacks replaced with fatal errors for 12 required services, final binary verification gate), incident store fsync (f.Sync() before close on both incident persistence and audit log writes), GPU backend metadata recording (`/etc/secure-ai/gpu-backend.json` written at build time with backend/version/timestamp), llama-server watchdog (Type=notify wrapper with startup health gate + WatchdogSec=30 continuous monitoring), model catalog externalization (`/etc/secure-ai/model-catalog.yaml` with YAML loading + hardcoded fallback), circuit breaker for Python services (closed→open→half-open state machine protecting inter-service HTTP calls), post-upgrade model verification in Greenboot (SHA256 manifest check closes 15-min integrity gap), cosign key rotation documentation (full lifecycle: generation, rotation schedule, distribution, emergency revocation, HSM migration path). 402 Go + 739 Python tests (1,141 total). |
 | Signed-first install path | Implemented | M49 | Signed bootstrap script (`secai-bootstrap.sh`) configures container signing policy (policy.json + registries.d + cosign public key) before first rebase — eliminates unverified transport from production install path. Digest-pinned install flow (CI publishes image digest in build summary and release assets). First-boot setup wizard (interactive verification of image integrity, transport, vault setup, TPM2 sealing, health check). Signing policy files baked into OS image (`/etc/pki/containers/secai-cosign.pub`, `/etc/containers/registries.d/secai-os.yaml`, policy.json merge in build script). Recovery/dev bootstrap path separated into dedicated doc with clear warnings. |
+| Production operations package | Implemented | M50 | Backup script (`secai-backup.sh`) with full/config/logs/keys categories, age/gpg encryption, internal SHA256 manifest, LUKS header backup. Restore script (`secai-restore.sh`) with integrity verification, staging extraction, double-confirmation LUKS header restore, post-restore health check. Production operations doc extended with rollback decision matrix (Greenboot auto-rollback triggers + manual criteria), 5 break-glass recovery procedures (token loss, attestation failure, Level 1 panic lockout, signing policy break, Greenboot exhaustion), formal data retention policy (7 data classes with retention periods, disk capacity thresholds at 70/80/90/95%). |
 
 ---