|
| 1 | +# Production Operations Guide |
| 2 | + |
| 3 | +## First Boot |
| 4 | + |
| 5 | +After imaging and booting the appliance for the first time: |
| 6 | + |
| 7 | +```bash |
| 8 | +sudo /usr/libexec/secure-ai/first-boot-check.sh |
| 9 | +``` |
| 10 | + |
| 11 | +This validates: |
| 12 | +- All core services are running |
| 13 | +- Health endpoints respond |
| 14 | +- Attestation state is verified |
| 15 | +- Integrity monitor baseline is established |
| 16 | +- No open incidents |
| 17 | +- Service token is present |
| 18 | +- No services exposed on public interfaces |
| 19 | + |
| 20 | +## Upgrade Procedure |
| 21 | + |
| 22 | +1. **Pre-upgrade snapshot** (if supported by hardware): |
| 23 | + ```bash |
| 24 | + rpm-ostree status # Record current deployment |
| 25 | + sudo cp /var/lib/secure-ai/data/incidents.jsonl /tmp/incidents-backup.jsonl |
| 26 | + ``` |
| 27 | + |
| 28 | +2. **Pull update**: |
| 29 | + ```bash |
| 30 | + rpm-ostree upgrade |
| 31 | + ``` |
| 32 | + |
| 33 | +3. **Reboot and verify**: |
| 34 | + ```bash |
| 35 | + sudo systemctl reboot |
| 36 | + # After reboot: |
| 37 | + sudo /usr/libexec/secure-ai/first-boot-check.sh |
| 38 | + ``` |
| 39 | + |
| 40 | +4. **Rollback if needed**: |
| 41 | + ```bash |
| 42 | + rpm-ostree rollback |
| 43 | + sudo systemctl reboot |
| 44 | + ``` |
| 45 | + |
| 46 | +## Log Rotation |
| 47 | + |
| 48 | +Audit logs are rotated automatically via `/etc/logrotate.d/secure-ai`: |
| 49 | +- **Audit JSONL** (`/var/lib/secure-ai/logs/*.jsonl`): daily, 30 days retained, max 50MB per file |
| 50 | +- **Incident store** (`/var/lib/secure-ai/data/*.jsonl`): weekly, 12 weeks retained, max 100MB |
| 51 | + |
| 52 | +To manually trigger rotation: |
| 53 | +```bash |
| 54 | +sudo logrotate -f /etc/logrotate.d/secure-ai |
| 55 | +``` |
| 56 | + |
| 57 | +## Key Rotation |
| 58 | + |
| 59 | +### Service Token |
| 60 | + |
| 61 | +The inter-service bearer token at `/run/secure-ai/service-token`: |
| 62 | + |
| 63 | +1. Generate new token: |
| 64 | + ```bash |
| 65 | + openssl rand -hex 32 > /tmp/new-token |
| 66 | + ``` |
| 67 | + |
| 68 | +2. Replace atomically: |
| 69 | + ```bash |
| 70 | + sudo mv /tmp/new-token /run/secure-ai/service-token |
| 71 | + sudo chmod 0640 /run/secure-ai/service-token |
| 72 | + ``` |
| 73 | + |
| 74 | +3. Restart all services (they read token at startup): |
| 75 | + ```bash |
| 76 | + sudo systemctl restart secure-ai-*.service |
| 77 | + ``` |
| 78 | + |
| 79 | +### HMAC Signing Key (Attestation Bundles) |
| 80 | + |
| 81 | +1. Generate new key: |
| 82 | + ```bash |
| 83 | + openssl rand 32 > /tmp/new-hmac-key |
| 84 | + ``` |
| 85 | + |
| 86 | +2. Replace and restart attestor: |
| 87 | + ```bash |
| 88 | + sudo mv /tmp/new-hmac-key /run/secure-ai/attestation-hmac-key |
| 89 | + sudo chmod 0640 /run/secure-ai/attestation-hmac-key |
| 90 | + sudo systemctl restart secure-ai-runtime-attestor |
| 91 | + ``` |
| 92 | + |
| 93 | +### Cosign Signing Key (Image & Release) |
| 94 | + |
| 95 | +Rotate via the GitHub repository secrets. Update `SIGNING_SECRET` in repository settings, then trigger a new build. |
| 96 | + |
| 97 | +## Monitoring |
| 98 | + |
| 99 | +### Service Health |
| 100 | + |
| 101 | +All services expose `/health` on their localhost ports: |
| 102 | + |
| 103 | +| Service | Port | Endpoint | |
| 104 | +|---------|------|----------| |
| 105 | +| Policy Engine | 8500 | `/health` | |
| 106 | +| Registry | 8470 | `/health` | |
| 107 | +| Tool Firewall | 8475 | `/health` | |
| 108 | +| Runtime Attestor | 8505 | `/health` | |
| 109 | +| Integrity Monitor | 8510 | `/health` | |
| 110 | +| Incident Recorder | 8515 | `/health` | |
| 111 | +| MCP Firewall | 8496 | `/health` | |
| 112 | +| GPU Integrity Watch | 8495 | `/health` | |
| 113 | + |
| 114 | +### Incident Dashboard |
| 115 | + |
| 116 | +```bash |
| 117 | +# Open incidents |
| 118 | +curl -s http://127.0.0.1:8515/api/v1/stats | python3 -m json.tool |
| 119 | + |
| 120 | +# Attestation state |
| 121 | +curl -s http://127.0.0.1:8505/api/v1/verify | python3 -m json.tool |
| 122 | + |
| 123 | +# Integrity status |
| 124 | +curl -s http://127.0.0.1:8510/api/v1/status | python3 -m json.tool |
| 125 | +``` |
| 126 | + |
| 127 | +### Journal Logs |
| 128 | + |
| 129 | +```bash |
| 130 | +# All Secure AI services |
| 131 | +journalctl -u 'secure-ai-*' --since "1 hour ago" |
| 132 | + |
| 133 | +# Specific service |
| 134 | +journalctl -u secure-ai-incident-recorder -f |
| 135 | + |
| 136 | +# Security events only |
| 137 | +journalctl -u 'secure-ai-*' -g 'FAIL\|DENIED\|degraded\|violation' --since today |
| 138 | +``` |
| 139 | + |
| 140 | +## Capacity Limits |
| 141 | + |
| 142 | +| Resource | Service | Default Limit | Notes | |
| 143 | +|----------|---------|---------------|-------| |
| 144 | +| Memory | Agent | 512MB | Increase for large context windows | |
| 145 | +| Memory | Registry | 128MB | Scales with model count | |
| 146 | +| Memory | Policy Engine | 128MB | Scales with rule count | |
| 147 | +| CPU | Agent | 50% | Primary workload | |
| 148 | +| CPU | GPU Integrity | 15% | Background monitoring | |
| 149 | +| Incidents | Recorder | 1000 max | Oldest trimmed; persisted to disk | |
| 150 | +| Models | Registry | Unlimited | Bounded by vault size | |
| 151 | + |
| 152 | +## Graceful Shutdown |
| 153 | + |
| 154 | +All Go services handle SIGTERM for clean shutdown: |
| 155 | +- In-flight HTTP requests complete (up to 10s drain) |
| 156 | +- Incident recorder flushes persistence file |
| 157 | +- Audit log files are closed cleanly |
| 158 | +- systemd `TimeoutStopSec=15` enforces hard deadline |
0 commit comments