Skip to content

Commit 61d72e1

Browse files
SecAI-Hubclaude
andcommitted
Implement M45: Production readiness hardening
- Incident recorder: file-backed JSONL persistence (survives restarts) - All 9 Go services: graceful shutdown via SIGTERM/SIGINT signal handling with 10s connection draining before hard stop - mcp-firewall, gpu-integrity-watch: add http.Server with proper ReadTimeout/WriteTimeout/IdleTimeout (were using bare ListenAndServe) - All 12 daemon systemd units: add TimeoutStartSec=30, TimeoutStopSec=15, StartLimitInterval=300, StartLimitBurst=5 - First-boot health validation script (first-boot-check.sh) - Audit log rotation via logrotate (/etc/logrotate.d/secure-ai) - CI: add dependency vulnerability scanning job (govulncheck + pip-audit) - CI: shellcheck coverage for first-boot-check.sh and verify-release.sh - Fix build.yml: lowercase GHCR image reference (OCI registries require it) - .gitignore: add compiled Go service binaries - Production operations guide (upgrade, key rotation, monitoring, capacity) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 6cbbf71 commit 61d72e1

29 files changed

Lines changed: 790 additions & 27 deletions

.github/workflows/build.yml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,11 +37,15 @@ jobs:
3737
pr_event_number: ${{ github.event.number }}
3838
maximize_build_space: true
3939

40+
- name: Set lowercase image ref
41+
if: github.event_name != 'pull_request'
42+
run: echo "IMAGE_REF=ghcr.io/${GITHUB_REPOSITORY,,}" >> "$GITHUB_ENV"
43+
4044
- name: Generate SBOM
4145
if: github.event_name != 'pull_request'
4246
uses: anchore/sbom-action@57aae528053a48a3f6235f2d9461b05fbcb7366d # v0.23.1
4347
with:
44-
image: ghcr.io/${{ github.repository_owner }}/${{ github.event.repository.name }}
48+
image: ${{ env.IMAGE_REF }}
4549
format: cyclonedx-json
4650
output-file: sbom.cdx.json
4751

@@ -51,6 +55,6 @@ jobs:
5155
cosign attest --type cyclonedx \
5256
--predicate sbom.cdx.json \
5357
--key env://COSIGN_PRIVATE_KEY \
54-
ghcr.io/${{ github.repository_owner }}/${{ github.event.repository.name }}
58+
"$IMAGE_REF"
5559
env:
5660
COSIGN_PRIVATE_KEY: ${{ secrets.SIGNING_SECRET }}

.github/workflows/ci.yml

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,9 @@ jobs:
9494
shellcheck -s bash \
9595
files/system/usr/libexec/secure-ai/*.sh \
9696
files/scripts/build-services.sh \
97-
files/scripts/generate-mok.sh
97+
files/scripts/generate-mok.sh \
98+
files/scripts/first-boot-check.sh \
99+
files/scripts/verify-release.sh
98100
99101
policy-validate:
100102
name: Validate YAML configs
@@ -258,3 +260,43 @@ jobs:
258260

259261
- name: Check test counts for drift
260262
run: bash .github/scripts/check-test-counts.sh
263+
264+
dependency-audit:
265+
name: Dependency Vulnerability Audit
266+
runs-on: ubuntu-latest
267+
permissions:
268+
contents: read
269+
steps:
270+
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
271+
272+
- uses: actions/setup-go@d35c59abb061a4a6fb18e82ac0862c26744d6ab5 # v5.5.0
273+
with:
274+
go-version: "1.23"
275+
276+
- uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
277+
with:
278+
python-version: "3.12"
279+
280+
- name: Install govulncheck
281+
run: go install golang.org/x/vuln/cmd/govulncheck@latest
282+
283+
- name: Go vulnerability scan
284+
run: |
285+
echo "=== Go Dependency Vulnerability Scan ==="
286+
VULN_ERRORS=0
287+
for svc in airlock registry tool-firewall gpu-integrity-watch mcp-firewall \
288+
policy-engine runtime-attestor integrity-monitor incident-recorder; do
289+
echo "--- ${svc} ---"
290+
cd "services/${svc}"
291+
govulncheck ./... || VULN_ERRORS=$((VULN_ERRORS + 1))
292+
cd ../..
293+
done
294+
if [ $VULN_ERRORS -gt 0 ]; then
295+
echo "WARNING: $VULN_ERRORS service(s) have known vulnerabilities"
296+
fi
297+
298+
- name: Python dependency audit
299+
run: |
300+
pip install pip-audit pyyaml flask requests
301+
echo "=== Python Dependency Audit ==="
302+
pip-audit --strict --desc || echo "WARNING: Python dependencies have known vulnerabilities"

.gitignore

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,16 @@ cosign.private
77

88
# Go
99
services/*/vendor/
10+
# Compiled Go binaries (match service directory name)
11+
services/airlock/airlock
12+
services/registry/registry
13+
services/tool-firewall/tool-firewall
14+
services/gpu-integrity-watch/gpu-integrity-watch
15+
services/mcp-firewall/mcp-firewall
16+
services/policy-engine/policy-engine
17+
services/runtime-attestor/runtime-attestor
18+
services/integrity-monitor/integrity-monitor
19+
services/incident-recorder/incident-recorder
1020

1121
# Python
1222
__pycache__/

README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -151,7 +151,7 @@ Every model passes through the same fully automatic pipeline:
151151
| **Updates** | Cosign-verified rpm-ostree, staged workflow, greenboot auto-rollback |
152152
| **Supply Chain** | Per-service CycloneDX SBOMs, SLSA3 provenance attestation, cosign-signed checksums |
153153

154-
See [docs/threat-model.md](docs/threat-model.md) for threat classes, residual risks, and security invariants. See [docs/security-status.md](docs/security-status.md) for implementation status of all 44 milestones.
154+
See [docs/threat-model.md](docs/threat-model.md) for threat classes, residual risks, and security invariants. See [docs/security-status.md](docs/security-status.md) for implementation status of all 45 milestones.
155155

156156
### Verify Image Signatures
157157

@@ -229,7 +229,7 @@ Each CI job produces specific security evidence:
229229
| [Threat Model](docs/threat-model.md) | Threat classes, invariants, residual risks |
230230
| [API Reference](docs/api.md) | HTTP API for all services |
231231
| [Policy Schema](docs/policy-schema.md) | Full policy.yaml schema reference |
232-
| [Security Status](docs/security-status.md) | Implementation status of all 44 milestones |
232+
| [Security Status](docs/security-status.md) | Implementation status of all 45 milestones |
233233
| [Test Matrix](docs/test-matrix.md) | Test coverage: 1,117 tests across Go and Python (see [test-counts.json](docs/test-counts.json)) |
234234
| [Compatibility Matrix](docs/compatibility-matrix.md) | GPU, VM, and hardware support |
235235
| [Security Test Matrix](docs/security-test-matrix.md) | Security feature test coverage |
@@ -258,6 +258,7 @@ Each CI job produces specific security evidence:
258258
| [Audit Quick Path](docs/audit-quick-path.md) | External auditor step-by-step verification guide |
259259
| [Recovery Runbook](docs/recovery-runbook.md) | Operator procedures for degradation, containment, and recovery |
260260
| [Sample Release Bundle](docs/sample-release-bundle.md) | Release artifact structure and verification commands |
261+
| [Production Operations](docs/production-operations.md) | First-boot checks, upgrades, key rotation, monitoring, capacity |
261262

262263
### Install Guides
263264

@@ -408,6 +409,7 @@ See [docs/test-matrix.md](docs/test-matrix.md) for full breakdown.
408409
- [x] **Milestone 42** -- Enforcement wiring + CI supply chain verification
409410
- [x] **Milestone 43** -- Stronger isolation: sandbox tightening, adversarial tests, CI security regression, MCP isolation, recovery ceremonies, M5 acceptance suite
410411
- [x] **Milestone 44** -- Auditability and documentation hardening: test-count drift CI check, CI evidence links and badges, M4/M5 terminology disambiguation, audit quick-path doc, recovery runbook, verify-release script, security/product roadmap split
412+
- [x] **Milestone 45** -- Production readiness hardening: incident persistence (file-backed), graceful shutdown for all Go services, HTTP timeouts, systemd production hardening, first-boot validation, audit log rotation, CI vulnerability scanning, production operations guide
411413

412414
</details>
413415

docs/production-operations.md

Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
# Production Operations Guide
2+
3+
## First Boot
4+
5+
After imaging and booting the appliance for the first time:
6+
7+
```bash
8+
sudo /usr/libexec/secure-ai/first-boot-check.sh
9+
```
10+
11+
This validates:
12+
- All core services are running
13+
- Health endpoints respond
14+
- Attestation state is verified
15+
- Integrity monitor baseline is established
16+
- No open incidents
17+
- Service token is present
18+
- No services exposed on public interfaces
19+
20+
## Upgrade Procedure
21+
22+
1. **Pre-upgrade snapshot** (if supported by hardware):
23+
```bash
24+
rpm-ostree status # Record current deployment
25+
sudo cp /var/lib/secure-ai/data/incidents.jsonl /tmp/incidents-backup.jsonl
26+
```
27+
28+
2. **Pull update**:
29+
```bash
30+
rpm-ostree upgrade
31+
```
32+
33+
3. **Reboot and verify**:
34+
```bash
35+
sudo systemctl reboot
36+
# After reboot:
37+
sudo /usr/libexec/secure-ai/first-boot-check.sh
38+
```
39+
40+
4. **Rollback if needed**:
41+
```bash
42+
rpm-ostree rollback
43+
sudo systemctl reboot
44+
```
45+
46+
## Log Rotation
47+
48+
Audit logs are rotated automatically via `/etc/logrotate.d/secure-ai`:
49+
- **Audit JSONL** (`/var/lib/secure-ai/logs/*.jsonl`): daily, 30 days retained, max 50MB per file
50+
- **Incident store** (`/var/lib/secure-ai/data/*.jsonl`): weekly, 12 weeks retained, max 100MB
51+
52+
To manually trigger rotation:
53+
```bash
54+
sudo logrotate -f /etc/logrotate.d/secure-ai
55+
```
56+
57+
## Key Rotation
58+
59+
### Service Token
60+
61+
The inter-service bearer token at `/run/secure-ai/service-token`:
62+
63+
1. Generate new token:
64+
```bash
65+
openssl rand -hex 32 > /tmp/new-token
66+
```
67+
68+
2. Replace atomically:
69+
```bash
70+
sudo mv /tmp/new-token /run/secure-ai/service-token
71+
sudo chmod 0640 /run/secure-ai/service-token
72+
```
73+
74+
3. Restart all services (they read token at startup):
75+
```bash
76+
sudo systemctl restart secure-ai-*.service
77+
```
78+
79+
### HMAC Signing Key (Attestation Bundles)
80+
81+
1. Generate new key:
82+
```bash
83+
openssl rand 32 > /tmp/new-hmac-key
84+
```
85+
86+
2. Replace and restart attestor:
87+
```bash
88+
sudo mv /tmp/new-hmac-key /run/secure-ai/attestation-hmac-key
89+
sudo chmod 0640 /run/secure-ai/attestation-hmac-key
90+
sudo systemctl restart secure-ai-runtime-attestor
91+
```
92+
93+
### Cosign Signing Key (Image & Release)
94+
95+
Rotate via the GitHub repository secrets. Update `SIGNING_SECRET` in repository settings, then trigger a new build.
96+
97+
## Monitoring
98+
99+
### Service Health
100+
101+
All services expose `/health` on their localhost ports:
102+
103+
| Service | Port | Endpoint |
104+
|---------|------|----------|
105+
| Policy Engine | 8500 | `/health` |
106+
| Registry | 8470 | `/health` |
107+
| Tool Firewall | 8475 | `/health` |
108+
| Runtime Attestor | 8505 | `/health` |
109+
| Integrity Monitor | 8510 | `/health` |
110+
| Incident Recorder | 8515 | `/health` |
111+
| MCP Firewall | 8496 | `/health` |
112+
| GPU Integrity Watch | 8495 | `/health` |
113+
114+
### Incident Dashboard
115+
116+
```bash
117+
# Open incidents
118+
curl -s http://127.0.0.1:8515/api/v1/stats | python3 -m json.tool
119+
120+
# Attestation state
121+
curl -s http://127.0.0.1:8505/api/v1/verify | python3 -m json.tool
122+
123+
# Integrity status
124+
curl -s http://127.0.0.1:8510/api/v1/status | python3 -m json.tool
125+
```
126+
127+
### Journal Logs
128+
129+
```bash
130+
# All Secure AI services
131+
journalctl -u 'secure-ai-*' --since "1 hour ago"
132+
133+
# Specific service
134+
journalctl -u secure-ai-incident-recorder -f
135+
136+
# Security events only
137+
journalctl -u 'secure-ai-*' -g 'FAIL\|DENIED\|degraded\|violation' --since today
138+
```
139+
140+
## Capacity Limits
141+
142+
| Resource | Service | Default Limit | Notes |
143+
|----------|---------|---------------|-------|
144+
| Memory | Agent | 512MB | Increase for large context windows |
145+
| Memory | Registry | 128MB | Scales with model count |
146+
| Memory | Policy Engine | 128MB | Scales with rule count |
147+
| CPU | Agent | 50% | Primary workload |
148+
| CPU | GPU Integrity | 15% | Background monitoring |
149+
| Incidents | Recorder | 1000 max | Oldest trimmed; persisted to disk |
150+
| Models | Registry | Unlimited | Bounded by vault size |
151+
152+
## Graceful Shutdown
153+
154+
All Go services handle SIGTERM for clean shutdown:
155+
- In-flight HTTP requests complete (up to 10s drain)
156+
- Incident recorder flushes persistence file
157+
- Audit log files are closed cleanly
158+
- systemd `TimeoutStopSec=15` enforces hard deadline

docs/security-status.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@ All M5 security assurance criteria are met. The controls below have been impleme
5757
| Enforcement wiring + CI supply chain verification | Implemented | M42 | Integrity monitor -> incident recorder reporting, runtime attestor -> incident recorder reporting, incident recorder -> containment action execution (freeze agent, disable airlock, force vault relock, quarantine model), CI SBOM generation verification via Syft, cosign availability check, release workflow provenance validation |
5858
| Stronger isolation (M5 hardening) | Implemented | M43 | Per-service sandbox tightening (device cgroups, resource limits, namespace isolation), agent execution compartmentalization (step signatures, subprocess isolation, per-step capability re-validation), workspace hard walls (symlink/hardlink/FD-reuse detection), model worker isolation profiles, formal adversarial test suite (prompt injection, policy bypass, containment, GPU tamper), CI security regression gate, MCP-specific isolation (trust tier enforcement, per-tool profiles, session binding, dynamic registration denial), recovery ceremony (ack + re-attestation), latched degraded states, severity escalation rules, forensic bundle export (signed), M5 control matrix doc, supply chain provenance doc, M5 acceptance suite (30 tests) |
5959
| Auditability and documentation hardening | Implemented | M44 | Test-count drift CI check with single source of truth (docs/test-counts.json), CI evidence links and GitHub Actions badges in README, M4/M5 terminology disambiguation (project milestones vs M5 security assurance level), operator verification column in M5 control matrix, external audit quick-path doc, recovery runbook with concrete curl commands, verify-release.sh auditor script, sample release bundle doc, security-status split into assurance controls vs product roadmap |
60+
| Production readiness hardening | Implemented | M45 | Incident recorder file-backed persistence (survives restarts), graceful shutdown (SIGTERM/SIGINT with connection draining) for all 9 Go services, HTTP server timeouts for mcp-firewall and gpu-integrity-watch, systemd production hardening (TimeoutStartSec, TimeoutStopSec, StartLimitInterval, StartLimitBurst) for all 12 daemon units, first-boot health validation script, audit log rotation via logrotate, CI dependency vulnerability scanning (govulncheck + pip-audit), production operations guide (upgrade, key rotation, capacity limits, monitoring) |
6061

6162
---
6263

0 commit comments

Comments
 (0)