Skip to content

Latest commit

 

History

History
840 lines (677 loc) · 32.3 KB

File metadata and controls

840 lines (677 loc) · 32.3 KB

Distributed Task Observatory - Complete Session Audit

Last Updated: 2025-12-27 Original Session: 2025-12-22 (Conversation ID: c305f5d6-89a1-4d5b-a311-e081142f51ae) Phases Completed: 0-31.6


Executive Summary

The Distributed Task Observatory is a production-grade distributed task processing system demonstrating modern microservice architecture, event-driven design, observability, and polyglot development on Kubernetes.

Final System State

  • 15+ Kubernetes pods running and healthy
  • 5 microservices (Node.js/TypeScript, Python, Go x2, Rust)
  • 4 infrastructure components (RabbitMQ, PostgreSQL, Redis, MongoDB)
  • 3 observability tools (Prometheus, Grafana, Alertmanager)
  • 3 DB admin UIs (pgAdmin, Mongo Express, RedisInsight)
  • 2 user interfaces (Web Terminal via PTY, Rust TUI with launcher)
  • Multi-platform distribution (install scripts, npm shim, release workflow)

Phase Summary

Phase Description Status
Phase 0 Foundation & Contracts ✅ Complete
Phase 1 Infrastructure & Platform ✅ Complete
Phase 2 Core Service Implementation ✅ Complete
Phase 3 Observability Stack ✅ Complete
Phase 4 Aggregation & Read Model ✅ Complete
Phase 5 Interface Layer ✅ Complete
Phase 6 Hardening & Verification ✅ Complete
Phase 7 Testing & Determinism ✅ Complete
Phase 8 Production-Grade Observability ✅ Complete
Phase 9 Message Filtering & Event Sourcing ✅ Complete
Phase 10 Startup Automation ✅ Complete
Phase 11 Version Governance ✅ Complete
Phase 12 Consumer Validation & TUI Enhancements ✅ Complete
Phase 13 Add Task/UI Launcher & Hardening ✅ Complete
Phase 14 Distribution Strategy ✅ Complete
Phase 15 TUI Refactoring & Prerequisites Setup ✅ Complete
Phase 16 TypeScript Migration & Doctor Enhancement ✅ Complete
Phase 17 Testing Optimizations & CI Hardening ✅ Complete
Phase 18 Integration Test Hardening ✅ Complete
Phase 19 Docker Hub Pre-Built Images ✅ Complete
Phase 20 Web Terminal Modernization ✅ Complete
Phase 21 PTY State Preservation ✅ Complete
Phase 22 Server Mode (W11) & In-Container Operation ✅ Complete
Phase 23 I7 Parity Invariant & Compose/K8s Validation ✅ Complete
Phase 24 Integration Budget Adjustment (I4: 180s) ✅ Complete
Phase 25 Visual Test Initial Stabilization ✅ Complete
Phase 26 WebSocket Cleanup (Double-Handle Pattern) ✅ Complete
Phase 27 Ratatui Compatibility (arboard v2.1 pin) ✅ Complete
Phase 28 Container Fidelity & Stale Image Detection ✅ Complete
Phase 29 PTY State Preservation (Waterford Replay) ✅ Complete
Phase 30 WebSocket Proxy Fix (Nginx /ws) ✅ Complete
Phase 31 Visual Test Suite Stabilization ✅ Complete
Phase 31.4 Container Contract & Per-IP Cap Tuning ✅ Complete
Phase 31.5 Visual Test Strategy & Failure Injection ✅ Complete
Phase 31.6 CI Docker Build Context Fix ✅ Complete

Technology Stack

Languages & Frameworks

Service Language Framework
Gateway Node.js Express
Processor Python pika, psycopg2
Metrics Engine Go amqp091-go, go-redis, mongo-driver
Read Model Go net/http, go-redis, lib/pq, mongo-go-driver
Web Terminal Rust/JS web-pty-server (Rust), xterm.js
TUI Rust ratatui 0.24, crossterm, reqwest

Infrastructure

  • Message Bus: RabbitMQ 3.12
  • Relational DB: PostgreSQL 15
  • Cache: Redis 7
  • Document Store: MongoDB
  • Ingress: nginx-ingress

Observability

  • Metrics: Prometheus
  • Dashboards: Grafana
  • Alerting: Alertmanager

Key Features

TUI (Terminal User Interface)

  • Cluster Launcher Mode - Detects if cluster is running, offers one-key launch
  • Animated Loading Splash - Braille spinner with cycling messages
  • Dashboard Mode - Real-time stats, alerts, and job table
  • Add Task (N) - Form with validation, 2s timeout, Gateway submission
  • UI Launcher (U) - Arrow navigation, 9 UIs from centralized registry
  • Graceful Alert Degradation - Bounded retries, no UI freeze
  • Environment-Aware Errors - SSH/headless detection for browser launch
  • Doctor Command - Checks Docker, PowerShell, kubectl, kind

Web Terminal

  • PTY Multiplexer - Runs TUI in pseudo-terminal, streams to browser
  • xterm.js Client - Identical rendering to native terminal
  • Split K8s Deployments - web-ui-http (nginx) + web-pty-ws (PTY broker)
  • Session Reconnect - Single-use tokens, page refresh reconnects
  • Fallback Dashboard - Shows stats when WebSocket unavailable
  • Read-Only Mode - Optional input filtering for mutating operations

Startup Automation

  • One-Click Script - scripts/start-all.ps1
  • TUI Integration - Press 'L' in launcher mode
  • JSON Progress Output - For programmatic tracking
  • Parallel Builds - Docker images built concurrently

Files Created/Modified

Core Services

Path Description
src/services/gateway/index.js Node.js Express API with metrics
src/services/processor/main.py Python RabbitMQ consumer with validation
src/services/metrics-engine/main.go Go event aggregator with MongoDB
src/services/read-model/main.go Go Read Model API with CORS

User Interfaces

Path Description
src/interfaces/web/index.html xterm.js terminal page
src/interfaces/web/terminal.js WebSocket client with auto-reconnect
src/interfaces/web/styles.css Terminal and fallback styles
src/interfaces/web/nginx.conf nginx proxy config (/ws proxy)
src/interfaces/tui/src/main.rs Rust TUI with launcher mode

Web PTY Server (Phase 20)

Path Description
src/services/web-pty-server/src/main.rs WebSocket server, metrics endpoint
src/services/web-pty-server/src/session.rs Session management, reconnect tokens
src/services/web-pty-server/src/protocol.rs Client/server message types
src/services/web-pty-server/src/auth.rs Bearer token authentication
src/services/web-pty-server/src/pty.rs PTY spawning with terminal caps
src/services/web-pty-server/src/config.rs Environment-driven configuration

Distribution (Phase 14)

Path Description
src/interfaces/tui/VERSION Single source of truth for version
src/interfaces/tui/build.rs Build metadata injection
install.sh POSIX install script with checksum verification
install.ps1 Windows install script with checksum verification
packages/npm-shim/* npm package with postinstall binary download
.github/workflows/release.yml Multi-platform release workflow
docker-compose.demo.yml Docker-only demo mode

Scripts

Path Description
scripts/start-all.ps1 One-click cluster startup
scripts/setup-cluster.ps1 Kind cluster creation
scripts/integration-gate.ps1 End-to-end test suite v2
scripts/run-all-tests.ps1 Canonical test entrypoint
scripts/audit-naming-consistency.ps1 Verify no old binary names
scripts/verify-version-sync.ps1 Verify version consistency
scripts/verify-artifact-names.ps1 Verify canonical artifact names
scripts/audit-workflows.ps1 Verify workflow secret isolation
scripts/validate-dockerfile-context.py CI prevention for Docker build context drift

Contracts

Path Description
contracts/schemas/event-envelope.json Event message schema
contracts/schemas/job.json Job domain object schema
contracts/ui-registry.json Centralized UI registry (9 launchable UIs)
contracts/VERSIONS.md Schema versioning documentation

Access Points

Service URL Credentials
Web Terminal http://localhost:8081 -
RabbitMQ http://localhost:15672 guest / guest
Grafana http://localhost:3002 admin / admin
Prometheus http://localhost:9090 -
Gateway API http://localhost:3000 -
Read Model API http://localhost:8080 -
TUI odd-dashboard or cargo run --release -

Quick Start

Using TUI Launcher (Recommended)

# If installed via install scripts or npm:
odd-dashboard

# From source:
cd src/interfaces/tui
cargo run --release
# Press 'L' to launch cluster

Using Script Directly

.\scripts\start-all.ps1

Manual Steps

  1. Clone repository
  2. Install prerequisites (Docker Desktop, kubectl, kind)
  3. Run .\\scripts\\setup-cluster.ps1
  4. Build and load Docker images
  5. Apply Kubernetes manifests
  6. Start port-forwards
  7. Verify with .\\scripts\\integration-gate.ps1

Integration Gate Results (v2)

============================================================
  INTEGRATION GATE v2 - Deterministic Tests
============================================================

>> Pre-flight: Execution Assumptions
[PASS] Kubectl Context
[PASS] Pods Ready
>> Test 1 - Gateway Health (with retry)
[PASS] Gateway Health
>> Test 2 - Read Model Health (with retry)
[PASS] Read Model Health
>> Test 3 - Submit 3 Jobs
[PASS] Job Submission
>> Test 4 - Wait for Processing (30s max)
[PASS] Jobs Processed
>> Test 5 - MongoDB Event Persistence
[PASS] MongoDB Events
>> Test 6 - Stats Aggregation
[PASS] Stats Aggregation
>> Test 7 - Gateway Metrics
[PASS] Gateway Metrics

============================================================
  INTEGRATION GATE RESULTS
============================================================
  Passed: 9
  Failed: 0

  [OK] ALL TESTS PASSED

Artifacts in /audit

File Description
complete-session-audit.md This file - comprehensive audit
task.md Phase-by-phase task checklist
walkthrough.md Implementation walkthrough
session-summary.md High-level summary

Phase 13: Add Task/UI Launcher & Hardening (2025-12-24)

Features Added

  • Add Task - TUI (N key) and Web form with Gateway submission
  • UI Launcher - TUI (U key) and Web modal from contracts/ui-registry.json

Hardening

  • Error enums: RegistryError, SubmitError, BrowserError
  • Input validation: alphanumeric job types, max 50 chars
  • Environment-aware browser launch (SSH/headless detection)
  • Graceful degradation (fallback registry on load failure)
  • Cross-surface invariant tests

Phase 14: Distribution Strategy (2025-12-24)

Binary Rename

  • Name: odd-dashboard
  • Build metadata injection (commit, timestamp, rustc version)

CLI Features

  • --version / -V: Shows version with build metadata
  • --help / -h: Usage information
  • doctor: Checks prerequisites (Docker, PowerShell, kubectl, kind)
  • Support matrix enforcement (fails fast on unsupported platforms)

Install Scripts

Script Platform Features
install.sh Linux/macOS Platform detection, checksum verification
install.ps1 Windows Version resolution, checksum verification

npm Shim (@oddessentials/odd-dashboard)

  • Postinstall binary download with checksum verification
  • Strict failure semantics (exits non-zero if binary missing)
  • Sentinel file for differentiated error messages

Demo Mode (docker-compose.demo.yml)

  • All ports in 13000+ range to avoid conflicts
  • Isolated naming (odto-demo-* containers/volumes/network)
  • Internal services (postgres, mongo, redis, amqp) not exposed

Release Workflow (.github/workflows/release.yml)

  • Triggered on v*.*.* tags
  • 5-platform matrix (Windows x64, macOS x64/arm64, Linux x64/arm64)
  • VERSION file validation
  • SHA256SUMS generation
  • Artifact completeness gate

CI Integration

  • Distribution audit job with 4 PowerShell checks
  • Platform detection unit tests

Commits (feature/distribution-strategy)

  1. d6f983f - Version infrastructure and audit scripts
  2. 91ce8b6 - Binary rename with build metadata
  3. f49662b - CLI features, doctor command, support matrix
  4. 56c4efa - docker-compose.demo.yml
  5. 9d06fe8 - Install scripts
  6. 5dab3d3 - npm shim with strict failure semantics
  7. 9784cc2 - Release workflow
  8. 0a73d52 - CI integration for distribution audits

Test Results (Phase 14)

Component Tests
TUI (Rust) 56 pass (4 new platform tests)
Audit Scripts All passing
Naming Consistency No old references

Phase 15: TUI Refactoring & Prerequisites Setup (2025-12-24)

Architecture Refactoring

Refactored monolithic 2710-line main.rs into 7 clean modules:

File Lines Purpose
main.rs ~1130 Entry point, rendering, event loop
lib.rs 50 Module re-exports
types.rs ~405 Data structures, App state
error.rs ~460 Error types, remediation helpers
doctor.rs ~300 Prerequisite checking, CLI handlers
cluster.rs ~470 Cluster ops, setup script, browser
install.rs ~140 Clipboard, install execution

New Feature: Guided Prerequisites Setup

  • Automatic detection of Docker, PowerShell, kubectl, kind
  • Interactive selection of missing prerequisites
  • Clipboard copy via arboard crate (cross-platform)
  • Status feedback in TUI

Dependencies Added

Crate Purpose
arboard v3 Cross-platform clipboard access

Test Results

Component Tests
TUI (Rust) 49 pass (3 new install tests)
All modules Full coverage

Phase 16: TypeScript Migration & Doctor Enhancement (2025-12-25)

Doctor Command Enhancement

The doctor command now displays OS-specific installation commands for missing prerequisites:

odd-dashboard doctor
====================

[OK] Platform: windows-x86_64 (supported)
[OK] Docker: Docker version 24.0.7
[FAIL] PowerShell Core: not found
[OK] kubectl: Client Version: v1.28.3
[FAIL] kind: not found

Installation Commands (windows):
----------------------------------------
  PowerShell Core: winget install Microsoft.PowerShell
  kind: winget install Kubernetes.kind

Some prerequisites are missing.
Run the commands above, then retry: odd-dashboard doctor

TypeScript Migration

Converted JavaScript test files to TypeScript with strict typing:

File Changes
src/services/gateway/__tests__/index.test.ts Added interfaces for EventEnvelope, EventProducer, OpenApiSpec
src/services/gateway/__tests__/web-smoke.test.ts Added interfaces for Registry, RegistryEntry, JobPayload, ErrorResponse
tests/web-smoke.test.ts Root-level tests migrated to TypeScript

New Configuration Files

File Purpose
tsconfig.json Root TypeScript config with strict mode
vitest.config.ts Root vitest config for tests directory
src/services/gateway/vitest.config.ts Gateway vitest config for tests directory
commitlint.config.cjs Renamed from .js for ESM compatibility

Package Updates

  • Added vitest ^1.6.0 to root devDependencies
  • Added typescript ^5.3.3 to root devDependencies
  • Set "type": "module" in root package.json
  • Added test and typecheck scripts to root

Bazel Updates

Updated src/services/gateway/BUILD.bazel to reference TypeScript test files.

Infrastructure Fix

  • RedisInsight port correction: Updated infra/k8s/redisinsight.yaml to use internal port 5540 (RedisInsight v2) while maintaining external access on port 8001

CHANGELOG Automation

Added @semantic-release/changelog plugin to .releaserc.json for automatic CHANGELOG.md generation on releases.

Test Results

Component Tests
TUI (Rust) 72 pass
Gateway (TypeScript) 17 pass
Root (TypeScript) 10 pass

Commits

c85b856 refactor: migrate JavaScript tests to TypeScript with strict typing
aa55c4f feat(tui): show OS-specific install commands in doctor output
ab9a126 fix(infra): correct RedisInsight port and enable changelog generation

Phase 17: Testing Optimizations & CI Hardening (2025-12-25)

Invariants Documentation

  • Created docs/INVARIANTS.md with formal system guarantees
  • Contract, cross-platform, coverage, integration, and automation invariants

Coverage Enforcement

  • coverage-config.json - Externalized thresholds (single source of truth)
  • scripts/check-coverage.py - Unified validator with self-test capability

CI Optimizations

  • Added dorny/paths-filter@v3 for reliable change detection
  • Conditional schema compatibility checks (fail-closed default)
  • Removed duplicate contracts job (covered by run-all-tests.ps1)
  • Integration gate trigger for contract/service changes

Performance

  • Parallel Go tests on pwsh 7+ with sequential fallback
  • Explicit exit code collection prevents swallowed failures

Phase 17.5: Go Coverage Improvements (2025-12-25)

Coverage Analysis

Analyzed Go services and identified that ~70-80% of code in main.go files handles external connections (RabbitMQ, Redis, MongoDB, PostgreSQL) and infinite processing loops - code that cannot be meaningfully unit-tested without refactoring for dependency injection or integration tests.

Decision

Opted for Option B: Maximize unit test coverage for business logic within current architecture, retain realistic thresholds for infrastructure-heavy services, and document the tradeoff.

New Tests Added

metrics-engine/metrics_engine_test.go (11 new tests)

  • TestEventEnvelopeJSONSerialization - JSON roundtrip
  • TestEventEnvelopeJSONFieldNames - Verify camelCase
  • TestDLQMessageJSONSerialization - JSON roundtrip
  • TestDLQMessageJSONFieldNames - Verify snake_case
  • TestDLQNameConstant - Constant value
  • TestFormatValidationErrorsThreeErrors - Edge case
  • TestFormatValidationErrorsSpecialChars - Edge case
  • TestGetEnvMultipleCalls - Multiple lookups
  • TestEventEnvelopeWithNilPayload - Nil handling
  • TestEventEnvelopeWithComplexPayload - Nested payload

read-model/read_model_test.go (13 new tests)

  • TestCorsMiddlewareAddsHeaders - CORS headers
  • TestCorsMiddlewareOptionsRequest - OPTIONS handling
  • TestCorsMiddlewarePassesThrough - Passthrough
  • TestHealthHandler - Health endpoint
  • TestStatsResponseJSONSerialization - JSON roundtrip
  • TestStatsResponseJSONFieldNames - Field names
  • TestJobJSONSerialization - Job struct
  • TestHealthResponseJSONFieldNames - Field names
  • TestGetEnvMultipleCalls - Multiple lookups
  • TestJobStatusValues - All statuses
  • TestStatsResponseZeroValues - Zero values
  • TestOpenApiVersionDynamic - Dynamic version
  • TestDocsHandlerContainsSwaggerUI - SwaggerUI elements

Coverage Results

Service Before After Threshold
metrics-engine 10.7% 10.7% 10% ✅
metrics-engine/validator 80.4% 80.4% 80% ✅
read-model 8.3% 18.5% 18% ✅

Documentation Updates

  • INVARIANTS.md: Added V2a invariant for validator (80%), updated V3 to 18%, added architecture tradeoff note
  • coverage-config.json: Updated thresholds with explanatory notes

Files Modified

  • src/services/metrics-engine/metrics_engine_test.go - 11 new tests
  • src/services/read-model/read_model_test.go - 13 new tests
  • docs/INVARIANTS.md - New invariant V2a, updated thresholds, architecture note
  • coverage-config.json - Updated thresholds with notes

Phase 18: Integration Test Hardening (2025-12-25)

Objective

Introduce reliable, deterministic integration testing via Docker Compose without weakening fast unit/contract guardrails.

Files Created

File Lines Purpose
scripts/integration-harness.ps1 ~310 Self-contained Docker Compose test runner
scripts/validate-json.mjs ~50 AJV schema validator helper (Node.js)
docker-compose.integration.yml ~145 Per-image healthchecks, isolated network

Files Modified

File Changes
.github/workflows/ci.yml Replaced integration-gate-check with integration-phase job
docs/INVARIANTS.md Added I3-I6 invariants
docs/TESTING.md Added integration harness documentation

Key Design Decisions

Decision Rationale
Docker Compose (not K8s) Self-contained, no cluster dependency
90s wall-clock budget exit 1 on breach (hard fail)
4 canonical proof paths Contract-first (schema validated)
Scoped retries Connection only; disabled after partial success
Node.js schema validator AJV via validate-json.mjs helper
Per-image healthchecks wget/nc/python per container base
Guarded teardown try/catch around compose commands
gate-decision.json Source of truth for trigger logging

New Invariants

ID Invariant Enforcement
I3 Integration harness self-contained Docker Compose only
I4 Runtime <90s wall-clock Harness exits 1 on breach
I5 Artifact capture every run Guarded finally block
I6 Victory gate: 3 green + nightly 📝 Governance-only

Canonical Proof Paths

Path Assertion Schema
P1 Gateway accepts job (201) job.json
P2 Events contain jobId event-envelope.json
P3 Jobs reflect COMPLETED job.json
P4 Metrics counter exposed Regex

Status

Implementation complete. Pending verification via local harness run and CI.


Phase 19: Docker Hub Pre-Built Images (2025-12-25)

Objective

Replace source-based Docker Compose builds with pre-built Docker Hub images for faster, more reliable integration tests. Unblock I3-I5 invariant enforcement in CI.

Dockerfiles Updated

Service Image Size Base Changes
Gateway 322 MB node:20-slim Multi-stage, baked contracts, prod deps only
Processor 491 MB python:3.11-slim Baked contracts
Metrics Engine 22.9 MB distroless/static Optimized binary flags (-s -w)
Read Model 19.5 MB distroless/static Optimized binary flags (-s -w)

CI Integration (.github/workflows/ci.yml)

  • New build-images job with matrix strategy (4 services)
  • Security: Only runs on main branch pushes, never on PRs or forks
  • Dual tagging: :latest + :sha-<commit> for traceability
  • GitHub Actions cache for faster rebuilds
  • Contracts copied into Gateway/Processor contexts for I3 compliance

docker-compose.integration.yml Updates

  • Replaced source-based builds with oddessentials/odto-*:latest images
  • Removed volume mounts and command overrides
  • Reduced start_period from 30s to 10s (no compilation delay)
  • Added note documenting PR behavior (uses last main images)

Re-enabled integration-phase Job

  • Depends on [paths-filter, tests, build-images]
  • Runs when build-images succeeds or is skipped
  • Uses pre-built images for <90s runtime budget

README.md Updates

  • Added 🐳 Docker Hub Images section with image table
  • Usage examples, tagging strategy, CI integration notes

Files Created/Modified

File Changes
src/services/gateway/Dockerfile Multi-stage build, baked contracts
src/services/processor/Dockerfile Baked contracts
src/services/metrics-engine/Dockerfile Optimized binary flags
src/services/read-model/Dockerfile Optimized binary flags
.github/workflows/ci.yml Added build-images job, re-enabled integration-phase
docker-compose.integration.yml Pre-built images, reduced startup times
README.md Docker Hub section
.gitignore Ignore copied contracts directories

Pending Steps

  1. Add DOCKERHUB_USERNAME and DOCKERHUB_TOKEN secrets to GitHub
  2. Push changes to main to trigger image builds
  3. Update INVARIANTS.md to mark I3-I5 as ✅ CI after verification

Phase 20: Web Terminal Modernization (2025-12-25)

BREAKING CHANGE

Replaced the glassmorphic Web Dashboard with an xterm.js-based terminal that mirrors the TUI via WebSocket PTY streaming. This provides 100% visual fidelity with the native TUI.

Architecture: PTY Multiplexer

  • web-pty-ws (Rust): PTY broker running odd-dashboard in pseudo-terminal
  • web-ui-http (nginx): Static files + /ws proxy to PTY server
  • Split K8s deployments: HTTP can roll independently without killing PTY sessions

web-pty-server (Rust)

Module Purpose
main.rs WebSocket server, cleanup task, metrics endpoint
session.rs Session lifecycle, single-use reconnect tokens
protocol.rs Client/server message types, input classification
auth.rs Bearer token validation (never logged)
pty.rs PTY spawning with xterm-256color, UTF-8, truecolor
config.rs Environment-driven configuration

Frontend (xterm.js)

File Purpose
terminal.js WebSocket client, auto-reconnect, resize handling
styles.css Terminal theme matching TUI colors
nginx.conf /ws proxy, /api proxy for fallback stats

Requirements Implemented

Req Feature
R1 Split K8s deployments (PTY survival during HTTP rollouts)
R2 Session model with single-use reconnect tokens
R3 Environment-driven resource limits (logged at startup)
R4 Read-only mode with rate-limited notices
R5 Bearer token auth (never logged, only for /ws)
R6 WebSocket ping/pong keepalive
R7 Terminal capabilities (xterm-256color, UTF-8, truecolor)
R9 Output coalescing with backpressure metrics
R10 Minimal fallback dashboard when WS unavailable
R11 Single access URL (http://localhost:8081)

Test Results

Component Tests
web-pty-server 35 pass (config, session, protocol, auth, pty)

Files Created

  • src/services/web-pty-server/ (6 modules, Dockerfile, README)
  • infra/k8s/web-ui-http.yaml (nginx deployment)
  • infra/k8s/web-pty-ws.yaml (PTY server deployment + secret)
  • src/interfaces/web/terminal.js (xterm.js client)
  • src/interfaces/web/nginx.conf (proxy config)

Files Removed

  • infra/k8s/web-ui.yaml (replaced by split deployments)

Visual Regression Tests (CI Integration)

  • Playwright tests: 6 tests for terminal rendering fidelity
  • CI job: visual-regression triggers on web_terminal path changes
  • Docker Compose: Services spun up automatically for tests
  • Artifacts: playwright-report and snapshots on failure
  • V7 invariant: Visual regression tests enforced in CI

Coverage

  • web-pty-server: 81.12% (35 unit tests)
  • Exceeds V6 threshold of 80%

Phase 31.5: Visual Test Strategy & Server-Side Failure Injection (2025-12-26)

Objective

Implement a tiered visual test strategy and server-side failure injection to enable deterministic fallback UI testing without relying on Playwright's unreliable WebSocket mocking.

Background

  • PTY_PER_IP_CAP=30 in docker-compose.integration.yml is an operational tuning parameter (not an invariant)
  • 6 skipped snapshot tests across 2 describe blocks needed responsible restoration

Tiered Test Strategy

Tier Tests Location Trigger
1 (CI) Bundle Smoke Tests (5) ci.yml Every PR
2 (Nightly) Web Terminal Visual Tests (4) nightly.yml Daily 3 AM UTC
3 (Fallback) Fallback Dashboard (2) nightly.yml Daily, with failure injection

Server-Side Failure Injection

Added TestMode enum to web-pty-server for deterministic testing:

Mode Env Value Query Param Behavior
None (default) - Normal operation
FailConnection fail ?test_mode=fail Reject all connections
DelayConnection delay:N - Delay by N ms

Files Created

File Purpose
.github/workflows/nightly.yml Nightly visual test workflow with serialized workers

Files Modified

File Changes
src/services/web-pty-server/src/config.rs Added TestMode enum, env parsing
src/services/web-pty-server/src/main.rs Added failure injection in handle_connection
src/services/web-pty-server/src/lib.rs Exported TestMode
src/services/web-pty-server/src/*.rs Updated test configs with test_mode field
src/interfaces/web/terminal.js Added test_mode query param passthrough, INTERNAL_ERROR handling
tests/visual/terminal.spec.ts Re-enabled Fallback Dashboard tests with server injection

Test Results

Component Tests
web-pty-server (Rust) 47 pass
Playwright (Tier 1 + 3) 7 pass

Key Design Decisions

  • Server-side injection over Playwright WebSocket mocking (unreliable)
  • Query param override allows per-request failure without server restart
  • Frontend passthrough makes tests self-contained and race-free
  • Nightly-only visual tests until TUI rendering stability proven

Phase 31.6: CI Docker Build Context Fix (2025-12-27)

Objective

Fix a latent bug where CI build-images job used service-local contexts while Dockerfiles require repo root contexts. Add prevention script to catch future drift.

Background

  • Phase 19 introduced CI build-images with service-local contexts
  • Dockerfiles were later updated to repo-relative COPY paths for start-all.ps1 compatibility
  • Bug hadn't surfaced because CI hadn't rebuilt images since the Dockerfile changes

Discovery

During INVARIANTS.md review, context mismatch was identified:

# CI Used (WRONG)           vs  Dockerfile Expects (CORRECT)
context: src/services/gateway   context: .  (repo root)
COPY src/services/gateway/...   # Fails with service-local context!

Fix Applied

1. CI Workflow (ci.yml)

# Before (BROKEN)
- service: gateway
  context: src/services/gateway
  
# After (FIXED)  
- service: gateway
  context: .

2. Prevention Script Created

New scripts/validate-dockerfile-context.py:

  • Conservative approach with explicit service lists
  • Documents assumptions in script header
  • Added to CI as "Dockerfile context parity check"

3. INVARIANTS.md Updated

  • B1/B2 now CI-enforced via validate-dockerfile-context.py
  • Added "Service Context Categories" table
  • Documented new hazard: "CI/Local Context Drift"

Service Context Categories

Category Services Context Reason
Repo Root Required gateway, processor, web-pty-server . Use COPY src/... and COPY contracts/
Service-Local OK metrics-engine, read-model, web-ui Service dir Self-contained, no shared assets

Files Created

File Purpose
scripts/validate-dockerfile-context.py CI prevention script for context/path drift

Files Modified

File Changes
.github/workflows/ci.yml Fixed contexts, removed cp contracts workaround, added validation step, added web-pty-server and web-ui to matrix
docs/INVARIANTS.md Updated B1-B4 with accurate documentation and CI enforcement
README.md Updated Docker Hub images table with web-pty-server and web-ui

Verification

✅ gateway: context '.' (repo root) - correct
✅ processor: context '.' (repo root) - correct
✅ web-pty-server: context '.' (repo root) - correct
✅ metrics-engine: context 'src/services/metrics-engine' (service-local OK)
✅ read-model: context 'src/services/read-model' (service-local OK)
✅ web-ui: context 'src/interfaces/web' (service-local OK)
✅ Docker build with repo root context: PASSED

Session Complete ✓

All 31.6 implementation phases completed successfully.