feat(e2e): run cdc-lifecycle on kubernetes via a Deployment adapter#1384
Conversation
Add KubernetesDeployment, a Deployment-protocol implementation that runs the unified e2e phase drivers unchanged on Kind, and route the cdc-lifecycle scenario through it so it is no longer local-only. The adapter closes the seams the kubernetes side lacked: per-pod `kubectl port-forward` addressing (+ run.py state.json so seed_cache builds through the cluster ncps), CDC enable/lazy toggling via `helm upgrade --set` + rollout, `run_subcommand` execing the shell-less ncps binary directly, and in-cluster DB invariant reads. Because the ncps image is deliberately shell-less, sqlite is read via a `kubectl debug` sidecar that shares the ncps PID namespace and reads the live DB at /proc/1/root; during the drain window (ncps scaled to 0) sqlite is read from a transient pod mounting the released PVC, and migrate-chunks-to-nar runs in a one-shot pod cloned from the resolved pod spec. Routing is gated by scenario name, so the multi-replica ha-s3-postgres-cdc-lifecycle permutation keeps its NCPSTester topology checks. cdc-lifecycle verified live on Kind (41/41 checks: baseline -> eager -> lazy -> drain -> restart -> fsck). staging-contention stays local-only: its adapter seams work and serving is byte-correct cross-pod, but in-flight staging activation is a single-shot timing event that port-forward latency jitter de-synchronizes, so activation cannot be reliably forced on Kind; the pin now documents that and the scenario SKIPs under --mode kubernetes. Adds offline unit coverage for the adapter and the routing (checks.e2e-harness-unit); no ncps production code, schema, or migration is touched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
This change is part of the following stack: Change managed by git-spice. |
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds a KubernetesDeployment adapter to run e2e phase drivers on Kind with per-pod port-forwards, Helm-based CDC toggles, in-cluster DB access (sqlite via debug/reader pod; postgres/mysql via port-forward), integrates adapter routing into the runner, enables ChangesKubernetes Deployment Adapter
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a KubernetesDeployment adapter that implements the Deployment protocol, enabling the cdc-lifecycle and staging-contention phase drivers to run on a Kind cluster. It maps essential seams such as provisioning, per-pod port-forwarding, and in-cluster database access (including a kubectl debug sidecar for sqlite), and adds comprehensive offline unit tests. The review feedback highlights several opportunities to make the adapter more robust, such as clearing stale SQLite WAL/SHM files to prevent query corruption, safely parsing container UIDs and cleaning up temporary files, replacing static sleeps with socket-based readiness probes for database port-forwards, and properly reaping terminated port-forward processes to avoid zombie processes.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nix/e2e-tests/tests/test_runner.py`:
- Around line 38-77: The test currently never wires up
kubernetes_mode.run_kubernetes_scenario so calls["ncps_tester"] stays 0
trivially; before calling runner._run_kubernetes, monkeypatch
kubernetes_mode.run_kubernetes_scenario (via monkeypatch.setattr) to a callable
that increments calls["ncps_tester"] and returns an appropriate rc, so that the
final assertion assert calls["ncps_tester"] == 0 actually verifies the
ncps_tester path was not executed; keep existing _FakeDeployment, calls dict,
and use the same monkeypatch pattern as tests 2 and 3.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e1f1cd52-1eae-464c-ad93-72eb4b1a2c7c
📒 Files selected for processing (12)
nix/e2e-tests/README.mdnix/e2e-tests/config.nixnix/e2e-tests/src/kubernetes_deployment.pynix/e2e-tests/src/runner.pynix/e2e-tests/tests/test_kubernetes_deployment.pynix/e2e-tests/tests/test_runner.pyopenspec/changes/archive/2026-06-09-e2e-kubernetes-deployment-adapter/.openspec.yamlopenspec/changes/archive/2026-06-09-e2e-kubernetes-deployment-adapter/design.mdopenspec/changes/archive/2026-06-09-e2e-kubernetes-deployment-adapter/proposal.mdopenspec/changes/archive/2026-06-09-e2e-kubernetes-deployment-adapter/specs/unified-e2e-harness/spec.mdopenspec/changes/archive/2026-06-09-e2e-kubernetes-deployment-adapter/tasks.mdopenspec/specs/unified-e2e-harness/spec.md
- _sqlite_via_sidecar: clear stale /tmp/q.db{,-wal,-shm} before copying
the live DB, so a long-lived sidecar's leftover WAL/SHM from a prior
query (whose live counterpart was since checkpointed) cannot corrupt
the next read (gemini PRRT_kwDONV6tZs6IT12F).
- _ensure_sqlite_sidecar: guard runAsUser with .isdigit() and wrap the
--custom temp file in try/finally so it is never leaked on error
(gemini PRRT_kwDONV6tZs6IT12H).
- _db_url: replace the flat sleep(3) with a socket readiness probe
(_wait_port_open) (gemini PRRT_kwDONV6tZs6IT12O).
- _reap(): terminate + wait (kill fallback) for port-forward procs in
_close_forwards and teardown so no zombies are left
(gemini PRRT_kwDONV6tZs6IT12Q, PRRT_kwDONV6tZs6IT12V).
- test_runner: wire kubernetes_mode.run_kubernetes_scenario so the
"routed to adapter, not NCPSTester" assertion is meaningful, not a
trivial 0 == 0 (coderabbit PRRT_kwDONV6tZs6IT4hC).
Adapter unit tests extended (stale-wal clear, forward reaping); offline
suite green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nix/e2e-tests/src/kubernetes_deployment.py (1)
383-389:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFail
stop()when the workload never actually reaches zero.If pods are still running after this loop, the method falls through and the harness proceeds as if the drain window is active. That can send
run_subcommand()and sqlite reads down the live-pod path instead of the stopped-workload path, hiding a failed drain.Proposed fix
deadline = time.time() + 120 while time.time() < deadline: if not self._pod_names(): return self._sleep(1) + raise RuntimeError( + f"kubernetes: timed out waiting for {self.release} pods to stop" + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nix/e2e-tests/src/kubernetes_deployment.py` around lines 383 - 389, The stop() method's pod-wait loop (using _pod_names()) currently falls through if pods remain and allows the harness to continue as if the workload drained; change stop() so that after the deadline loop it checks _pod_names() and if any pods still exist it fails explicitly (e.g., raise a RuntimeError or a HarnessDrainError) instead of returning normally, ensuring callers like run_subcommand() and any sqlite read logic cannot continue down the live-pod path when the workload never reached zero.
🧹 Nitpick comments (2)
nix/e2e-tests/tests/test_kubernetes_deployment.py (1)
94-99: ⚡ Quick winRecord
input=in_Recorderso the manifest-building tests can assert the applied spec.
kubectl apply -f -sends the cloned pod manifest on stdin, but this fake runner drops it. That leaves the oneshot/reader-pod tests proving only that an apply happened, not that the PVC/env survived or the probes were stripped.Minimal change
class _Recorder: """Fake runner: records argv lists and returns canned outputs by matcher.""" def __init__(self): self.calls: List[List[str]] = [] + self.inputs = [] self._responses = [] # list of (predicate, (rc, out)) self.default = (0, "") @@ def __call__(self, args, *, timeout=180, check=True, input=None): self.calls.append(list(args)) + self.inputs.append(input) for pred, resp in self._responses: if pred(args): return resp return self.default🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nix/e2e-tests/tests/test_kubernetes_deployment.py` around lines 94 - 99, The fake runner's __call__ in _Recorder is dropping the stdin payload so tests can't assert applied manifests; modify _Recorder.__call__ (and its self.calls storage) to record the input parameter as well (e.g., append a tuple or dict containing args and input) so that tests can inspect the stdin passed to "kubectl apply -f -"; ensure existing predicate/response logic still returns resp/default unchanged while preserving the new recorded input for assertions.nix/e2e-tests/src/kubernetes_deployment.py (1)
56-57: ⚡ Quick winPin the sqlite helper image to an immutable tag or digest.
Using
nouchka/sqlite3:latestmakes this adapter path non-deterministic; an upstream retag can break the Kubernetes e2e flow without any repo change. Please pin the tested image version or digest here.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nix/e2e-tests/src/kubernetes_deployment.py` around lines 56 - 57, Replace the mutable tag in the SQLITE_DEBUG_IMAGE constant with an immutable reference: update SQLITE_DEBUG_IMAGE to point to a specific version tag or a content-addressable digest (e.g., nouchka/sqlite3:vX.Y.Z or nouchka/sqlite3@sha256:...) so the e2e Kubernetes behavior is deterministic; ensure the chosen tag/digest is the one used in testing and update any test infra docs if needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@nix/e2e-tests/src/kubernetes_deployment.py`:
- Around line 383-389: The stop() method's pod-wait loop (using _pod_names())
currently falls through if pods remain and allows the harness to continue as if
the workload drained; change stop() so that after the deadline loop it checks
_pod_names() and if any pods still exist it fails explicitly (e.g., raise a
RuntimeError or a HarnessDrainError) instead of returning normally, ensuring
callers like run_subcommand() and any sqlite read logic cannot continue down the
live-pod path when the workload never reached zero.
---
Nitpick comments:
In `@nix/e2e-tests/src/kubernetes_deployment.py`:
- Around line 56-57: Replace the mutable tag in the SQLITE_DEBUG_IMAGE constant
with an immutable reference: update SQLITE_DEBUG_IMAGE to point to a specific
version tag or a content-addressable digest (e.g., nouchka/sqlite3:vX.Y.Z or
nouchka/sqlite3@sha256:...) so the e2e Kubernetes behavior is deterministic;
ensure the chosen tag/digest is the one used in testing and update any test
infra docs if needed.
In `@nix/e2e-tests/tests/test_kubernetes_deployment.py`:
- Around line 94-99: The fake runner's __call__ in _Recorder is dropping the
stdin payload so tests can't assert applied manifests; modify _Recorder.__call__
(and its self.calls storage) to record the input parameter as well (e.g., append
a tuple or dict containing args and input) so that tests can inspect the stdin
passed to "kubectl apply -f -"; ensure existing predicate/response logic still
returns resp/default unchanged while preserving the new recorded input for
assertions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 72b50fc8-7f1e-44e6-bf3e-67fb65f504b3
📒 Files selected for processing (3)
nix/e2e-tests/src/kubernetes_deployment.pynix/e2e-tests/tests/test_kubernetes_deployment.pynix/e2e-tests/tests/test_runner.py
🚧 Files skipped from review as they are similar to previous changes (1)
- nix/e2e-tests/tests/test_runner.py
Summary
Adds
KubernetesDeployment— aDeployment-protocol implementation that runs the unified e2e phase drivers unchanged on Kind — and routes thecdc-lifecyclescenario through it, so it is no longerlocal-only. The runner gates on scenario name, so the multi-replicaha-s3-postgres-cdc-lifecyclepermutation keeps itsNCPSTestertopology checks.Seams the kubernetes side previously lacked, now implemented:
kubectl port-forward pod/<n>per replica, plus a run.py-shapedstate.jsonsoseed_cachebuilds through the cluster ncps.helm upgrade --set config.cdc.enabled=… --set config.cdc.lazyChunkingEnabled=…+kubectl rollout restart(the previous tester could only disable CDC).run_subcommand— execs the shell-less ncps binary directly; during drain (ncps scaled to 0) runsmigrate-chunks-to-nar/fsckin a one-shot pod cloned from the resolved pod spec (preserves the StatefulSet PVC).db()— postgres/mysql via port-forward; sqlite via akubectl debugsidecar sharing the ncps PID namespace, reading the live DB at/proc/1/root(the image is shell-less). During the drain window sqlite is read from a transient pod mounting the released PVC.Verified live on Kind
nix run .#e2e -- --mode kubernetes --scenario cdc-lifecycle→ PASS, 41/41 checks (baseline → eager → lazy → drain → restart → fsck).Scope decision:
staging-contentionstays local-onlyIts adapter seams all work and serving is byte-correct cross-pod, but in-flight staging activation is a single-shot timing event that
kubectl port-forwardlatency jitter de-synchronizes (the lock-holder caches the NAR before cross-pod waiters contend), so activation can't be reliably forced on Kind. The pin now documents that and the scenario SKIPs under--mode kubernetes. The adapter is ready to lift it later if the race is made deterministic.No ncps production code, schema, or migration is touched. Offline unit coverage for the adapter + routing runs in
checks.e2e-harness-unit.Test plan
task fmt/task lint(Go: 0 issues)task test(Go unit, race) greenchecks.e2e-harness-unitflake check (34 offline adapter/runner tests)openspec validate --strictcdc-lifecycle [kubernetes]41/41 PASS--mode kubernetes --allexercises it (post-merge; staging-contention SKIPs)