feat(e2e): run cdc-lifecycle on kubernetes via a Deployment adapter by kalbasit · Pull Request #1384 · kalbasit/ncps

kalbasit · 2026-06-09T22:48:53Z

Summary

Adds KubernetesDeployment — a Deployment-protocol implementation that runs the unified e2e phase drivers unchanged on Kind — and routes the cdc-lifecycle scenario through it, so it is no longer local-only. The runner gates on scenario name, so the multi-replica ha-s3-postgres-cdc-lifecycle permutation keeps its NCPSTester topology checks.

Seams the kubernetes side previously lacked, now implemented:

Per-pod addressing — one kubectl port-forward pod/<n> per replica, plus a run.py-shaped state.json so seed_cache builds through the cluster ncps.
CDC enable/lazy toggle — helm upgrade --set config.cdc.enabled=… --set config.cdc.lazyChunkingEnabled=… + kubectl rollout restart (the previous tester could only disable CDC).
run_subcommand — execs the shell-less ncps binary directly; during drain (ncps scaled to 0) runs migrate-chunks-to-nar/fsck in a one-shot pod cloned from the resolved pod spec (preserves the StatefulSet PVC).
In-cluster db() — postgres/mysql via port-forward; sqlite via a kubectl debug sidecar sharing the ncps PID namespace, reading the live DB at /proc/1/root (the image is shell-less). During the drain window sqlite is read from a transient pod mounting the released PVC.

Verified live on Kind

nix run .#e2e -- --mode kubernetes --scenario cdc-lifecycle → PASS, 41/41 checks (baseline → eager → lazy → drain → restart → fsck).

Scope decision: `staging-contention` stays local-only

Its adapter seams all work and serving is byte-correct cross-pod, but in-flight staging activation is a single-shot timing event that kubectl port-forward latency jitter de-synchronizes (the lock-holder caches the NAR before cross-pod waiters contend), so activation can't be reliably forced on Kind. The pin now documents that and the scenario SKIPs under --mode kubernetes. The adapter is ready to lift it later if the race is made deterministic.

No ncps production code, schema, or migration is touched. Offline unit coverage for the adapter + routing runs in checks.e2e-harness-unit.

Test plan

task fmt / task lint (Go: 0 issues)
task test (Go unit, race) green
checks.e2e-harness-unit flake check (34 offline adapter/runner tests)
openspec validate --strict
Live Kind: cdc-lifecycle [kubernetes] 41/41 PASS
Nightly --mode kubernetes --all exercises it (post-merge; staging-contention SKIPs)

Add KubernetesDeployment, a Deployment-protocol implementation that runs the unified e2e phase drivers unchanged on Kind, and route the cdc-lifecycle scenario through it so it is no longer local-only. The adapter closes the seams the kubernetes side lacked: per-pod `kubectl port-forward` addressing (+ run.py state.json so seed_cache builds through the cluster ncps), CDC enable/lazy toggling via `helm upgrade --set` + rollout, `run_subcommand` execing the shell-less ncps binary directly, and in-cluster DB invariant reads. Because the ncps image is deliberately shell-less, sqlite is read via a `kubectl debug` sidecar that shares the ncps PID namespace and reads the live DB at /proc/1/root; during the drain window (ncps scaled to 0) sqlite is read from a transient pod mounting the released PVC, and migrate-chunks-to-nar runs in a one-shot pod cloned from the resolved pod spec. Routing is gated by scenario name, so the multi-replica ha-s3-postgres-cdc-lifecycle permutation keeps its NCPSTester topology checks. cdc-lifecycle verified live on Kind (41/41 checks: baseline -> eager -> lazy -> drain -> restart -> fsck). staging-contention stays local-only: its adapter seams work and serving is byte-correct cross-pod, but in-flight staging activation is a single-shot timing event that port-forward latency jitter de-synchronizes, so activation cannot be reliably forced on Kind; the pin now documents that and the scenario SKIPs under --mode kubernetes. Adds offline unit coverage for the adapter and the routing (checks.e2e-harness-unit); no ncps production code, schema, or migration is touched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

kalbasit · 2026-06-09T22:48:54Z

This change is part of the following stack:

feat(e2e): run cdc-lifecycle on kubernetes via a Deployment adapter #1384 ◀

_{Change managed by git-spice.}

coderabbitai · 2026-06-09T22:49:05Z

📝 Walkthrough

Summary by CodeRabbit

New Features
- cdc-lifecycle scenarios can now run in Kubernetes mode via a Kubernetes deployment adapter (broader e2e coverage).
Tests
- Added unit tests covering the Kubernetes adapter and scenario routing to validate behavior without a real cluster.
Documentation
- Updated README, spec, and design/proposal docs to clarify mode semantics, adapter behavior, and that staging-contention remains local-only (skipped in Kubernetes).

Walkthrough

Adds a KubernetesDeployment adapter to run e2e phase drivers on Kind with per-pod port-forwards, Helm-based CDC toggles, in-cluster DB access (sqlite via debug/reader pod; postgres/mysql via port-forward), integrates adapter routing into the runner, enables cdc-lifecycle for Kubernetes, keeps staging-contention local-only, and adds offline unit tests and spec/docs updates.

Changes

Kubernetes Deployment Adapter

Layer / File(s)	Summary
Kubernetes Deployment Adapter Implementation `nix/e2e-tests/src/kubernetes_deployment.py`	New `KubernetesDeployment` class implementing the `Deployment` protocol for Kind. Manages cluster provisioning, per-replica port-forwarding, Helm-based CDC lifecycle (enable/disable, rollout restart), kubectl exec subcommand execution and oneshot pod fallback, in-cluster DB access (postgres/mysql via port-forward; sqlite via `kubectl debug` sidecar or reader pod), `state.json` writing, and teardown.
Runner Integration and Scenario Routing `nix/e2e-tests/src/runner.py`	Updates `_run_kubernetes` to route selected phase-driver scenarios through the `KubernetesDeployment` adapter with explicit provision/phase/teardown handling and leaves other permutations on the existing `kubernetes_mode.run_kubernetes_scenario` (NCPSTester) path.
Scenario Configuration and Pinning `nix/e2e-tests/config.nix`, `nix/e2e-tests/README.md`	Enables `cdc-lifecycle` for both `local` and `kubernetes` modes by updating the harness-only `modes` key; documents why `staging-contention` remains `local`-only (port-forward timing jitter) in config comments and README.
Kubernetes Deployment Adapter Unit Tests `nix/e2e-tests/tests/test_kubernetes_deployment.py`	Offline unit tests with injected fakes validating adapter behavior: namespace/release derivation, locker selection, provision order and port-forward spawning, state writer contents, replica URL formatting, CDC Helm `--set` flags and rollout restart, stop/scale behavior, run_subcommand exec and oneshot cloning, sqlite debug vs reader pod queries, postgres port-forward URL building, logs, teardown, and protocol surface.
Runner Scenario Routing Tests `nix/e2e-tests/tests/test_runner.py`	Unit tests asserting adapter routing for phase-driver scenarios and correct delegation to NCPSTester for plain permutations and HA `cdc-lifecycle` permutations.
Design, Specification, and Documentation `nix/e2e-tests/README.md`, `openspec/changes/archive/2026-06-09-e2e-kubernetes-deployment-adapter/*`, `openspec/specs/unified-e2e-harness/spec.md`	README and archived design/proposal/spec documents describe adapter seams (provisioning, per-pod addressing, CDC toggles, DB access) and decisions. Spec updates formalize `--mode local

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

kalbasit/ncps#1348: Related Kubernetes-mode routing and cdc-lifecycle k8s test logic; overlaps on runner/adapter integration.
kalbasit/ncps#1381: Related updates to the e2e Kubernetes execution flow and runner routing foundations.

Suggested labels

merge-stack

Poem

🐰 Hopped the Kind fence, nose to the cluster,

Port-forwards humming, Helm flags flutter,
SQLite peeks through /proc/1/root,
Phase drivers march unchanged, adapters salute,
A tiny rabbit stamps tests and docs with a mutter.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.48% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: adding a Deployment adapter that enables the cdc-lifecycle scenario to run on Kubernetes, shifting it from local-only execution.
Description check	✅ Passed	The description is well-detailed and directly related to the changeset, explaining the new KubernetesDeployment adapter, the seams implemented, and the reasoning for keeping staging-contention local-only.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a KubernetesDeployment adapter that implements the Deployment protocol, enabling the cdc-lifecycle and staging-contention phase drivers to run on a Kind cluster. It maps essential seams such as provisioning, per-pod port-forwarding, and in-cluster database access (including a kubectl debug sidecar for sqlite), and adds comprehensive offline unit tests. The review feedback highlights several opportunities to make the adapter more robust, such as clearing stale SQLite WAL/SHM files to prevent query corruption, safely parsing container UIDs and cleaning up temporary files, replacing static sleeps with socket-based readiness probes for database port-forwards, and properly reaping terminated port-forward processes to avoid zombie processes.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nix/e2e-tests/tests/test_runner.py`:
- Around line 38-77: The test currently never wires up
kubernetes_mode.run_kubernetes_scenario so calls["ncps_tester"] stays 0
trivially; before calling runner._run_kubernetes, monkeypatch
kubernetes_mode.run_kubernetes_scenario (via monkeypatch.setattr) to a callable
that increments calls["ncps_tester"] and returns an appropriate rc, so that the
final assertion assert calls["ncps_tester"] == 0 actually verifies the
ncps_tester path was not executed; keep existing _FakeDeployment, calls dict,
and use the same monkeypatch pattern as tests 2 and 3.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e1f1cd52-1eae-464c-ad93-72eb4b1a2c7c

📥 Commits

Reviewing files that changed from the base of the PR and between 18638fd and 33da656.

📒 Files selected for processing (12)

nix/e2e-tests/README.md
nix/e2e-tests/config.nix
nix/e2e-tests/src/kubernetes_deployment.py
nix/e2e-tests/src/runner.py
nix/e2e-tests/tests/test_kubernetes_deployment.py
nix/e2e-tests/tests/test_runner.py
openspec/changes/archive/2026-06-09-e2e-kubernetes-deployment-adapter/.openspec.yaml
openspec/changes/archive/2026-06-09-e2e-kubernetes-deployment-adapter/design.md
openspec/changes/archive/2026-06-09-e2e-kubernetes-deployment-adapter/proposal.md
openspec/changes/archive/2026-06-09-e2e-kubernetes-deployment-adapter/specs/unified-e2e-harness/spec.md
openspec/changes/archive/2026-06-09-e2e-kubernetes-deployment-adapter/tasks.md
openspec/specs/unified-e2e-harness/spec.md

- _sqlite_via_sidecar: clear stale /tmp/q.db{,-wal,-shm} before copying the live DB, so a long-lived sidecar's leftover WAL/SHM from a prior query (whose live counterpart was since checkpointed) cannot corrupt the next read (gemini PRRT_kwDONV6tZs6IT12F). - _ensure_sqlite_sidecar: guard runAsUser with .isdigit() and wrap the --custom temp file in try/finally so it is never leaked on error (gemini PRRT_kwDONV6tZs6IT12H). - _db_url: replace the flat sleep(3) with a socket readiness probe (_wait_port_open) (gemini PRRT_kwDONV6tZs6IT12O). - _reap(): terminate + wait (kill fallback) for port-forward procs in _close_forwards and teardown so no zombies are left (gemini PRRT_kwDONV6tZs6IT12Q, PRRT_kwDONV6tZs6IT12V). - test_runner: wire kubernetes_mode.run_kubernetes_scenario so the "routed to adapter, not NCPSTester" assertion is meaningful, not a trivial 0 == 0 (coderabbit PRRT_kwDONV6tZs6IT4hC). Adapter unit tests extended (stale-wal clear, forward reaping); offline suite green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nix/e2e-tests/src/kubernetes_deployment.py (1)
383-389: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail stop() when the workload never actually reaches zero.

If pods are still running after this loop, the method falls through and the harness proceeds as if the drain window is active. That can send run_subcommand() and sqlite reads down the live-pod path instead of the stopped-workload path, hiding a failed drain.
Proposed fix
     deadline = time.time() + 120
     while time.time() < deadline:
         if not self._pod_names():
             return
         self._sleep(1)
+    raise RuntimeError(
+        f"kubernetes: timed out waiting for {self.release} pods to stop"
+    )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nix/e2e-tests/src/kubernetes_deployment.py` around lines 383 - 389, The
stop() method's pod-wait loop (using _pod_names()) currently falls through if
pods remain and allows the harness to continue as if the workload drained;
change stop() so that after the deadline loop it checks _pod_names() and if any
pods still exist it fails explicitly (e.g., raise a RuntimeError or a
HarnessDrainError) instead of returning normally, ensuring callers like
run_subcommand() and any sqlite read logic cannot continue down the live-pod
path when the workload never reached zero.

🧹 Nitpick comments (2)

nix/e2e-tests/tests/test_kubernetes_deployment.py (1)

94-99: ⚡ Quick win

Record input= in _Recorder so the manifest-building tests can assert the applied spec.

kubectl apply -f - sends the cloned pod manifest on stdin, but this fake runner drops it. That leaves the oneshot/reader-pod tests proving only that an apply happened, not that the PVC/env survived or the probes were stripped.

Minimal change

 class _Recorder:
     """Fake runner: records argv lists and returns canned outputs by matcher."""
 
     def __init__(self):
         self.calls: List[List[str]] = []
+        self.inputs = []
         self._responses = []  # list of (predicate, (rc, out))
         self.default = (0, "")
@@
     def __call__(self, args, *, timeout=180, check=True, input=None):
         self.calls.append(list(args))
+        self.inputs.append(input)
         for pred, resp in self._responses:
             if pred(args):
                 return resp
         return self.default

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nix/e2e-tests/tests/test_kubernetes_deployment.py` around lines 94 - 99, The
fake runner's __call__ in _Recorder is dropping the stdin payload so tests can't
assert applied manifests; modify _Recorder.__call__ (and its self.calls storage)
to record the input parameter as well (e.g., append a tuple or dict containing
args and input) so that tests can inspect the stdin passed to "kubectl apply -f
-"; ensure existing predicate/response logic still returns resp/default
unchanged while preserving the new recorded input for assertions.

nix/e2e-tests/src/kubernetes_deployment.py (1)

56-57: ⚡ Quick win

Pin the sqlite helper image to an immutable tag or digest.

Using nouchka/sqlite3:latest makes this adapter path non-deterministic; an upstream retag can break the Kubernetes e2e flow without any repo change. Please pin the tested image version or digest here.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nix/e2e-tests/src/kubernetes_deployment.py` around lines 56 - 57, Replace the
mutable tag in the SQLITE_DEBUG_IMAGE constant with an immutable reference:
update SQLITE_DEBUG_IMAGE to point to a specific version tag or a
content-addressable digest (e.g., nouchka/sqlite3:vX.Y.Z or
nouchka/sqlite3@sha256:...) so the e2e Kubernetes behavior is deterministic;
ensure the chosen tag/digest is the one used in testing and update any test
infra docs if needed.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@nix/e2e-tests/src/kubernetes_deployment.py`:
- Around line 383-389: The stop() method's pod-wait loop (using _pod_names())
currently falls through if pods remain and allows the harness to continue as if
the workload drained; change stop() so that after the deadline loop it checks
_pod_names() and if any pods still exist it fails explicitly (e.g., raise a
RuntimeError or a HarnessDrainError) instead of returning normally, ensuring
callers like run_subcommand() and any sqlite read logic cannot continue down the
live-pod path when the workload never reached zero.

---

Nitpick comments:
In `@nix/e2e-tests/src/kubernetes_deployment.py`:
- Around line 56-57: Replace the mutable tag in the SQLITE_DEBUG_IMAGE constant
with an immutable reference: update SQLITE_DEBUG_IMAGE to point to a specific
version tag or a content-addressable digest (e.g., nouchka/sqlite3:vX.Y.Z or
nouchka/sqlite3@sha256:...) so the e2e Kubernetes behavior is deterministic;
ensure the chosen tag/digest is the one used in testing and update any test
infra docs if needed.

In `@nix/e2e-tests/tests/test_kubernetes_deployment.py`:
- Around line 94-99: The fake runner's __call__ in _Recorder is dropping the
stdin payload so tests can't assert applied manifests; modify _Recorder.__call__
(and its self.calls storage) to record the input parameter as well (e.g., append
a tuple or dict containing args and input) so that tests can inspect the stdin
passed to "kubectl apply -f -"; ensure existing predicate/response logic still
returns resp/default unchanged while preserving the new recorded input for
assertions.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 72b50fc8-7f1e-44e6-bf3e-67fb65f504b3

📥 Commits

Reviewing files that changed from the base of the PR and between 33da656 and d886391.

📒 Files selected for processing (3)

nix/e2e-tests/src/kubernetes_deployment.py
nix/e2e-tests/tests/test_kubernetes_deployment.py
nix/e2e-tests/tests/test_runner.py

🚧 Files skipped from review as they are similar to previous changes (1)

nix/e2e-tests/tests/test_runner.py

dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request go Pull requests that update go code labels Jun 9, 2026

github-advanced-security AI found potential problems Jun 9, 2026

View reviewed changes

gemini-code-assist Bot reviewed Jun 9, 2026

View reviewed changes

coderabbitai Bot requested changes Jun 9, 2026

View reviewed changes

Comment thread nix/e2e-tests/tests/test_runner.py

coderabbitai Bot approved these changes Jun 9, 2026

View reviewed changes

kalbasit enabled auto-merge (squash) June 9, 2026 23:14

github-advanced-security AI found potential problems Jun 9, 2026

View reviewed changes

Comment thread nix/e2e-tests/src/kubernetes_deployment.py Dismissed

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

kalbasit merged commit 364bccd into main Jun 9, 2026
39 checks passed

kalbasit deleted the user/wnasreddine/full-e2e-on-both-modes branch June 9, 2026 23:21

Uh oh!

Conversation

kalbasit commented Jun 9, 2026

Summary

Verified live on Kind

Scope decision: staging-contention stays local-only

Test plan

Uh oh!

kalbasit commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Scope decision: `staging-contention` stays local-only

kalbasit commented Jun 9, 2026 •

edited

Loading

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading