Skip to content

feat(e2e): run cdc-lifecycle on kubernetes via a Deployment adapter#1384

Merged
kalbasit merged 2 commits into
mainfrom
user/wnasreddine/full-e2e-on-both-modes
Jun 9, 2026
Merged

feat(e2e): run cdc-lifecycle on kubernetes via a Deployment adapter#1384
kalbasit merged 2 commits into
mainfrom
user/wnasreddine/full-e2e-on-both-modes

Conversation

@kalbasit

@kalbasit kalbasit commented Jun 9, 2026

Copy link
Copy Markdown
Owner

Summary

Adds KubernetesDeployment — a Deployment-protocol implementation that runs the unified e2e phase drivers unchanged on Kind — and routes the cdc-lifecycle scenario through it, so it is no longer local-only. The runner gates on scenario name, so the multi-replica ha-s3-postgres-cdc-lifecycle permutation keeps its NCPSTester topology checks.

Seams the kubernetes side previously lacked, now implemented:

  • Per-pod addressing — one kubectl port-forward pod/<n> per replica, plus a run.py-shaped state.json so seed_cache builds through the cluster ncps.
  • CDC enable/lazy togglehelm upgrade --set config.cdc.enabled=… --set config.cdc.lazyChunkingEnabled=… + kubectl rollout restart (the previous tester could only disable CDC).
  • run_subcommand — execs the shell-less ncps binary directly; during drain (ncps scaled to 0) runs migrate-chunks-to-nar/fsck in a one-shot pod cloned from the resolved pod spec (preserves the StatefulSet PVC).
  • In-cluster db() — postgres/mysql via port-forward; sqlite via a kubectl debug sidecar sharing the ncps PID namespace, reading the live DB at /proc/1/root (the image is shell-less). During the drain window sqlite is read from a transient pod mounting the released PVC.

Verified live on Kind

nix run .#e2e -- --mode kubernetes --scenario cdc-lifecyclePASS, 41/41 checks (baseline → eager → lazy → drain → restart → fsck).

Scope decision: staging-contention stays local-only

Its adapter seams all work and serving is byte-correct cross-pod, but in-flight staging activation is a single-shot timing event that kubectl port-forward latency jitter de-synchronizes (the lock-holder caches the NAR before cross-pod waiters contend), so activation can't be reliably forced on Kind. The pin now documents that and the scenario SKIPs under --mode kubernetes. The adapter is ready to lift it later if the race is made deterministic.

No ncps production code, schema, or migration is touched. Offline unit coverage for the adapter + routing runs in checks.e2e-harness-unit.

Test plan

  • task fmt / task lint (Go: 0 issues)
  • task test (Go unit, race) green
  • checks.e2e-harness-unit flake check (34 offline adapter/runner tests)
  • openspec validate --strict
  • Live Kind: cdc-lifecycle [kubernetes] 41/41 PASS
  • Nightly --mode kubernetes --all exercises it (post-merge; staging-contention SKIPs)

Add KubernetesDeployment, a Deployment-protocol implementation that runs
the unified e2e phase drivers unchanged on Kind, and route the
cdc-lifecycle scenario through it so it is no longer local-only.

The adapter closes the seams the kubernetes side lacked: per-pod
`kubectl port-forward` addressing (+ run.py state.json so seed_cache
builds through the cluster ncps), CDC enable/lazy toggling via
`helm upgrade --set` + rollout, `run_subcommand` execing the shell-less
ncps binary directly, and in-cluster DB invariant reads. Because the
ncps image is deliberately shell-less, sqlite is read via a `kubectl
debug` sidecar that shares the ncps PID namespace and reads the live DB
at /proc/1/root; during the drain window (ncps scaled to 0) sqlite is
read from a transient pod mounting the released PVC, and
migrate-chunks-to-nar runs in a one-shot pod cloned from the resolved
pod spec. Routing is gated by scenario name, so the multi-replica
ha-s3-postgres-cdc-lifecycle permutation keeps its NCPSTester topology
checks.

cdc-lifecycle verified live on Kind (41/41 checks: baseline -> eager ->
lazy -> drain -> restart -> fsck). staging-contention stays local-only:
its adapter seams work and serving is byte-correct cross-pod, but
in-flight staging activation is a single-shot timing event that
port-forward latency jitter de-synchronizes, so activation cannot be
reliably forced on Kind; the pin now documents that and the scenario
SKIPs under --mode kubernetes.

Adds offline unit coverage for the adapter and the routing
(checks.e2e-harness-unit); no ncps production code, schema, or migration
is touched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@kalbasit

kalbasit commented Jun 9, 2026

Copy link
Copy Markdown
Owner Author

This change is part of the following stack:

Change managed by git-spice.

@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request go Pull requests that update go code labels Jun 9, 2026
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • cdc-lifecycle scenarios can now run in Kubernetes mode via a Kubernetes deployment adapter (broader e2e coverage).
  • Tests

    • Added unit tests covering the Kubernetes adapter and scenario routing to validate behavior without a real cluster.
  • Documentation

    • Updated README, spec, and design/proposal docs to clarify mode semantics, adapter behavior, and that staging-contention remains local-only (skipped in Kubernetes).

Walkthrough

Adds a KubernetesDeployment adapter to run e2e phase drivers on Kind with per-pod port-forwards, Helm-based CDC toggles, in-cluster DB access (sqlite via debug/reader pod; postgres/mysql via port-forward), integrates adapter routing into the runner, enables cdc-lifecycle for Kubernetes, keeps staging-contention local-only, and adds offline unit tests and spec/docs updates.

Changes

Kubernetes Deployment Adapter

Layer / File(s) Summary
Kubernetes Deployment Adapter Implementation
nix/e2e-tests/src/kubernetes_deployment.py
New KubernetesDeployment class implementing the Deployment protocol for Kind. Manages cluster provisioning, per-replica port-forwarding, Helm-based CDC lifecycle (enable/disable, rollout restart), kubectl exec subcommand execution and oneshot pod fallback, in-cluster DB access (postgres/mysql via port-forward; sqlite via kubectl debug sidecar or reader pod), state.json writing, and teardown.
Runner Integration and Scenario Routing
nix/e2e-tests/src/runner.py
Updates _run_kubernetes to route selected phase-driver scenarios through the KubernetesDeployment adapter with explicit provision/phase/teardown handling and leaves other permutations on the existing kubernetes_mode.run_kubernetes_scenario (NCPSTester) path.
Scenario Configuration and Pinning
nix/e2e-tests/config.nix, nix/e2e-tests/README.md
Enables cdc-lifecycle for both local and kubernetes modes by updating the harness-only modes key; documents why staging-contention remains local-only (port-forward timing jitter) in config comments and README.
Kubernetes Deployment Adapter Unit Tests
nix/e2e-tests/tests/test_kubernetes_deployment.py
Offline unit tests with injected fakes validating adapter behavior: namespace/release derivation, locker selection, provision order and port-forward spawning, state writer contents, replica URL formatting, CDC Helm --set flags and rollout restart, stop/scale behavior, run_subcommand exec and oneshot cloning, sqlite debug vs reader pod queries, postgres port-forward URL building, logs, teardown, and protocol surface.
Runner Scenario Routing Tests
nix/e2e-tests/tests/test_runner.py
Unit tests asserting adapter routing for phase-driver scenarios and correct delegation to NCPSTester for plain permutations and HA cdc-lifecycle permutations.
Design, Specification, and Documentation
nix/e2e-tests/README.md, openspec/changes/archive/2026-06-09-e2e-kubernetes-deployment-adapter/*, openspec/specs/unified-e2e-harness/spec.md
README and archived design/proposal/spec documents describe adapter seams (provisioning, per-pod addressing, CDC toggles, DB access) and decisions. Spec updates formalize `--mode local

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • kalbasit/ncps#1348: Related Kubernetes-mode routing and cdc-lifecycle k8s test logic; overlaps on runner/adapter integration.
  • kalbasit/ncps#1381: Related updates to the e2e Kubernetes execution flow and runner routing foundations.

Suggested labels

merge-stack

Poem

🐰 Hopped the Kind fence, nose to the cluster,

Port-forwards humming, Helm flags flutter,
SQLite peeks through /proc/1/root,
Phase drivers march unchanged, adapters salute,
A tiny rabbit stamps tests and docs with a mutter.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.48% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: adding a Deployment adapter that enables the cdc-lifecycle scenario to run on Kubernetes, shifting it from local-only execution.
Description check ✅ Passed The description is well-detailed and directly related to the changeset, explaining the new KubernetesDeployment adapter, the seams implemented, and the reasoning for keeping staging-contention local-only.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread nix/e2e-tests/src/kubernetes_deployment.py Dismissed
Comment thread nix/e2e-tests/src/kubernetes_deployment.py Dismissed
Comment thread nix/e2e-tests/src/kubernetes_deployment.py Dismissed
Comment thread nix/e2e-tests/src/kubernetes_deployment.py Dismissed
Comment thread nix/e2e-tests/src/kubernetes_deployment.py Dismissed
Comment thread nix/e2e-tests/tests/test_kubernetes_deployment.py Dismissed
Comment thread nix/e2e-tests/tests/test_kubernetes_deployment.py Dismissed

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a KubernetesDeployment adapter that implements the Deployment protocol, enabling the cdc-lifecycle and staging-contention phase drivers to run on a Kind cluster. It maps essential seams such as provisioning, per-pod port-forwarding, and in-cluster database access (including a kubectl debug sidecar for sqlite), and adds comprehensive offline unit tests. The review feedback highlights several opportunities to make the adapter more robust, such as clearing stale SQLite WAL/SHM files to prevent query corruption, safely parsing container UIDs and cleaning up temporary files, replacing static sleeps with socket-based readiness probes for database port-forwards, and properly reaping terminated port-forward processes to avoid zombie processes.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread nix/e2e-tests/src/kubernetes_deployment.py
Comment thread nix/e2e-tests/src/kubernetes_deployment.py Outdated
Comment thread nix/e2e-tests/src/kubernetes_deployment.py Outdated
Comment thread nix/e2e-tests/src/kubernetes_deployment.py
Comment thread nix/e2e-tests/src/kubernetes_deployment.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nix/e2e-tests/tests/test_runner.py`:
- Around line 38-77: The test currently never wires up
kubernetes_mode.run_kubernetes_scenario so calls["ncps_tester"] stays 0
trivially; before calling runner._run_kubernetes, monkeypatch
kubernetes_mode.run_kubernetes_scenario (via monkeypatch.setattr) to a callable
that increments calls["ncps_tester"] and returns an appropriate rc, so that the
final assertion assert calls["ncps_tester"] == 0 actually verifies the
ncps_tester path was not executed; keep existing _FakeDeployment, calls dict,
and use the same monkeypatch pattern as tests 2 and 3.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e1f1cd52-1eae-464c-ad93-72eb4b1a2c7c

📥 Commits

Reviewing files that changed from the base of the PR and between 18638fd and 33da656.

📒 Files selected for processing (12)
  • nix/e2e-tests/README.md
  • nix/e2e-tests/config.nix
  • nix/e2e-tests/src/kubernetes_deployment.py
  • nix/e2e-tests/src/runner.py
  • nix/e2e-tests/tests/test_kubernetes_deployment.py
  • nix/e2e-tests/tests/test_runner.py
  • openspec/changes/archive/2026-06-09-e2e-kubernetes-deployment-adapter/.openspec.yaml
  • openspec/changes/archive/2026-06-09-e2e-kubernetes-deployment-adapter/design.md
  • openspec/changes/archive/2026-06-09-e2e-kubernetes-deployment-adapter/proposal.md
  • openspec/changes/archive/2026-06-09-e2e-kubernetes-deployment-adapter/specs/unified-e2e-harness/spec.md
  • openspec/changes/archive/2026-06-09-e2e-kubernetes-deployment-adapter/tasks.md
  • openspec/specs/unified-e2e-harness/spec.md

Comment thread nix/e2e-tests/tests/test_runner.py
- _sqlite_via_sidecar: clear stale /tmp/q.db{,-wal,-shm} before copying
  the live DB, so a long-lived sidecar's leftover WAL/SHM from a prior
  query (whose live counterpart was since checkpointed) cannot corrupt
  the next read (gemini PRRT_kwDONV6tZs6IT12F).
- _ensure_sqlite_sidecar: guard runAsUser with .isdigit() and wrap the
  --custom temp file in try/finally so it is never leaked on error
  (gemini PRRT_kwDONV6tZs6IT12H).
- _db_url: replace the flat sleep(3) with a socket readiness probe
  (_wait_port_open) (gemini PRRT_kwDONV6tZs6IT12O).
- _reap(): terminate + wait (kill fallback) for port-forward procs in
  _close_forwards and teardown so no zombies are left
  (gemini PRRT_kwDONV6tZs6IT12Q, PRRT_kwDONV6tZs6IT12V).
- test_runner: wire kubernetes_mode.run_kubernetes_scenario so the
  "routed to adapter, not NCPSTester" assertion is meaningful, not a
  trivial 0 == 0 (coderabbit PRRT_kwDONV6tZs6IT4hC).

Adapter unit tests extended (stale-wal clear, forward reaping); offline
suite green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@kalbasit kalbasit enabled auto-merge (squash) June 9, 2026 23:14
Comment thread nix/e2e-tests/src/kubernetes_deployment.py Dismissed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nix/e2e-tests/src/kubernetes_deployment.py (1)

383-389: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail stop() when the workload never actually reaches zero.

If pods are still running after this loop, the method falls through and the harness proceeds as if the drain window is active. That can send run_subcommand() and sqlite reads down the live-pod path instead of the stopped-workload path, hiding a failed drain.

Proposed fix
     deadline = time.time() + 120
     while time.time() < deadline:
         if not self._pod_names():
             return
         self._sleep(1)
+    raise RuntimeError(
+        f"kubernetes: timed out waiting for {self.release} pods to stop"
+    )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nix/e2e-tests/src/kubernetes_deployment.py` around lines 383 - 389, The
stop() method's pod-wait loop (using _pod_names()) currently falls through if
pods remain and allows the harness to continue as if the workload drained;
change stop() so that after the deadline loop it checks _pod_names() and if any
pods still exist it fails explicitly (e.g., raise a RuntimeError or a
HarnessDrainError) instead of returning normally, ensuring callers like
run_subcommand() and any sqlite read logic cannot continue down the live-pod
path when the workload never reached zero.
🧹 Nitpick comments (2)
nix/e2e-tests/tests/test_kubernetes_deployment.py (1)

94-99: ⚡ Quick win

Record input= in _Recorder so the manifest-building tests can assert the applied spec.

kubectl apply -f - sends the cloned pod manifest on stdin, but this fake runner drops it. That leaves the oneshot/reader-pod tests proving only that an apply happened, not that the PVC/env survived or the probes were stripped.

Minimal change
 class _Recorder:
     """Fake runner: records argv lists and returns canned outputs by matcher."""
 
     def __init__(self):
         self.calls: List[List[str]] = []
+        self.inputs = []
         self._responses = []  # list of (predicate, (rc, out))
         self.default = (0, "")
@@
     def __call__(self, args, *, timeout=180, check=True, input=None):
         self.calls.append(list(args))
+        self.inputs.append(input)
         for pred, resp in self._responses:
             if pred(args):
                 return resp
         return self.default
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nix/e2e-tests/tests/test_kubernetes_deployment.py` around lines 94 - 99, The
fake runner's __call__ in _Recorder is dropping the stdin payload so tests can't
assert applied manifests; modify _Recorder.__call__ (and its self.calls storage)
to record the input parameter as well (e.g., append a tuple or dict containing
args and input) so that tests can inspect the stdin passed to "kubectl apply -f
-"; ensure existing predicate/response logic still returns resp/default
unchanged while preserving the new recorded input for assertions.
nix/e2e-tests/src/kubernetes_deployment.py (1)

56-57: ⚡ Quick win

Pin the sqlite helper image to an immutable tag or digest.

Using nouchka/sqlite3:latest makes this adapter path non-deterministic; an upstream retag can break the Kubernetes e2e flow without any repo change. Please pin the tested image version or digest here.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nix/e2e-tests/src/kubernetes_deployment.py` around lines 56 - 57, Replace the
mutable tag in the SQLITE_DEBUG_IMAGE constant with an immutable reference:
update SQLITE_DEBUG_IMAGE to point to a specific version tag or a
content-addressable digest (e.g., nouchka/sqlite3:vX.Y.Z or
nouchka/sqlite3@sha256:...) so the e2e Kubernetes behavior is deterministic;
ensure the chosen tag/digest is the one used in testing and update any test
infra docs if needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@nix/e2e-tests/src/kubernetes_deployment.py`:
- Around line 383-389: The stop() method's pod-wait loop (using _pod_names())
currently falls through if pods remain and allows the harness to continue as if
the workload drained; change stop() so that after the deadline loop it checks
_pod_names() and if any pods still exist it fails explicitly (e.g., raise a
RuntimeError or a HarnessDrainError) instead of returning normally, ensuring
callers like run_subcommand() and any sqlite read logic cannot continue down the
live-pod path when the workload never reached zero.

---

Nitpick comments:
In `@nix/e2e-tests/src/kubernetes_deployment.py`:
- Around line 56-57: Replace the mutable tag in the SQLITE_DEBUG_IMAGE constant
with an immutable reference: update SQLITE_DEBUG_IMAGE to point to a specific
version tag or a content-addressable digest (e.g., nouchka/sqlite3:vX.Y.Z or
nouchka/sqlite3@sha256:...) so the e2e Kubernetes behavior is deterministic;
ensure the chosen tag/digest is the one used in testing and update any test
infra docs if needed.

In `@nix/e2e-tests/tests/test_kubernetes_deployment.py`:
- Around line 94-99: The fake runner's __call__ in _Recorder is dropping the
stdin payload so tests can't assert applied manifests; modify _Recorder.__call__
(and its self.calls storage) to record the input parameter as well (e.g., append
a tuple or dict containing args and input) so that tests can inspect the stdin
passed to "kubectl apply -f -"; ensure existing predicate/response logic still
returns resp/default unchanged while preserving the new recorded input for
assertions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 72b50fc8-7f1e-44e6-bf3e-67fb65f504b3

📥 Commits

Reviewing files that changed from the base of the PR and between 33da656 and d886391.

📒 Files selected for processing (3)
  • nix/e2e-tests/src/kubernetes_deployment.py
  • nix/e2e-tests/tests/test_kubernetes_deployment.py
  • nix/e2e-tests/tests/test_runner.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • nix/e2e-tests/tests/test_runner.py

@kalbasit kalbasit merged commit 364bccd into main Jun 9, 2026
39 checks passed
@kalbasit kalbasit deleted the user/wnasreddine/full-e2e-on-both-modes branch June 9, 2026 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request go Pull requests that update go code size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants