Skip to content

Enhancement: Add a ticker to check job completion as a fallback#347

Open
feichashao wants to merge 2 commits into
openshift:mainfrom
feichashao:pod_watch
Open

Enhancement: Add a ticker to check job completion as a fallback#347
feichashao wants to merge 2 commits into
openshift:mainfrom
feichashao:pod_watch

Conversation

@feichashao
Copy link
Copy Markdown

@feichashao feichashao commented May 14, 2026

Enhancement

What does this PR do? / Related Issues / Jira

Assisted by Claude Code.

This PR adds a ticker to check the job completion.

When using osdctl network verify-egress with pod mode, it hangs when waiting for job completion:

osdctl network verify-egress -C xxxxxx --reason "xxxxxx"

2026/05/14 14:17:07 Cluster is HCP - forcing pod mode.
2026/05/14 14:17:07 Preparing to run pod-based network verification in namespace openshift-network-diagnostics.
2026/05/14 14:17:10 Pod mode using elevated backplane credentials (backplane-cluster-admin) for cluster: xxxxxx
2026/05/14 14:17:10 Pod mode initialized with namespace: openshift-network-diagnostics
2026/05/14 14:17:10 Detected AWS region from OCM: us-east-2
Using egress URL list from https://api.github.com/repos/openshift/osd-network-verifier/contents/pkg/data/egress_lists/aws-hcp.yaml?ref=main at SHA 1cb7c15e951908f44d5ff3b3589ec763cea917c9
^C

(hangs here)

From the cluster, I can see the job has already completed.

It could be some issues that lead to the watch broken, and the WaitForJobCompletion keeps in a loop.

This PR adds a ticker to get the job status every 10s, as a fallback.

After this change, the verify-egress can succeed:

osdctl network verify-egress -C xxxxxxx --reason "xxxxxx"
2026/05/14 15:36:45 Cluster is HCP - forcing pod mode.
2026/05/14 15:36:45 Preparing to run pod-based network verification in namespace openshift-network-diagnostics.
2026/05/14 15:36:49 Pod mode using elevated backplane credentials (backplane-cluster-admin) for cluster: xxxxxx
2026/05/14 15:36:49 Pod mode initialized with namespace: openshift-network-diagnostics
2026/05/14 15:36:49 Detected AWS region from OCM: us-east-2
Using egress URL list from https://api.github.com/repos/openshift/osd-network-verifier/contents/pkg/data/egress_lists/aws-hcp.yaml?ref=main at SHA 1cb7c15e951908f44d5ff3b3589ec763cea917c9
Summary:
All tests passed!

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have tested the functionality against gcp / aws, it doesn't cause any regression
  • I have added execution results to the PR's readme

Reviewer's Checklist

  • (This needs to be done after technical review) I've run the branch on my local, verified that the functionality is ok

How to test this PR locally / Special Instructions

Logs

Summary by CodeRabbit

  • Bug Fixes
    • Improved job completion monitoring by combining event watches with periodic polling and more robust handling of watch interruptions, reducing false positives/negatives and improving reliability of status detection.
  • Tests
    • Added unit tests covering various job completion and failure scenarios, including timeout behavior, to validate the waiting logic.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

Walkthrough

WaitForJobCompletion now polls job status every 10s via a time.Ticker as a fallback alongside the existing Kubernetes watch. Both watch events and ticker-driven fetches use a new checkJobCompletion helper that returns nil on success, an error on failure, or a package sentinel error (errJobIncomplete) to indicate "keep waiting."

Changes

Job Completion Detection with Polling Fallback

Layer / File(s) Summary
Sentinel and completion-check helper
pkg/clients/kube/kube.go (lines 107–109, 119–157)
Add package sentinel errJobIncomplete and checkJobCompletion(job) which centralizes detection of completion/failure conditions (JobComplete, JobFailed, JobSuccessCriteriaMet, JobFailureTarget) and returns sentinel when still waiting.
Polling-driven wait loop with watch integration
pkg/clients/kube/kube.go (lines 119–161)
WaitForJobCompletion now creates time.NewTicker(10 * time.Second) and selects over ctx.Done, ticker.C and the watch channel; ticker ticks re-fetch the Job (ignoring transient Get errors) and both paths use checkJobCompletion; watch closure handled via ok check.
Table-driven tests for scenarios and timeouts
pkg/clients/kube/kube_test.go (new test function, imports)
Add TestClient_WaitForJobCompletion (imports strings, time), exercising Job condition variants and timeout behavior using a fake Kubernetes client and per-case expectations for success or specific error substrings.

Sequence Diagram

sequenceDiagram
  participant Caller as Client
  participant Watch as Kube Watch
  participant API as Kubernetes API
  participant Ticker as Poll Ticker

  Caller->>Watch: Start watch for Job events
  Caller->>Ticker: Start 10s ticker
  alt Watch event arrives
    Watch->>Caller: Job event
    Caller->>Caller: checkJobCompletion(event.Job)
    alt complete
      Caller-->>Caller: return nil
    else failed
      Caller-->>Caller: return error
    else incomplete
      Caller-->>Caller: continue waiting
    end
  else Ticker tick
    Ticker->>API: Get Job
    API->>Caller: Job object
    Caller->>Caller: checkJobCompletion(job)
    alt complete
      Caller-->>Caller: return nil
    else failed
      Caller-->>Caller: return error
    else incomplete
      Caller-->>Caller: continue waiting
    end
  end
  Caller->>Caller: ctx.Done() -> return ctx error
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding a ticker (periodic polling fallback) to check job completion in the WaitForJobCompletion function.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Code uses standard Go testing, not Ginkgo. Test names are stable and descriptive. Check not applicable.
Test Structure And Quality ✅ Passed Test meets all quality requirements: proper setup/cleanup, timeouts with defer, meaningful assertion messages, single responsibility per test case, and follows codebase patterns consistently.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests were added in this PR. The new test TestClient_WaitForJobCompletion is a standard Go unit test using the testing package, not a Ginkgo e2e test. The custom check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests added. New test is a standard Go unit test, not Ginkgo. SNO check applies only to Ginkgo e2e tests.
Topology-Aware Scheduling Compatibility ✅ Passed This PR modifies only a client utility library for Job polling. No scheduling constraints, deployment manifests, operator code, or topology-dependent logic are introduced. Check does not apply.
Ote Binary Stdout Contract ✅ Passed PR modifies only library code and unit tests. No process-level code (main, init, TestMain, BeforeSuite) modified. No stdout writes, klog issues, or fmt.Print calls detected.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The test added is a standard Go unit test, not a Ginkgo e2e test. The custom check targets Ginkgo e2e tests. No IPv4 assumptions or external connectivity requirements found.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from geowa4 and joshbranham May 14, 2026 05:51
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 14, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign geowa4 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/clients/kube/kube.go`:
- Around line 151-158: The watch receive from watcher.ResultChan() can return a
closed channel causing a nil event and panic when accessing event.Object; change
the single-value receive to a two-value receive (e.g., event, ok :=
<-watcher.ResultChan()) and if !ok treat the watch as closed (stop using the
watcher, break/continue to fall back to the existing ticker-based polling) so
checkJobCompletion(job) is only called when event != nil and the type assertion
to *batchv1.Job succeeds; ensure you still compare the returned error to
errJobIncomplete and return other errors as before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3938f4d0-679d-4c55-8350-2f959af8840f

📥 Commits

Reviewing files that changed from the base of the PR and between 6137bdf and 0df5b51.

📒 Files selected for processing (2)
  • pkg/clients/kube/kube.go
  • pkg/clients/kube/kube_test.go

Comment thread pkg/clients/kube/kube.go Outdated
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 14, 2026

Codecov Report

❌ Patch coverage is 57.14286% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 34.40%. Comparing base (6137bdf) to head (6603cf4).

Files with missing lines Patch % Lines
pkg/clients/kube/kube.go 57.14% 8 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #347      +/-   ##
==========================================
+ Coverage   33.51%   34.40%   +0.89%     
==========================================
  Files          30       30              
  Lines        2256     2270      +14     
==========================================
+ Hits          756      781      +25     
+ Misses       1460     1447      -13     
- Partials       40       42       +2     
Files with missing lines Coverage Δ
pkg/clients/kube/kube.go 55.17% <57.14%> (+23.66%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 14, 2026

@feichashao: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants