MGMT-22370: Add exponential backoff to agent image pull#10337
MGMT-22370: Add exponential backoff to agent image pull#10337yoavsc0302 wants to merge 1 commit into
Conversation
When the agent's container image pull fails, systemd retries every 3 seconds forever (RestartSec=3, StartLimitInterval=0), flooding the registry. In staging, one host made 1836 attempts in 1.5 hours. Add a wrapper script (agent-pull-image) that handles image pulling with exponential backoff (5s → 10s → 20s → 40s → 80s → 160s → 300s cap). The script skips the pull entirely if the image is already cached locally, and never stops retrying. This reduces registry load from ~1200 to ~48 attempts/hour per host (~96% reduction).
|
@yoavsc0302: This pull request references MGMT-22370 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
WalkthroughThis PR adds a container image pull mechanism with retry logic to the agent discovery and startup process. A new shell script with exponential backoff is defined, passed through the ignition template system, provisioned to disk, and invoked as a systemd unit pre-start step. ChangesAgent image pull and service startup
🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: yoavsc0302 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #10337 +/- ##
==========================================
- Coverage 44.32% 44.32% -0.01%
==========================================
Files 417 417
Lines 72762 72763 +1
==========================================
- Hits 32253 32252 -1
- Misses 37589 37591 +2
Partials 2920 2920
🚀 New features to boost your workflow:
|
|
@yoavsc0302: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| podman images | grep $IMAGE || podman rmi --force $1 || true | ||
| ` | ||
|
|
||
| const agentPullImage = `#!/usr/bin/sh |
There was a problem hiding this comment.
Would it be a problem if the global timeout is reached while we are pulling the image ?
Coult it corrupt the local container storage ?
Maybe the right approach would be to move the podman pull + podman cp into the ExecStart step
It would be nice to have another opinion on this
There was a problem hiding this comment.
Regarding whether the global timeout reached while pulling the image could corrupt the local container storage, let's assume for now it could. This concern exists in today's upstream behavior as well iiuc. This PR hasn't introduced it, since the pull already happens during the 10-minute timeout window in upstream (via the podman run in ExecStartPre).
Now, both in upstream and in this PR, when the 10-minute timeout occurs, systemd restarts the full sequence from ExecStartPre 1 (https://github.com/openshift/assisted-service/blob/master/internal/ignition/templates/agent.service#L14), which runs the BZ1964591 cleanup script (https://github.com/openshift/assisted-service/blob/master/internal/ignition/discovery.go#L95).
Isn't this the exact purpose of this script? To clean up the image in case it got corrupted, so the next pull starts fresh?
When the agent's container image pull fails, systemd retries every 3 seconds forever (RestartSec=3, StartLimitInterval=0), flooding the registry. In staging, one host made 1836 attempts in 1.5 hours.
Add a wrapper script (agent-pull-image) that handles image pulling with exponential backoff (5s → 10s → 20s → 40s → 80s → 160s → 300s cap). The script skips the pull entirely if the image is already cached locally, and never stops retrying.
This reduces registry load from ~1200 to ~48 attempts/hour per host (~96% reduction).
List all the issues related to this PR
What environments does this code impact?
How was this code tested?
Checklist
docs, README, etc)Reviewers Checklist
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes