MGMT-22370: Add exponential backoff to agent image pull by yoavsc0302 · Pull Request #10337 · openshift/assisted-service

yoavsc0302 · 2026-05-17T10:55:37Z

When the agent's container image pull fails, systemd retries every 3 seconds forever (RestartSec=3, StartLimitInterval=0), flooding the registry. In staging, one host made 1836 attempts in 1.5 hours.

Add a wrapper script (agent-pull-image) that handles image pulling with exponential backoff (5s → 10s → 20s → 40s → 80s → 160s → 300s cap). The script skips the pull entirely if the image is already cached locally, and never stops retrying.

This reduces registry load from ~1200 to ~48 attempts/hour per host (~96% reduction).

List all the issues related to this PR

What environments does this code impact?

Automation (CI, tools, etc)
Cloud
Operator Managed Deployments
None

How was this code tested?

assisted-test-infra environment
dev-scripts environment
Reviewer's test appreciated
Waiting for CI to do a full test run
Manual (Elaborate on how it was tested)
No tests needed

Checklist

Title and description added to both, commit and PR.
Relevant issues have been associated (see CONTRIBUTING guide)
This change does not require a documentation update (docstring, docs, README, etc)
Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

Are the title and description (in both PR and commit) meaningful and clear?
Is there a bug required (and linked) for this change?
Should this PR be backported?

Summary by CodeRabbit

Release Notes

New Features
- Implemented automatic retry logic with exponential backoff for container image pulls to improve handling of transient network failures.
Bug Fixes
- Agent service now explicitly pulls required container images before startup to ensure availability and prevent initialization failures.

When the agent's container image pull fails, systemd retries every 3 seconds forever (RestartSec=3, StartLimitInterval=0), flooding the registry. In staging, one host made 1836 attempts in 1.5 hours. Add a wrapper script (agent-pull-image) that handles image pulling with exponential backoff (5s → 10s → 20s → 40s → 80s → 160s → 300s cap). The script skips the pull entirely if the image is already cached locally, and never stops retrying. This reduces registry load from ~1200 to ~48 attempts/hour per host (~96% reduction).

openshift-ci-robot · 2026-05-17T10:55:41Z

coderabbitai · 2026-05-17T10:55:56Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2126dd2b-c363-4f91-8739-5572ebbdc7cf

📥 Commits

Reviewing files that changed from the base of the PR and between 963e8a7 and 25ca5d6.

📒 Files selected for processing (3)

internal/ignition/discovery.go
internal/ignition/templates/agent.service
internal/ignition/templates/discovery.ign

Walkthrough

This PR adds a container image pull mechanism with retry logic to the agent discovery and startup process. A new shell script with exponential backoff is defined, passed through the ignition template system, provisioned to disk, and invoked as a systemd unit pre-start step.

Changes

Agent image pull and service startup

Layer / File(s)	Summary
Image pull retry script definition and parameter setup `internal/ignition/discovery.go`	Introduces the `agentPullImage` shell script constant that checks whether a container image exists and retries `podman pull` with exponential backoff. The script is added to the discovery ignition parameters as `AGENT_PULL_IMAGE` and URL-escaped for template consumption.
Script provisioning and agent service integration `internal/ignition/templates/discovery.ign`, `internal/ignition/templates/agent.service`	Writes the `agent-pull-image` script to disk during Ignition provisioning and updates the agent systemd unit to call the script as an `ExecStartPre` step before the existing podman run/copy operation.

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Ipv6 And Disconnected Network Test Compatibility	⚠️ Warning	Tests contain hardcoded IPv4 (10.10.1.1, 192.168.126.11), external registries (quay.io, registry.redhat.com), IPv4-only CIDR configs incompatible with IPv6-only disconnected clusters.	Detect IP family dynamically. Replace IPv4 addresses with IPv6 equivalents. Add [Skipped:Disconnected] for external registry tests. Validate with IPv6 CI job.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding exponential backoff to agent image pull operations, which directly matches the core objective of the changeset.
Description check	✅ Passed	The description includes a clear problem statement, explains the solution with specific backoff values, documents the performance impact (~96% reduction), marks the appropriate checklist items, and references the related Jira issue (MGMT-22370).
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	Test files use only static, descriptive test names. Zero instances of fmt.Sprintf or string concatenation in test titles. Complies with stable and deterministic naming requirements.
Test Structure And Quality	✅ Passed	PR contains no Ginkgo test code modifications. Changes are limited to production files (ignition discovery configuration, systemd templates). Custom check for test quality is not applicable.
Microshift Test Compatibility	✅ Passed	No new Ginkgo e2e tests were added. Changes are shell script constants and systemd template modifications. Custom check applies only when tests are added, so not applicable.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	No new Ginkgo e2e tests are added in this PR. Changes are limited to Ignition configuration files and shell script constants for container image pulling. The SNO compatibility check does not apply.
Topology-Aware Scheduling Compatibility	✅ Passed	Changes are limited to Ignition templates and systemd configuration for bare metal provisioning. No Kubernetes pod scheduling constraints introduced.
Ote Binary Stdout Contract	✅ Passed	Check not applicable. PR modifies assisted-service ignition provisioning code, not an OTE binary. No process-level stdout writes. Logging uses logrus to stderr.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-05-17T10:56:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yoavsc0302

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [yoavsc0302]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov · 2026-05-17T11:26:50Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.32%. Comparing base (963e8a7) to head (25ca5d6).

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10337      +/-   ##
==========================================
- Coverage   44.32%   44.32%   -0.01%     
==========================================
  Files         417      417              
  Lines       72762    72763       +1     
==========================================
- Hits        32253    32252       -1     
- Misses      37589    37591       +2     
  Partials     2920     2920

Files with missing lines	Coverage Δ
internal/ignition/discovery.go	`75.09% <100.00%> (+0.09%)`	⬆️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

openshift-ci · 2026-05-17T12:57:42Z

@yoavsc0302: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/edge-e2e-metal-assisted-4-22	`25ca5d6`	link	true	`/test edge-e2e-metal-assisted-4-22`
ci/prow/edge-e2e-metal-assisted-5-0	`25ca5d6`	link	true	`/test edge-e2e-metal-assisted-5-0`
ci/prow/e2e-agent-compact-ipv4-iso-no-registry	`25ca5d6`	link	false	`/test e2e-agent-compact-ipv4-iso-no-registry`
ci/prow/e2e-agent-compact-ipv4	`25ca5d6`	link	true	`/test e2e-agent-compact-ipv4`
ci/prow/edge-e2e-ai-operator-ztp	`25ca5d6`	link	true	`/test edge-e2e-ai-operator-ztp`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

pastequo · 2026-05-18T09:35:07Z

 podman images | grep $IMAGE || podman rmi --force $1 || true
 `

+const agentPullImage = `#!/usr/bin/sh


Would it be a problem if the global timeout is reached while we are pulling the image ?
Coult it corrupt the local container storage ?

Maybe the right approach would be to move the podman pull + podman cp into the ExecStart step

It would be nice to have another opinion on this

Regarding whether the global timeout reached while pulling the image could corrupt the local container storage, let's assume for now it could. This concern exists in today's upstream behavior as well iiuc. This PR hasn't introduced it, since the pull already happens during the 10-minute timeout window in upstream (via the podman run in ExecStartPre).

Now, both in upstream and in this PR, when the 10-minute timeout occurs, systemd restarts the full sequence from ExecStartPre 1 (https://github.com/openshift/assisted-service/blob/master/internal/ignition/templates/agent.service#L14), which runs the BZ1964591 cleanup script (https://github.com/openshift/assisted-service/blob/master/internal/ignition/discovery.go#L95).
Isn't this the exact purpose of this script? To clean up the image in case it got corrupted, so the next pull starts fresh?

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 17, 2026

openshift-ci Bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 17, 2026

openshift-ci Bot requested review from gamli75 and linoyaslan May 17, 2026 10:56

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 17, 2026

pastequo reviewed May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MGMT-22370: Add exponential backoff to agent image pull#10337

MGMT-22370: Add exponential backoff to agent image pull#10337
yoavsc0302 wants to merge 1 commit into
openshift:masterfrom
yoavsc0302:MGMT-22370/image-pull-backoff

yoavsc0302 commented May 17, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

openshift-ci-robot commented May 17, 2026 •

edited by openshift-ci Bot

Loading

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

Uh oh!

coderabbitai Bot commented May 17, 2026 •

edited by openshift-ci Bot

Loading

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented May 17, 2026

Uh oh!

codecov Bot commented May 17, 2026 •

edited

Loading

Uh oh!

openshift-ci Bot commented May 17, 2026

Uh oh!

pastequo May 18, 2026

Uh oh!

yoavsc0302 May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yoavsc0302 commented May 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

Summary by CodeRabbit

Release Notes

Uh oh!

openshift-ci-robot commented May 17, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

Uh oh!

coderabbitai Bot commented May 17, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented May 17, 2026

Uh oh!

codecov Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

openshift-ci Bot commented May 17, 2026

Uh oh!

pastequo May 18, 2026

Choose a reason for hiding this comment

Uh oh!

yoavsc0302 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yoavsc0302 commented May 17, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented May 17, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented May 17, 2026 •

edited by openshift-ci Bot

Loading

codecov Bot commented May 17, 2026 •

edited

Loading