Skip to content

ci: retry SonarCloud scan once on transient failure#21604

Merged
taratorio merged 1 commit into
mainfrom
yperbasis/sonar-scan-retry
Jun 3, 2026
Merged

ci: retry SonarCloud scan once on transient failure#21604
taratorio merged 1 commit into
mainfrom
yperbasis/sonar-scan-retry

Conversation

@yperbasis
Copy link
Copy Markdown
Member

Problem

Two merge-queue evictions in the last 3 weeks were caused by the SonarCloud scan failing to download the scanner CLI from binaries.sonarsource.com — not by anything in the queued code:

In the merge queue the sonar job fast-cancels the whole CI Gate run on failure, so a CDN blip cancels ~40 min of green sibling jobs and github-merge-queue removes the PR with failed_checks. Per CI-GUIDELINES.md, merge-queue checks must have no false positives; CDN weather is one.

Fix

Give the scan one spaced retry:

  • the first attempt runs with continue-on-error: true
  • if it failed, wait 90s and run the action again
  • if the retry also fails, the job fails as before — a persistent outage still blocks correctly

cache-warming-only runs are unaffected (scan skipped → outcome is skipped, so the retry steps skip too). A continue-on-error step reports conclusion: success to the jobs API, so ci-gate's root-cause detection won't flag a run recovered by the retry; a double failure is attributed to SonarCloud scan (retry).

Alternatives considered

  • Pre-seeding the runner tool cache: impossible — the action's tc.find() lookup can never match SonarSource's 4-segment version string (semver.clean("8.1.0.6389") is null), so the action's internal tool-cache path is dead code on any runner.
  • Mirroring the scanner zip via scannerBinariesUrl: removes the CDN dependency entirely but adds hosting and per-upgrade maintenance, and the GPG keyserver dependency remains. Can revisit if 403s persist despite the retry.

No tests: CI workflow YAML change (TDD not applicable); validated with actionlint and make lint.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the SonarCloud GitHub Actions workflow to reduce merge-queue false positives caused by transient external download/service failures during the Sonar scan.

Changes:

  • Runs the first SonarCloud scan attempt with continue-on-error: true and captures its outcome via a step id.
  • If the first attempt fails, waits 90 seconds and retries the SonarCloud scan once.
  • Keeps existing behavior for persistent failures (the job still fails if the retry fails) and leaves cache-warming-only runs unaffected.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yperbasis yperbasis requested a review from taratorio June 3, 2026 12:51
@taratorio taratorio enabled auto-merge June 3, 2026 12:53
@taratorio taratorio added this pull request to the merge queue Jun 3, 2026
Merged via the queue into main with commit 830763a Jun 3, 2026
90 checks passed
@taratorio taratorio deleted the yperbasis/sonar-scan-retry branch June 3, 2026 14:53
Sahil-4555 pushed a commit to Sahil-4555/erigon that referenced this pull request Jun 5, 2026
… jar (erigontech#21632)

## Problem

The sonar job pulls two artifacts from `scanner.sonarcloud.io` on every
scan: a JRE ("JRE provisioning") and the scanner-engine jar. That CDN
intermittently 403s GitHub-runner IPs, and the blocks outlive the spaced
retry added in erigontech#21604:

- [CI Gate merge-queue
run](https://github.com/erigontech/erigon/actions/runs/26994431020/job/79661154435):
scan and retry both hit `HTTP 403 Forbidden` on the JRE tarball, failing
the gate and bouncing erigontech#21562 out of the merge queue — after the coverage
suite had already passed.
- [CI Gate run on this
PR](https://github.com/erigontech/erigon/actions/runs/27003084716/job/79687976307):
with JRE provisioning eliminated, scan and retry both 403'd on the
engine jar instead.

The 403s are IP-scoped blocking, not artifact availability: the exact
jar URL that failed serves 200 from outside the runners (published Jun
1, still on the CDN), and `api.sonarcloud.io` answered fine in the same
failing run — only the `scanner.sonarcloud.io` host blocks, and for
longer than the 90s retry spacing, so a same-runner retry cannot ride it
out.

## Fix

Remove both per-scan dependencies on that host.

**JRE**: skip provisioning and point the scanner at the JDK already
baked into the runner image, via the scanner's documented switches:

- `SONAR_SCANNER_SKIP_JRE_PROVISIONING=true`
- `SONAR_SCANNER_JAVA_EXE_PATH=$JAVA_HOME_21_X64/bin/java`

The ubuntu-24.04 image ships Temurin 21 — the same major version Sonar
provisions (the failing artifact was `OpenJDK21U-jre_...21.0.9`). The
env vars go through `$GITHUB_ENV`, so both the scan and the retry step
inherit them. `cleanup-space` in setup-erigon does not remove the
preinstalled JDKs.

**Engine jar**: seed it into the actions cache from cache-warming push
runs; PR and merge-queue scans restore it. The seed step queries
`api.sonarcloud.io/analysis/engine` (the host that stays reachable;
returns `{filename, sha256, downloadUrl}`), downloads the jar with
retries, sha256-verifies it, and saves it in the scanner's
content-addressed download-cache layout —
`~/.sonar/cache/<sha256>/<filename>`, cache key
`sonar-scanner-engine|<sha256>`. At scan time the bootstrapper asks the
API for the prescribed sha and, finding it in the local cache, never
contacts the CDN.

Verified end-to-end with scanner CLI 8.1.0.6389 against a cache seeded
exactly as the workflow does it: the debug log shows the metadata call,
zero requests to `scanner.sonarcloud.io`, and the engine launched
straight from the cached jar.

Cache-warming runs on every push to main/release, and SonarCloud rotates
engines every few days (12.37 published Jun 1, 12.38 on Jun 5), so seeds
refresh within hours of a rotation.

## Failure modes considered

- Runner image drops Temurin 21: the `[ -x ... ]` guard leaves the env
vars unset and the scanner falls back to downloading, i.e. current
behavior.
- SonarCloud raises its minimum JRE above 21: the scan fails
deterministically with a version error (historically preceded by months
of deprecation warnings in the scan log); fix is bumping the env var to
`JAVA_HOME_25_X64`, which the image already ships.
- Engine cache miss (version rotated since the last base-branch push, or
cache evicted): the scanner falls back to the direct download plus the
existing retry — today's behavior, never worse.
- Seeding fails (the CDN 403s the cache-warming runner too, or the
bootstrap API contract changes): the lookup and download steps are
`continue-on-error`, no cache is saved, and the next push to the branch
retries; scans fall back as above.

The scanner CLI zip and GPG key still come per run from
`binaries.sonarsource.com` and the keyserver; those hosts have not been
the ones failing, and the existing retry covers them.

Note: the first engine seed only materializes once this merges
(cache-warming triggers on push to main), so this PR's own sonar runs
still use the fallback download path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants