Skip to content

cl: cap req/resp response body size#21562

Merged
taratorio merged 27 commits into
mainfrom
yperbasis/caplin-reqresp-oom-limit
Jun 5, 2026
Merged

cl: cap req/resp response body size#21562
taratorio merged 27 commits into
mainfrom
yperbasis/caplin-reqresp-oom-limit

Conversation

@yperbasis
Copy link
Copy Markdown
Member

@yperbasis yperbasis commented Jun 1, 2026

What

Caplin's req/resp handler (cl/sentinel/httpreqresp) buffered a peer's entire response into memory via an uncapped io.Copy, bounded only by a 10s read deadline. A peer that floods the response stream could drive the victim's heap to many GiB. The cheapest trigger is unauthenticated and automatic: on every inbound connection the victim sends a Status request back to the peer (discovery.onConnectionhandshake.ValidatePeer), built with http.NewRequest (background context), so the handler runs the full copy with no caller-side timeout.

Fix

Stream the response through a per-topic size cap instead of buffering it all up front (the reply never needs Content-Length): a compliant response passes through untouched, and one that exceeds its cap surfaces an explicit ErrResponseTooLarge read error rather than being silently truncated and accepted. The cap is an erroring maxBytesReader (net/http.MaxBytesReader without the ResponseWriter coupling), not io.LimitReader, so a valid prefix followed by junk is rejected, not accepted. Sizes per topic:

  • Single-object protocols (status, ping, metadata, goodbye, light-client singles) → 1 MiB, tens-of-× over the largest such object. Fail-closed: any protocol not in the multi-chunk registry gets this cap.
  • Multi-chunk protocols (by_range / by_root / by_head: blocks, blobs, data columns, exec payloads) → a per-request byte budget = chunkCount × max-on-wire-size-of-one-chunk, computed by the typed cl/rpc builders (which know both the requested item count and the per-protocol chunk size) and plumbed through a new max_response_bytes field on RequestData / RequestDataWithPeer and the Reqresp-Max-Response-Bytes header. The handler clamps that budget to an absolute on-wire MAX_REQUEST_BLOCKS × MAX_CHUNK_SIZE ceiling, so a miscomputed or overflowed budget can't make it stream more than a spec-maximal response. A multi-chunk request that arrives without a budget falls back to the tight single-object cap (not the ceiling), so a forgotten budget fails loud with ErrResponseTooLarge rather than risking an OOM.

Per-chunk sizing: blocks and execution payloads can fill a chunk; a blob sidecar is a fixed ~129 KiB; a data-column sidecar scales with the fork's max blobs (clparams.MaxBlobsPerBlockUpperBound, which covers the BPO schedule). Because the cap is on on-wire (snappy-framed) bytes and blob/column payloads are incompressible, each budget is wrapped in snappy.MaxEncodedLen + framing (saturating to MaxUint64 rather than overflowing) so it never truncates a compliant response.

Protocol classification and the wire-size helper live in cl/sentinel/communication (AllProtocols, MaxWireResponseBytes) as a single source of truth shared by the cl/rpc budget and the httpreqresp backstop. A unit test cross-checks the registry against the by_range / _by_root / _by_head naming convention, so a mis-classified protocol fails the build rather than silently getting the wrong cap.

Because the response is streamed, an over-cap flood is no longer a 413 (by the time the overflow is seen, the 200 status and the peer's response code have already been committed) but an ErrResponseTooLarge from the body read. handshake.ValidatePeer already fails closed on any body read/decode error; service.requestPeer drops the peer on it, preserving the peer removal the 413 path used to give.

The Do bridge is also hardened: a handler panic is recovered and returned as a 500 instead of crashing the process, and a streamed body is closed even if the request context is cancelled before the handler publishes it, so the stream isn't leaked.

The read deadline (and sync timing) is unchanged — the budget rides its own header. A single connection can no longer OOM the node: the worst-case 128-peer Status flood drops from unbounded to ~256 MiB, and a blob by_range flood from multiple GiB to tens-to-hundreds of MiB.

Test

Flood/cap tests drive the real NewRequestHandler receive path over two libp2p hosts:

  • TestResponseBodyRejectedOnFlood — a compliant Status response streams through in full (200); a 40 MiB Status flood is bounded at the cap and surfaces ErrResponseTooLarge, not truncated and accepted.
  • TestMultiChunkResponseCappedWithoutBudget — an unbudgeted 5 MiB by_range response is bounded at the single-object cap (not the ceiling) and surfaces ErrResponseTooLarge.
  • TestMultiChunkResponseRejectedOverByteBudget — a response within the 8 MiB budget passes through in full (200); a 40 MiB flood over it surfaces ErrResponseTooLarge.
  • TestMaxBytesReader — the capping reader streams bytes up to its limit unchanged, then returns ErrResponseTooLarge (rather than truncating at EOF) for anything over it.
  • TestDoCopiesHandlerWriteBufferDo copies the handler's write buffer rather than retaining it, so a caller that reuses its buffer (e.g. http.Error via fmt) can't corrupt the response.
  • TestDoClosesStreamingBodyOnContextCancel — if Do returns early on a cancelled request context, the streaming body the handler publishes afterwards is still closed, so the libp2p stream isn't leaked.
  • TestDoRecoversHandlerPanic — a handler panic is recovered and surfaced as a 500 rather than crashing the process.
  • TestMaxResponseBodySize — iterates communication.AllProtocols: single-object protocols and unbudgeted multi-chunk requests get the tight cap; a budget below the ceiling is honored, above it is clamped.
  • communication.TestResponseShapeConvention — cross-checks every protocol's multi-chunk flag against the by_range / _by_root / _by_head naming convention, so a new mis-classified protocol fails the build.
  • communication.TestMaxWireResponseBytes — the on-wire budget saturates to MaxUint64 on overflow instead of wrapping to a smaller (truncating) value.
  • communication.TestMaxWireResponseBytesBoundsStreamEncoding — the budget upper-bounds the real snappy-stream framed size for items up to MAX_CHUNK_SIZE.
  • communication.TestIsMultiChunkProtocol — known protocols classify correctly; unknown ones fail closed to single-object.
  • clparams.TestMaxBlobsPerBlockUpperBound — the fork/BPO max-blobs helper.

Fixes ethereum-bounty/erigon#13

Caplin's req/resp handler buffered a peer's entire response into an
in-memory httptest.ResponseRecorder via an uncapped io.Copy, bounded
only by a read deadline (10s, since no caller sets REQRESP-EXPECTED-CHUNKS).
A peer can flood the response stream and drive the victim's heap to many
GiB. The cheapest trigger is the Status handshake the victim sends back
to every inbound connection (discovery.onConnection -> ValidatePeer), so
it is remotely and unauthenticatedly reachable.

Wrap the response copy in an io.LimitReader sized per topic: single-chunk
protocols (status, ping, metadata, goodbye, light-client singles) are
capped at one MAX_CHUNK_SIZE (15 MiB), and by_range / by_root protocols
at MAX_REQUEST_BLOCKS chunks. Legitimate responses sit far below these
ceilings, so a flooding peer is bounded instead of unbounded; the Status
handshake vector in particular is tightly capped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis requested a review from domiwei as a code owner June 1, 2026 13:35
@yperbasis yperbasis added this to the 3.5.0 milestone Jun 1, 2026
@yperbasis yperbasis added the Caplin Caplin: Consensus Layer, Beacon API label Jun 1, 2026
@yperbasis yperbasis requested a review from Copilot June 1, 2026 13:45
The previous commit capped single-chunk responses at one MAX_CHUNK_SIZE
(15 MiB). Single-object protocols (status, ping, metadata, goodbye,
light-client singles) are at most tens of KiB, so 15 MiB left a large
amplification: a 128-peer Status-handshake flood could still force
~3.8 GiB of transient heap. Cap single-object responses at 1 MiB (tens of
x margin over the largest such object), cutting the residual to ~256 MiB.
The by_range / by_root ceilings are unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens Caplin’s cl/sentinel/httpreqresp req/resp HTTP bridge against unbounded memory growth by capping how much response data is buffered/read from a libp2p stream (especially impactful for the automatic Status handshake path).

Changes:

  • Add per-topic maximum response body sizing (maxResponseBodySize) with a tight cap for single-chunk protocols and a larger cap for _by_range/_by_root multi-chunk protocols.
  • Apply the cap by wrapping the stream reader with io.LimitReader in the response io.Copy.
  • Add tests validating the topic→ceiling mapping and that a flooding peer cannot cause buffering beyond maxChunkSize for a single-chunk (Status) response.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
cl/sentinel/httpreqresp/server.go Introduces max response sizing logic and enforces it via io.LimitReader during response copy.
cl/sentinel/httpreqresp/server_test.go Adds tests for the sizing logic and a libp2p-based flood regression test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

beacon_blocks_by_head is multi-chunk (up to MAX_REQUEST_BLOCKS_DENEB
blocks), but its topic contains neither _by_range nor _by_root, so
maxResponseBodySize classified it as a single-object response and would
cap it at 1 MiB. Match _by_head alongside _by_range / _by_root so the
ceiling reflects the protocol's real shape.

Also fix a stale flood-test comment that still referenced the
pre-tightening maxChunkSize cap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread cl/sentinel/httpreqresp/server.go Outdated
Address review follow-ups on the response-body size cap:

- Reuse clparams.MaxChunkSize (spec MAX_CHUNK_SIZE) instead of a duplicated
  15 MiB literal so the two can't drift.
- Make TestMaxResponseBodySize exhaustive over the protocol constants in
  communication/topics.go. The cap classifies by substring, so a new
  multi-chunk protocol whose topic lacks _by_range/_by_root/_by_head would be
  silently capped at the single-object ceiling and truncated; this test now
  fails if such a protocol is added without updating maxResponseBodySize.
- Add TestMultiChunkResponseNotCappedAtSingleObject: a legitimate by_range
  response larger than the single-object cap must pass through the capped
  receive path uncapped. The handlers tests only exercise the serving side
  via direct streams, not httpreqresp's NewRequestHandler.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Comment thread cl/sentinel/httpreqresp/server.go Outdated
Comment thread cl/sentinel/httpreqresp/server_test.go Outdated
Comment thread cl/sentinel/httpreqresp/server_test.go
Comment thread cl/sentinel/httpreqresp/server_test.go Outdated
@yperbasis yperbasis marked this pull request as draft June 2, 2026 10:02
…ount

The multi-chunk response cap was a flat MAX_REQUEST_BLOCKS x MAX_CHUNK_SIZE
(~15 GiB) for every by_range/by_root/by_head topic, regardless of how many
items the request actually asked for. Plumb the caller's expected response
chunk count through RequestData (new max_response_chunks field) and the
Reqresp-Max-Response-Chunks header so the handler caps at
expectedChunks x MAX_CHUNK_SIZE, clamped to the old ceiling. A request for
N blocks is now bounded at N x MAX_CHUNK_SIZE instead of the full 15 GiB.

The cap never truncates a compliant response: the responder returns at most
the requested number of chunks and each chunk is at most MAX_CHUNK_SIZE, so
the bound is an upper bound on the real response. An absent or zero count
falls back to the old ceiling, so any path that doesn't set it (e.g.
column-by-root via RequestDataWithPeer) is unchanged and safe. The read
deadline is untouched (the count rides a separate header, not
REQRESP-EXPECTED-CHUNKS).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis changed the title cl/sentinel/httpreqresp: cap req/resp response body size cl/sentinel, cl/rpc: cap req/resp response body size Jun 2, 2026
@yperbasis yperbasis changed the title cl/sentinel, cl/rpc: cap req/resp response body size cl: cap req/resp response body size Jun 2, 2026
yperbasis and others added 2 commits June 2, 2026 12:22
- fetchPeerResponse now bounds victim.Connect and the request with a 30s
  timeout context, so a handler regression fails fast instead of hanging
  until the Go test timeout.
- Write exactly payloadSize bytes (cap the final write to the remainder)
  so the helper is correct for sizes that aren't a multiple of 1 MiB.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The prior commit bounded multi-chunk responses by requested chunk count x
MAX_CHUNK_SIZE. That stays tight for blocks (which can legitimately approach
MAX_CHUNK_SIZE) but is ~100x too loose for blob/column sidecars — a blob
sidecar is a fixed ~129 KiB, far below the 15 MiB per-chunk max — so a blob
by_range request still permitted a multi-GB cap.

Replace the plumbed chunk count with a per-protocol byte budget:
- blocks / execution payloads: count x MAX_CHUNK_SIZE (items can be large)
- blob sidecars:   numSidecars x fixed BlobSidecar SSZ size
- data-column sidecars: numSidecars x (MaxBlobsPerBlockUpperBound cells + slack),
  via a new clparams helper so the bound tracks the fork/BPO blob limit

Each budget is wrapped in snappy.MaxEncodedLen + framing: the cap is on the
on-wire (snappy-framed) bytes and blob/column payloads are incompressible, so a
raw-size budget would truncate. RequestData.max_response_chunks becomes
max_response_bytes; the field is also added to RequestDataWithPeer so
column-by-root is bounded. A blob by_range flood now caps at tens-to-hundreds
of MiB instead of multiple GiB.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.

Files not reviewed (1)
  • node/gointerfaces/sentinelproto/sentinel.pb.go: Language not supported

Comment thread cl/sentinel/service/service.go
Comment thread cl/sentinel/httpreqresp/server.go Outdated
…eiling

maxResponseBodySize compared the caller's on-wire byte budget against a ceiling
computed from MAX_CHUNK_SIZE (the uncompressed per-chunk limit), so a large but
valid multi-chunk budget could be clamped below its intended on-wire value and
truncate the response. The budget is computed by cl/rpc with snappy-aware sizing,
is not peer-controllable and is already bounded, so use it as-is; the
MAX_REQUEST_BLOCKS × MAX_CHUNK_SIZE ceiling now applies only as the fallback when
no budget is provided.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 1 comment.

Files not reviewed (1)
  • node/gointerfaces/sentinelproto/sentinel.pb.go: Language not supported

Comment thread cl/sentinel/httpreqresp/server.go
Copy link
Copy Markdown
Member

domiwei commented Jun 4, 2026

Thanks for working on this. The Status/single-object flood case looks much safer now: the handler caps the read, returns 413 on overflow, and avoids setting REQRESP-RESPONSE-CODE on rejected responses.

I think the main remaining issue is that multi-chunk responses are still fully buffered. NewRequestHandler enforces the cap with:

respBody, err := io.ReadAll(io.LimitReader(stream, limit+1))

For multi-chunk protocols, limit is derived from the request’s wire budget. That is useful as a stream guard, but while this path still materializes the whole body, it also becomes the heap allocation bound. For common block-range requests, a malicious peer can still force hundreds of MiB to over 1 GiB of buffering, and then service.requestPeer reads the response body again into ResponseData.Data.

I’d suggest changing this path to streaming decode instead of full-response buffering:

  • keep max_response_bytes as a byte-budget guard;
  • enforce it with a counting/limited reader while reading from the peer stream;
  • decode req/resp chunks incrementally from an io.Reader;
  • reject as soon as the byte budget, chunk count, or per-chunk size is exceeded;
  • avoid building a second full copy in service.requestPeer.

That would make max_response_bytes a wire-budget limit rather than a potential heap allocation size. The memory peak would be closer to one chunk plus the decoded object, which is the right shape for multi-chunk req/resp.

Two smaller points:

  1. service.requestPeer should probably create the internal request with the caller context:
httpReq, err := http.NewRequestWithContext(ctx, "GET", "http://service.internal/", bytes.NewBuffer(req.Data))

cl/rpc wraps these calls in a 2s timeout, but http.NewRequest gives httpreqresp.Do a background context, so cancellation does not propagate into the bridge.

  1. For non-2xx responses, the handler currently sets CONTENT-ENCODING: snappy/stream before it may call http.Error, whose body is plain text. Since service.requestPeer reads error bodies as text this works today, but it would be cleaner to set the snappy response headers only on the success path.

yperbasis added 2 commits June 4, 2026 10:54
- Document the Reqresp-Max-Response-Bytes and REQRESP-FALLBACK-BODY request
  headers in Do's header list, which had drifted from the code.
- Replace the dead non-captureWriter io.Copy fallback with an explicit 500.
  Do always supplies a captureWriter and the streaming hand-off can't carry a
  body through an arbitrary ResponseWriter, so a different writer is a wiring
  error, not a degraded-but-working path; the old fallback also called
  http.Error after the stream had started, which can't change a committed status.
- Add TestDoClosesStreamingBodyOnContextCancel: when Do returns early on a
  cancelled request context, the streaming body the handler publishes afterwards
  owns a live stream, so Do must still drain and close it. Covers the otherwise
  unexercised cancel branch.
…cess only

Two review follow-ups:

- Do now recovers a panic from handler.ServeHTTP in the goroutine and publishes
  a 500, so a handler panic can't crash the process or leave Do blocked forever
  on a background-context request. The handler's deferred stream.Close still runs
  during the unwind, so no stream is leaked. Covered by TestDoRecoversHandlerPanic.

- NewRequestHandler checks the captureWriter precondition right after reading the
  response code, so the CONTENT-ENCODING: snappy/stream and the other response
  headers are set only on the success path, not ahead of an error response.
@yperbasis
Copy link
Copy Markdown
Member Author

Thanks @domiwei — I think you caught the revision just before the streaming refactor I pushed earlier today (89083db): the handler no longer does io.ReadAll(io.LimitReader(stream, limit+1)). It now hands the libp2p stream back as the response body wrapped in a capping reader (maxBytesReader), so the handler itself doesn't buffer the response at all — the cap is enforced incrementally as the body is read, and an over-cap response surfaces as a read error (ErrResponseTooLarge) rather than a 413. (That's the same point @anacrolix raised.)

On the remaining points:

  1. Full materialization — you're right that the response is still materialized once, but that's now in service.requestPeer's io.ReadAll(resp.Body) (into ResponseData.Data), not in the handler. It's bounded by the per-request byte budget, so a peer can't exceed what we asked for, but for a large block-range request that bound is still a big single allocation. Dropping the peak to ~one chunk + the decoded object means decoding chunks incrementally across the sentinel→cl/rpc boundary — which is a single gRPC bytes response (ResponseData.Data) today, so it would need that contract to become streaming. Worthwhile, but well beyond this OOM fix (which takes the buffering from unbounded to budget-bounded). Filed as a follow-up: Caplin req/resp: stream multi-chunk responses instead of materializing the full body in requestPeer #21616.

  2. http.NewRequestWithContext — this came up in an earlier round and was left intentionally: cl/rpc wraps these calls in a 2s timeout, so propagating the context into Do would cancel in-flight req/resp at 2s instead of at the stream read deadline, changing sync timing.

  3. snappy headers on error responses — fixed in 3d0652f: the response headers (incl. CONTENT-ENCODING: snappy/stream) are now set only on the success path. Post-streaming there's also no longer an http.Error/413 path after them.

No behavior change:
- streamBody embeds io.Reader/io.Closer instead of hand-written Read/Close
  forwarders.
- Do publishes the response from a single site in the deferred func; the
  panic-recovery path swaps the writer to a 500 rather than duplicating the
  channel send.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 1 comment.

Files not reviewed (1)
  • node/gointerfaces/sentinelproto/sentinel.pb.go: Language not supported

Comment thread cl/sentinel/httpreqresp/server.go Outdated
The ctx-cancel branch only closes the late-published response body (which
releases the libp2p stream); it does not drain it. Match the comment.
@yperbasis yperbasis marked this pull request as ready for review June 4, 2026 09:36
yperbasis and others added 2 commits June 4, 2026 12:39
The flood/budget tests invoked NewRequestHandler directly, but production
reaches it through a chi.Router (sentinel.go). Route the e2e helper through
chi and add a focused guard so that middleware wrapping the ResponseWriter —
which would 500 every req/resp via the *captureWriter type assertion — fails
the suite instead of silently breaking peer validation.
Copy link
Copy Markdown
Member

@domiwei domiwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now lgtm

@anacrolix
Copy link
Copy Markdown
Contributor

I can't tell if the 413 is returned if the response body exceeds the limit but response header was already returned. There's 2 ways this can be detected by http: the response body doesn't match the advertised length, or the response body fails some other check specific to the content. The first is preferable, but requires you to know the response body length in advance, which can require two passes, or buffering. There is a helper to buffer with a max size in memory before switching to writing to a temporary file.

It's probably overkill at this point, I'm satisfied with not buffering unnecessarily if people are satisfied the size limits are respected.

@anacrolix anacrolix added this pull request to the merge queue Jun 5, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 5, 2026
@anacrolix
Copy link
Copy Markdown
Contributor

False negative in MQ Sonar 😭

@taratorio taratorio added this pull request to the merge queue Jun 5, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 5, 2026
@taratorio
Copy link
Copy Markdown
Member

False negative in MQ Sonar 😭

@anacrolix fix - #21632

@taratorio taratorio added this pull request to the merge queue Jun 5, 2026
Merged via the queue into main with commit 2fdc4ac Jun 5, 2026
89 checks passed
@taratorio taratorio deleted the yperbasis/caplin-reqresp-oom-limit branch June 5, 2026 05:37
Sahil-4555 pushed a commit to Sahil-4555/erigon that referenced this pull request Jun 5, 2026
… jar (erigontech#21632)

## Problem

The sonar job pulls two artifacts from `scanner.sonarcloud.io` on every
scan: a JRE ("JRE provisioning") and the scanner-engine jar. That CDN
intermittently 403s GitHub-runner IPs, and the blocks outlive the spaced
retry added in erigontech#21604:

- [CI Gate merge-queue
run](https://github.com/erigontech/erigon/actions/runs/26994431020/job/79661154435):
scan and retry both hit `HTTP 403 Forbidden` on the JRE tarball, failing
the gate and bouncing erigontech#21562 out of the merge queue — after the coverage
suite had already passed.
- [CI Gate run on this
PR](https://github.com/erigontech/erigon/actions/runs/27003084716/job/79687976307):
with JRE provisioning eliminated, scan and retry both 403'd on the
engine jar instead.

The 403s are IP-scoped blocking, not artifact availability: the exact
jar URL that failed serves 200 from outside the runners (published Jun
1, still on the CDN), and `api.sonarcloud.io` answered fine in the same
failing run — only the `scanner.sonarcloud.io` host blocks, and for
longer than the 90s retry spacing, so a same-runner retry cannot ride it
out.

## Fix

Remove both per-scan dependencies on that host.

**JRE**: skip provisioning and point the scanner at the JDK already
baked into the runner image, via the scanner's documented switches:

- `SONAR_SCANNER_SKIP_JRE_PROVISIONING=true`
- `SONAR_SCANNER_JAVA_EXE_PATH=$JAVA_HOME_21_X64/bin/java`

The ubuntu-24.04 image ships Temurin 21 — the same major version Sonar
provisions (the failing artifact was `OpenJDK21U-jre_...21.0.9`). The
env vars go through `$GITHUB_ENV`, so both the scan and the retry step
inherit them. `cleanup-space` in setup-erigon does not remove the
preinstalled JDKs.

**Engine jar**: seed it into the actions cache from cache-warming push
runs; PR and merge-queue scans restore it. The seed step queries
`api.sonarcloud.io/analysis/engine` (the host that stays reachable;
returns `{filename, sha256, downloadUrl}`), downloads the jar with
retries, sha256-verifies it, and saves it in the scanner's
content-addressed download-cache layout —
`~/.sonar/cache/<sha256>/<filename>`, cache key
`sonar-scanner-engine|<sha256>`. At scan time the bootstrapper asks the
API for the prescribed sha and, finding it in the local cache, never
contacts the CDN.

Verified end-to-end with scanner CLI 8.1.0.6389 against a cache seeded
exactly as the workflow does it: the debug log shows the metadata call,
zero requests to `scanner.sonarcloud.io`, and the engine launched
straight from the cached jar.

Cache-warming runs on every push to main/release, and SonarCloud rotates
engines every few days (12.37 published Jun 1, 12.38 on Jun 5), so seeds
refresh within hours of a rotation.

## Failure modes considered

- Runner image drops Temurin 21: the `[ -x ... ]` guard leaves the env
vars unset and the scanner falls back to downloading, i.e. current
behavior.
- SonarCloud raises its minimum JRE above 21: the scan fails
deterministically with a version error (historically preceded by months
of deprecation warnings in the scan log); fix is bumping the env var to
`JAVA_HOME_25_X64`, which the image already ships.
- Engine cache miss (version rotated since the last base-branch push, or
cache evicted): the scanner falls back to the direct download plus the
existing retry — today's behavior, never worse.
- Seeding fails (the CDN 403s the cache-warming runner too, or the
bootstrap API contract changes): the lookup and download steps are
`continue-on-error`, no cache is saved, and the next push to the branch
retries; scans fall back as above.

The scanner CLI zip and GPG key still come per run from
`binaries.sonarsource.com` and the keyserver; those hosts have not been
the ones failing, and the existing retry covers them.

Note: the first engine seed only materializes once this merges
(cache-warming triggers on push to main), so this PR's own sonar runs
still use the fallback download path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Caplin Caplin: Consensus Layer, Beacon API Networking

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants