Skip to content

Orca - ORigin CAche#124

Draft
plombardi89 wants to merge 67 commits into
mainfrom
phlombar/origincache
Draft

Orca - ORigin CAche#124
plombardi89 wants to merge 67 commits into
mainfrom
phlombar/origincache

Conversation

@plombardi89
Copy link
Copy Markdown
Collaborator

@plombardi89 plombardi89 commented May 5, 2026

Orca origin cache: implementation, hardening, and dev tooling

Summary

Adds the Orca origin cache as a complete v1: a horizontally-scaled
HTTP cache that fronts S3 / Azure Blob origins with chunked,
rendezvous-hashed, in-DC caching. Spans the design doc, the runtime
binary (cmd/orca), the in-process integration suite
(internal/orca/inttest), Kubernetes deployment manifests, and a
local dev harness with a Go-based seeding tool.

The PR is large because it combines:

  1. The initial implementation of the cache.
  2. Three rounds of review-driven hardening (about 35 commits of bug
    fixes, correctness tightening, and observability work).
  3. The local dev harness (hack/orca/, hack/cmd/orcaseed) and its
    docs.

This description walks through what the system does, the major
components, and the high-leverage areas reviewers should focus on.


What Orca is

A read-only origin cache. Clients speak a minimal S3-compatible
surface (GET, HEAD, LIST) to Orca; Orca fans out to a configured
origin (AWS S3 or Azure Blob Storage), splits objects into fixed-size
chunks (default 8 MiB), commits them to an in-DC S3-compatible
cachestore (atomic via If-None-Match: *), and serves subsequent
reads from cache.

A cluster of Orca replicas coordinates via rendezvous-hashed
coordinator selection
plus a per-replica singleflight: at any
given time, exactly one replica is doing the origin GET for any given
chunk, and all concurrent client requests for that chunk collapse
onto it. The coordinator either fills locally or streams from a peer
via an internal RPC (GET /internal/fill).

No disk spool. No metadata DB. Chunk identity is encoded as a
deterministic SHA-256 over (origin_id, bucket, key, etag, chunk_size)
so the cachestore key set is itself the source of truth.


Major components

Runtime packages

Path Responsibility
internal/orca/server/ HTTP edge (port 8443) + internal-fill listener (port 8444) + ops listener (port 8442 with /healthz and /readyz).
internal/orca/fetch/ The per-replica fetch coordinator: HeadObject metadata, GetChunk orchestration, singleflight, pre-header origin retry, fill streaming.
internal/orca/cluster/ Peer discovery (DNS-backed in prod, static in tests), rendezvous hashing, internal-RPC client.
internal/orca/metadata/ Per-replica metadata cache with TTL, negative-TTL, and singleflight HEAD.
internal/orca/chunkcatalog/ Presence-only LRU recording chunks known to be in the cachestore.
internal/orca/cachestore/s3/ S3-compatible cachestore driver (LocalStack in dev, VAST or similar in prod). Atomic commit via PutObject + If-None-Match. Boot-time versioning gate + self-test.
internal/orca/origin/awss3/ AWS S3 (and S3-compatible) origin driver.
internal/orca/origin/azureblob/ Azure Blob Storage origin driver (Block Blobs only).
internal/orca/chunk/ Chunk-key encoding (SHA-256 path computation, range-to-index math).
internal/orca/config/ YAML config + env-var fallbacks + Logging.Level + ORCA_LOG_LEVEL env override.
internal/orca/app/ Lifecycle wiring (Start / Wait / Shutdown), three HTTP listeners, dependency injection seams for tests.

Tests

  • Unit tests in every package, covering driver branches, range
    math, error mapping, config defaults, manifest YAML validity.
  • Integration suite (internal/orca/inttest, build tag
    integrationtest, target make orca-inttest): 7 tests against
    real LocalStack + Azurite containers via testcontainers-go, with
    N in-process orca instances wired to those backends. Covers
    cold/warm GET, cross-chunk ranges, 64-chunk multi-chunk, rendezvous
    routing, singleflight collapse with bounded origin-call counts,
    membership-disagreement 409 fallback, and azureblob end-to-end.

Deployment

  • deploy/orca/*.yaml.tmpl - production manifests (Namespace, RBAC,
    ConfigMap, Service x2 - headless peer Service + edge Service,
    Deployment with anti-affinity).
  • deploy/orca/dev/*.yaml.tmpl - dev-only manifests (LocalStack,
    bucket-init Job, Azurite, container-init Job).
  • hack/cmd/render-manifests - Go template renderer driven by
    repeatable --set flags.

Dev harness

  • hack/orca/Makefile - orchestration for a 4-node Kind cluster
    (1 control-plane + 3 workers) with extraPortMappings for NodePort
    30100 -> host loopback.
  • hack/orca/quickstart.md - end-to-end recipe.
  • hack/orca/dev-harness.md - long-form reference (origin modes,
    switching, troubleshooting).
  • hack/cmd/orcaseed/ - Go tool for populating the dev-cluster
    origin: generate (synthetic N x size random blobs with optional
    per-blob deterministic seeding), upload (file from disk),
    list, delete.

Local dev-cluster workflow

For reviewers who want to bring this up themselves:

cp hack/orca/.env.example hack/orca/.env
$EDITOR hack/orca/.env  # set ORIGIN_DRIVER=azureblob LOG_LEVEL=debug
make orca-up
make -C hack/orca seed-generate SEED_ARGS='--size 10MiB --count 5'
make -C hack/orca port-forward  # in another terminal
curl -v http://localhost:8443/orca-test/synth-0 -o /dev/null
make -C hack/orca logs | jq 'select(.chunk.bucket=="orca-test")'
make orca-down

Full walkthrough in hack/orca/quickstart.md.

Comment thread design/orca/design.md Outdated
Comment thread design/orca/design.md Outdated
breaker** (s10.2). Sustained `ErrAuth` (default 3 consecutive) flips
`/readyz` to NotReady so load balancers drain the replica.

### 8.7 Metadata-layer singleflight
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I'm reading this right the doc proposes consistent hashing of the file/blob keyspace across Orca replicas, plus in-memory pipelining to batch concurrent requests for the same blocks.

Any concerns about hotspots if we're only running a few replicas? Since we have shared storage under the cache it's feasible to avoid partitioning (at the cost of another network roundtrip for locking)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth highlighting that partitioning here is narrower than full-keyspace consistent hashing: rendezvous hashing only picks the miss-fill coordinator per ChunkKey (s1 row "Cluster", s8.3). On cache hits every replica reads the chunk directly from the shared CacheStore, so the steady-state hot-key path is not partitioned and doesn't hotspot a single replica. Coordination only matters during the (singleflighted) miss window for a given chunk.

Given that, and given your follow-up on the eviction thread re: keeping access counters in memory, I think we're aligned on sticking with rendezvous + a small replica count for v1. A shared-storage lock to fully avoid partitioning is captured implicitly by the deferred cluster-wide singleflight/limiter items in s15.2 / s15.5; happy to add an explicit "shared-storage lock" alternative there if you'd like.

Comment thread design/orca/design.md Outdated
access-frequency-driven eviction loop. This is the recommended
posixfs path when external sweep tooling is impractical.

### 13.2 Active eviction (opt-in, access-frequency)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we definitely need eviction? I assume the remote dataset might be larger than in-region storage, but worth confirming

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @jwilder has mentioned that the origin data set is going to be bigger than the origin cache data store can store at least some of the times.

It feels like we need some kind of eviction but maybe we need more clarity on the requirement

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, this approach checks out then. Being able to keep the counters in memory might justify sticking with consistent hashing even if it comes at the cost of some hotspotting across replicas. So I guess that mostly addresses both of my remaining comments. 🙃

As long as we run fewer, "larger" replicas of this service I think we're good to go 👍

Add the orca origin cache binary (cmd/orca, internal/orca) that fronts
S3 / Azure Blob origins with a chunked, rendezvous-hashed, in-DC cache
backed by an S3-compatible store (LocalStack in dev, VAST or similar in
production). Includes:

- Per-replica fetch coordinator with cluster-wide singleflight collapse
  via rendezvous-hash coordinator selection plus an /internal/fill RPC
  for cross-replica fan-in.
- cluster.PeerSource abstraction (DNS-backed in production, mutable
  StaticPeerSource for tests) with Peer.Port to support multiple
  replicas sharing an IP under test.
- internal/orca/app factory exposing Start/Shutdown plus options for
  injecting alternate origin / cachestore / peer-source / internal
  handler wrap (used by tests).
- Integration suite (internal/orca/inttest, build tag integrationtest)
  driven by testcontainers-go: 7 scenarios against real LocalStack +
  Azurite covering cold/warm GET, ranged GET with chunk-boundary edge
  cases, 64-chunk multi-chunk GET, rendezvous routing, singleflight
  collapse with 64 <= origin GetRanges <= 76 bound, and a real
  membership-disagreement fallback test.
- Unit tests covering driver branches (cachestore versioning gate,
  azureblob blob-type gate, server error mapping + handler XML +
  range parser + path split + headers, chunk arithmetic + path
  determinism, config env-var fallback, manifest YAML validity).
- Deploy manifest templates (deploy/orca/) defaulting to the
  unbounded-kube namespace, and an extracted reusable
  hack/cmd/render-manifests/render package consumed by both the CLI
  and the manifest validity test.

Adds make orca-inttest target and a parallel CI job. Docs for the
dev harness and integration suite are intentionally excluded from
this commit.
in := &s3.ListObjectsV2Input{
Bucket: aws.String(b),
Prefix: aws.String(prefix),
MaxKeys: aws.Int32(int32(maxResults)),
}

cc := a.client.ServiceClient().NewContainerClient(cName)
max := int32(maxResults)
Remove all references to design-doc sections (e.g. design.md s8.3,
Scope A+B, Option D) from orca source comments. The design doc
sections will change independently of the code; comments need to
stand on their own.

Where a section reference was redundant (the surrounding text already
explained the concept), the reference is dropped. Where a section
reference carried load (it was the only rationale), it is replaced
with the actual rationale inline. Jargon like 'Scope A+B' and
'Option D' is replaced with concrete descriptions of what the code
actually does.

12 files changed, 22 individual comment edits. Comment-only changes;
no behavior change.
Remove dead code and unexport identifiers that have no callers outside
their own package. Every exported identifier now has at least one
external user.

Deleted:
- cluster.Mu (unused sync.Mutex scaffold)
- origin.ErrThrottle (never returned or matched)
- chunkcatalog.(*Catalog).Len() (claimed test helper, no test uses it)
- OriginETagChangedError.Got (never set; only formatted in Error())
- cluster.NewDNSPeerSource (orphan wrapper around internal newDNSPeerSource)
- cluster.WithResolver and app.WithResolver (no external callers;
  WithPeerSource is the actual test seam)

Unexported:
- metadata.(*Cache).Lookup -> lookup (only called by LookupOrFetch
  in the same file)
- cluster.(*Cluster).Self -> self (only called by Coordinator in the
  same file)

Updated server_test.go to drop the Got field reference in the
OriginETagChangedError struct literal used by TestWriteOriginError.

No behavior change.
@plombardi89
Copy link
Copy Markdown
Collaborator Author

@copilot perform a review of this PR

Record a code-review pass over internal/orca and cmd/orca, plus an
adversarially-reviewed remediation plan organized into prerequisite
plumbing, must-fix correctness bugs, should-fix concerns, and
cleanup tiers. Subsequent commits address the items in order.
Swap the defer order so c.sf.Delete(k) runs before close(sfe.done).
A second caller arriving after Delete creates a fresh entry instead
of silently replaying our (possibly transient-error) result.

Without the fix:
  1. Leader L finishes fetch with a transient error.
  2. recordResult returns early (transient errors are not cached).
  3. defer closes done, then Delete runs.
  4. Between close(done) and Delete, caller B does LoadOrStore and
     gets L's stale entry. B's once.Do is a no-op; B waits on the
     already-closed done and silently returns L's transient error
     without calling fetch.

After: B finds no entry, creates a fresh one, runs fetch. The
benign side effect is a brief overlap window where a new fetch can
start while the previous leader's done is closing; cluster-wide
dedup mitigates the duplicated-fetch risk.

Adds metadata_test.go covering positive caching, ErrNotFound
negative caching, transient-error non-caching (the regression test),
and concurrent-joiner collapse via a synchronisation gate.
Previously cluster.refresh wiped a known-good multi-peer snapshot
with [Self] on any discovery error, causing every chunk's rendezvous
coordinator to become Self for at least one refresh interval (5 s in
prod). That broke cluster-wide dedup until the next successful
refresh.

Now refresh distinguishes:
  - discovery error with a previous snapshot AND under the staleness
    ceiling (maxStalePeerRefreshes = 5 consecutive errors): retain
    the previous snapshot, log + bump consecutiveRefreshErrors.
  - discovery error without a previous snapshot (bootstrap), OR
    after the staleness ceiling: fall back to [Self]. The ceiling
    bounds how long we route to dead peers if discovery is
    permanently broken; cluster.FillFromPeer's
    ErrPeerNotCoordinator fallback absorbs brief stale-routing.
  - discovery success with zero peers (legitimate 'I'm alone'):
    fall back to [Self], do not bump the error counter.

Adds cluster_test.go covering all three branches via a
fakePeerSource that can be toggled between healthy and erroring.
Bundle of low-risk cleanups from the orca review:

- B6 cluster.DecodeChunkKey rejects chunk_size <= 0 and index < 0.
- Q1 cluster.IsCoordinator drops the dead IP+Port fallback; the
  Self flag is the single source of truth (stamped by every
  PeerSource and by the empty-peer-set fallback).
- Q2 chunkcatalog and metadata drop dead-defensive type-assertion
  error returns. The lists are private; we control every value
  inserted, so direct assertions are safe.
- Q3 rename skipCacheSelfTst -> skipCacheSelfTest in app.options.
- Q4 delete the _ = cachestore.ErrNotFound / _ = context.Canceled
  import guards in server.go and the now-unused cachestore import.
- Q6 fetch.fetchWithRetry checks ctx.Err() at the top of each loop
  iteration so a pre-cancelled context returns immediately instead
  of issuing a wasted GetRange.
- Q7 cluster.Close(ctx) is now ctx-aware; callers can cap the wait
  on an in-flight DNS lookup. app.Shutdown threads its ctx through.
- Q8 app.WithEdgeListener / WithInternalListener are explicitly
  marked TEST-ONLY in doc comments to deter production use.
- Q9 origin.go gets a comment block documenting which external
  conditions map to ErrNotFound / ErrAuth across drivers.
- C3 inttest CountingInternalHandlerWrap's responseWriter forwards
  Flush() to the embedded http.ResponseWriter so wrapping a handler
  that flushes (currently only the edge handler) does not silently
  degrade to buffered responses.
- M2 app.Wait drains any pending listener error from errCh after
  ctx.Done so a shutdown-time listener failure shows up in logs
  rather than being silently dropped.

All comment-and-API tightening; no functional behavior change.
cachestore/s3.PutChunk used to do io.ReadAll on the input reader
even when it was already an io.ReadSeeker (which is what the AWS
SDK actually needs for unsigned-payload retries). The caller in
fetch.runFill passes a bytes.NewReader(buf.Bytes()) - that's a
seekable reader sitting on top of the chunk buffer. Reading it all
again into a fresh byte slice doubled the chunk's memory footprint
during the put.

Now PutChunk type-asserts io.ReadSeeker; if present it passes the
reader straight to the SDK. The io.ReadAll fallback handles
non-seekable readers, where we still need to materialise the body
for the SDK's rewind contract.

With 8 MiB chunks and target_per_replica=64 concurrent fills, this
removes ~512 MiB of duplicate buffering at saturation.
…P0 + B1)

The fetch coordinator did not know the authoritative object size,
so it could not (a) clamp cachestore reads to the per-chunk length
on the tail chunk, (b) detect a short-body response from origin
during a fill, or (c) set Content-Length on the internal-fill
response. All three issues converge on the same prerequisite: pass
info.Size from the edge handler down through the cluster RPC.

Changes:

- cluster.encodeChunkKey/DecodeChunkKey carry an object_size query
  parameter on the internal-fill RPC. DecodeChunkKey returns
  (chunk.Key, int64, error); object_size is validated >= 0.
- cluster.FillFromPeer takes objectSize int64.
- fetch.Coordinator.GetChunk, FillForPeer, fillLocal, and runFill
  all take objectSize int64.
- fetch adds an expectedChunkLen helper that returns
  min(ChunkSize, objectSize - off) for the requested chunk, or 0
  for chunks past the end.
- runFill now requests expectedChunkLen bytes from origin (not
  k.ChunkSize), validates io.Copy's result against expectedChunkLen
  before commit, and fails the fill with a retryable-looking error
  on mismatch. This is B1 from the review: a flaky origin returning
  short bytes is rejected instead of permanently poisoning the
  catalog with the short length.
- The cachestore-hit fast path in GetChunk and FillForPeer clamps
  the read length to expectedChunkLen so even a too-large
  cachestore stat does not over-read past the object end.
- server.edgeFetchAPI.GetChunk and server.internalFetchAPI.FillForPeer
  signatures gain the objectSize parameter; edge handler passes
  info.Size, internal handler passes the decoded object_size.
- server_test.go fakeEdgeAPI updated to match.

When objectSize is 0 (unknown - the cluster-RPC path with old peers
during a rolling upgrade) the validation is a no-op and behavior
matches the pre-change shape. Production callers always provide
info.Size from origin Head.
The internal-fill handler now sets Content-Length to
chunk.Key.ExpectedLen(objectSize) on successful responses. This lets
net/http on the requesting peer surface mid-stream truncations as
io.ErrUnexpectedEOF instead of treating a short connection-close as
a clean EOF. cluster.FillFromPeer additionally wraps the body in a
defense-in-depth validator that re-checks the byte count, making the
contract explicit at the call site.

Extracts the per-chunk expected length into a chunk.Key.ExpectedLen
method so both producer and consumer compute it identically.
Previously the edge handler called WriteHeader(200) before any chunk
was fetched, so a first-chunk failure (origin 404, auth error, or a
cachestore body whose first Read failed) could only be surfaced by
aborting the connection mid-stream. Clients saw a half-written 200
that they had to parse as a transport error.

The handler now fetches the first chunk and peeks one byte before
committing any status. Failures up to that point flow through
writeOriginError as proper S3-style XML responses. Subsequent chunk
failures (chunks 1..N) remain mid-stream aborts since headers are
already on the wire.
The azureblob driver always rejects PageBlob and AppendBlob, and the
cachestore/s3 driver always runs the bucket-versioning gate. The
matching YAML knobs (origin.azureblob.enforce_block_blob_only,
cachestore.s3.require_unversioned_bucket) were previously
force-true via 'if !X { X = true }' but the shape implied operators
could disable them; in reality the safety guarantees orca depends on
(immutable chunked cache contract, no-clobber atomic commit) are
broken without them.

Removes both fields from config.Azureblob and config.CachestoreS3,
makes the enforcement and gate unconditional in their respective
drivers, and strips the keys from the deployed configmap template.
Existing YAML files that set either field will fail to parse on next
deploy, which is the intended clean-break signal.
A GET on a zero-byte object previously computed rangeEnd = info.Size
- 1 = -1 and fell into the rangeStart > rangeEnd guard, returning a
spurious 416 for what is a successful empty-body fetch under any
sane S3-compatible reading. Add an explicit size==0 short-circuit
that writes 200 + Content-Length: 0 before the range-parsing flow.
A Range request against a zero-byte object remains a 416 per
RFC 7233.
The cluster's internal-RPC HTTP client previously carried
Client.Timeout: 60s, a request-total wall clock that aborted any
internal-fill body stream taking longer than 60s. An 8 MiB chunk on
a degraded inter-pod link can exceed that bound; the abort was
indistinguishable on the requester side from the legitimate
mid-stream truncation case we plumbed Content-Length validation
against. Drop the wall timeout and rely on the caller's ctx (the
edge request ctx for client-driven fills, the 5-minute detached
fill ctx in fetch.runFill for leader-side ones) as the sole
deadline.

Adds WithHTTPClient as a test seam (production never sets it). Adds
direct white-box assertion that Client.Timeout is zero and a
behavioural test confirming ctx deadlines still terminate slow peer
responses.
The azureblob driver strips entity-tag quotes when reading the ETag
on Head (orca's internal representation is unquoted) and then passed
that unquoted value straight back through azcore.ETag for If-Match
on GetRange. RFC 7232 requires entity-tags to be quoted-strings;
Azure tolerated unquoted values in practice but the contract was
inconsistent and diverged from the awss3 driver, which already
re-wraps in quotes (awss3.go).

Re-wrap on egress. Add httptest-backed regression tests asserting
the inbound If-Match header is quoted when an etag is supplied and
absent when the etag is empty.
The driver previously mapped backend errors by matching service
error codes and by substring of err.Error(). Three concrete bugs:

- isPreconditionFailed matched 'InvalidArgument' and
  'ConditionalRequestConflict' (neither is the AWS S3 contract for
  If-None-Match: * conflicts) and fell back to
  strings.Contains(err.Error(), '412'), which fires false-positive
  on any error message containing '412' in any context.
- mapErr detected 5xx via strings.Contains(err.Error(),
  'StatusCode: 5') against the SDK's err.Error() format - a
  presentation detail with no stability contract.
- The driver carried a vestigial _ = http.StatusOK no-op to keep
  the net/http import alive.

Switch both predicates to *awshttp.ResponseError-based inspection
of the underlying HTTP status code: 412 for precondition, 4xx (401
/ 403) for auth, 5xx for transient. The 'AccessDenied' / similar
APIError-code branch is retained because the SDK surfaces these as
typed service errors carrying stable codes. Verified against the
orca integration suite: LocalStack 3.8 returns 412 for the
SelfTestAtomicCommit duplicate-put path, so dropping the legacy
matches is safe.
Adds slog.LogAttrs debug emissions at every origin operation for
diagnostic visibility into upstream calls:

- Head: request + response (size, short etag) + the three error
  mapping branches (not_found, auth, generic).
- GetRange: request (bucket, key, short etag, off, n) + response
  + etag-changed / not-found / auth branches.
- List: request (bucket, prefix, marker, max) + response (count,
  truncated) + auth error.

Adds an origin.ETagShort helper that truncates entity-tags to the
first 8 characters for log-line readability. ETags are not secrets
but their full Azure / AWS form is long enough to make grep output
noisy; the prefix is sufficient for matching one fill against
another while keeping the log handler's output narrow. Object keys
and bucket names are logged in full because they are part of the
operator's diagnostic context.
EdgeHandler:
- ServeHTTP entry: edge_request with method, path, bucket, key,
  range, and source remote-addr.
- handleHead exit: edge_head_response with size and truncated etag.
- handleGet: edge_get_plan (range/first/last chunk indices),
  edge_get_empty_object short-circuit trace, edge_get_chunk_next
  per chunk-stream transition, edge_get_complete on success.

InternalHandler:
- internal_fill_request on every accepted RPC with chunk attrs,
  object_size, and source remote-addr.
- internal_fill_not_coordinator trace for the 409 fallback path
  (operator visibility into membership-disagreement scenarios).
- internal_fill_complete on success with delivered bytes.

Migrates the 6 existing Warn callsites in server.go and the 5
existing Info / 5 existing Warn callsites in app.go to
slog.LogAttrs so they share the same attribute taxonomy as the new
Debug emissions and get zero-cost attribute evaluation when
filtered out.

Adds a debugLoggerTo test helper and a representative emission
test asserting the edge_request + edge_head_response trace pair.
Cluster.refresh now drives peer-set storage through a storePeerSet
helper that emits:

- Debug 'peer_set_refreshed' per refresh cycle with the new count
  and a 'reason' attribute (discovery_ok / empty_discovery_self_only
  / self_only_fallback).
- Info 'peer_set_initial' the first time a snapshot is stored.
- Info 'peer_set_changed' on every subsequent transition where the
  rendered (ip, port) set differs from the previous snapshot, with
  added/removed lists for diagnostic visibility.

Stable refreshes (no membership delta) emit only the per-cycle
debug line. This is how operators distinguish 'discovery healthy
but quiet' from 'membership churn' in the logs.

Adds Coordinator debug emission with chosen-peer IP, is_self bool,
and rendezvous score - the per-chunk routing decision visible in
the structured log when level=debug.

FillFromPeer emits request + response (or not-coordinator) debug
traces so peer-fill round-trips are observable.

Migrates the one existing Warn callsite in refresh's
retain-snapshot branch to LogAttrs.
Documents the debug-logging work that landed in the second-pass
remediation:

- The cfg.Logging.Level knob and ORCA_LOG_LEVEL env override.
- The HandlerOptions.AddSource: true convention and what it
  replaces (per-package With() tagging).
- Logger injection through every orca package that previously
  lacked one.
- LogAttrs-everywhere convention with zero-cost-when-filtered
  guarantee on hot paths (notably chunkcatalog.Lookup).
- The cross-package 'chunk' attribute group taxonomy.
- The sensitive-data audit (no keys, no signed URLs, ETag
  truncation via origin.ETagShort).
- The operator workflow for enabling debug logging.
- Deferred follow-ups: per-request correlation IDs, Prometheus
  metrics, runtime log-level switching.
The Azurite Service in the dev manifest was ClusterIP, so reaching
it from the host required a kubectl port-forward. The forthcoming
hack/cmd/orcaseed tool runs outside the cluster and populates the
origin container directly; switching to NodePort with a stable
high-port (default 30100) lets the seeder talk to
http://localhost:30100/devstoreaccount1/ once the dev cluster is up.

The fixed port sits in the Kubernetes NodePort range
(30000-32767). Two concurrent dev clusters on the same host would
collide; the renderer accepts an AzuriteNodePort override (wired
through hack/orca/Makefile via AZURITE_NODE_PORT). The Kind config
on the operator side also needs an extraPortMapping for 30100 to
expose the NodePort to the host loopback (Kind does not route
NodePorts to the host without explicit mappings).
Adds a small Go binary used by the Orca dev harness to populate the
in-cluster Azurite origin container with synthetic or operator-
supplied content. Four subcommands sharing the same connection
flags (default endpoint targets the dev harness's NodePort 30100):

  generate  N synthetic random-byte blobs of size S each (with
            --seed for reproducible content; per-blob ceiling 1 GiB
            unless --force; warns if size*count > 1 GiB).
  upload    a single file from disk (--file, optional --name).
  list      blobs currently in the container, optional --prefix.
  delete    blobs matching --prefix (default all), interactive
            confirmation unless --yes.

The well-known Azurite dev account-key is the default --account-key
value; it is a public Microsoft-documented constant baked into
Azurite, not a secret.

Tests cover the human-readable size parser (every accepted suffix +
the error paths) and a deterministic-seed end-to-end generate run
against an httptest-backed fake Azurite that captures the uploaded
bodies and asserts byte-identical output across two runs of the
same --seed.
Captures the minimal recipe for bringing up the Orca dev cluster in
Azurite mode with debug logging enabled from t=0, seeding the
origin with hack/cmd/orcaseed, exercising the cache, and watching
the per-chunk structured-log trace.

Walks through the eight-step path: .env setup, make orca-up,
seed-generate / seed-upload, port-forward, curl, jq-filtered log
tailing, iterate, tear down. Plus a cheat-sheet of the helper
targets and a pointer to the in-process integration suite for
folks who don't need Kind.

Complements the longer dev-harness.md reference; this file is the
quickstart path.
Brings the Orca dev-harness tooling into git as a coherent unit
alongside the already-tracked quickstart.md, and applies the
cleanup-pass findings identified during the consistency review.

Files newly tracked (previously local-only):

  Makefile              - dev-harness orchestration (kind, render,
                          deploy, seed, reset, etc.).
  kind-config.yaml      - 1 control-plane + 3 workers; first worker
                          forwards NodePort 30100 to host loopback
                          for orcaseed -> Azurite reachability.
  kind-create.sh        - idempotent kind cluster bring-up.
  kind-load.sh          - sideload the orca image into kind nodes.
  down.sh               - kind cluster teardown.
  deploy-credentials.sh - build the orca-credentials Secret from .env.
  dev-harness.md        - long-form reference for the dev harness.
  inttest.md            - companion docs for the in-process
                          integration suite.
  .gitignore            - excludes .env + rendered-dev/ output.

Cleanup applied in this commit:

- F1: seed-azure.sh deleted. Its functionality is fully covered by
  hack/cmd/orcaseed; the Makefile's seed-azure target now invokes
  'orcaseed upload' against real Azure with credentials sourced
  from .env. Drops a 76-line shell script plus the dependency on
  the mcr.microsoft.com/azure-cli container image.
- F5: dev-harness.md trimmed where it duplicated quickstart.md
  (Seed sample data / Exercise the cache / Logging now point at
  the quickstart). dev-harness.md shrank from 381 to 333 lines.
  Reference material (origin-mode table, troubleshooting,
  'NOT covered' list, switching-to-real-Azure) stays.
- F6: removed the stale 'sample-get.sh' reference.
- F8: .PHONY in Makefile lists deploy-azurite-maybe; deploy-orca
  recipe carries a comment explaining the Service-before-Deployment
  apply ordering (headless DNS must exist before pods start so the
  initial cluster.refresh sees the full peer set).

Skipped:

- F2: kind-create / kind-load / down kept as standalone scripts
  rather than inlined into the Makefile (recipe readability wins
  over file-count reduction).
- F7: orcaseed remains azblob-only; the awss3 LocalStack path is
  rarely interactively driven and the kubectl-run + amazon/aws-cli
  workaround is documented in dev-harness.md.
The 'object_size unknown' sentinel value (object_size == 0 in the
internal-fill wire format, with a runFill fallback substituting
k.ChunkSize for the request length and skipping the body-length
validation) was dead code: production callers always plumb info.Size
from a prior Head, so the sentinel was never legitimately set. Its
existence created three reachable foot-guns:

- cachestore/s3.GetChunk(n=0) produced a malformed S3 Range header
  'bytes=0--1', returning a 400 instead of an empty body.
- runFill validation was skipped, allowing an adversarial peer to
  commit wrong-sized blobs to the cachestore by sending
  object_size=0 over the internal-fill RPC.
- cluster.FillFromPeer's validatingReader was bypassed because the
  internal-fill handler omitted Content-Length when expectedLen=0,
  so a short stream from the peer was indistinguishable from a
  clean EOF.

Tighten the wire format and propagate the strictness inward:

- cluster.DecodeChunkKey: object_size must be > 0 (was >= 0).
- fetch.runFill: drop the requestLen==0 fallback; always pass the
  authoritative expectedLen to fetchWithRetry and always run the
  body-length validation.
- cachestore/s3.GetChunk: reject n <= 0 and off < 0.
- cachestore/s3.PutChunk: reject size <= 0.

Adds regression tests at the wire boundary and at the cachestore
driver entry points. No production caller is affected (all callers
already pass positive sizes); a peer running pre-fix code that sent
object_size=0 would be rejected at decode time, which is the
intended hardening.
The Catalog stored cachestore.Info for every recorded chunk but the
only caller (fetch.lookupOrStat) discarded it: the catalog-hit path
used k.ExpectedLen(objectSize) for the GetChunk call, never reading
catalog.info.Size. The defensive value of the stored Info was
illusory: chunk.Path encodes (origin_id, bucket, key, etag,
chunk_size), so under the design contract a path hit implies the
cachestore contains bytes for this exact version of this chunk.
Catalog-stored Info couldn't detect mutations to the cachestore
that happened after Record (those are self-healing via
ErrNotFound -> Forget), and contract violations within Path
encoding are the wrong layer to defend.

Slim the catalog to a presence-only LRU:

- Catalog.Lookup returns bool, not (Info, bool).
- Catalog.Record takes only the chunk key; the prior Info argument
  is gone.
- entry struct drops the info field.

Updates fetch.lookupOrStat and fetch.runFill (commit-success and
commit-lost branches) for the new API. Net delta is a smaller
catalog struct, one less alloc per Record, and clearer intent in
the package docstring.
The previous ordering closed f.done in a deferred function that ran
AFTER PutChunk returned, so joiners had to wait for both the origin
fetch AND the cachestore commit before seeing bytes. The runFill
comment described 'commit-after-serve' but the code was actually
commit-before-serve - the latency cost was hidden in the joiner
path on every cold fill.

Promote close(f.done) to an explicit call right after the
body-length validation succeeds and f.bodyBuf is assigned. Use a
sync.Once-wrapped release function so the deferred safety net path
(which catches panics) and the explicit success-path call are both
safe to invoke; close(f.done) fires exactly once either way.

Correctness is preserved: the bytes.Buffer is fully populated and
length-validated before release; the underlying byte slice is no
longer mutated after io.Copy returns, so joiner reads of
f.bodyBuf.Bytes() are safe to overlap with the PutChunk RPC's read
of the same slice via bytes.NewReader.

Latency improvement: joiners now return as soon as the origin
delivered bytes; the commit RTT is removed from the joiner path.

Regression test exercises a slow PutChunk (mock cachestore that
blocks PutChunk until signaled) and asserts that fillLocal returns
while PutChunk is still pending - which fails under the old
ordering and passes under the new.
chunk.Path encodes the ETag in its SHA-256 hash input. A stable
cache key requires the origin to supply one. If the origin returns
an empty ETag - some S3-compatible backends with specific bucket
policies, or custom origins that do not follow the AWS/Azure
contract - then two different versions of the same (bucket, key)
would alias to the same chunk.Path and orca would serve stale
bytes silently on the next read after a mutation.

Catch the misconfiguration immediately at Head time. New sentinel
origin.MissingETagError. fetch.HeadObject wraps the underlying
origin.Head and rejects empty-ETag responses. metadata.recordResult
caches the negative result under NegativeTTL so we do not re-Head
a permanently-misconfigured origin on every request.
server.writeOriginError maps MissingETagError to 502 with a
descriptive 'OriginMissingETag' body so the operator sees the
misconfiguration in the response, not just the logs.

Coordinator-level enforcement (single policy point); the drivers
themselves remain unchanged - they return whatever the upstream
gave us. The coordinator decides whether the response is
acceptable, and the metadata cache prevents repeated re-fetches.

Trade-off: misconfigured origins that worked today now fail loud.
That is the intended hardening - empty-ETag origins are corruption
hazards under orca's caching model.
The previous implementation used a shared *math/rand.Rand source
serialised through a sync.Mutex when --seed was set. Concurrent
upload goroutines interleaved reads from that single source in
lock-acquisition order, which is non-deterministic. The first test
worked only because it used --concurrency 1; under any --concurrency
> 1 (including the default 4), two runs with the same --seed
produced different per-blob bytes.

Derive each blob's stream from (userSeed + blobIndex). Each blob
has its own independent math/rand source, no shared state, no
mutex. Determinism becomes a pure function of (--seed, blobIndex)
and is invariant under upload-completion ordering.

Updates the test to use --concurrency 4 (so the regression is
caught) and adds a second test asserting per-blob streams differ
within the same run (so --prefix-0 and --prefix-1 are not byte-
identical under the same seed).
The previous fallback to the well-known Azurite dev key applied
whenever AZURE_STORAGE_KEY was empty, regardless of
AZURE_STORAGE_ACCOUNT. An operator with
AZURE_STORAGE_ACCOUNT=mycompanyacct and a missing AZURE_STORAGE_KEY
got the Azurite dev key silently injected into the orca-credentials
Secret, and Orca pods spent a startup window 401'ing against the
real account before the operator figured out why.

Tighten the gate: the Azurite dev-key fallback applies only when
AZURE_STORAGE_ACCOUNT is empty OR equals 'devstoreaccount1' (the
canonical Azurite account name). For any other account, a missing
AZURE_STORAGE_KEY is a hard error with a clear message naming the
account, so the misconfiguration surfaces immediately.
http.Client.Timeout was deliberately removed earlier so caller ctx
could bound long-running fill body streams (8 MiB chunks on slow
links can exceed any hardcoded value). The trade-off was: a stuck
TCP SYN or stalled TLS handshake against a half-failed peer would
hang until the caller's ctx fired - which is the full 5-minute
fill ctx for leader-side fills. Over time those stuck connects can
saturate the origin semaphore with cancelled work.

Bound the connection-establishment surface independently of the
body-read deadline:

- Transport.DialContext with a 10s connect timeout (via net.Dialer).
- Transport.TLSHandshakeTimeout: 10 seconds.
- Transport.ExpectContinueTimeout: 1 second.

Body reads still rely on the caller's ctx, preserving the
arbitrary-size-on-slow-link contract. Connect-level latency is now
bounded regardless of caller ctx, so stuck-syn cancelled work no
longer hangs the originSem.

Test asserts DialContext is non-nil and TLSHandshakeTimeout is
bounded; existing TestNewHTTPClient_NoWallTimeout still locks the
body-read contract.
Only the non-seekable path validated len(buf) == size. The
seekable-reader path (production: bytes.NewReader(buf.Bytes()) from
runFill, or os.File from orcaseed) trusted the caller's claimed
size and handed it straight to S3 with ContentLength=size. A buggy
caller passing a 100-byte reader with size=1024 (or vice versa)
either got rejected by S3 (ContentLength mismatch) or uploaded a
truncated / overlong blob, depending on backend behaviour.

Add a seek-and-check probe at the driver entry point: Seek(0, End)
to discover the actual length, compare to the declared size, then
Seek(0, Start) to rewind for the upload. The probe is cheap on
both bytes.Reader and *os.File (the production callers), and
catches the size-mismatch case before any RPC.

Regression test passes a bytes.Reader with the wrong length and
asserts PutChunk errors out before reaching S3.
The ctx-cancel branch already drained any concurrently-arriving
listener errors before returning nil. The errCh-arrived-first
branch returned the first error and left the rest in the channel;
subsequent errors (e.g. a multi-listener crash where two listeners
errored within the same tick) sat in the buffered channel until
the goroutines exited at Shutdown time, by which point nothing
read errCh and the errors were silently dropped.

Extract the drain into a small helper and call it on both Wait
return paths. The first-error case still returns the first error
as before; additional errors are now logged at Warn so operators
see the full picture of a multi-listener failure.

Test pre-buffers three errors and cancels ctx, asserting all three
appear in the captured log output.
- M-1: tighten chunk.IndexRange docstring. The previous comment
  noted the negative-end clamp but didn't lay out the full input
  contract (chunkSize > 0 guaranteed by config validation;
  empty-objects short-circuit upstream and never reach IndexRange;
  the clamp is a guard against an arithmetic bug, not a supported
  empty-range encoding).
- M-7: orcaseed delete confirmation prompt now distinguishes
  stdin-closed (EOF) from generic read errors. The message points
  the operator at --yes for non-interactive contexts instead of
  the opaque 'read confirmation: EOF'.
- Append a 'Third-pass findings and remediation' section to
  design/orca/review.md summarising the ten commits in this pass
  with one-line descriptions per finding, plus the deferred-items
  list with rationale.
Three rounds of review-driven hardening landed substantial
behavioural changes since the design docs were last touched, plus
several broad features that had been described as v1 scope were
explicitly deferred. The result was a design surface that no
longer mapped onto the shipped code.

- Rewrite design.md from scratch around what shipped. Drops the
  spool / tee / posixfs / localfs / circuit-breaker / LIST-cache /
  prefetch / active-eviction / bounded-freshness / max_response_bytes
  / metrics surfaces that were never implemented; redraws the
  retained mermaid diagrams to match the in-memory bytes.Buffer +
  sync.Once-wrapped release + commit-after-serve flow; adds a
  Deferred / future work section consolidating the not-yet-shipped
  design ideas (including the design-intent items from the prior
  review passes).
- Surgical edits to brief.md to remove the same deferred-feature
  references, redraw Diagrams A and B, and rewrite the top-risks
  list around what an operator would actually hit today.
- Delete plan.md (1554 lines): the phased-delivery narrative no
  longer maps onto shipped code; the repo-layout section was 50%+
  stale; the approval-checklist semantics make no sense for a doc
  describing shipped code.
- Delete review.md (753 lines): its design-intent deferred items
  are now consolidated into design.md s15; the audit-trail value
  lives in git history.
- Update hack/orca/dev-harness.md cross-link to point at
  design.md's Deferred / future work section instead of plan.md.
- Fix a stale parameter docstring on chunkcatalog.Record: the
  comment referenced a removed 'info' argument; the catalog has
  been presence-only since the third review pass.

Verified by re-reading the new design.md against
internal/orca/{fetch,cluster,cachestore,server,app,chunk,
chunkcatalog,metadata,origin,config}/, internal/orca/inttest/,
and deploy/orca/. make and make orca-inttest are green.
GitHub's mermaid renderer treats ';' as a statement separator in
sequence diagrams, not as message text. Three messages in the
cold-miss local-coordinator diagram contained semicolons,
producing 'expected arrow, got NEWLINE' parse errors and a
non-rendering diagram:

  R->>R: ChunkCatalog miss; Stat miss
  R->>R: Peek(1); commit headers
  SF->>CS: Stat(k); Record on success

Replaced ';' with ', then' or ',' in each case so the message
text reads naturally and the parser sees a single statement per
line.

Also declared the missing 'Cat as ChunkCatalog' participant so it
renders next to its peers instead of being auto-created at the
right edge.

Confirmed no other mermaid block in design.md or brief.md
contains semicolons inside message text.
The Internal interfaces section duplicated 30 lines of Go that
mirrored internal/orca/origin/origin.go and
internal/orca/cachestore/cachestore.go byte-for-byte, plus ~30
lines of bullets enumerating specific method signatures. Both
were guaranteed to rot on any refactor.

- Delete the section in full.
- Renumber design.md sections 8-15 down to 7-14 (and matching
  subsections / TOC).
- Insert a 'Named seams' table at the top of the new s7 Stampede
  protection (former s8) as a navigation aid: seam name -> file
  path -> one-line role.
- Update all in-doc and cross-doc anchors / section-number
  references in design.md and brief.md to match the renumbered
  structure (TOC, decisions table, terminology glossary, request
  flow, stampede protection internal cross-refs, concurrency
  section, bounded staleness, create-after-404, eviction,
  deferred / future work).
- Rewrite the brief.md s4 framing that opened with 'formal Go
  interfaces in design.md s7' to point readers at internal/orca/
  source files directly.

Load-bearing semantics that previously hid in section 7 -- atomic
no-clobber commit, If-Match identity contract, typed-sentinel
error routing -- were already documented in s2 (Decisions),
s7.7 (Failure handling), and s9.2 (Typed cachestore errors); no
information loss. Verified every cross-reference still resolves
against an existing heading by greping the unique anchor URLs in
each doc against the heading-derived anchors.
The 'Azure adapter: Block Blob only' section was unnecessary
detail given how much of its content is covered elsewhere:

- The Block Blob constraint is already stated in s2 Decisions
  ('Azure constraint | Block Blobs only').
- UnsupportedBlobTypeError handling is in s2 Decisions and the
  HTTP error-code mapping in s6.3.
- Negative caching of unsupported-blob-type responses is
  documented in s10 Create-after-404 and negative-cache
  lifecycle.
- If-Match quoting is a driver-internal implementation detail
  documented in the source.

- Delete the section in full (~20 lines).
- Renumber design.md sections 9-14 down to 8-13 (and matching
  subsections / TOC).
- Update every cross-reference in design.md and brief.md to
  match the renumbered structure.

Verified every anchor URL still resolves against an existing
heading; no broken links.
The two docs leaned on nominalized noun phrases, Latinate
vocabulary, and dense compound-noun pile-ups that made them
tedious to parse. A reader had to translate phrases like
'asymmetric defaults reflect asymmetric operational reality'
into the underlying idea before reading on.

This pass rewrites the prose throughout both docs to be readable
by someone who is not steeped in the codebase:

- Active voice over passive.
- Verbs over nominalized abstractions.
- Plain words over Latinate equivalents.
- Shorter sentences (split at ~25 words).
- Hedging like 'load-bearing', 'essentially', 'intentionally',
  'operational reality' removed or replaced with specifics.
- The point of each paragraph leads; the implementation walk
  follows.
- Each diagram now has a one-sentence caption describing what
  it shows.

Preserved without change:
- All section / subsection numbers and heading text (and
  therefore anchor URLs).
- All nine mermaid diagrams.
- All code identifiers, file-path references, and config-key
  names.
- The named technical concepts (singleflight, rendezvous
  hashing, pre-header retry, commit-after-serve, the typed
  sentinel errors) - these stay as grep handles, but each
  first-use now carries a plain-language gloss.

Verified every cross-reference still resolves: the unique
anchor URLs in design.md (17) and brief.md (15) all match real
headings post-rewrite.
s8.2 Typed cachestore errors, s8.3 Range/sizes/edge cases, and
s8.4 Readiness probe added detail that already lived elsewhere:

- s8.2's four-sentinel-error list duplicated content in s2 (the
  Atomic-commit decisions row), s6.3 (the HTTP error-code
  mapping table), s7.2 (commit-after-serve handling), and s7.7
  (failure handling without re-stampede).
- s8.3's bullets all duplicated content in s5 (chunk model +
  ExpectedLen) and s6 (range parse, zero-byte short-circuit).
  The PutChunk seekable-vs-non-seekable length probe was driver
  implementation detail.
- s8.4's /readyz gate conditions and /healthz semantics were
  already stated in s4 (the Ops listener bullet); the /readyz
  predicate behavior under broken DNS is in s12 (Horizontal
  scale).

With those gone, s8 has only the atomic-commit content. Rename
s8 to 'Atomic commit' to match what's left, drop the now-
redundant '### 8.1 Atomic commit' subheading, and update the
one in-doc cross-reference plus the two cross-references in
brief.md.

Verified anchor URLs in both docs still resolve.
The brief had grown to 373 lines and was no longer fast to
read. Reshape it as a one-screen orientation that lands the
reader on design.md.

- Drop section 4 (Components). design.md s7's Named seams
  table is a better reference; the brief doesn't need to
  enumerate components.
- Drop section 7 (Cold-miss request + Diagram B). The same
  flow is in design.md Diagram 5; Diagram A is enough as an
  entry point for the brief.
- Compress section 5 (Five mechanisms): each mechanism is now
  one short paragraph, not a multi-paragraph walk-through.
  Detail lives in design.md.
- Convert section 8 (Top risks) from numbered paragraphs to a
  compact 4-column table: Risk, What goes wrong, Bound, Detail.
- Strip section 9 (Where to go next) to bare links; the
  destination section titles speak for themselves.
- Tighten the Problem, Goals/Non-goals, System-at-a-glance,
  and Backing-store-options sections.

Sections 5-9 renumber to 4-7 to fill the gaps left by the two
deleted sections.

About 198 lines now, down from 373 (47% cut).
Comment thread internal/orca/fetch/fetch.go
Audit pass on internal/orca and cmd/orca surfaced a handful of
swallowed errors and one dead-coded security configuration:

- server.handleList: log XML encode failures at warn instead of
  silently dropping them, matching the mid-stream treatment in the
  GET path.
- fetch.runFill: log Stat errors that follow ErrCommitLost so
  cachestore flapping is observable; catalog still stays unrecorded.
- cachestore/s3.randHex: surface crypto/rand failures to the
  selftest caller instead of falling back to a time-based suffix
  that can collide on parallel boots.
- cluster.newHTTPClient: refuse to start when
  cluster.internal_tls.enabled=true. The TLS configuration was
  previously discarded with '_ = cfg', which would have silently
  downgraded production deployments to system-trust-store HTTPS
  instead of the configured CA / client cert.
- cmd/orca/orca.serve: propagate App.Shutdown errors to the
  process exit code so failed-shutdown signals reach kubelet
  probes and init systems.
Large-blob GETs previously paid one strictly-sequential cachestore
GetObject per chunk; at 8 MiB chunks a 700 GB object required
~90,000 serial round trips even when fully cached. This commit adds
two compounding throughput improvements:

- Dynamic chunk size via a tier ladder. chunk.SizeFor selects the
  effective chunk size from a base value plus an ascending-threshold
  list of {MinObjectSize, ChunkSize}. The base covers small objects;
  each tier overrides for objects at or above its MinObjectSize.
  Tiers are strictly-ascending and rejected at config validate time
  if unsorted or out of bounds, so overlap and ambiguity are
  structurally impossible. Default ladder: 8 MiB base, 64 MiB above
  1 GiB, 128 MiB above 10 GiB. Backward compatible: empty tiers
  preserves the prior global-only chunk size.

- Server-side parallel read-ahead. EdgeHandler.handleGet now spawns
  a producer that runs up to chunking.readahead chunk fetches in
  flight while the consumer streams the current chunk, with strict
  ordered delivery via a per-job result channel. Cold-fill memory
  remains bounded by the existing originSem; the readahead depth
  bounds extra warm-path cachestore body buffers. Default depth 8;
  readahead is a *int so the YAML can distinguish omitted (defaults
  applied) from explicit 0 (disabled), restoring the prior strictly-
  sequential path.

Cross-replica safety: the internal-fill RPC already carries
chunk_size per request, so rolling deploys with different tier
policies on different replicas remain correct.

Worked example for a 700 GB warm GET against the new defaults:
chunk count drops from ~90,000 to ~5,600 (16x fewer requests), with
up to 8 of those in flight in parallel.
App.Wait's single select between <-ctx.Done() and <-a.errCh was
non-deterministic when both channels were simultaneously ready: Go
randomizes among ready cases, so Wait returned nil ~50% of the time
and the head-of-queue listener error the other 50%. This contradicts
the function's documented contract ('returns nil if ctx was
canceled') and surfaced as a flaky TestApp_Wait_DrainsErrChOnCtxCancel
that pre-fills errCh and pre-cancels ctx.

Add a non-blocking pre-check that takes the shutdown branch
deterministically when ctx is already canceled at entry. Buffered
listener errors are still drained to the Warn log via drainErrCh;
only their effect on Wait's return value is suppressed in this
specific overlap. The main select remains unchanged for the
listener-error-during-normal-operation path.

Verified: TestApp_Wait_DrainsErrChOnCtxCancel passes 100/100 runs
after this change (previously ~50/50).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants