Orca - ORigin CAche#124
Conversation
| breaker** (s10.2). Sustained `ErrAuth` (default 3 consecutive) flips | ||
| `/readyz` to NotReady so load balancers drain the replica. | ||
|
|
||
| ### 8.7 Metadata-layer singleflight |
There was a problem hiding this comment.
So if I'm reading this right the doc proposes consistent hashing of the file/blob keyspace across Orca replicas, plus in-memory pipelining to batch concurrent requests for the same blocks.
Any concerns about hotspots if we're only running a few replicas? Since we have shared storage under the cache it's feasible to avoid partitioning (at the cost of another network roundtrip for locking)
There was a problem hiding this comment.
Worth highlighting that partitioning here is narrower than full-keyspace consistent hashing: rendezvous hashing only picks the miss-fill coordinator per ChunkKey (s1 row "Cluster", s8.3). On cache hits every replica reads the chunk directly from the shared CacheStore, so the steady-state hot-key path is not partitioned and doesn't hotspot a single replica. Coordination only matters during the (singleflighted) miss window for a given chunk.
Given that, and given your follow-up on the eviction thread re: keeping access counters in memory, I think we're aligned on sticking with rendezvous + a small replica count for v1. A shared-storage lock to fully avoid partitioning is captured implicitly by the deferred cluster-wide singleflight/limiter items in s15.2 / s15.5; happy to add an explicit "shared-storage lock" alternative there if you'd like.
| access-frequency-driven eviction loop. This is the recommended | ||
| posixfs path when external sweep tooling is impractical. | ||
|
|
||
| ### 13.2 Active eviction (opt-in, access-frequency) |
There was a problem hiding this comment.
Do we definitely need eviction? I assume the remote dataset might be larger than in-region storage, but worth confirming
There was a problem hiding this comment.
I think @jwilder has mentioned that the origin data set is going to be bigger than the origin cache data store can store at least some of the times.
It feels like we need some kind of eviction but maybe we need more clarity on the requirement
There was a problem hiding this comment.
Cool, this approach checks out then. Being able to keep the counters in memory might justify sticking with consistent hashing even if it comes at the cost of some hotspotting across replicas. So I guess that mostly addresses both of my remaining comments. 🙃
As long as we run fewer, "larger" replicas of this service I think we're good to go 👍
…rred optimization
Add the orca origin cache binary (cmd/orca, internal/orca) that fronts S3 / Azure Blob origins with a chunked, rendezvous-hashed, in-DC cache backed by an S3-compatible store (LocalStack in dev, VAST or similar in production). Includes: - Per-replica fetch coordinator with cluster-wide singleflight collapse via rendezvous-hash coordinator selection plus an /internal/fill RPC for cross-replica fan-in. - cluster.PeerSource abstraction (DNS-backed in production, mutable StaticPeerSource for tests) with Peer.Port to support multiple replicas sharing an IP under test. - internal/orca/app factory exposing Start/Shutdown plus options for injecting alternate origin / cachestore / peer-source / internal handler wrap (used by tests). - Integration suite (internal/orca/inttest, build tag integrationtest) driven by testcontainers-go: 7 scenarios against real LocalStack + Azurite covering cold/warm GET, ranged GET with chunk-boundary edge cases, 64-chunk multi-chunk GET, rendezvous routing, singleflight collapse with 64 <= origin GetRanges <= 76 bound, and a real membership-disagreement fallback test. - Unit tests covering driver branches (cachestore versioning gate, azureblob blob-type gate, server error mapping + handler XML + range parser + path split + headers, chunk arithmetic + path determinism, config env-var fallback, manifest YAML validity). - Deploy manifest templates (deploy/orca/) defaulting to the unbounded-kube namespace, and an extracted reusable hack/cmd/render-manifests/render package consumed by both the CLI and the manifest validity test. Adds make orca-inttest target and a parallel CI job. Docs for the dev harness and integration suite are intentionally excluded from this commit.
| in := &s3.ListObjectsV2Input{ | ||
| Bucket: aws.String(b), | ||
| Prefix: aws.String(prefix), | ||
| MaxKeys: aws.Int32(int32(maxResults)), |
| } | ||
|
|
||
| cc := a.client.ServiceClient().NewContainerClient(cName) | ||
| max := int32(maxResults) |
Remove all references to design-doc sections (e.g. design.md s8.3, Scope A+B, Option D) from orca source comments. The design doc sections will change independently of the code; comments need to stand on their own. Where a section reference was redundant (the surrounding text already explained the concept), the reference is dropped. Where a section reference carried load (it was the only rationale), it is replaced with the actual rationale inline. Jargon like 'Scope A+B' and 'Option D' is replaced with concrete descriptions of what the code actually does. 12 files changed, 22 individual comment edits. Comment-only changes; no behavior change.
Remove dead code and unexport identifiers that have no callers outside their own package. Every exported identifier now has at least one external user. Deleted: - cluster.Mu (unused sync.Mutex scaffold) - origin.ErrThrottle (never returned or matched) - chunkcatalog.(*Catalog).Len() (claimed test helper, no test uses it) - OriginETagChangedError.Got (never set; only formatted in Error()) - cluster.NewDNSPeerSource (orphan wrapper around internal newDNSPeerSource) - cluster.WithResolver and app.WithResolver (no external callers; WithPeerSource is the actual test seam) Unexported: - metadata.(*Cache).Lookup -> lookup (only called by LookupOrFetch in the same file) - cluster.(*Cluster).Self -> self (only called by Coordinator in the same file) Updated server_test.go to drop the Got field reference in the OriginETagChangedError struct literal used by TestWriteOriginError. No behavior change.
|
@copilot perform a review of this PR |
Record a code-review pass over internal/orca and cmd/orca, plus an adversarially-reviewed remediation plan organized into prerequisite plumbing, must-fix correctness bugs, should-fix concerns, and cleanup tiers. Subsequent commits address the items in order.
Swap the defer order so c.sf.Delete(k) runs before close(sfe.done).
A second caller arriving after Delete creates a fresh entry instead
of silently replaying our (possibly transient-error) result.
Without the fix:
1. Leader L finishes fetch with a transient error.
2. recordResult returns early (transient errors are not cached).
3. defer closes done, then Delete runs.
4. Between close(done) and Delete, caller B does LoadOrStore and
gets L's stale entry. B's once.Do is a no-op; B waits on the
already-closed done and silently returns L's transient error
without calling fetch.
After: B finds no entry, creates a fresh one, runs fetch. The
benign side effect is a brief overlap window where a new fetch can
start while the previous leader's done is closing; cluster-wide
dedup mitigates the duplicated-fetch risk.
Adds metadata_test.go covering positive caching, ErrNotFound
negative caching, transient-error non-caching (the regression test),
and concurrent-joiner collapse via a synchronisation gate.
Previously cluster.refresh wiped a known-good multi-peer snapshot
with [Self] on any discovery error, causing every chunk's rendezvous
coordinator to become Self for at least one refresh interval (5 s in
prod). That broke cluster-wide dedup until the next successful
refresh.
Now refresh distinguishes:
- discovery error with a previous snapshot AND under the staleness
ceiling (maxStalePeerRefreshes = 5 consecutive errors): retain
the previous snapshot, log + bump consecutiveRefreshErrors.
- discovery error without a previous snapshot (bootstrap), OR
after the staleness ceiling: fall back to [Self]. The ceiling
bounds how long we route to dead peers if discovery is
permanently broken; cluster.FillFromPeer's
ErrPeerNotCoordinator fallback absorbs brief stale-routing.
- discovery success with zero peers (legitimate 'I'm alone'):
fall back to [Self], do not bump the error counter.
Adds cluster_test.go covering all three branches via a
fakePeerSource that can be toggled between healthy and erroring.
Bundle of low-risk cleanups from the orca review: - B6 cluster.DecodeChunkKey rejects chunk_size <= 0 and index < 0. - Q1 cluster.IsCoordinator drops the dead IP+Port fallback; the Self flag is the single source of truth (stamped by every PeerSource and by the empty-peer-set fallback). - Q2 chunkcatalog and metadata drop dead-defensive type-assertion error returns. The lists are private; we control every value inserted, so direct assertions are safe. - Q3 rename skipCacheSelfTst -> skipCacheSelfTest in app.options. - Q4 delete the _ = cachestore.ErrNotFound / _ = context.Canceled import guards in server.go and the now-unused cachestore import. - Q6 fetch.fetchWithRetry checks ctx.Err() at the top of each loop iteration so a pre-cancelled context returns immediately instead of issuing a wasted GetRange. - Q7 cluster.Close(ctx) is now ctx-aware; callers can cap the wait on an in-flight DNS lookup. app.Shutdown threads its ctx through. - Q8 app.WithEdgeListener / WithInternalListener are explicitly marked TEST-ONLY in doc comments to deter production use. - Q9 origin.go gets a comment block documenting which external conditions map to ErrNotFound / ErrAuth across drivers. - C3 inttest CountingInternalHandlerWrap's responseWriter forwards Flush() to the embedded http.ResponseWriter so wrapping a handler that flushes (currently only the edge handler) does not silently degrade to buffered responses. - M2 app.Wait drains any pending listener error from errCh after ctx.Done so a shutdown-time listener failure shows up in logs rather than being silently dropped. All comment-and-API tightening; no functional behavior change.
cachestore/s3.PutChunk used to do io.ReadAll on the input reader even when it was already an io.ReadSeeker (which is what the AWS SDK actually needs for unsigned-payload retries). The caller in fetch.runFill passes a bytes.NewReader(buf.Bytes()) - that's a seekable reader sitting on top of the chunk buffer. Reading it all again into a fresh byte slice doubled the chunk's memory footprint during the put. Now PutChunk type-asserts io.ReadSeeker; if present it passes the reader straight to the SDK. The io.ReadAll fallback handles non-seekable readers, where we still need to materialise the body for the SDK's rewind contract. With 8 MiB chunks and target_per_replica=64 concurrent fills, this removes ~512 MiB of duplicate buffering at saturation.
…P0 + B1) The fetch coordinator did not know the authoritative object size, so it could not (a) clamp cachestore reads to the per-chunk length on the tail chunk, (b) detect a short-body response from origin during a fill, or (c) set Content-Length on the internal-fill response. All three issues converge on the same prerequisite: pass info.Size from the edge handler down through the cluster RPC. Changes: - cluster.encodeChunkKey/DecodeChunkKey carry an object_size query parameter on the internal-fill RPC. DecodeChunkKey returns (chunk.Key, int64, error); object_size is validated >= 0. - cluster.FillFromPeer takes objectSize int64. - fetch.Coordinator.GetChunk, FillForPeer, fillLocal, and runFill all take objectSize int64. - fetch adds an expectedChunkLen helper that returns min(ChunkSize, objectSize - off) for the requested chunk, or 0 for chunks past the end. - runFill now requests expectedChunkLen bytes from origin (not k.ChunkSize), validates io.Copy's result against expectedChunkLen before commit, and fails the fill with a retryable-looking error on mismatch. This is B1 from the review: a flaky origin returning short bytes is rejected instead of permanently poisoning the catalog with the short length. - The cachestore-hit fast path in GetChunk and FillForPeer clamps the read length to expectedChunkLen so even a too-large cachestore stat does not over-read past the object end. - server.edgeFetchAPI.GetChunk and server.internalFetchAPI.FillForPeer signatures gain the objectSize parameter; edge handler passes info.Size, internal handler passes the decoded object_size. - server_test.go fakeEdgeAPI updated to match. When objectSize is 0 (unknown - the cluster-RPC path with old peers during a rolling upgrade) the validation is a no-op and behavior matches the pre-change shape. Production callers always provide info.Size from origin Head.
The internal-fill handler now sets Content-Length to chunk.Key.ExpectedLen(objectSize) on successful responses. This lets net/http on the requesting peer surface mid-stream truncations as io.ErrUnexpectedEOF instead of treating a short connection-close as a clean EOF. cluster.FillFromPeer additionally wraps the body in a defense-in-depth validator that re-checks the byte count, making the contract explicit at the call site. Extracts the per-chunk expected length into a chunk.Key.ExpectedLen method so both producer and consumer compute it identically.
Previously the edge handler called WriteHeader(200) before any chunk was fetched, so a first-chunk failure (origin 404, auth error, or a cachestore body whose first Read failed) could only be surfaced by aborting the connection mid-stream. Clients saw a half-written 200 that they had to parse as a transport error. The handler now fetches the first chunk and peeks one byte before committing any status. Failures up to that point flow through writeOriginError as proper S3-style XML responses. Subsequent chunk failures (chunks 1..N) remain mid-stream aborts since headers are already on the wire.
The azureblob driver always rejects PageBlob and AppendBlob, and the
cachestore/s3 driver always runs the bucket-versioning gate. The
matching YAML knobs (origin.azureblob.enforce_block_blob_only,
cachestore.s3.require_unversioned_bucket) were previously
force-true via 'if !X { X = true }' but the shape implied operators
could disable them; in reality the safety guarantees orca depends on
(immutable chunked cache contract, no-clobber atomic commit) are
broken without them.
Removes both fields from config.Azureblob and config.CachestoreS3,
makes the enforcement and gate unconditional in their respective
drivers, and strips the keys from the deployed configmap template.
Existing YAML files that set either field will fail to parse on next
deploy, which is the intended clean-break signal.
A GET on a zero-byte object previously computed rangeEnd = info.Size - 1 = -1 and fell into the rangeStart > rangeEnd guard, returning a spurious 416 for what is a successful empty-body fetch under any sane S3-compatible reading. Add an explicit size==0 short-circuit that writes 200 + Content-Length: 0 before the range-parsing flow. A Range request against a zero-byte object remains a 416 per RFC 7233.
The cluster's internal-RPC HTTP client previously carried Client.Timeout: 60s, a request-total wall clock that aborted any internal-fill body stream taking longer than 60s. An 8 MiB chunk on a degraded inter-pod link can exceed that bound; the abort was indistinguishable on the requester side from the legitimate mid-stream truncation case we plumbed Content-Length validation against. Drop the wall timeout and rely on the caller's ctx (the edge request ctx for client-driven fills, the 5-minute detached fill ctx in fetch.runFill for leader-side ones) as the sole deadline. Adds WithHTTPClient as a test seam (production never sets it). Adds direct white-box assertion that Client.Timeout is zero and a behavioural test confirming ctx deadlines still terminate slow peer responses.
The azureblob driver strips entity-tag quotes when reading the ETag on Head (orca's internal representation is unquoted) and then passed that unquoted value straight back through azcore.ETag for If-Match on GetRange. RFC 7232 requires entity-tags to be quoted-strings; Azure tolerated unquoted values in practice but the contract was inconsistent and diverged from the awss3 driver, which already re-wraps in quotes (awss3.go). Re-wrap on egress. Add httptest-backed regression tests asserting the inbound If-Match header is quoted when an etag is supplied and absent when the etag is empty.
The driver previously mapped backend errors by matching service error codes and by substring of err.Error(). Three concrete bugs: - isPreconditionFailed matched 'InvalidArgument' and 'ConditionalRequestConflict' (neither is the AWS S3 contract for If-None-Match: * conflicts) and fell back to strings.Contains(err.Error(), '412'), which fires false-positive on any error message containing '412' in any context. - mapErr detected 5xx via strings.Contains(err.Error(), 'StatusCode: 5') against the SDK's err.Error() format - a presentation detail with no stability contract. - The driver carried a vestigial _ = http.StatusOK no-op to keep the net/http import alive. Switch both predicates to *awshttp.ResponseError-based inspection of the underlying HTTP status code: 412 for precondition, 4xx (401 / 403) for auth, 5xx for transient. The 'AccessDenied' / similar APIError-code branch is retained because the SDK surfaces these as typed service errors carrying stable codes. Verified against the orca integration suite: LocalStack 3.8 returns 412 for the SelfTestAtomicCommit duplicate-put path, so dropping the legacy matches is safe.
Adds slog.LogAttrs debug emissions at every origin operation for diagnostic visibility into upstream calls: - Head: request + response (size, short etag) + the three error mapping branches (not_found, auth, generic). - GetRange: request (bucket, key, short etag, off, n) + response + etag-changed / not-found / auth branches. - List: request (bucket, prefix, marker, max) + response (count, truncated) + auth error. Adds an origin.ETagShort helper that truncates entity-tags to the first 8 characters for log-line readability. ETags are not secrets but their full Azure / AWS form is long enough to make grep output noisy; the prefix is sufficient for matching one fill against another while keeping the log handler's output narrow. Object keys and bucket names are logged in full because they are part of the operator's diagnostic context.
EdgeHandler: - ServeHTTP entry: edge_request with method, path, bucket, key, range, and source remote-addr. - handleHead exit: edge_head_response with size and truncated etag. - handleGet: edge_get_plan (range/first/last chunk indices), edge_get_empty_object short-circuit trace, edge_get_chunk_next per chunk-stream transition, edge_get_complete on success. InternalHandler: - internal_fill_request on every accepted RPC with chunk attrs, object_size, and source remote-addr. - internal_fill_not_coordinator trace for the 409 fallback path (operator visibility into membership-disagreement scenarios). - internal_fill_complete on success with delivered bytes. Migrates the 6 existing Warn callsites in server.go and the 5 existing Info / 5 existing Warn callsites in app.go to slog.LogAttrs so they share the same attribute taxonomy as the new Debug emissions and get zero-cost attribute evaluation when filtered out. Adds a debugLoggerTo test helper and a representative emission test asserting the edge_request + edge_head_response trace pair.
Cluster.refresh now drives peer-set storage through a storePeerSet helper that emits: - Debug 'peer_set_refreshed' per refresh cycle with the new count and a 'reason' attribute (discovery_ok / empty_discovery_self_only / self_only_fallback). - Info 'peer_set_initial' the first time a snapshot is stored. - Info 'peer_set_changed' on every subsequent transition where the rendered (ip, port) set differs from the previous snapshot, with added/removed lists for diagnostic visibility. Stable refreshes (no membership delta) emit only the per-cycle debug line. This is how operators distinguish 'discovery healthy but quiet' from 'membership churn' in the logs. Adds Coordinator debug emission with chosen-peer IP, is_self bool, and rendezvous score - the per-chunk routing decision visible in the structured log when level=debug. FillFromPeer emits request + response (or not-coordinator) debug traces so peer-fill round-trips are observable. Migrates the one existing Warn callsite in refresh's retain-snapshot branch to LogAttrs.
Documents the debug-logging work that landed in the second-pass remediation: - The cfg.Logging.Level knob and ORCA_LOG_LEVEL env override. - The HandlerOptions.AddSource: true convention and what it replaces (per-package With() tagging). - Logger injection through every orca package that previously lacked one. - LogAttrs-everywhere convention with zero-cost-when-filtered guarantee on hot paths (notably chunkcatalog.Lookup). - The cross-package 'chunk' attribute group taxonomy. - The sensitive-data audit (no keys, no signed URLs, ETag truncation via origin.ETagShort). - The operator workflow for enabling debug logging. - Deferred follow-ups: per-request correlation IDs, Prometheus metrics, runtime log-level switching.
The Azurite Service in the dev manifest was ClusterIP, so reaching it from the host required a kubectl port-forward. The forthcoming hack/cmd/orcaseed tool runs outside the cluster and populates the origin container directly; switching to NodePort with a stable high-port (default 30100) lets the seeder talk to http://localhost:30100/devstoreaccount1/ once the dev cluster is up. The fixed port sits in the Kubernetes NodePort range (30000-32767). Two concurrent dev clusters on the same host would collide; the renderer accepts an AzuriteNodePort override (wired through hack/orca/Makefile via AZURITE_NODE_PORT). The Kind config on the operator side also needs an extraPortMapping for 30100 to expose the NodePort to the host loopback (Kind does not route NodePorts to the host without explicit mappings).
Adds a small Go binary used by the Orca dev harness to populate the
in-cluster Azurite origin container with synthetic or operator-
supplied content. Four subcommands sharing the same connection
flags (default endpoint targets the dev harness's NodePort 30100):
generate N synthetic random-byte blobs of size S each (with
--seed for reproducible content; per-blob ceiling 1 GiB
unless --force; warns if size*count > 1 GiB).
upload a single file from disk (--file, optional --name).
list blobs currently in the container, optional --prefix.
delete blobs matching --prefix (default all), interactive
confirmation unless --yes.
The well-known Azurite dev account-key is the default --account-key
value; it is a public Microsoft-documented constant baked into
Azurite, not a secret.
Tests cover the human-readable size parser (every accepted suffix +
the error paths) and a deterministic-seed end-to-end generate run
against an httptest-backed fake Azurite that captures the uploaded
bodies and asserts byte-identical output across two runs of the
same --seed.
Captures the minimal recipe for bringing up the Orca dev cluster in Azurite mode with debug logging enabled from t=0, seeding the origin with hack/cmd/orcaseed, exercising the cache, and watching the per-chunk structured-log trace. Walks through the eight-step path: .env setup, make orca-up, seed-generate / seed-upload, port-forward, curl, jq-filtered log tailing, iterate, tear down. Plus a cheat-sheet of the helper targets and a pointer to the in-process integration suite for folks who don't need Kind. Complements the longer dev-harness.md reference; this file is the quickstart path.
Brings the Orca dev-harness tooling into git as a coherent unit
alongside the already-tracked quickstart.md, and applies the
cleanup-pass findings identified during the consistency review.
Files newly tracked (previously local-only):
Makefile - dev-harness orchestration (kind, render,
deploy, seed, reset, etc.).
kind-config.yaml - 1 control-plane + 3 workers; first worker
forwards NodePort 30100 to host loopback
for orcaseed -> Azurite reachability.
kind-create.sh - idempotent kind cluster bring-up.
kind-load.sh - sideload the orca image into kind nodes.
down.sh - kind cluster teardown.
deploy-credentials.sh - build the orca-credentials Secret from .env.
dev-harness.md - long-form reference for the dev harness.
inttest.md - companion docs for the in-process
integration suite.
.gitignore - excludes .env + rendered-dev/ output.
Cleanup applied in this commit:
- F1: seed-azure.sh deleted. Its functionality is fully covered by
hack/cmd/orcaseed; the Makefile's seed-azure target now invokes
'orcaseed upload' against real Azure with credentials sourced
from .env. Drops a 76-line shell script plus the dependency on
the mcr.microsoft.com/azure-cli container image.
- F5: dev-harness.md trimmed where it duplicated quickstart.md
(Seed sample data / Exercise the cache / Logging now point at
the quickstart). dev-harness.md shrank from 381 to 333 lines.
Reference material (origin-mode table, troubleshooting,
'NOT covered' list, switching-to-real-Azure) stays.
- F6: removed the stale 'sample-get.sh' reference.
- F8: .PHONY in Makefile lists deploy-azurite-maybe; deploy-orca
recipe carries a comment explaining the Service-before-Deployment
apply ordering (headless DNS must exist before pods start so the
initial cluster.refresh sees the full peer set).
Skipped:
- F2: kind-create / kind-load / down kept as standalone scripts
rather than inlined into the Makefile (recipe readability wins
over file-count reduction).
- F7: orcaseed remains azblob-only; the awss3 LocalStack path is
rarely interactively driven and the kubectl-run + amazon/aws-cli
workaround is documented in dev-harness.md.
The 'object_size unknown' sentinel value (object_size == 0 in the internal-fill wire format, with a runFill fallback substituting k.ChunkSize for the request length and skipping the body-length validation) was dead code: production callers always plumb info.Size from a prior Head, so the sentinel was never legitimately set. Its existence created three reachable foot-guns: - cachestore/s3.GetChunk(n=0) produced a malformed S3 Range header 'bytes=0--1', returning a 400 instead of an empty body. - runFill validation was skipped, allowing an adversarial peer to commit wrong-sized blobs to the cachestore by sending object_size=0 over the internal-fill RPC. - cluster.FillFromPeer's validatingReader was bypassed because the internal-fill handler omitted Content-Length when expectedLen=0, so a short stream from the peer was indistinguishable from a clean EOF. Tighten the wire format and propagate the strictness inward: - cluster.DecodeChunkKey: object_size must be > 0 (was >= 0). - fetch.runFill: drop the requestLen==0 fallback; always pass the authoritative expectedLen to fetchWithRetry and always run the body-length validation. - cachestore/s3.GetChunk: reject n <= 0 and off < 0. - cachestore/s3.PutChunk: reject size <= 0. Adds regression tests at the wire boundary and at the cachestore driver entry points. No production caller is affected (all callers already pass positive sizes); a peer running pre-fix code that sent object_size=0 would be rejected at decode time, which is the intended hardening.
The Catalog stored cachestore.Info for every recorded chunk but the only caller (fetch.lookupOrStat) discarded it: the catalog-hit path used k.ExpectedLen(objectSize) for the GetChunk call, never reading catalog.info.Size. The defensive value of the stored Info was illusory: chunk.Path encodes (origin_id, bucket, key, etag, chunk_size), so under the design contract a path hit implies the cachestore contains bytes for this exact version of this chunk. Catalog-stored Info couldn't detect mutations to the cachestore that happened after Record (those are self-healing via ErrNotFound -> Forget), and contract violations within Path encoding are the wrong layer to defend. Slim the catalog to a presence-only LRU: - Catalog.Lookup returns bool, not (Info, bool). - Catalog.Record takes only the chunk key; the prior Info argument is gone. - entry struct drops the info field. Updates fetch.lookupOrStat and fetch.runFill (commit-success and commit-lost branches) for the new API. Net delta is a smaller catalog struct, one less alloc per Record, and clearer intent in the package docstring.
The previous ordering closed f.done in a deferred function that ran AFTER PutChunk returned, so joiners had to wait for both the origin fetch AND the cachestore commit before seeing bytes. The runFill comment described 'commit-after-serve' but the code was actually commit-before-serve - the latency cost was hidden in the joiner path on every cold fill. Promote close(f.done) to an explicit call right after the body-length validation succeeds and f.bodyBuf is assigned. Use a sync.Once-wrapped release function so the deferred safety net path (which catches panics) and the explicit success-path call are both safe to invoke; close(f.done) fires exactly once either way. Correctness is preserved: the bytes.Buffer is fully populated and length-validated before release; the underlying byte slice is no longer mutated after io.Copy returns, so joiner reads of f.bodyBuf.Bytes() are safe to overlap with the PutChunk RPC's read of the same slice via bytes.NewReader. Latency improvement: joiners now return as soon as the origin delivered bytes; the commit RTT is removed from the joiner path. Regression test exercises a slow PutChunk (mock cachestore that blocks PutChunk until signaled) and asserts that fillLocal returns while PutChunk is still pending - which fails under the old ordering and passes under the new.
chunk.Path encodes the ETag in its SHA-256 hash input. A stable cache key requires the origin to supply one. If the origin returns an empty ETag - some S3-compatible backends with specific bucket policies, or custom origins that do not follow the AWS/Azure contract - then two different versions of the same (bucket, key) would alias to the same chunk.Path and orca would serve stale bytes silently on the next read after a mutation. Catch the misconfiguration immediately at Head time. New sentinel origin.MissingETagError. fetch.HeadObject wraps the underlying origin.Head and rejects empty-ETag responses. metadata.recordResult caches the negative result under NegativeTTL so we do not re-Head a permanently-misconfigured origin on every request. server.writeOriginError maps MissingETagError to 502 with a descriptive 'OriginMissingETag' body so the operator sees the misconfiguration in the response, not just the logs. Coordinator-level enforcement (single policy point); the drivers themselves remain unchanged - they return whatever the upstream gave us. The coordinator decides whether the response is acceptable, and the metadata cache prevents repeated re-fetches. Trade-off: misconfigured origins that worked today now fail loud. That is the intended hardening - empty-ETag origins are corruption hazards under orca's caching model.
The previous implementation used a shared *math/rand.Rand source serialised through a sync.Mutex when --seed was set. Concurrent upload goroutines interleaved reads from that single source in lock-acquisition order, which is non-deterministic. The first test worked only because it used --concurrency 1; under any --concurrency > 1 (including the default 4), two runs with the same --seed produced different per-blob bytes. Derive each blob's stream from (userSeed + blobIndex). Each blob has its own independent math/rand source, no shared state, no mutex. Determinism becomes a pure function of (--seed, blobIndex) and is invariant under upload-completion ordering. Updates the test to use --concurrency 4 (so the regression is caught) and adds a second test asserting per-blob streams differ within the same run (so --prefix-0 and --prefix-1 are not byte- identical under the same seed).
The previous fallback to the well-known Azurite dev key applied whenever AZURE_STORAGE_KEY was empty, regardless of AZURE_STORAGE_ACCOUNT. An operator with AZURE_STORAGE_ACCOUNT=mycompanyacct and a missing AZURE_STORAGE_KEY got the Azurite dev key silently injected into the orca-credentials Secret, and Orca pods spent a startup window 401'ing against the real account before the operator figured out why. Tighten the gate: the Azurite dev-key fallback applies only when AZURE_STORAGE_ACCOUNT is empty OR equals 'devstoreaccount1' (the canonical Azurite account name). For any other account, a missing AZURE_STORAGE_KEY is a hard error with a clear message naming the account, so the misconfiguration surfaces immediately.
http.Client.Timeout was deliberately removed earlier so caller ctx could bound long-running fill body streams (8 MiB chunks on slow links can exceed any hardcoded value). The trade-off was: a stuck TCP SYN or stalled TLS handshake against a half-failed peer would hang until the caller's ctx fired - which is the full 5-minute fill ctx for leader-side fills. Over time those stuck connects can saturate the origin semaphore with cancelled work. Bound the connection-establishment surface independently of the body-read deadline: - Transport.DialContext with a 10s connect timeout (via net.Dialer). - Transport.TLSHandshakeTimeout: 10 seconds. - Transport.ExpectContinueTimeout: 1 second. Body reads still rely on the caller's ctx, preserving the arbitrary-size-on-slow-link contract. Connect-level latency is now bounded regardless of caller ctx, so stuck-syn cancelled work no longer hangs the originSem. Test asserts DialContext is non-nil and TLSHandshakeTimeout is bounded; existing TestNewHTTPClient_NoWallTimeout still locks the body-read contract.
Only the non-seekable path validated len(buf) == size. The seekable-reader path (production: bytes.NewReader(buf.Bytes()) from runFill, or os.File from orcaseed) trusted the caller's claimed size and handed it straight to S3 with ContentLength=size. A buggy caller passing a 100-byte reader with size=1024 (or vice versa) either got rejected by S3 (ContentLength mismatch) or uploaded a truncated / overlong blob, depending on backend behaviour. Add a seek-and-check probe at the driver entry point: Seek(0, End) to discover the actual length, compare to the declared size, then Seek(0, Start) to rewind for the upload. The probe is cheap on both bytes.Reader and *os.File (the production callers), and catches the size-mismatch case before any RPC. Regression test passes a bytes.Reader with the wrong length and asserts PutChunk errors out before reaching S3.
The ctx-cancel branch already drained any concurrently-arriving listener errors before returning nil. The errCh-arrived-first branch returned the first error and left the rest in the channel; subsequent errors (e.g. a multi-listener crash where two listeners errored within the same tick) sat in the buffered channel until the goroutines exited at Shutdown time, by which point nothing read errCh and the errors were silently dropped. Extract the drain into a small helper and call it on both Wait return paths. The first-error case still returns the first error as before; additional errors are now logged at Warn so operators see the full picture of a multi-listener failure. Test pre-buffers three errors and cancels ctx, asserting all three appear in the captured log output.
- M-1: tighten chunk.IndexRange docstring. The previous comment noted the negative-end clamp but didn't lay out the full input contract (chunkSize > 0 guaranteed by config validation; empty-objects short-circuit upstream and never reach IndexRange; the clamp is a guard against an arithmetic bug, not a supported empty-range encoding). - M-7: orcaseed delete confirmation prompt now distinguishes stdin-closed (EOF) from generic read errors. The message points the operator at --yes for non-interactive contexts instead of the opaque 'read confirmation: EOF'. - Append a 'Third-pass findings and remediation' section to design/orca/review.md summarising the ten commits in this pass with one-line descriptions per finding, plus the deferred-items list with rationale.
Three rounds of review-driven hardening landed substantial
behavioural changes since the design docs were last touched, plus
several broad features that had been described as v1 scope were
explicitly deferred. The result was a design surface that no
longer mapped onto the shipped code.
- Rewrite design.md from scratch around what shipped. Drops the
spool / tee / posixfs / localfs / circuit-breaker / LIST-cache /
prefetch / active-eviction / bounded-freshness / max_response_bytes
/ metrics surfaces that were never implemented; redraws the
retained mermaid diagrams to match the in-memory bytes.Buffer +
sync.Once-wrapped release + commit-after-serve flow; adds a
Deferred / future work section consolidating the not-yet-shipped
design ideas (including the design-intent items from the prior
review passes).
- Surgical edits to brief.md to remove the same deferred-feature
references, redraw Diagrams A and B, and rewrite the top-risks
list around what an operator would actually hit today.
- Delete plan.md (1554 lines): the phased-delivery narrative no
longer maps onto shipped code; the repo-layout section was 50%+
stale; the approval-checklist semantics make no sense for a doc
describing shipped code.
- Delete review.md (753 lines): its design-intent deferred items
are now consolidated into design.md s15; the audit-trail value
lives in git history.
- Update hack/orca/dev-harness.md cross-link to point at
design.md's Deferred / future work section instead of plan.md.
- Fix a stale parameter docstring on chunkcatalog.Record: the
comment referenced a removed 'info' argument; the catalog has
been presence-only since the third review pass.
Verified by re-reading the new design.md against
internal/orca/{fetch,cluster,cachestore,server,app,chunk,
chunkcatalog,metadata,origin,config}/, internal/orca/inttest/,
and deploy/orca/. make and make orca-inttest are green.
GitHub's mermaid renderer treats ';' as a statement separator in sequence diagrams, not as message text. Three messages in the cold-miss local-coordinator diagram contained semicolons, producing 'expected arrow, got NEWLINE' parse errors and a non-rendering diagram: R->>R: ChunkCatalog miss; Stat miss R->>R: Peek(1); commit headers SF->>CS: Stat(k); Record on success Replaced ';' with ', then' or ',' in each case so the message text reads naturally and the parser sees a single statement per line. Also declared the missing 'Cat as ChunkCatalog' participant so it renders next to its peers instead of being auto-created at the right edge. Confirmed no other mermaid block in design.md or brief.md contains semicolons inside message text.
The Internal interfaces section duplicated 30 lines of Go that mirrored internal/orca/origin/origin.go and internal/orca/cachestore/cachestore.go byte-for-byte, plus ~30 lines of bullets enumerating specific method signatures. Both were guaranteed to rot on any refactor. - Delete the section in full. - Renumber design.md sections 8-15 down to 7-14 (and matching subsections / TOC). - Insert a 'Named seams' table at the top of the new s7 Stampede protection (former s8) as a navigation aid: seam name -> file path -> one-line role. - Update all in-doc and cross-doc anchors / section-number references in design.md and brief.md to match the renumbered structure (TOC, decisions table, terminology glossary, request flow, stampede protection internal cross-refs, concurrency section, bounded staleness, create-after-404, eviction, deferred / future work). - Rewrite the brief.md s4 framing that opened with 'formal Go interfaces in design.md s7' to point readers at internal/orca/ source files directly. Load-bearing semantics that previously hid in section 7 -- atomic no-clobber commit, If-Match identity contract, typed-sentinel error routing -- were already documented in s2 (Decisions), s7.7 (Failure handling), and s9.2 (Typed cachestore errors); no information loss. Verified every cross-reference still resolves against an existing heading by greping the unique anchor URLs in each doc against the heading-derived anchors.
The 'Azure adapter: Block Blob only' section was unnecessary
detail given how much of its content is covered elsewhere:
- The Block Blob constraint is already stated in s2 Decisions
('Azure constraint | Block Blobs only').
- UnsupportedBlobTypeError handling is in s2 Decisions and the
HTTP error-code mapping in s6.3.
- Negative caching of unsupported-blob-type responses is
documented in s10 Create-after-404 and negative-cache
lifecycle.
- If-Match quoting is a driver-internal implementation detail
documented in the source.
- Delete the section in full (~20 lines).
- Renumber design.md sections 9-14 down to 8-13 (and matching
subsections / TOC).
- Update every cross-reference in design.md and brief.md to
match the renumbered structure.
Verified every anchor URL still resolves against an existing
heading; no broken links.
The two docs leaned on nominalized noun phrases, Latinate vocabulary, and dense compound-noun pile-ups that made them tedious to parse. A reader had to translate phrases like 'asymmetric defaults reflect asymmetric operational reality' into the underlying idea before reading on. This pass rewrites the prose throughout both docs to be readable by someone who is not steeped in the codebase: - Active voice over passive. - Verbs over nominalized abstractions. - Plain words over Latinate equivalents. - Shorter sentences (split at ~25 words). - Hedging like 'load-bearing', 'essentially', 'intentionally', 'operational reality' removed or replaced with specifics. - The point of each paragraph leads; the implementation walk follows. - Each diagram now has a one-sentence caption describing what it shows. Preserved without change: - All section / subsection numbers and heading text (and therefore anchor URLs). - All nine mermaid diagrams. - All code identifiers, file-path references, and config-key names. - The named technical concepts (singleflight, rendezvous hashing, pre-header retry, commit-after-serve, the typed sentinel errors) - these stay as grep handles, but each first-use now carries a plain-language gloss. Verified every cross-reference still resolves: the unique anchor URLs in design.md (17) and brief.md (15) all match real headings post-rewrite.
s8.2 Typed cachestore errors, s8.3 Range/sizes/edge cases, and s8.4 Readiness probe added detail that already lived elsewhere: - s8.2's four-sentinel-error list duplicated content in s2 (the Atomic-commit decisions row), s6.3 (the HTTP error-code mapping table), s7.2 (commit-after-serve handling), and s7.7 (failure handling without re-stampede). - s8.3's bullets all duplicated content in s5 (chunk model + ExpectedLen) and s6 (range parse, zero-byte short-circuit). The PutChunk seekable-vs-non-seekable length probe was driver implementation detail. - s8.4's /readyz gate conditions and /healthz semantics were already stated in s4 (the Ops listener bullet); the /readyz predicate behavior under broken DNS is in s12 (Horizontal scale). With those gone, s8 has only the atomic-commit content. Rename s8 to 'Atomic commit' to match what's left, drop the now- redundant '### 8.1 Atomic commit' subheading, and update the one in-doc cross-reference plus the two cross-references in brief.md. Verified anchor URLs in both docs still resolve.
The brief had grown to 373 lines and was no longer fast to read. Reshape it as a one-screen orientation that lands the reader on design.md. - Drop section 4 (Components). design.md s7's Named seams table is a better reference; the brief doesn't need to enumerate components. - Drop section 7 (Cold-miss request + Diagram B). The same flow is in design.md Diagram 5; Diagram A is enough as an entry point for the brief. - Compress section 5 (Five mechanisms): each mechanism is now one short paragraph, not a multi-paragraph walk-through. Detail lives in design.md. - Convert section 8 (Top risks) from numbered paragraphs to a compact 4-column table: Risk, What goes wrong, Bound, Detail. - Strip section 9 (Where to go next) to bare links; the destination section titles speak for themselves. - Tighten the Problem, Goals/Non-goals, System-at-a-glance, and Backing-store-options sections. Sections 5-9 renumber to 4-7 to fill the gaps left by the two deleted sections. About 198 lines now, down from 373 (47% cut).
Audit pass on internal/orca and cmd/orca surfaced a handful of swallowed errors and one dead-coded security configuration: - server.handleList: log XML encode failures at warn instead of silently dropping them, matching the mid-stream treatment in the GET path. - fetch.runFill: log Stat errors that follow ErrCommitLost so cachestore flapping is observable; catalog still stays unrecorded. - cachestore/s3.randHex: surface crypto/rand failures to the selftest caller instead of falling back to a time-based suffix that can collide on parallel boots. - cluster.newHTTPClient: refuse to start when cluster.internal_tls.enabled=true. The TLS configuration was previously discarded with '_ = cfg', which would have silently downgraded production deployments to system-trust-store HTTPS instead of the configured CA / client cert. - cmd/orca/orca.serve: propagate App.Shutdown errors to the process exit code so failed-shutdown signals reach kubelet probes and init systems.
Large-blob GETs previously paid one strictly-sequential cachestore
GetObject per chunk; at 8 MiB chunks a 700 GB object required
~90,000 serial round trips even when fully cached. This commit adds
two compounding throughput improvements:
- Dynamic chunk size via a tier ladder. chunk.SizeFor selects the
effective chunk size from a base value plus an ascending-threshold
list of {MinObjectSize, ChunkSize}. The base covers small objects;
each tier overrides for objects at or above its MinObjectSize.
Tiers are strictly-ascending and rejected at config validate time
if unsorted or out of bounds, so overlap and ambiguity are
structurally impossible. Default ladder: 8 MiB base, 64 MiB above
1 GiB, 128 MiB above 10 GiB. Backward compatible: empty tiers
preserves the prior global-only chunk size.
- Server-side parallel read-ahead. EdgeHandler.handleGet now spawns
a producer that runs up to chunking.readahead chunk fetches in
flight while the consumer streams the current chunk, with strict
ordered delivery via a per-job result channel. Cold-fill memory
remains bounded by the existing originSem; the readahead depth
bounds extra warm-path cachestore body buffers. Default depth 8;
readahead is a *int so the YAML can distinguish omitted (defaults
applied) from explicit 0 (disabled), restoring the prior strictly-
sequential path.
Cross-replica safety: the internal-fill RPC already carries
chunk_size per request, so rolling deploys with different tier
policies on different replicas remain correct.
Worked example for a 700 GB warm GET against the new defaults:
chunk count drops from ~90,000 to ~5,600 (16x fewer requests), with
up to 8 of those in flight in parallel.
App.Wait's single select between <-ctx.Done() and <-a.errCh was
non-deterministic when both channels were simultaneously ready: Go
randomizes among ready cases, so Wait returned nil ~50% of the time
and the head-of-queue listener error the other 50%. This contradicts
the function's documented contract ('returns nil if ctx was
canceled') and surfaced as a flaky TestApp_Wait_DrainsErrChOnCtxCancel
that pre-fills errCh and pre-cancels ctx.
Add a non-blocking pre-check that takes the shutdown branch
deterministically when ctx is already canceled at entry. Buffered
listener errors are still drained to the Warn log via drainErrCh;
only their effect on Wait's return value is suppressed in this
specific overlap. The main select remains unchanged for the
listener-error-during-normal-operation path.
Verified: TestApp_Wait_DrainsErrChOnCtxCancel passes 100/100 runs
after this change (previously ~50/50).
Orca origin cache: implementation, hardening, and dev tooling
Summary
Adds the Orca origin cache as a complete v1: a horizontally-scaled
HTTP cache that fronts S3 / Azure Blob origins with chunked,
rendezvous-hashed, in-DC caching. Spans the design doc, the runtime
binary (
cmd/orca), the in-process integration suite(
internal/orca/inttest), Kubernetes deployment manifests, and alocal dev harness with a Go-based seeding tool.
The PR is large because it combines:
fixes, correctness tightening, and observability work).
hack/orca/,hack/cmd/orcaseed) and itsdocs.
This description walks through what the system does, the major
components, and the high-leverage areas reviewers should focus on.
What Orca is
A read-only origin cache. Clients speak a minimal S3-compatible
surface (
GET,HEAD,LIST) to Orca; Orca fans out to a configuredorigin (AWS S3 or Azure Blob Storage), splits objects into fixed-size
chunks (default 8 MiB), commits them to an in-DC S3-compatible
cachestore (atomic via
If-None-Match: *), and serves subsequentreads from cache.
A cluster of Orca replicas coordinates via rendezvous-hashed
coordinator selection plus a per-replica singleflight: at any
given time, exactly one replica is doing the origin GET for any given
chunk, and all concurrent client requests for that chunk collapse
onto it. The coordinator either fills locally or streams from a peer
via an internal RPC (
GET /internal/fill).No disk spool. No metadata DB. Chunk identity is encoded as a
deterministic SHA-256 over (origin_id, bucket, key, etag, chunk_size)
so the cachestore key set is itself the source of truth.
Major components
Runtime packages
internal/orca/server//healthzand/readyz).internal/orca/fetch/internal/orca/cluster/internal/orca/metadata/internal/orca/chunkcatalog/internal/orca/cachestore/s3/internal/orca/origin/awss3/internal/orca/origin/azureblob/internal/orca/chunk/internal/orca/config/Logging.Level+ORCA_LOG_LEVELenv override.internal/orca/app/Tests
math, error mapping, config defaults, manifest YAML validity.
internal/orca/inttest, build tagintegrationtest, targetmake orca-inttest): 7 tests againstreal LocalStack + Azurite containers via testcontainers-go, with
N in-process orca instances wired to those backends. Covers
cold/warm GET, cross-chunk ranges, 64-chunk multi-chunk, rendezvous
routing, singleflight collapse with bounded origin-call counts,
membership-disagreement 409 fallback, and azureblob end-to-end.
Deployment
deploy/orca/*.yaml.tmpl- production manifests (Namespace, RBAC,ConfigMap, Service x2 - headless peer Service + edge Service,
Deployment with anti-affinity).
deploy/orca/dev/*.yaml.tmpl- dev-only manifests (LocalStack,bucket-init Job, Azurite, container-init Job).
hack/cmd/render-manifests- Go template renderer driven byrepeatable
--setflags.Dev harness
hack/orca/Makefile- orchestration for a 4-node Kind cluster(1 control-plane + 3 workers) with extraPortMappings for NodePort
30100 -> host loopback.
hack/orca/quickstart.md- end-to-end recipe.hack/orca/dev-harness.md- long-form reference (origin modes,switching, troubleshooting).
hack/cmd/orcaseed/- Go tool for populating the dev-clusterorigin:
generate(synthetic N x size random blobs with optionalper-blob deterministic seeding),
upload(file from disk),list,delete.Local dev-cluster workflow
For reviewers who want to bring this up themselves:
Full walkthrough in
hack/orca/quickstart.md.