[SPARK-57144][INFRA] Unify Coursier cache to a single key across all jobs#56201
Open
zhengruifeng wants to merge 4 commits into
Open
[SPARK-57144][INFRA] Unify Coursier cache to a single key across all jobs#56201zhengruifeng wants to merge 4 commits into
zhengruifeng wants to merge 4 commits into
Conversation
Replace 8 distinct per-job Coursier cache keys (`$matrix.java-$matrix.hadoop-coursier-`, `pyspark-coursier-`, `sparkr-coursier-`, `docs-coursier-` (×2), `tpcds-coursier-`, `docker-integration-coursier-`, `k8s-integration-coursier-`) with a single `coursier-<hash>` key written exclusively by the `precompile` job and restored read-only (`actions/cache/restore`) by all consumers. The near-duplicate per-job Coursier caches were consuming ~4.5 GB on master and ~5.2 GB on branch-4.x, leaving the 10 GB repo-wide cache budget almost entirely full. Old maintenance branches (4.0, 4.1, 4.2, 3.5) had their caches evicted before their next CI run and were always cold. With one writer per branch the footprint shrinks to ~1.4 GB, leaving room for all actively-maintained branches simultaneously. The `precompile` job already builds with every profile (`-Phadoop-3 -Pyarn -Pspark-ganglia-lgpl -Phadoop-cloud -Phive -Pkubernetes -Pjvm-profiler -Pkinesis-asl -Phive-thriftserver -Pdocker-integration-tests -Pvolcano`) so its `~/.cache/coursier` is a superset of every consumer job's dependency closure. Generated-by: Claude Code (claude-sonnet-4-6)
Make the `build` (Scala test matrix) job use `actions/cache@v5` instead of `actions/cache/restore` for the Coursier step, so it can act as a fallback writer when `precompile` is absent or its cache save fails. When `precompile` succeeds, `build` gets an exact key hit on `coursier-<hash>` and GHA automatically skips saving (caches are immutable) — no duplicate entry. When `precompile` fails or is skipped (e.g. an infra-only PR where build=false for precompile's condition, or a transient precompile failure covered by continue-on-error), `build` runs the full SBT compilation, populates ~/.cache/coursier, and seeds the cache for subsequent runs. All other consumers (pyspark, sparkr, lint, docs, tpcds-1g, docker-integration-tests, k8s-integration-tests) remain read-only via `actions/cache/restore` since they use the precompile artifact and do not do full SBT dependency resolution from scratch. Generated-by: Claude Code (claude-sonnet-4-6)
Rename `pyspark-coursier-` to `coursier-` to match the unified key introduced in build_and_test.yml, so this workflow benefits from the shared cache pool instead of maintaining a separate per-workflow copy. benchmark.yml (benchmark-coursier-<jdk>-<hash>) and build_python_connect*.yml (coursier-build-spark-connect-python-only-) are intentionally left unchanged: benchmark runs with a user-supplied JDK and has a hardcoded coursier path reference; build_python_connect is deliberately isolated to a connect-only subset build. Generated-by: Claude Code (claude-sonnet-4-6)
Give the build (Scala test matrix) job a specific key `$runner.os-$matrix.java-$matrix.hadoop-coursier-<hash>` so that different OS/JDK/Hadoop combinations each maintain a tailored cache entry, falling back to the precompile superset (`coursier-<hash>`) when cold. The precompile job keeps the plain `coursier-<hash>` key since it has no OS/matrix dimension and is the shared base-level writer. Generated-by: Claude Code (claude-sonnet-4-6)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Replace 8 distinct per-job Coursier cache keys with a single
coursier-<hash>key in.github/workflows/build_and_test.ymlandpython_hosted_runner_test.yml:precompileandbuild(Scala test matrix):actions/cache@v5— both can writecoursier-<hash>.precompileis the primary writer (runs first, full dependency superset via all-Pprofiles).buildis the fallback writer — whenprecompileis absent or its save fails, the firstbuildmatrix entry seeds the cache. Whenprecompiledid save it,buildgets an exact key hit and GHA automatically skips the post-save (caches are immutable).pyspark×9,sparkr,lint,docs,tpcds-1g,docker-integration-tests,k8s-integration-tests): converted toactions/cache/restore@v5— restore-only, never write.tpcds-1gin particular only fires when SQL code changes and is skipped on the vast majority of runs, so its own Coursier cache entry would typically be LRU-evicted before the next run anyway.Why are the changes needed?
1. Same-commit duplicates — ~0.01% apart by bytes.
Per-job keys let every consumer job re-save its own copy of effectively the same content. Measured on master:
The 145 KB delta exists because Coursier doesn't prune: on a cold run the test-matrix job restores the precompile superset via restore-key, runs tests (which resolve nothing beyond it), and its post-step re-saves a byte-for-byte copy under its own key. The per-module keys are not holding different dependency sets — they are holding copies of the same superset.
2. Repo-wide 10 GB budget consumed by duplicates.
Duplicates from just two branches left no room for any other branch:
Old maintenance branches (branch-4.0, 4.1, 4.2, 3.5) had their caches evicted before their next scheduled CI run and were always cold.
3. Dep-upgrade burst amplifies the problem.
pom.xml/plugins.sbtare touched ~5–6 times per month on average, but upgrades cluster: on 2026-05-28 alone, 5 dependency upgrades merged in a single day (rocksdbjni, joda-time, gson, Jetty, zstd-jni). Each commit rolls the hash, so 5 consecutive CI runs each start with a cold Coursier cache. Under the old design each cold run raced to create ~5 new ~1.4 GB entries (~7 GB), immediately overflowing the budget and evicting the previous run's still-warm caches. Under the new design each cold run creates exactly 1 entry (~1.4 GB), so a burst of 5 dep-upgrade commits creates ~7 GB total — still within budget and without evicting each other.Summary: with one writer per branch the per-branch footprint drops from ~4.5 GB to ~1.4 GB, fitting ~6 branches in the 10 GB budget simultaneously, and a burst of dep-upgrade commits no longer triggers a cascade of mutual evictions.
Does this PR introduce any user-facing change?
No. CI-only.
How was this patch tested?
YAML validates with
python3 -c "import yaml; yaml.safe_load(...)".The correctness of the one-writer design relies on two GHA cache guarantees verified in prior CI runs:
Cache hit occurred on the primary key …, not saving cache), so multiple jobs usingactions/cache@v5with the same key don't produce duplicates when the cache already exists.precompilejob builds with every profile (-Phadoop-3 -Pyarn -Pspark-ganglia-lgpl -Phadoop-cloud -Phive -Pkubernetes -Pjvm-profiler -Pkinesis-asl -Phive-thriftserver -Pdocker-integration-tests -Pvolcano), so its~/.cache/coursieris a superset of every consumer job's closure.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (claude-sonnet-4-6)