Skip to content

build: rebalance unit test shards to reduce CI critical path (~29m --> ~23m) and use fewer workers (16 --> 10)#38287

Merged
feanil merged 4 commits intomasterfrom
feanil/collect-test-timings
Apr 14, 2026
Merged

build: rebalance unit test shards to reduce CI critical path (~29m --> ~23m) and use fewer workers (16 --> 10)#38287
feanil merged 4 commits intomasterfrom
feanil/collect-test-timings

Conversation

@feanil
Copy link
Copy Markdown
Contributor

@feanil feanil commented Apr 4, 2026

Summary

Rebalances the unit test shard configuration to reduce the CI critical path from ~29 minutes down to ~23 minutes, measured across 3 consistent runs.

What changed

  • Replaced 16 uneven shards with 10 balanced shards targeting ~19–23 min each
  • New shard layout: 5 LMS shards, 2 shared-with-LMS shards, 1 shared-with-CMS shard, and 2 CMS shards
    • `cms-1`: small CMS apps (api, cms_user_tasks, course_creators, envs, lib, etc.) — ~5–6 min
    • `cms-2`: `contentstore/` only — ~19–20 min (split out to avoid OOM on runners when combined with other CMS apps)
  • Also includes test isolation fixes (also in fix: test isolation fixes for certificates and openedx-events test mixins #38347 for independent merging):
    • Moved `CourseFactory()` calls from `setUp` to `setUpClass` in three certificate test classes to avoid exhausting MongoDB connections across test methods
    • Updated to openedx-events 11.1.1 which adds `tearDownClass` to `OpenEdxEventsTestMixin`, preventing events from being left globally disabled between test classes

Timing data — 3 runs on the new config (all passed)

Shard Run 1 Run 2 Run 3 Avg
shared-with-cms-1 22m48s 22m33s 23m09s ~22m50s
shared-with-lms-1 21m57s 22m49s 23m01s ~22m36s
shared-with-lms-2 20m06s 20m54s 22m05s ~21m02s
lms-4 22m28s 21m07s 21m27s ~21m41s
lms-1 21m18s 19m23s 22m25s ~21m02s
lms-5 19m50s 21m11s 19m22s ~20m08s
cms-2 19m31s 18m13s 20m19s ~19m21s
lms-2 19m16s 19m19s 18m43s ~19m06s
lms-3 18m43s 18m48s 18m52s ~18m48s
cms-1 5m40s 5m27s 5m32s ~5m33s
Critical path ~23m ~23m ~23m ~23m

For comparison, the old config's critical path (visible in the #38347 run on unmodified master) was ~29m on lms-4.

Future Issues: #38355

@feanil feanil force-pushed the feanil/collect-test-timings branch 2 times, most recently from afef03c to c17633d Compare April 4, 2026 18:45
@feanil
Copy link
Copy Markdown
Contributor Author

feanil commented Apr 6, 2026

The fix needs to be made upstream: openedx/openedx-events#559 waiting for that to be merged and released before coming back to this PR.

@feanil feanil force-pushed the feanil/collect-test-timings branch 2 times, most recently from 676809c to a89e30c Compare April 10, 2026 21:11
@feanil feanil force-pushed the feanil/collect-test-timings branch 2 times, most recently from 19c5643 to 1422307 Compare April 11, 2026 15:53
@feanil feanil changed the title build: collect per-test timing data via --report-log build: rebalance unit test shards to reduce CI critical path (~29m → ~23m) Apr 11, 2026
@feanil feanil marked this pull request as ready for review April 11, 2026 17:07
@feanil feanil requested review from a team, farhan, irtazaakram and salman2013 as code owners April 11, 2026 17:07
@feanil feanil requested a review from a team April 11, 2026 17:07
@feanil feanil changed the title build: rebalance unit test shards to reduce CI critical path (~29m → ~23m) build: rebalance unit test shards to reduce CI critical path (~29m → ~23m) and use fewer workers (16 -> 10) Apr 11, 2026
@feanil feanil changed the title build: rebalance unit test shards to reduce CI critical path (~29m → ~23m) and use fewer workers (16 -> 10) build: rebalance unit test shards to reduce CI critical path (~29m --> ~23m) and use fewer workers (16 --> 10) Apr 11, 2026
Copy link
Copy Markdown
Member

@kdmccormick kdmccormick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

]
},
"openedx-1-with-cms": {
"shared-with-cms-1": {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was nice for debugging that for each X-with-lms shard there was a corresponding X-with-cms, especially for those tests which would pass in one system and fail in the other. Would you be willing to change it so that we have parallel shared-with-lms-[1,2] and shared-with-cms-[1,2] shards? It would only add one additional shard and I don't think it'd increase the critical test time.

If not, then could you simplify shared-with-cms-1 definition into just the paths xmodule/, common/, and openedx/?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split them for convenience for now. I didn't want to collapse the tests because that will make it harder to re-balance them in the future since it would require more lookups to do the rebalancing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you catch this feedback?

It was nice for debugging that for each X-with-lms shard there was a corresponding X-with-cms, especially for those tests which would pass in one system and fail in the other. Would you be willing to change it so that we have parallel shared-with-lms-[1,2] and shared-with-cms-[1,2] shards? It would only add one additional shard and I don't think it'd increase the critical test time.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, just saw your commit

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL, that apparently we don't run all of the openedx apps under CMS, just some of them and if we try to run all of them there are issues:

These folders are run under LMS but not CMS:

openedx/core/djangoapps/course_live/
openedx/core/djangoapps/notifications/
openedx/core/djangolib/
openedx/core/tests/
openedx/features/
openedx/testing/

What do you think about landing this as is? I think it could be further improved and there's more to investigate but I don't want this to be blocked on existing issues.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow that's crazy

Yeah, in that case, totally OK with there being just one CMS shard

I wrote a followup issue, do you mind linking it here with a TODO comment? #38355

Otherwise, LGTM ✅

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess you can't add a comment here's, it's JSON

Comment thread .github/workflows/unit-tests.yml
feanil added a commit that referenced this pull request Apr 13, 2026
Per #38287 (comment)

Having the shared-with... tests correspond between the LMS and CMS makes
it easier to spot tastes that are failing in one context but not the
other more quickly. Since neither of these is the longest run, we pay a
bit more in overhead for this but it's still an improvement over what we
had.
@feanil feanil requested a review from kdmccormick April 13, 2026 16:17
Copy link
Copy Markdown
Member

@kdmccormick kdmccormick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@feanil feanil force-pushed the feanil/collect-test-timings branch from dcbee9b to 1422307 Compare April 13, 2026 19:44
@feanil
Copy link
Copy Markdown
Contributor Author

feanil commented Apr 13, 2026

@kdmccormick take a look at my latest comment in the conversation above, the extra commit revealed issues, I could still split shared-cms to 2 shards removing those highlighted test files but not sure if it's valuable. What do you think?

Run pytest with extra reporting enabled to generate files with per-test
durations. The file is uploaded as a CI artifact so timing data can be
downloaded and used to drive optimal shard rebalancing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@feanil feanil force-pushed the feanil/collect-test-timings branch from 1422307 to 62ba894 Compare April 14, 2026 14:52
- "openedx-2-with-cms"
- "shared-with-lms-1"
- "shared-with-lms-2"
- "shared-with-cms-1"
Copy link
Copy Markdown
Member

@kdmccormick kdmccormick Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- "shared-with-cms-1"
# Note: The shared-with-cms-1 shard is currently a subset of both
# shared-with-lms-1 and shared-with-lms-2. Some shared tests are
# not run -with-cms at all.
# https://github.com/openedx/openedx-platform/issues/38355
- "shared-with-cms-1"

Copy link
Copy Markdown
Member

@kdmccormick kdmccormick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment suggestion

feanil and others added 3 commits April 14, 2026 11:05
Redistribute test paths across 9 shards (down from 16) using a greedy
bin-packing optimiser driven by real per-test timing data from
pytest-reportlog. Predicted critical path: ~18.7m (down from ~29m).

Key changes:
- Rename shard groups to reflect semantic meaning: lms-*, shared-with-lms-*,
  shared-with-cms-*, cms-* (openedx/common/xmodule paths explicitly separated
  from lms-only and cms-only paths)
- Split lms/djangoapps/discussion/ into its 4 subdirectories so the heavy
  rest_api/ shard (15.7m) can be distributed across bins independently
- Remove outdated comment referencing unit-tests-gh-hosted.yml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
contentstore/ is large enough that the cms-1 runner was being killed
mid-run in CI (OOM or runner-level timeout). Splitting it into its own
shard keeps each job under the ~20-25 min target.

No changes needed to gha_unit_tests_collector.py — it already classifies
any shard whose first path starts with "cms/" as a CMS shard.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The --report-log flag adds overhead (writing a JSONL file for every
test) that's only useful for rebalancing work. Skip it entirely on
PR runs by conditionally setting the flag via an env var; also gate
the upload step on master so artifacts aren't created unnecessarily.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@feanil feanil force-pushed the feanil/collect-test-timings branch from b445ed2 to 4f1e0c8 Compare April 14, 2026 15:05
@feanil feanil enabled auto-merge (rebase) April 14, 2026 15:08
@feanil feanil merged commit 01dc3c8 into master Apr 14, 2026
54 of 60 checks passed
@feanil feanil deleted the feanil/collect-test-timings branch April 14, 2026 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants