Conversation
afef03c to
c17633d
Compare
|
The fix needs to be made upstream: openedx/openedx-events#559 waiting for that to be merged and released before coming back to this PR. |
676809c to
a89e30c
Compare
19c5643 to
1422307
Compare
| ] | ||
| }, | ||
| "openedx-1-with-cms": { | ||
| "shared-with-cms-1": { |
There was a problem hiding this comment.
It was nice for debugging that for each X-with-lms shard there was a corresponding X-with-cms, especially for those tests which would pass in one system and fail in the other. Would you be willing to change it so that we have parallel shared-with-lms-[1,2] and shared-with-cms-[1,2] shards? It would only add one additional shard and I don't think it'd increase the critical test time.
If not, then could you simplify shared-with-cms-1 definition into just the paths xmodule/, common/, and openedx/?
There was a problem hiding this comment.
I split them for convenience for now. I didn't want to collapse the tests because that will make it harder to re-balance them in the future since it would require more lookups to do the rebalancing.
There was a problem hiding this comment.
Did you catch this feedback?
It was nice for debugging that for each X-with-lms shard there was a corresponding X-with-cms, especially for those tests which would pass in one system and fail in the other. Would you be willing to change it so that we have parallel shared-with-lms-[1,2] and shared-with-cms-[1,2] shards? It would only add one additional shard and I don't think it'd increase the critical test time.
There was a problem hiding this comment.
Nevermind, just saw your commit
There was a problem hiding this comment.
TIL, that apparently we don't run all of the openedx apps under CMS, just some of them and if we try to run all of them there are issues:
- https://github.com/openedx/openedx-platform/actions/runs/24354051533/job/71116066612?pr=38287
- https://github.com/openedx/openedx-platform/actions/runs/24354051533/job/71116066418?pr=38287
These folders are run under LMS but not CMS:
openedx/core/djangoapps/course_live/
openedx/core/djangoapps/notifications/
openedx/core/djangolib/
openedx/core/tests/
openedx/features/
openedx/testing/
What do you think about landing this as is? I think it could be further improved and there's more to investigate but I don't want this to be blocked on existing issues.
There was a problem hiding this comment.
Wow that's crazy
Yeah, in that case, totally OK with there being just one CMS shard
I wrote a followup issue, do you mind linking it here with a TODO comment? #38355
Otherwise, LGTM ✅
There was a problem hiding this comment.
Ah, I guess you can't add a comment here's, it's JSON
Per #38287 (comment) Having the shared-with... tests correspond between the LMS and CMS makes it easier to spot tastes that are failing in one context but not the other more quickly. Since neither of these is the longest run, we pay a bit more in overhead for this but it's still an improvement over what we had.
dcbee9b to
1422307
Compare
|
@kdmccormick take a look at my latest comment in the conversation above, the extra commit revealed issues, I could still split shared-cms to 2 shards removing those highlighted test files but not sure if it's valuable. What do you think? |
Run pytest with extra reporting enabled to generate files with per-test durations. The file is uploaded as a CI artifact so timing data can be downloaded and used to drive optimal shard rebalancing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1422307 to
62ba894
Compare
| - "openedx-2-with-cms" | ||
| - "shared-with-lms-1" | ||
| - "shared-with-lms-2" | ||
| - "shared-with-cms-1" |
There was a problem hiding this comment.
| - "shared-with-cms-1" | |
| # Note: The shared-with-cms-1 shard is currently a subset of both | |
| # shared-with-lms-1 and shared-with-lms-2. Some shared tests are | |
| # not run -with-cms at all. | |
| # https://github.com/openedx/openedx-platform/issues/38355 | |
| - "shared-with-cms-1" |
kdmccormick
left a comment
There was a problem hiding this comment.
Just a comment suggestion
Redistribute test paths across 9 shards (down from 16) using a greedy bin-packing optimiser driven by real per-test timing data from pytest-reportlog. Predicted critical path: ~18.7m (down from ~29m). Key changes: - Rename shard groups to reflect semantic meaning: lms-*, shared-with-lms-*, shared-with-cms-*, cms-* (openedx/common/xmodule paths explicitly separated from lms-only and cms-only paths) - Split lms/djangoapps/discussion/ into its 4 subdirectories so the heavy rest_api/ shard (15.7m) can be distributed across bins independently - Remove outdated comment referencing unit-tests-gh-hosted.yml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
contentstore/ is large enough that the cms-1 runner was being killed mid-run in CI (OOM or runner-level timeout). Splitting it into its own shard keeps each job under the ~20-25 min target. No changes needed to gha_unit_tests_collector.py — it already classifies any shard whose first path starts with "cms/" as a CMS shard. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The --report-log flag adds overhead (writing a JSONL file for every test) that's only useful for rebalancing work. Skip it entirely on PR runs by conditionally setting the flag via an env var; also gate the upload step on master so artifacts aren't created unnecessarily. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
b445ed2 to
4f1e0c8
Compare
Summary
Rebalances the unit test shard configuration to reduce the CI critical path from ~29 minutes down to ~23 minutes, measured across 3 consistent runs.
What changed
Timing data — 3 runs on the new config (all passed)
shared-with-cms-1shared-with-lms-1shared-with-lms-2lms-4lms-1lms-5cms-2lms-2lms-3cms-1For comparison, the old config's critical path (visible in the #38347 run on unmodified master) was ~29m on
lms-4.Future Issues: #38355