Skip to content

Overhaul test infrastructure#389

Merged
lkdvos merged 50 commits intomainfrom
ld-testing
Apr 15, 2026
Merged

Overhaul test infrastructure#389
lkdvos merged 50 commits intomainfrom
ld-testing

Conversation

@lkdvos
Copy link
Copy Markdown
Member

@lkdvos lkdvos commented Mar 24, 2026

Summary

  • Moves test dependencies into a dedicated test/Project.toml, separating them from the main package manifest
  • Replaces the monolithic test runner with ParallelTestRunner.jl, running each test file in its own worker process
  • Adds --fast mode that skips AD test groups and reduces sector/scalar-type coverage for quick iteration
  • Adds test/README.md documenting how to run tests, available groups, fast mode, and how to add new test files
  • Updates CI to auto-discover test groups as matrix jobs; draft PRs run a reduced matrix (ubuntu + Julia 1 only) while ready PRs run the full matrix

Comment thread .github/workflows/CI.yml Outdated
kshyatt
kshyatt previously approved these changes Mar 25, 2026
Copy link
Copy Markdown
Member

@kshyatt kshyatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from minor comment LGTM, this will be very helpful for the Enzyme PR

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 25, 2026

Your PR no longer requires formatting changes. Thank you for your contribution!

@kshyatt
Copy link
Copy Markdown
Member

kshyatt commented Mar 26, 2026

Ope, GPU failures look related...

@kshyatt kshyatt self-requested a review March 26, 2026 09:58
@lkdvos
Copy link
Copy Markdown
Member Author

lkdvos commented Mar 26, 2026

Needs #390 first, but should then be ready.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines Coverage Δ
ext/TensorKitChainRulesCoreExt/linalg.jl 100.00% <100.00%> (ø)
src/fusiontrees/braiding_manipulations.jl 95.08% <100.00%> (+7.37%) ⬆️
src/spaces/productspace.jl 88.46% <100.00%> (-0.15%) ⬇️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lkdvos lkdvos marked this pull request as ready for review March 27, 2026 13:23
@lkdvos lkdvos force-pushed the ld-testing branch 2 times, most recently from ab125b5 to e3ce090 Compare March 30, 2026 12:56
Comment thread test/setup.jl
Comment thread test/setup.jl Outdated
Comment thread test/cuda/factorizations.jl Outdated
Comment thread test/cuda/tensors.jl Outdated
Comment thread test/chainrules/tensoroperations.jl Outdated
Comment thread test/chainrules/tensoroperations.jl Outdated
@Jutho
Copy link
Copy Markdown
Member

Jutho commented Mar 30, 2026

This generally looks good; I left a few small comments and questions. But clearly, this is too much change for a detailed review. Is there a convenient way to review such code reorganization, i.e. to separate between what has just moved to other files and what are actual changes. I could probably ask some agent, but I don't feel like doing that.

@lkdvos
Copy link
Copy Markdown
Member Author

lkdvos commented Mar 30, 2026

I don't think there is, but I did try and not actually alter anything except for the organization of the tests.
To summarize:

  • Split files over separate files/folders to have some parallelization
  • Reworked the github actions implementation
  • added a --fast filter that just reduces some of the tests

In principle there is no reason to review the actual contents of the test files, since these are unchanged, which also explains some of the things you commented on.
I'll address those in the meantime.

@kshyatt
Copy link
Copy Markdown
Member

kshyatt commented Apr 8, 2026

Buildkite now succeeding!!!

Comment thread test/cuda/tensors.jl
@tensor t′[1 2 3; 4 5] := t1[1; 4] * t2[2 3; 5]
CUDA.@allowscalar begin
@tensor t′[1 2 3; 4 5] := t1[1; 4] * t2[2 3; 5]
end
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need an allowscalar? (just to understand what is still missing).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's missing the changes in the BraidingTensor PR -- a BraidingTensor is arising in this contraction for the Irrep[CU₁] case. Once we get the BraidingTensor changes in this can be removed

@Jutho
Copy link
Copy Markdown
Member

Jutho commented Apr 8, 2026

Buildkite now succeeding!!!

That is great news. The chainrules test still take quite long and time out on Windows. The tensor contraction AD test is repeated 5 times over with random contraction patterns, so that cost can easily be cut down by lowering the number of repetitions.

Also thanks to @borisdevos ; your commits contain useful improvements.

@Jutho
Copy link
Copy Markdown
Member

Jutho commented Apr 9, 2026

This is the output of a local run of chainrules/tensoroperations.jl :

---------------------------------------
Auto-diff with symmetry: Trivial
---------------------------------------
Test Summary:                                          | Pass  Total     Time
ChainRules for tensor operations with symmetry Trivial | 1434   1434  1m30.8s
  scalartype Float64                                   |  717    717    45.9s
  scalartype ComplexF64                                |  717    717    44.9s
---------------------------------------
Auto-diff with symmetry: Irrep[ℤ₂]
---------------------------------------
Test Summary:                                            | Pass  Total     Time
ChainRules for tensor operations with symmetry Irrep[ℤ₂] | 1582   1582  1m26.8s
  scalartype Float64                                     |  811    811    44.9s
  scalartype ComplexF64                                  |  771    771    41.9s
---------------------------------------
Auto-diff with symmetry: Irrep[CU₁]
---------------------------------------
Test Summary:                                             | Pass  Total      Time
ChainRules for tensor operations with symmetry Irrep[CU₁] | 1746   1746  31m46.5s
  scalartype Float64                                      |  885    885  11m54.9s
  scalartype ComplexF64                                   |  861    861  19m51.5s
---------------------------------------
Auto-diff with symmetry: (FermionParity ⊠ Irrep[SU₂] ⊠ Irrep[U₁])
---------------------------------------
Test Summary:                                                                           | Pass  Total       Time
ChainRules for tensor operations with symmetry (FermionParity ⊠ Irrep[SU₂] ⊠ Irrep[U₁]) | 1806   1806  202m42.4s
  scalartype Float64                                                                    |  905    905    2m55.2s
  scalartype ComplexF64                                                                 |  901    901  199m47.2s

Clearly the last two symmetry types are the culprit, but I would need to add more detailed timing statements to know the true origin. I checked that tensors with spaces which are related to V1 * V2 * V3 * V4 * V5 are not too big for all of the chosen spaces, but in the chainrules tensor contraction tests , tensors with a random set of spaces are created, which might be creating tensors that are much bigger.

What is very strange is that for Irrep[CU₁] both real and complex tests take considerable amount of time, whereas for (FermionParity ⊠ Irrep[SU₂] ⊠ Irrep[U₁]), it is only the complex case, which takes a truly outrageous amount of time.

Comment thread src/fusiontrees/braiding_manipulations.jl Outdated
Comment thread src/fusiontrees/braiding_manipulations.jl Outdated
@Jutho
Copy link
Copy Markdown
Member

Jutho commented Apr 15, 2026

It's been very hard to get the tests to pass again, since I started making small changes in the spaces etc. There was also some issue with random numbers that accidentally caused a very small singular value in Float32 precision etc.

I think everything is now working, except that one windows test run failed due to some (unrelated?) error in compiling the CUDA package, and that one of the actual buildkite cuda runs also failed. I don't know if this failure is deterministic and related to some actual change here; I fail to see how that could be the case as I didn't change anything in the cuda tests in the last few commits.

If we can rule out that the failure is unrelated, then I would be ok with having this merged, even though I will try to further improve the tests in further PRs.

@kshyatt
Copy link
Copy Markdown
Member

kshyatt commented Apr 15, 2026

CUDA failure seems to have resolved itself btw

@lkdvos
Copy link
Copy Markdown
Member Author

lkdvos commented Apr 15, 2026

Still not sure what is going on with the CUDA tests, but will merge for now and deal with fallout later. Thanks everyone for the help on this one!

@lkdvos lkdvos merged commit ec7af8f into main Apr 15, 2026
52 of 61 checks passed
@lkdvos lkdvos deleted the ld-testing branch April 15, 2026 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants