[CI] Run test_scheduler on github-hosted with extra disk#279
Merged
Conversation
test_scheduler compiles resnet18 and EncoderBlock and launches each model twice, so the per-kernel RISC-V ELFs, objdump disassembly dumps and gem5 m5out directories accumulate within a single run. On the small github-hosted root volume (~14G) this overflows during the RISC-V final link step (ld: final link failed: No space left on device). Free the preinstalled tool caches before the run and redirect the PyTorchSim outputs/ and /tmp artifacts onto the larger /mnt scratch disk (~70G) so the accumulated artifacts no longer fill the root volume. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
840a049 to
b9dad83
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
test_schedulerfails on the github-hosted runner with:The test compiles
resnet18(channels_last) andEncoderBlock(768, 12)andlaunch_models each model twice. Every unique compiled kernel leaves artifacts behind within the single run:binary.dumpfromobjdump -d(full disassembly incl. libc)m5outdirs and spike arg dumpsresnet18 alone has 20+ conv/bn/relu kernels, so these accumulate past the ~14G github-hosted root volume and the RISC-V link step runs out of space. Single-op tests stay small enough to pass.
Fix
Scope changes to the
test_schedulerjob only:jlumbroso/free-disk-spacestep removes preinstalled tool caches (~25G recovered)./mntscratch disk (~70G) over the container'soutputs/and/tmpso accumulated artifacts no longer land on root.No test semantics or simulator behavior change.
🤖 Generated with Claude Code