fix: reduce Docker layers, add auto CI trigger, fix fake ops import#363
fix: reduce Docker layers, add auto CI trigger, fix fake ops import#363shijieliu merged 7 commits intoNVIDIA:mainfrom
Conversation
Greptile SummaryThis PR consolidates Dockerfile
Confidence Score: 5/5Safe to merge — all functional changes are correct; the only finding is a P2 description/code mismatch. No P0 or P1 findings. The Dockerfile layer consolidation is functionally equivalent to the original. The Python fake-ops import ordering is correct and uses the established
Important Files Changed
Sequence DiagramsequenceDiagram
participant Dev as Developer
participant GH as GitHub PR
participant BCI as Blossom CI (GH Action)
participant GL as GitLab Pipeline
Dev->>GH: Comment /build [devel|nightly]
GH->>BCI: issue_comment event (created)
BCI->>BCI: Authorization: actor in allowlist?
BCI->>BCI: Authorization: startsWith(body, '/build')?
BCI->>BCI: Vulnerability scan (checkout + blossom-action)
BCI->>GL: START-CI-JOB (blossom-ci, passes flags)
GL-->>GH: Pipeline status reported back
Reviews (13): Last reviewed commit: "ci: remove pull_request_target trigger, ..." | Re-trigger Greptile |
b816ab3 to
df8ec8a
Compare
Aggressively merge RUN instructions in the Dockerfile to reduce total
layer count from ~126 to ~119. The inference image was hitting the
overlay2 128-layer limit ("failed to register layer: max depth
exceeded") on CI nodes.
devel stage: 8 RUN + 1 COPY -> 4 RUN + 1 COPY (-4 layers)
build stage: 4 RUN + 1 COPY -> 1 RUN + 1 COPY (-3 layers)
FBGEMM and TorchRec kept as separate layers for build cache efficiency.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
df8ec8a to
a105838
Compare
|
/build |
3 similar comments
|
/build |
|
/build |
|
/build |
|
/ci |
1 similar comment
|
/ci |
|
/build |
|
/ci |
0279dc3 to
a105838
Compare
|
/build |
13 similar comments
|
/build |
|
/build |
|
/build |
|
/build |
|
/build |
|
/build |
|
/build |
|
/build |
|
/build |
|
/build |
|
/build |
|
/build |
|
/build |
|
/build |
2 similar comments
|
/build |
|
/build |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/build |
|
/build |
|
/build |
|
❔ Pipeline #48650175 -- canceling
|
The module hstu.hstu_ops_gpu does not exist as a Python module. The C++ source hstu_ops_gpu.cpp compiles into hstu/fbgemm_gpu_experimental_hstu.so, not a separate hstu_ops_gpu submodule. This import was incorrectly added in PR NVIDIA#327 and causes ModuleNotFoundError in CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/build |
|
/build |
|
❌ Pipeline #48656842 -- failed
Result: 10/14 jobs passed |
Update from 04df536 to 65bad42 which adds fake tensor implementations for torch.export (hstu_ops_gpu.py). This was missing since PR NVIDIA#340 accidentally reverted the submodule pointer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/build |
|
❌ Pipeline #48669643 -- failed
Result: 8/14 jobs passed |
|
/build |
|
❌ Pipeline #48680124 -- failed
Result: 10/14 jobs passed |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/build |
|
❌ Pipeline #48697202 -- failed
Result: 5/14 jobs passed |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/build |
|
❌ Pipeline #48700740 -- failed
Result: 11/14 jobs passed |
Summary
RUNinstructions indocker/Dockerfileto reduce total layer count, fixing the overlay2 128-layer limit (max depth exceeded) on CI nodes. Saves 12 layers (~119 total).pull_request_targettrigger toblossom-ci.ymlso CI runs automatically on PR open/update (no need to manually comment/build)/buildcommand: support/build develand/build nightlyflags to passBUILD_DEVEL=1/NIGHTLY_TEST=1to GitLab pipeline (companion change in GitLab MR !125)Dockerfile Changes
blossom-ci.yml Changes
pull_request_target: [opened, synchronize]triggerifcondition:startsWith(comment.body, '/build')to support/build devel,/build nightly,/build devel nightly/buildFlag Support/build/build develBUILD_DEVEL=1)/build nightlyNIGHTLY_TEST=1)/build devel nightlyCI
Test plan
inference_build+inference_test_1gpupass (no moremax depth exceeded)train_build+ unit tests unaffected/build develtriggers pipeline withBUILD_DEVEL=1(requires GitLab MR !125 merged first)🤖 Generated with Claude Code