Skip to content

Bump CI retention days#591

Draft
matthiasdiener wants to merge 5 commits into
devfrom
mdiener/bump-ci-retention
Draft

Bump CI retention days#591
matthiasdiener wants to merge 5 commits into
devfrom
mdiener/bump-ci-retention

Conversation

@matthiasdiener
Copy link
Copy Markdown
Contributor

@matthiasdiener matthiasdiener commented May 20, 2026

Description

To make it faster and easier to rerun intermittent failing tests.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

To make it faster and easier to rerun intermittent failing tests
Copy link
Copy Markdown
Collaborator

@ipanfilo ipanfilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is made 1 day because of size of artifact and limited storage. It has to be confirmed with enterprise admins that we have enough space to store 7N750Mb data. Even after CK_JIT adoption it will be 7N320Mb that will probably be good for 3 day comparing to original (before QoLA) 1Gb package size.

@matthiasdiener
Copy link
Copy Markdown
Contributor Author

It is made 1 day because of size of artifact and limited storage. It has to be confirmed with enterprise admins that we have enough space to store 7_N_750Mb data. Even after CK_JIT adoption it will be 7_N_320Mb that will probably be good for 3 day comparing to original (before QoLA) 1Gb package size.

Understood. My impression was that since this is a public repo with artifact storage on GitHub's servers, no quota applies (see e.g. https://docs.github.com/en/actions/concepts/billing-and-usage#about-billing-for-github-actions). But would be good to double check with the admins.

@matthiasdiener matthiasdiener marked this pull request as draft May 21, 2026 18:27
@wenchenvincent
Copy link
Copy Markdown
Collaborator

wenchenvincent commented May 26, 2026

It is made 1 day because of size of artifact and limited storage. It has to be confirmed with enterprise admins that we have enough space to store 7_N_750Mb data. Even after CK_JIT adoption it will be 7_N_320Mb that will probably be good for 3 day comparing to original (before QoLA) 1Gb package size.

Understood. My impression was that since this is a public repo with artifact storage on GitHub's servers, no quota applies (see e.g. https://docs.github.com/en/actions/concepts/billing-and-usage#about-billing-for-github-actions). But would be good to double check with the admins.

I think actions are free for public repos but the storage might not. But yeah, better to check with admin.

@ipanfilo
Copy link
Copy Markdown
Collaborator

It is made 1 day because of size of artifact and limited storage. It has to be confirmed with enterprise admins that we have enough space to store 7_N_750Mb data. Even after CK_JIT adoption it will be 7_N_320Mb that will probably be good for 3 day comparing to original (before QoLA) 1Gb package size.

Understood. My impression was that since this is a public repo with artifact storage on GitHub's servers, no quota applies (see e.g. https://docs.github.com/en/actions/concepts/billing-and-usage#about-billing-for-github-actions). But would be good to double check with the admins.

I think actions are free for public repos but the storage might not. But yeah, better to check with admin.

It is not about the price but about storage size limits: https://docs.github.com/en/actions/reference/limits#storage-limits-for-all-github-hosted-runners

@matthiasdiener
Copy link
Copy Markdown
Contributor Author

Here is what is currently in the artifact storage, sorted by size (approx. 9 GB total):

$ gh api --paginate repos/ROCm/TransformerEngine/actions/artifacts \
  --jq '.artifacts[] | select(.expired == false) | "\(.name)\t\(.size_in_bytes / 1024 / 1024 | round)MB\t\(.workflow_run.head_branch)\t\(.created_at)"' \
  | sort -t$'\t' -k2 -rn | head -20

te-rocm-wheels  721MB   mdiener/cktile-grouped-gemm-padding     2026-05-26T21:40:30Z
te-rocm-wheels  709MB   mdiener/tests-rebuild   2026-05-26T19:56:35Z
te-rocm-wheels  709MB   mdiener/prodgemm-test   2026-05-26T20:19:32Z
te-rocm-wheels  709MB   mdiener/prodgemm-test   2026-05-26T16:32:26Z
te-rocm-wheels  709MB   ipanfilo/wheel_build_action     2026-04-28T04:58:09Z
te-rocm-wheels  709MB   ipanfilo/pytorch_tests_timeout  2026-05-27T03:13:03Z
te-rocm-wheels  709MB   ipanfilo/pytorch_tests_timeout  2026-05-26T20:50:01Z
te-rocm-wheels  709MB   amartin/ck-fp8-tuning   2026-05-27T15:52:12Z
te-rocm-wheels  709MB   amartin/ck-fp8-tuning   2026-05-27T15:52:10Z
te-rocm-wheels  709MB   amartin/ck-fp8-tuning   2026-05-27T04:48:05Z
te-rocm-wheels  709MB   amartin/ck-fp8-tuning   2026-05-27T01:35:23Z
te-rocm-wheels  709MB   amartin/ck-fp8-tuning   2026-05-27T00:21:43Z
logs-sgpu-mi35x 2MB     ipanfilo/pytorch_tests_timeout  2026-05-27T05:36:41Z
logs-sgpu-mi35x 2MB     ipanfilo/pytorch_tests_timeout  2026-05-26T17:06:40Z
logs-sgpu-mi30x 2MB     mdiener/cktile-grouped-gemm-padding     2026-05-27T00:04:01Z
logs-sgpu-mi30x 2MB     mdiener/cktile-grouped-gemm-padding     2026-05-26T18:25:29Z
logs-sgpu-mi30x 2MB     ipanfilo/pytorch_tests_timeout  2026-05-27T06:08:30Z
logs-sgpu-mi30x 2MB     ipanfilo/pytorch_tests_timeout  2026-05-26T23:06:43Z
logs-sgpu-mi30x 2MB     ipanfilo/pytorch_tests_timeout  2026-05-26T17:10:13Z
logs-sgpu-mi30x 2MB     dev     2026-05-23T16:42:10Z

@matthiasdiener
Copy link
Copy Markdown
Contributor Author

It is made 1 day because of size of artifact and limited storage. It has to be confirmed with enterprise admins that we have enough space to store 7_N_750Mb data. Even after CK_JIT adoption it will be 7_N_320Mb that will probably be good for 3 day comparing to original (before QoLA) 1Gb package size.

Understood. My impression was that since this is a public repo with artifact storage on GitHub's servers, no quota applies (see e.g. https://docs.github.com/en/actions/concepts/billing-and-usage#about-billing-for-github-actions). But would be good to double check with the admins.

I think actions are free for public repos but the storage might not. But yeah, better to check with admin.

It is not about the price but about storage size limits: https://docs.github.com/en/actions/reference/limits#storage-limits-for-all-github-hosted-runners

With the changes in d62a755, older wheel artifacts from the same branch are automatically deleted before uploading new ones, so each branch keeps at most one copy of the wheels (~709 MB). Total storage consumption should remain comparable to the current usage.

@matthiasdiener matthiasdiener requested a review from ipanfilo May 28, 2026 17:52
env:
GH_TOKEN: ${{ github.token }}
run: |
BRANCH="${{ github.head_ref || github.ref_name }}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it tested in all trigger scenarios? I'm not confident that github.head_ref and github.ref_name do always contain what we need

types: [ labeled, synchronize, reopened ]

permissions:
actions: write
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why permission elevation is needed for this action?

group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

permissions:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as before: why does it need permission elevation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-level 1 CI test level 1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants