Skip to content

Handle BucketAlreadyOwnedByYou race condition in S3 bucket initialization#7325

Merged
hanwen-cluster merged 3 commits intoaws:developfrom
hanwen-cluster:developapr2
Apr 14, 2026
Merged

Handle BucketAlreadyOwnedByYou race condition in S3 bucket initialization#7325
hanwen-cluster merged 3 commits intoaws:developfrom
hanwen-cluster:developapr2

Conversation

@hanwen-cluster
Copy link
Copy Markdown
Contributor

Description of changes

The default S3 bucket name is deterministic per account+region, so concurrent cluster creation calls (e.g. parallel integration tests) can race: multiple callers get a 404 from head_bucket, then all attempt create_bucket. The first succeeds; the rest fail with BucketAlreadyOwnedByYou, which was previously unhandled and surfaced as "Unable to initialize s3 bucket."

Treat BucketAlreadyOwnedByYou as a success condition since it confirms the bucket exists and is owned by the caller, which is the desired end state. Log at info level and proceed normally.

Tests

  • Tests will run after PR is merged

References

  • Link to impacted open issues.
  • Link to related PRs in other packages (i.e. cookbook, node).
  • Link to documentation useful to understand the changes.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@hanwen-cluster hanwen-cluster requested review from a team as code owners April 2, 2026 20:09
@hanwen-cluster hanwen-cluster added the skip-changelog-update Disables the check that enforces changelog updates in PRs label Apr 2, 2026
@hanwen-cluster hanwen-cluster enabled auto-merge (rebase) April 2, 2026 20:23
Copy link
Copy Markdown
Contributor

@gmarciani gmarciani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Docs] This bug fix deserves an entry in the changelog because it prevents cluster creation failures.

Comment thread cli/src/pcluster/models/s3_bucket.py Outdated
LOGGER.error("Unable to create S3 bucket %s.", bucket.name)
raise error
if error.error_code == "BucketAlreadyOwnedByYou":
LOGGER.info("S3 bucket %s was concurrently created by another process. Proceeding.", bucket.name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] What about logging instead "Bucket ${BUCKET_NAME} is already owned by you". The fact the the bucket was created by a concurrent process is a low level detail we observed in our integ test, but this is a customer facing log line

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This detail is needed because the error wouldn't appear if the bucket existed long before cluster creation. This error is only triggered if the race condition happens

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is message is part of the product not just the Integ test Infra so the messaging should be aligned with what the CX would see, which may or may not be multi-threaded cluster creation process.

@hanwen-cluster hanwen-cluster removed the skip-changelog-update Disables the check that enforces changelog updates in PRs label Apr 9, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 75.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 90.07%. Comparing base (0a83ff2) to head (d298487).
⚠️ Report is 10 commits behind head on develop.

Files with missing lines Patch % Lines
cli/src/pcluster/models/s3_bucket.py 75.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #7325      +/-   ##
===========================================
- Coverage    90.08%   90.07%   -0.01%     
===========================================
  Files          182      182              
  Lines        16730    16732       +2     
===========================================
+ Hits         15071    15072       +1     
- Misses        1659     1660       +1     
Flag Coverage Δ
unittests 90.07% <75.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread CHANGELOG.md Outdated
------

**BUG FIXES**
- Fix sporadic S3 bucket creation failure when multiple create-cluster commands are running simultaneously in the same region.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a name for the specific bucket like default bucket? or do-not-delete bucket? IF so I suggest we specify that in the changelog.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

…tion

The default S3 bucket name is deterministic per account+region, so
concurrent cluster creation calls (e.g. parallel integration tests)
can race: multiple callers get a 404 from head_bucket, then all
attempt create_bucket. The first succeeds; the rest fail with
BucketAlreadyOwnedByYou, which was previously unhandled and surfaced
as "Unable to initialize s3 bucket."

Treat BucketAlreadyOwnedByYou as a success condition since it confirms
the bucket exists and is owned by the caller, which is the desired
end state. Log at info level and proceed normally.
@hanwen-cluster hanwen-cluster merged commit eec8372 into aws:develop Apr 14, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants