fix(pipelines): align tidb CI handling and fix hardcoded usages#4626
Conversation
There was a problem hiding this comment.
I have already done a preliminary review for you, and I hope to help you do a better job.
Summary:
This PR removes the ee-bigdisk nodeSelector and corresponding tolerations from the pingcap-inc/tidb release-8.5 pod-pull build and unit test pipeline manifests. The intent is to align these manifests with the existing pingcap/tidb release-8.5 scheduling configuration by no longer restricting pods to nodes labeled with ee-bigdisk=true. The changes are straightforward YAML removals and appear to be limited to only the relevant pipeline manifests. Overall, the PR is clear and concise with no apparent errors.
Code Improvements
-
Consistency Verification:
- Files:
pipelines/pingcap-inc/tidb/release-8.5/pod-pull_build.yaml(lines ~69-79)
pipelines/pingcap-inc/tidb/release-8.5/pod-pull_unit_test.yaml(lines ~71-81) - Issue:
The PR description mentions matching the configuration to the existingpingcap/tidbrelease-8.5 scheduling, but there is no evidence in the diff or PR that the target manifests have been verified to confirm this exact alignment. It’s important to ensure that the removal here does not unintentionally schedule these pods on undesired nodes. - Suggested Solution:
Add a comment or note in the PR or codebase referencing the source configuration that these changes are being aligned with. Possibly include a link or a snippet of thepingcap/tidbmanifests for reviewer clarity.
- Files:
-
Error Handling / Validation:
- Files: Same as above.
- Issue:
Removing nodeSelector and tolerations changes pod scheduling behavior, which could cause pods to land on unsuitable nodes if cluster labels or taints change in the future. - Suggested Solution:
Consider adding a validation step in the CI pipeline or a pre-submit hook that verifies pods are scheduled on appropriate nodes (e.g., nodes with sufficient disk or resources), or document the expected node labeling standards clearly.
Best Practices
-
Documentation:
- Files: Both YAML manifests, near the removed sections.
- Issue:
The removed nodeSelector and tolerations are a critical part of pod scheduling; removing them can affect cluster resource usage. There are no comments explaining why these were removed or what the new expected scheduling behavior is. - Suggested Solution:
Add a comment block above thespec:section or near the scheduling configuration explaining the rationale for removal, for example:# Removed ee-bigdisk nodeSelector and tolerations to align with pingcap/tidb release-8.5 scheduling. # Pods will now be scheduled on any node matching the default selectors without requiring the ee-bigdisk label.
-
Testing Coverage:
- Files: N/A (pipeline YAMLs)
- Issue:
Changes to scheduling constraints might affect pod placement and test results, but no mention is made of testing these changes. - Suggested Solution:
Ensure that test runs triggered by these updated pipelines are monitored for any unexpected pod scheduling failures or resource issues. If possible, add automated checks or alerts for pod evictions or scheduling delays.
No critical issues or security concerns are present given the limited scope of the changes. The main improvements focus on clarifying intent, ensuring alignment is verified, and documenting the impact for future maintainers.
There was a problem hiding this comment.
Code Review
This pull request removes the ee-bigdisk node selectors and tolerations from the build and unit test pod configurations for TiDB release 8.5. The reviewer expressed concerns that removing these constraints while maintaining high ephemeral storage limits (up to 300Gi) could lead to scheduling instability or pod evictions on standard nodes that cannot accommodate such limits. It is recommended to verify these changes through empirical testing to ensure CI stability.
I am having trouble creating individual review comments. Click here to see my feedback.
pipelines/pingcap-inc/tidb/release-8.5/pod-pull_build.yaml (72-78)
Removing the ee-bigdisk node selector and tolerations while maintaining a high ephemeral-storage limit (150Gi) may lead to scheduling instability. While removing storage constraints can be an intentional performance improvement, stability is prioritized in CI. Please verify this change with empirical testing to ensure pods are not evicted when disk usage exceeds the 50Gi request on standard nodes.
References
- Parallelism and resource settings in CI jobs should be determined by empirical testing for stability and performance rather than strictly adhering to theoretical limits.
- Removing persistent volumes for caching can be an intentional change to improve build performance, even though it seems counterintuitive.
- In CI configurations, stability and behavioral consistency may be prioritized over refactoring, especially when the change is perceived as risky.
pipelines/pingcap-inc/tidb/release-8.5/pod-pull_unit_test.yaml (74-80)
Removing the ee-bigdisk constraints is risky given the 300Gi ephemeral-storage limit. Although such removals can be intentional performance optimizations, they may lead to non-deterministic failures if pods are scheduled on nodes unable to accommodate the limit. Please verify this change through empirical testing to maintain CI stability.
References
- Parallelism and resource settings in CI jobs should be determined by empirical testing for stability and performance rather than strictly adhering to theoretical limits.
- Removing persistent volumes for caching can be an intentional change to improve build performance, even though it seems counterintuitive.
- In CI configurations, stability and behavioral consistency may be prioritized over refactoring, especially when the change is perceived as risky.
|
/approve |
|
/hold |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wuhuizuo The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
- Increase CPU/memory resources and reduce ephemeral storage requests - Replace emptyDir volumes with ephemeral PVCs using hyperdisk-rwo - Use checkoutRefsWithCacheLock instead of manual cache + retry logic - Add bazel-repository-cache volume mount for unit tests - Remove ci-nvme-high-performance node selector from check2 and br_test - Strip bazel cache mirrors and local repository cache references - Increase workspace volume size to 300Gi for unit tests
replace hardcoded dir name with `REFS.repo`
|
/unhold |
|
@wuhuizuo: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
pingcap-inc/tidbrelease-8.5pull_buildandpull_unit_testwith the correspondingpingcap/tidbhandlingcheckoutRefsWithCacheLock(..., withSubmodule = true)pingcap-inc/tidbpodspingcap/tidbpipeline scripts and pod configsTesting