Skip to content

fix(pipelines): align tidb CI handling and fix hardcoded usages#4626

Merged
ti-chi-bot[bot] merged 3 commits into
mainfrom
agent/coder/ecaf4137
May 25, 2026
Merged

fix(pipelines): align tidb CI handling and fix hardcoded usages#4626
ti-chi-bot[bot] merged 3 commits into
mainfrom
agent/coder/ecaf4137

Conversation

@wuhuizuo
Copy link
Copy Markdown
Contributor

@wuhuizuo wuhuizuo commented May 22, 2026

Summary

  • align pingcap-inc/tidb release-8.5 pull_build and pull_unit_test with the corresponding pingcap/tidb handling
  • keep private-repo checkout compatible by using the repository credential flow with checkoutRefsWithCacheLock(..., withSubmodule = true)
  • remove outdated bigdisk scheduling settings from the affected pingcap-inc/tidb pods
  • fix hardcoded repo and workspace usages in related pingcap/tidb pipeline scripts and pod configs

Testing

  • not run (PR metadata update only)

Copy link
Copy Markdown

@ti-chi-bot ti-chi-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already done a preliminary review for you, and I hope to help you do a better job.

Summary:
This PR removes the ee-bigdisk nodeSelector and corresponding tolerations from the pingcap-inc/tidb release-8.5 pod-pull build and unit test pipeline manifests. The intent is to align these manifests with the existing pingcap/tidb release-8.5 scheduling configuration by no longer restricting pods to nodes labeled with ee-bigdisk=true. The changes are straightforward YAML removals and appear to be limited to only the relevant pipeline manifests. Overall, the PR is clear and concise with no apparent errors.


Code Improvements

  • Consistency Verification:

    • Files:
      pipelines/pingcap-inc/tidb/release-8.5/pod-pull_build.yaml (lines ~69-79)
      pipelines/pingcap-inc/tidb/release-8.5/pod-pull_unit_test.yaml (lines ~71-81)
    • Issue:
      The PR description mentions matching the configuration to the existing pingcap/tidb release-8.5 scheduling, but there is no evidence in the diff or PR that the target manifests have been verified to confirm this exact alignment. It’s important to ensure that the removal here does not unintentionally schedule these pods on undesired nodes.
    • Suggested Solution:
      Add a comment or note in the PR or codebase referencing the source configuration that these changes are being aligned with. Possibly include a link or a snippet of the pingcap/tidb manifests for reviewer clarity.
  • Error Handling / Validation:

    • Files: Same as above.
    • Issue:
      Removing nodeSelector and tolerations changes pod scheduling behavior, which could cause pods to land on unsuitable nodes if cluster labels or taints change in the future.
    • Suggested Solution:
      Consider adding a validation step in the CI pipeline or a pre-submit hook that verifies pods are scheduled on appropriate nodes (e.g., nodes with sufficient disk or resources), or document the expected node labeling standards clearly.

Best Practices

  • Documentation:

    • Files: Both YAML manifests, near the removed sections.
    • Issue:
      The removed nodeSelector and tolerations are a critical part of pod scheduling; removing them can affect cluster resource usage. There are no comments explaining why these were removed or what the new expected scheduling behavior is.
    • Suggested Solution:
      Add a comment block above the spec: section or near the scheduling configuration explaining the rationale for removal, for example:
      # Removed ee-bigdisk nodeSelector and tolerations to align with pingcap/tidb release-8.5 scheduling.
      # Pods will now be scheduled on any node matching the default selectors without requiring the ee-bigdisk label.
  • Testing Coverage:

    • Files: N/A (pipeline YAMLs)
    • Issue:
      Changes to scheduling constraints might affect pod placement and test results, but no mention is made of testing these changes.
    • Suggested Solution:
      Ensure that test runs triggered by these updated pipelines are monitored for any unexpected pod scheduling failures or resource issues. If possible, add automated checks or alerts for pod evictions or scheduling delays.

No critical issues or security concerns are present given the limited scope of the changes. The main improvements focus on clarifying intent, ensuring alignment is verified, and documenting the impact for future maintainers.

@ti-chi-bot ti-chi-bot Bot added the size/S label May 22, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the ee-bigdisk node selectors and tolerations from the build and unit test pod configurations for TiDB release 8.5. The reviewer expressed concerns that removing these constraints while maintaining high ephemeral storage limits (up to 300Gi) could lead to scheduling instability or pod evictions on standard nodes that cannot accommodate such limits. It is recommended to verify these changes through empirical testing to ensure CI stability.

I am having trouble creating individual review comments. Click here to see my feedback.

pipelines/pingcap-inc/tidb/release-8.5/pod-pull_build.yaml (72-78)

medium

Removing the ee-bigdisk node selector and tolerations while maintaining a high ephemeral-storage limit (150Gi) may lead to scheduling instability. While removing storage constraints can be an intentional performance improvement, stability is prioritized in CI. Please verify this change with empirical testing to ensure pods are not evicted when disk usage exceeds the 50Gi request on standard nodes.

References
  1. Parallelism and resource settings in CI jobs should be determined by empirical testing for stability and performance rather than strictly adhering to theoretical limits.
  2. Removing persistent volumes for caching can be an intentional change to improve build performance, even though it seems counterintuitive.
  3. In CI configurations, stability and behavioral consistency may be prioritized over refactoring, especially when the change is perceived as risky.

pipelines/pingcap-inc/tidb/release-8.5/pod-pull_unit_test.yaml (74-80)

medium

Removing the ee-bigdisk constraints is risky given the 300Gi ephemeral-storage limit. Although such removals can be intentional performance optimizations, they may lead to non-deterministic failures if pods are scheduled on nodes unable to accommodate the limit. Please verify this change through empirical testing to maintain CI stability.

References
  1. Parallelism and resource settings in CI jobs should be determined by empirical testing for stability and performance rather than strictly adhering to theoretical limits.
  2. Removing persistent volumes for caching can be an intentional change to improve build performance, even though it seems counterintuitive.
  3. In CI configurations, stability and behavioral consistency may be prioritized over refactoring, especially when the change is perceived as risky.

@wuhuizuo
Copy link
Copy Markdown
Contributor Author

/approve

@ti-chi-bot ti-chi-bot Bot added the approved label May 22, 2026
@wuhuizuo
Copy link
Copy Markdown
Contributor Author

/hold

@wuhuizuo
Copy link
Copy Markdown
Contributor Author

/approve

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 22, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wuhuizuo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wuhuizuo added 2 commits May 25, 2026 17:46
- Increase CPU/memory resources and reduce ephemeral storage requests
- Replace emptyDir volumes with ephemeral PVCs using hyperdisk-rwo
- Use checkoutRefsWithCacheLock instead of manual cache + retry logic
- Add bazel-repository-cache volume mount for unit tests
- Remove ci-nvme-high-performance node selector from check2 and br_test
- Strip bazel cache mirrors and local repository cache references
- Increase workspace volume size to 300Gi for unit tests
replace hardcoded dir name with `REFS.repo`
@ti-chi-bot ti-chi-bot Bot added size/XL and removed size/S labels May 25, 2026
@ti-chi-bot ti-chi-bot changed the title ci(pipelines): remove bigdisk nodeSelector from pingcap-inc/tidb release-8.5 pods fix(pipelines): align tidb CI handling and fix hardcoded usages May 25, 2026
@wuhuizuo wuhuizuo marked this pull request as ready for review May 25, 2026 09:54
@wuhuizuo
Copy link
Copy Markdown
Contributor Author

/unhold

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 25, 2026

@wuhuizuo: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-replay-jenkins-pipelines 7959255 link false /test pull-replay-jenkins-pipelines

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ti-chi-bot ti-chi-bot Bot merged commit 5ce24f9 into main May 25, 2026
7 of 8 checks passed
@ti-chi-bot ti-chi-bot Bot deleted the agent/coder/ecaf4137 branch May 25, 2026 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

1 participant