Skip to content

Add Kueue default config for Dynamic Slicing#5693

Open
FIoannides wants to merge 2 commits into
GoogleCloudPlatform:developfrom
FIoannides:feature/dynamic-slicing-kueue
Open

Add Kueue default config for Dynamic Slicing#5693
FIoannides wants to merge 2 commits into
GoogleCloudPlatform:developfrom
FIoannides:feature/dynamic-slicing-kueue

Conversation

@FIoannides
Copy link
Copy Markdown
Contributor

@FIoannides FIoannides commented May 20, 2026

Introduce enable_dynamic_slicing_for_tpus, adding:

the two more default Kueue configs,
Controller overrides defaults,
Validations.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the GKE TPU workload management by integrating support for Kueue dynamic slicing. It updates the infrastructure configuration to allow users to enable dynamic slicing via new variables, introduces dedicated configuration templates, and adds robust validation logic to ensure that the required cluster settings are correctly configured for this feature.

Highlights

  • Dynamic Slicing Support: Added support for GKE TPU dynamic slicing in Kueue configurations, allowing for more flexible resource allocation.
  • Configuration Templates: Introduced new Terraform templates for Kueue configuration to support dynamic slicing, both independently and in combination with Pathways.
  • Validation Logic: Implemented Terraform preconditions to ensure correct configuration of accelerator topology mode, slice controller, and machine types when dynamic slicing is enabled.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions github-actions Bot added the external PR from external contributor label May 20, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for TPU dynamic slicing in GKE by adding new Kueue configuration templates and updating the kubectl-apply module with selection logic and validation rules. Review feedback identifies a need to broaden the machine_type validation regex to include ct types, suggests refactoring nested ternary expressions into a map lookup for improved maintainability, and recommends removing hardcoded TPU partition labels in the templates to support varied topologies.

error_message = "When enable_dynamic_slicing_for_tpus is true, enable_slice_controller must be true."
}
precondition {
condition = coalesce(var.kueue.enable_dynamic_slicing_for_tpus, false) == false || length(regexall("^tpu", coalesce(var.kueue.machine_type, ""))) > 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The regular expression ^tpu to validate the machine_type is too restrictive. It will incorrectly fail for valid TPU machine types that start with ct, such as ct6e-standard-4t used in the gke-tpu-v6e example. To support all TPU machine types, please broaden the regex.

      condition     = coalesce(var.kueue.enable_dynamic_slicing_for_tpus, false) == false || length(regexall("^(tpu|ct)", coalesce(var.kueue.machine_type, ""))) > 0

Comment on lines +25 to +29
kueue_default_config_template = (local.enable_pathways && local.enable_slicing) ? "${path.module}/kueue/kueue-configuration-dynamic-slicing-pathways.yaml.tftpl" : (
local.enable_slicing ? "${path.module}/kueue/kueue-configuration-dynamic-slicing.yaml.tftpl" : (
local.enable_pathways ? "${path.module}/kueue/kueue-configuration-pathways.yaml.tftpl" : ""
)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better readability and maintainability, consider refactoring the nested ternary expression into a lookup on a map. This approach makes the logic clearer and simplifies adding more configuration options in the future.

  kueue_default_config_template = lookup({
    "true-true"  = "${path.module}/kueue/kueue-configuration-dynamic-slicing-pathways.yaml.tftpl",
    "false-true" = "${path.module}/kueue/kueue-configuration-dynamic-slicing.yaml.tftpl",
    "true-false" = "${path.module}/kueue/kueue-configuration-pathways.yaml.tftpl",
  }, "${local.enable_pathways}-${local.enable_slicing}", "")

spec:
levels:
- nodeLabel: cloud.google.com/gce-topology-block
- nodeLabel: cloud.google.com/gke-tpu-partition-4x4x4-id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The nodeLabel cloud.google.com/gke-tpu-partition-4x4x4-id is hardcoded, which is specific to a 4x4x4 TPU topology. This will cause issues for users with different TPU topologies. For a default configuration, it's better to be more generic. Consider removing this topology level to make the configuration applicable to a wider range of TPU setups.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAI, as cubes are always 4x4x4

spec:
levels:
- nodeLabel: cloud.google.com/gce-topology-block
- nodeLabel: cloud.google.com/gke-tpu-partition-4x4x4-id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The nodeLabel cloud.google.com/gke-tpu-partition-4x4x4-id is hardcoded, which is specific to a 4x4x4 TPU topology. This will cause issues for users with different TPU topologies. For a default configuration, it's better to be more generic. Consider removing this topology level to make the configuration applicable to a wider range of TPU setups.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAI, as cubes are always 4x4x4

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for TPU Dynamic Slicing across GKE blueprints and the kubectl-apply management module, adding new Kueue configuration templates and validation logic. Feedback identifies that some blueprint settings reference non-existent module outputs, which should be resolved by defining them in the source modules to support automatic wiring. Additionally, the regex for TPU machine type validation is too restrictive and needs expansion, while hardcoded node labels in the templates should be parameterized for reusability. A redundant try() function in the Terraform configuration was also noted for removal.

Comment on lines +306 to +308
accelerator_topology_mode: $(gke-tpu-7x-pool.accelerator_topology_mode)
machine_type: $(gke-tpu-7x-pool.machine_type)
enable_slice_controller: $(gke-tpu-7x-cluster.enable_slice_controller)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

These settings reference non-existent outputs from the gke-tpu-7x-pool and gke-tpu-7x-cluster modules. To resolve this, ensure the source modules define these as outputs. This allows the Cluster Toolkit to leverage automatic variable wiring by matching output names with variable names, which is preferred over manual wiring like $(vars.machine_type). Additionally, the enable_slice_controller setting should only be enabled when the 'Dynamic Super Slicing' feature is explicitly needed.

References
  1. In the Cluster Toolkit framework, leverage automatic variable wiring by matching output names from a dependency module with top-level variable names in the consuming module to avoid redundant manual wiring.
  2. The enable_slice_controller setting is not required to be enabled by default and should only be enabled when the 'Dynamic Super Slicing' feature is explicitly needed.

Comment on lines +306 to +308
accelerator_topology_mode: $(gke-tpu-v6e-pool.accelerator_topology_mode)
machine_type: $(gke-tpu-v6e-pool.machine_type)
enable_slice_controller: $(gke-tpu-v6e-cluster.enable_slice_controller)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

These settings reference non-existent outputs from the gke-tpu-v6e-pool and gke-tpu-v6e-cluster modules. Ensure these modules define the required outputs to leverage automatic variable wiring. This avoids manual wiring and ensures consistency across the toolkit. Also, verify if enable_slice_controller is explicitly needed for this specific blueprint.

References
  1. In the Cluster Toolkit framework, leverage automatic variable wiring by matching output names from a dependency module with top-level variable names in the consuming module to avoid redundant manual wiring.
  2. The enable_slice_controller setting is not required to be enabled by default and should only be enabled when the 'Dynamic Super Slicing' feature is explicitly needed.

error_message = "When enable_dynamic_slicing_for_tpus is true, enable_slice_controller must be true."
}
precondition {
condition = coalesce(var.kueue.enable_dynamic_slicing_for_tpus, false) == false || length(regexall("^tpu", coalesce(var.kueue.machine_type, ""))) > 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The regex ^tpu is too restrictive and will fail for TPU v6e (ct6e-...) and other generations (v4-..., v5-...). Expand the regex to include other valid TPU machine type prefixes.

      condition     = coalesce(var.kueue.enable_dynamic_slicing_for_tpus, false) == false || length(regexall("^(tpu|ct6e|v)", coalesce(var.kueue.machine_type, ""))) > 0

spec:
levels:
- nodeLabel: cloud.google.com/gce-topology-block
- nodeLabel: cloud.google.com/gke-tpu-partition-4x4x4-id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The node label cloud.google.com/gke-tpu-partition-4x4x4-id is specific to a 4x4x4 topology. Hardcoding this in a default template limits its reusability. Consider parameterizing this label via config_template_vars.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAI, as cubes are always 4x4x4

spec:
levels:
- nodeLabel: cloud.google.com/gce-topology-block
- nodeLabel: cloud.google.com/gke-tpu-partition-4x4x4-id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The node label cloud.google.com/gke-tpu-partition-4x4x4-id is specific to a 4x4x4 topology. Hardcoding this in a default template limits its reusability. Consider parameterizing this label via config_template_vars.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAI, as cubes are always 4x4x4

Comment thread modules/management/kubectl-apply/main.tf Outdated
@FIoannides FIoannides force-pushed the feature/dynamic-slicing-kueue branch from 8237060 to 48a7548 Compare May 20, 2026 13:34
Comment thread examples/gke-tpu-7x/gke-tpu-7x-advanced.yaml
Comment thread modules/management/kubectl-apply/variables.tf
Comment thread modules/management/kubectl-apply/main.tf Outdated
Comment thread modules/management/kubectl-apply/main.tf Outdated
Comment thread modules/management/kubectl-apply/variables.tf Outdated
Comment thread modules/management/kubectl-apply/variables.tf
Comment thread modules/management/kubectl-apply/variables.tf Outdated
Comment thread examples/gke-tpu-v6e/gke-tpu-v6e-advanced.yaml
@FIoannides FIoannides force-pushed the feature/dynamic-slicing-kueue branch 6 times, most recently from 1bbae7d to 7a8e804 Compare May 20, 2026 15:50
@FIoannides FIoannides force-pushed the feature/dynamic-slicing-kueue branch from 7a8e804 to b646e94 Compare May 20, 2026 15:55
spec:
levels:
- nodeLabel: cloud.google.com/gce-topology-block
- nodeLabel: cloud.google.com/gke-tpu-partition-4x4x4-id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAI, as cubes are always 4x4x4

spec:
levels:
- nodeLabel: cloud.google.com/gce-topology-block
- nodeLabel: cloud.google.com/gke-tpu-partition-4x4x4-id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAI, as cubes are always 4x4x4

spec:
levels:
- nodeLabel: cloud.google.com/gce-topology-block
- nodeLabel: cloud.google.com/gke-tpu-partition-4x4x4-id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAI, as cubes are always 4x4x4

spec:
levels:
- nodeLabel: cloud.google.com/gce-topology-block
- nodeLabel: cloud.google.com/gke-tpu-partition-4x4x4-id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAI, as cubes are always 4x4x4

Comment thread modules/management/kubectl-apply/variables.tf
@FIoannides FIoannides marked this pull request as ready for review May 21, 2026 10:17
@FIoannides FIoannides requested a review from a team as a code owner May 21, 2026 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external PR from external contributor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants