Add Kueue default config for Dynamic Slicing#5693
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the GKE TPU workload management by integrating support for Kueue dynamic slicing. It updates the infrastructure configuration to allow users to enable dynamic slicing via new variables, introduces dedicated configuration templates, and adds robust validation logic to ensure that the required cluster settings are correctly configured for this feature. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for TPU dynamic slicing in GKE by adding new Kueue configuration templates and updating the kubectl-apply module with selection logic and validation rules. Review feedback identifies a need to broaden the machine_type validation regex to include ct types, suggests refactoring nested ternary expressions into a map lookup for improved maintainability, and recommends removing hardcoded TPU partition labels in the templates to support varied topologies.
| error_message = "When enable_dynamic_slicing_for_tpus is true, enable_slice_controller must be true." | ||
| } | ||
| precondition { | ||
| condition = coalesce(var.kueue.enable_dynamic_slicing_for_tpus, false) == false || length(regexall("^tpu", coalesce(var.kueue.machine_type, ""))) > 0 |
There was a problem hiding this comment.
The regular expression ^tpu to validate the machine_type is too restrictive. It will incorrectly fail for valid TPU machine types that start with ct, such as ct6e-standard-4t used in the gke-tpu-v6e example. To support all TPU machine types, please broaden the regex.
condition = coalesce(var.kueue.enable_dynamic_slicing_for_tpus, false) == false || length(regexall("^(tpu|ct)", coalesce(var.kueue.machine_type, ""))) > 0
| kueue_default_config_template = (local.enable_pathways && local.enable_slicing) ? "${path.module}/kueue/kueue-configuration-dynamic-slicing-pathways.yaml.tftpl" : ( | ||
| local.enable_slicing ? "${path.module}/kueue/kueue-configuration-dynamic-slicing.yaml.tftpl" : ( | ||
| local.enable_pathways ? "${path.module}/kueue/kueue-configuration-pathways.yaml.tftpl" : "" | ||
| ) | ||
| ) |
There was a problem hiding this comment.
For better readability and maintainability, consider refactoring the nested ternary expression into a lookup on a map. This approach makes the logic clearer and simplifies adding more configuration options in the future.
kueue_default_config_template = lookup({
"true-true" = "${path.module}/kueue/kueue-configuration-dynamic-slicing-pathways.yaml.tftpl",
"false-true" = "${path.module}/kueue/kueue-configuration-dynamic-slicing.yaml.tftpl",
"true-false" = "${path.module}/kueue/kueue-configuration-pathways.yaml.tftpl",
}, "${local.enable_pathways}-${local.enable_slicing}", "")
| spec: | ||
| levels: | ||
| - nodeLabel: cloud.google.com/gce-topology-block | ||
| - nodeLabel: cloud.google.com/gke-tpu-partition-4x4x4-id |
There was a problem hiding this comment.
The nodeLabel cloud.google.com/gke-tpu-partition-4x4x4-id is hardcoded, which is specific to a 4x4x4 TPU topology. This will cause issues for users with different TPU topologies. For a default configuration, it's better to be more generic. Consider removing this topology level to make the configuration applicable to a wider range of TPU setups.
There was a problem hiding this comment.
WAI, as cubes are always 4x4x4
| spec: | ||
| levels: | ||
| - nodeLabel: cloud.google.com/gce-topology-block | ||
| - nodeLabel: cloud.google.com/gke-tpu-partition-4x4x4-id |
There was a problem hiding this comment.
The nodeLabel cloud.google.com/gke-tpu-partition-4x4x4-id is hardcoded, which is specific to a 4x4x4 TPU topology. This will cause issues for users with different TPU topologies. For a default configuration, it's better to be more generic. Consider removing this topology level to make the configuration applicable to a wider range of TPU setups.
There was a problem hiding this comment.
WAI, as cubes are always 4x4x4
There was a problem hiding this comment.
Code Review
This pull request introduces support for TPU Dynamic Slicing across GKE blueprints and the kubectl-apply management module, adding new Kueue configuration templates and validation logic. Feedback identifies that some blueprint settings reference non-existent module outputs, which should be resolved by defining them in the source modules to support automatic wiring. Additionally, the regex for TPU machine type validation is too restrictive and needs expansion, while hardcoded node labels in the templates should be parameterized for reusability. A redundant try() function in the Terraform configuration was also noted for removal.
| accelerator_topology_mode: $(gke-tpu-7x-pool.accelerator_topology_mode) | ||
| machine_type: $(gke-tpu-7x-pool.machine_type) | ||
| enable_slice_controller: $(gke-tpu-7x-cluster.enable_slice_controller) |
There was a problem hiding this comment.
These settings reference non-existent outputs from the gke-tpu-7x-pool and gke-tpu-7x-cluster modules. To resolve this, ensure the source modules define these as outputs. This allows the Cluster Toolkit to leverage automatic variable wiring by matching output names with variable names, which is preferred over manual wiring like $(vars.machine_type). Additionally, the enable_slice_controller setting should only be enabled when the 'Dynamic Super Slicing' feature is explicitly needed.
References
- In the Cluster Toolkit framework, leverage automatic variable wiring by matching output names from a dependency module with top-level variable names in the consuming module to avoid redundant manual wiring.
- The enable_slice_controller setting is not required to be enabled by default and should only be enabled when the 'Dynamic Super Slicing' feature is explicitly needed.
| accelerator_topology_mode: $(gke-tpu-v6e-pool.accelerator_topology_mode) | ||
| machine_type: $(gke-tpu-v6e-pool.machine_type) | ||
| enable_slice_controller: $(gke-tpu-v6e-cluster.enable_slice_controller) |
There was a problem hiding this comment.
These settings reference non-existent outputs from the gke-tpu-v6e-pool and gke-tpu-v6e-cluster modules. Ensure these modules define the required outputs to leverage automatic variable wiring. This avoids manual wiring and ensures consistency across the toolkit. Also, verify if enable_slice_controller is explicitly needed for this specific blueprint.
References
- In the Cluster Toolkit framework, leverage automatic variable wiring by matching output names from a dependency module with top-level variable names in the consuming module to avoid redundant manual wiring.
- The enable_slice_controller setting is not required to be enabled by default and should only be enabled when the 'Dynamic Super Slicing' feature is explicitly needed.
| error_message = "When enable_dynamic_slicing_for_tpus is true, enable_slice_controller must be true." | ||
| } | ||
| precondition { | ||
| condition = coalesce(var.kueue.enable_dynamic_slicing_for_tpus, false) == false || length(regexall("^tpu", coalesce(var.kueue.machine_type, ""))) > 0 |
There was a problem hiding this comment.
The regex ^tpu is too restrictive and will fail for TPU v6e (ct6e-...) and other generations (v4-..., v5-...). Expand the regex to include other valid TPU machine type prefixes.
condition = coalesce(var.kueue.enable_dynamic_slicing_for_tpus, false) == false || length(regexall("^(tpu|ct6e|v)", coalesce(var.kueue.machine_type, ""))) > 0
| spec: | ||
| levels: | ||
| - nodeLabel: cloud.google.com/gce-topology-block | ||
| - nodeLabel: cloud.google.com/gke-tpu-partition-4x4x4-id |
There was a problem hiding this comment.
WAI, as cubes are always 4x4x4
| spec: | ||
| levels: | ||
| - nodeLabel: cloud.google.com/gce-topology-block | ||
| - nodeLabel: cloud.google.com/gke-tpu-partition-4x4x4-id |
There was a problem hiding this comment.
WAI, as cubes are always 4x4x4
8237060 to
48a7548
Compare
1bbae7d to
7a8e804
Compare
7a8e804 to
b646e94
Compare
| spec: | ||
| levels: | ||
| - nodeLabel: cloud.google.com/gce-topology-block | ||
| - nodeLabel: cloud.google.com/gke-tpu-partition-4x4x4-id |
There was a problem hiding this comment.
WAI, as cubes are always 4x4x4
| spec: | ||
| levels: | ||
| - nodeLabel: cloud.google.com/gce-topology-block | ||
| - nodeLabel: cloud.google.com/gke-tpu-partition-4x4x4-id |
There was a problem hiding this comment.
WAI, as cubes are always 4x4x4
| spec: | ||
| levels: | ||
| - nodeLabel: cloud.google.com/gce-topology-block | ||
| - nodeLabel: cloud.google.com/gke-tpu-partition-4x4x4-id |
There was a problem hiding this comment.
WAI, as cubes are always 4x4x4
| spec: | ||
| levels: | ||
| - nodeLabel: cloud.google.com/gce-topology-block | ||
| - nodeLabel: cloud.google.com/gke-tpu-partition-4x4x4-id |
There was a problem hiding this comment.
WAI, as cubes are always 4x4x4
Introduce enable_dynamic_slicing_for_tpus, adding:
the two more default Kueue configs,
Controller overrides defaults,
Validations.