Skip to content

Commit 837cf86

Browse files
authored
Pathwaysjob migration (#1108)
* Migrate PathwaysJob CRD to JobSet This commit switches the XPK workload generation template for pathways workloads from using the PathwaysJob CRD directly over to a standard JobSet object. - Updates PW_WORKLOAD_CREATE_YAML to use JobSet API - Modifies component YAML generation in pathways.py to output Pod containers rather than custom PathwaysJob components - Adds unit tests to workload_test.py to verify correct layout of pathways jobsets * Fix args.worker_image attribute error in Pathways worker container string formatting * Ensure Pathways JobSet parity with legacy PathwaysJob controller output * Convert proxy and RM sidecars to initContainers with restartPolicy: Always. * Ensure all container ports specify protocol: TCP. * Enforce restartPolicy: OnFailure for the worker template. * Inject required environment variables (JAX_PLATFORMS, JAX_BACKEND_TARGET, XCLOUD_ENVIRONMENT) natively into the primary user workload container. * Add completionMode: Indexed to both head and worker replicated jobs. * Set successPolicy, startupPolicy, and suspend fields to match legacy generated JobSet. * Update pathways instance_type formatting in JobSet migration * Added get_pathways_instance_type to core/pathways.py * Format the gke_accelerator natively for the RM sidecar (e.g. tpuv6e:2x2) * Update backoffLimit for Pathways worker to match legacy controller scaling * Set worker backoffLimit to args.max_slice_restarts * 4 to replicate the legacy controller's logic. * Remove PathwaysJob CRD installation from cluster creation * Since XPK now deploys Pathways workloads using native JobSet API, we no longer need to install the PathwaysJob CRD when creating or adapting a cluster. * Refactor append_custom_pathways_flags to remove magic number offsetting * Changed prev_indentation to base_indentation to specify exactly how many spaces to prepend. * Removed the convoluted len(indentation) - 2 calls in favor of passing base_indentation=16 directly. * Refactor env variable injection to use regex * Instead of manually checking for multiple variants of trailing whitespace after the 'env:' key, use the re module to robustly identify and inject the required Pathways environment variables. * Remove unnecessary else block from env injection * Since docker_container.py generates the container YAML template using `env: {env}\n`, the key will always be followed by an optional space and a newline, whether {env} is empty or not. Thus, the defensive else block was unreachable and unnecessary. * Move regex import to top of pathways.py * Moved the inline 'import re' from get_user_workload_for_pathways to the module level. * Simplify environment variable injection logic * Removed regex pattern matching in favor of a direct string replacement on 'env:' since 'env: {env}' always appears in the template even if {env} evaluates to an empty string. This ensures our injected block always gets prepended correctly before any user variables. * Remove redundant 'env:' from env_injection string * Since we always append the variables directly after the 'env:' key from the main container template, we can just define the injected block without the 'env:' header entirely and avoid having to strip it out. * Add return type hint to get_user_workload_container * Solves the 'Returning Any from function declared to return str' type checker error in pathways.py by explicitly defining the tuple return type of get_user_workload_container. * Refactor test_workload_create_pathways_jobset_yaml assertions * Replaced hardcoded strings and numbers (like 'test-pw-workload', 'test-docker', and 2) with formatted variables linked directly to the mocked args (e.g. {args.workload}, {args.num_slices}) to make the unit test more robust and maintainable. * Expand test_workload_create_pathways_jobset_yaml assertions * Added explicit assertions for the newly migrated JobSet fields including the coordinator block, network dns configurations, restart strategies, completion modes, and strict verification of the dynamically scaled backoffLimit calculation. * Apply auto-formatting to pathways and workload files * Fix inline worker_backoff_limit calculation inside format string * Bypass CRD check during Pathways workload creation * Pass system characteristics to dynamically calculate google.com/tpu limits * Migrate PathwaysJob CRD to JobSet This commit switches the XPK workload generation template for pathways workloads from using the PathwaysJob CRD directly over to a standard JobSet object. - Updates PW_WORKLOAD_CREATE_YAML to use JobSet API - Modifies component YAML generation in pathways.py to output Pod containers rather than custom PathwaysJob components - Adds unit tests to workload_test.py to verify correct layout of pathways jobsets * Fix args.worker_image attribute error in Pathways worker container string formatting * Ensure Pathways JobSet parity with legacy PathwaysJob controller output * Convert proxy and RM sidecars to initContainers with restartPolicy: Always. * Ensure all container ports specify protocol: TCP. * Enforce restartPolicy: OnFailure for the worker template. * Inject required environment variables (JAX_PLATFORMS, JAX_BACKEND_TARGET, XCLOUD_ENVIRONMENT) natively into the primary user workload container. * Add completionMode: Indexed to both head and worker replicated jobs. * Set successPolicy, startupPolicy, and suspend fields to match legacy generated JobSet. * Update pathways instance_type formatting in JobSet migration * Added get_pathways_instance_type to core/pathways.py * Format the gke_accelerator natively for the RM sidecar (e.g. tpuv6e:2x2) * Update backoffLimit for Pathways worker to match legacy controller scaling * Set worker backoffLimit to args.max_slice_restarts * 4 to replicate the legacy controller's logic. * Remove PathwaysJob CRD installation from cluster creation * Since XPK now deploys Pathways workloads using native JobSet API, we no longer need to install the PathwaysJob CRD when creating or adapting a cluster. * Refactor append_custom_pathways_flags to remove magic number offsetting * Changed prev_indentation to base_indentation to specify exactly how many spaces to prepend. * Removed the convoluted len(indentation) - 2 calls in favor of passing base_indentation=16 directly. * Refactor env variable injection to use regex * Instead of manually checking for multiple variants of trailing whitespace after the 'env:' key, use the re module to robustly identify and inject the required Pathways environment variables. * Remove unnecessary else block from env injection * Since docker_container.py generates the container YAML template using `env: {env}\n`, the key will always be followed by an optional space and a newline, whether {env} is empty or not. Thus, the defensive else block was unreachable and unnecessary. * Move regex import to top of pathways.py * Moved the inline 'import re' from get_user_workload_for_pathways to the module level. * Simplify environment variable injection logic * Removed regex pattern matching in favor of a direct string replacement on 'env:' since 'env: {env}' always appears in the template even if {env} evaluates to an empty string. This ensures our injected block always gets prepended correctly before any user variables. * Remove redundant 'env:' from env_injection string * Since we always append the variables directly after the 'env:' key from the main container template, we can just define the injected block without the 'env:' header entirely and avoid having to strip it out. * Add return type hint to get_user_workload_container * Solves the 'Returning Any from function declared to return str' type checker error in pathways.py by explicitly defining the tuple return type of get_user_workload_container. * Refactor test_workload_create_pathways_jobset_yaml assertions * Replaced hardcoded strings and numbers (like 'test-pw-workload', 'test-docker', and 2) with formatted variables linked directly to the mocked args (e.g. {args.workload}, {args.num_slices}) to make the unit test more robust and maintainable. * Expand test_workload_create_pathways_jobset_yaml assertions * Added explicit assertions for the newly migrated JobSet fields including the coordinator block, network dns configurations, restart strategies, completion modes, and strict verification of the dynamically scaled backoffLimit calculation. * Apply auto-formatting to pathways and workload files * Fix inline worker_backoff_limit calculation inside format string * Bypass CRD check during Pathways workload creation * Pass system characteristics to dynamically calculate google.com/tpu limits * Linter fix * Fix unit tests * fix linter * Address PR feedback for Pathways JobSet migration * Fixed get_pathways_instance_type to explicitly map GCE machine types to TPU versions. * Fixed headless mode to properly inject proxy and RM into 'containers' instead of 'initContainers'. * Fixed worker backoffLimit calculation to accurately scale by vms_per_slice. * Respected --elastic-slices CLI argument by passing it to pathways-proxy. * Added deprecation warning for the legacy colocate_head_with_workers deployment mode. * Cleaned up dead targetReplicatedJob tracking and conditionally disabled successPolicy for headless workloads. * Updated unit tests to mock and assert the new conditional formats and fields correctly. * update goldens * fix linter * Handle unknown pathways machine types and fix tests * Refactor Pathways TPU version handling to use SystemCharacteristics directly * Added an optional field directly to . * Populated the for all supported TPU generations (v4, v5e, v5p, v6e, tpu7, tpu7x) directly in their system definitions in . * Removed the hardcoded fallback mapping and its associated exception from entirely, fetching the version dynamically from the system configuration instead. * Enforce pathways_tpu_version as a mandatory keyword argument for TPU configurations * Fix unit tests after changing get_tpu_system_characteristics_map signature * some formatting changes * golden updates * cleanup * Readme update * Fix YAML indentation for pathways_head_containers injection * Goldens update * fix the identation problem
1 parent a72a752 commit 837cf86

25 files changed

Lines changed: 646 additions & 271 deletions

README.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,6 @@ XPK also supports the following [Google Cloud Storage solutions](./docs/usage/st
8686
| [JobSet](https://github.com/kubernetes-sigs/jobset) | Workload creation |
8787
| [Docker](https://docs.docker.com/engine/install/) | Building workload container |
8888
| [CoreDNS](https://github.com/coredns/deployment/tree/master/kubernetes) | Cluster set up |
89-
| [PathwaysJob](https://github.com/google/pathways-job) | Running Pathways workloads |
9089

9190
# Privacy notice
9291

recipes/Basic_cluster_create.md

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -45,11 +45,11 @@ kubectl wait deployment/coredns --for=condition=Available=true --namespace=kube-
4545
[XPK] Task: `Determine current gke master version` is implemented by the following command not running since it is a dry run.
4646
gcloud beta container clusters describe golden-cluster --location us-central1 --project golden-project --format="value(currentMasterVersion)"
4747
[XPK] Creating 1 node pool or pools of tpu7x-8
48-
We assume that the underlying system is: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=2)
48+
We assume that the underlying system is: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=2, pathways_tpu_version='tpu7x')
4949
[XPK] Task: `Get All Node Pools` is implemented by the following command not running since it is a dry run.
5050
gcloud beta container node-pools list --cluster golden-cluster --project=golden-project --location=us-central1 --format="csv[no-heading](name)"
5151
[XPK] Creating 1 node pool or pools of tpu7x-8
52-
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=2)
52+
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=2, pathways_tpu_version='tpu7x')
5353
[XPK] Task: `Get Node Pool Zone` is implemented by the following command not running since it is a dry run.
5454
gcloud beta container node-pools describe 0 --cluster golden-cluster --project=golden-project --location=us-central1 --format="value(locations)"
5555
[XPK] Task: `GKE Cluster Get ConfigMap` is implemented by the following command not running since it is a dry run.
@@ -170,9 +170,6 @@ spec:
170170
[XPK] Try 1: Updating jobset Controller Manager resources
171171
[XPK] Task: `Updating jobset Controller Manager resources` is implemented by the following command not running since it is a dry run.
172172
kubectl apply -f 1b31e624e490f9c8c4ef4e369f08d3fa467990af5a261e4405bd045265d70e95
173-
[XPK] Try 1: Install PathwaysJob on golden-cluster
174-
[XPK] Task: `Install PathwaysJob on golden-cluster` is implemented by the following command not running since it is a dry run.
175-
kubectl apply --server-side -f https://github.com/google/pathways-job/releases/download/v0.1.4/install.yaml
176173
[XPK] Enabling Kueue on the cluster
177174
[XPK] Task: `Get kueue version on server` is implemented by the following command not running since it is a dry run.
178175
kubectl get deployment kueue-controller-manager -n kueue-system -o jsonpath='{.spec.template.spec.containers[0].image}'

recipes/Cluster_create_RayCluster.md

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -47,11 +47,11 @@ kubectl wait deployment/coredns --for=condition=Available=true --namespace=kube-
4747
[XPK] Task: `Determine current gke master version` is implemented by the following command not running since it is a dry run.
4848
gcloud beta container clusters describe golden-cluster --location us-central1 --project golden-project --format="value(currentMasterVersion)"
4949
[XPK] Creating 1 node pool or pools of tpu7x-8
50-
We assume that the underlying system is: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=2)
50+
We assume that the underlying system is: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=2, pathways_tpu_version='tpu7x')
5151
[XPK] Task: `Get All Node Pools` is implemented by the following command not running since it is a dry run.
5252
gcloud beta container node-pools list --cluster golden-cluster --project=golden-project --location=us-central1 --format="csv[no-heading](name)"
5353
[XPK] Creating 1 node pool or pools of tpu7x-8
54-
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=2)
54+
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=2, pathways_tpu_version='tpu7x')
5555
[XPK] Task: `Get Node Pool Zone` is implemented by the following command not running since it is a dry run.
5656
gcloud beta container node-pools describe 0 --cluster golden-cluster --project=golden-project --location=us-central1 --format="value(locations)"
5757
[XPK] Task: `GKE Cluster Get ConfigMap` is implemented by the following command not running since it is a dry run.
@@ -173,9 +173,6 @@ spec:
173173
[XPK] Try 1: Updating jobset Controller Manager resources
174174
[XPK] Task: `Updating jobset Controller Manager resources` is implemented by the following command not running since it is a dry run.
175175
kubectl apply -f 1b31e624e490f9c8c4ef4e369f08d3fa467990af5a261e4405bd045265d70e95
176-
[XPK] Try 1: Install PathwaysJob on golden-cluster
177-
[XPK] Task: `Install PathwaysJob on golden-cluster` is implemented by the following command not running since it is a dry run.
178-
kubectl apply --server-side -f https://github.com/google/pathways-job/releases/download/v0.1.4/install.yaml
179176
[XPK] Enabling Kueue on the cluster
180177
[XPK] Task: `Get kueue version on server` is implemented by the following command not running since it is a dry run.
181178
kubectl get deployment kueue-controller-manager -n kueue-system -o jsonpath='{.spec.template.spec.containers[0].image}'

recipes/Cluster_create_for_multi-host_nodepool.md

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -45,11 +45,11 @@ kubectl wait deployment/coredns --for=condition=Available=true --namespace=kube-
4545
[XPK] Task: `Determine current gke master version` is implemented by the following command not running since it is a dry run.
4646
gcloud beta container clusters describe golden-cluster --location us-central1 --project golden-project --format="value(currentMasterVersion)"
4747
[XPK] Creating 1 node pool or pools of tpu7x-16
48-
We assume that the underlying system is: SystemCharacteristics(topology='2x2x2', vms_per_slice=2, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-16', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=True, gpu_config=None, parallel_containers=2)
48+
We assume that the underlying system is: SystemCharacteristics(topology='2x2x2', vms_per_slice=2, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-16', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=True, gpu_config=None, parallel_containers=2, pathways_tpu_version='tpu7x')
4949
[XPK] Task: `Get All Node Pools` is implemented by the following command not running since it is a dry run.
5050
gcloud beta container node-pools list --cluster golden-cluster --project=golden-project --location=us-central1 --format="csv[no-heading](name)"
5151
[XPK] Creating 1 node pool or pools of tpu7x-16
52-
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x2', vms_per_slice=2, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-16', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=True, gpu_config=None, parallel_containers=2)
52+
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x2', vms_per_slice=2, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-16', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=True, gpu_config=None, parallel_containers=2, pathways_tpu_version='tpu7x')
5353
[XPK] Task: `Get Node Pool Zone` is implemented by the following command not running since it is a dry run.
5454
gcloud beta container node-pools describe 0 --cluster golden-cluster --project=golden-project --location=us-central1 --format="value(locations)"
5555
[XPK] Task: `GKE Cluster Get ConfigMap` is implemented by the following command not running since it is a dry run.
@@ -172,9 +172,6 @@ spec:
172172
[XPK] Try 1: Updating jobset Controller Manager resources
173173
[XPK] Task: `Updating jobset Controller Manager resources` is implemented by the following command not running since it is a dry run.
174174
kubectl apply -f 1b31e624e490f9c8c4ef4e369f08d3fa467990af5a261e4405bd045265d70e95
175-
[XPK] Try 1: Install PathwaysJob on golden-cluster
176-
[XPK] Task: `Install PathwaysJob on golden-cluster` is implemented by the following command not running since it is a dry run.
177-
kubectl apply --server-side -f https://github.com/google/pathways-job/releases/download/v0.1.4/install.yaml
178175
[XPK] Enabling Kueue on the cluster
179176
[XPK] Task: `Get kueue version on server` is implemented by the following command not running since it is a dry run.
180177
kubectl get deployment kueue-controller-manager -n kueue-system -o jsonpath='{.spec.template.spec.containers[0].image}'

recipes/Cluster_create_for_single-host_nodepool.md

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -45,11 +45,11 @@ kubectl wait deployment/coredns --for=condition=Available=true --namespace=kube-
4545
[XPK] Task: `Determine current gke master version` is implemented by the following command not running since it is a dry run.
4646
gcloud beta container clusters describe golden-cluster --location us-central1 --project golden-project --format="value(currentMasterVersion)"
4747
[XPK] Creating 1 node pool or pools of v4-8
48-
We assume that the underlying system is: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu-v4-podslice', gce_machine_type='ct4p-hightpu-4t', chips_per_vm=4, accelerator_type=TPU, device_type='v4-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=1)
48+
We assume that the underlying system is: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu-v4-podslice', gce_machine_type='ct4p-hightpu-4t', chips_per_vm=4, accelerator_type=TPU, device_type='v4-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=1, pathways_tpu_version='tpuv4')
4949
[XPK] Task: `Get All Node Pools` is implemented by the following command not running since it is a dry run.
5050
gcloud beta container node-pools list --cluster golden-cluster --project=golden-project --location=us-central1 --format="csv[no-heading](name)"
5151
[XPK] Creating 1 node pool or pools of v4-8
52-
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu-v4-podslice', gce_machine_type='ct4p-hightpu-4t', chips_per_vm=4, accelerator_type=TPU, device_type='v4-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=1)
52+
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu-v4-podslice', gce_machine_type='ct4p-hightpu-4t', chips_per_vm=4, accelerator_type=TPU, device_type='v4-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=1, pathways_tpu_version='tpuv4')
5353
[XPK] Task: `Get Node Pool Zone` is implemented by the following command not running since it is a dry run.
5454
gcloud beta container node-pools describe 0 --cluster golden-cluster --project=golden-project --location=us-central1 --format="value(locations)"
5555
[XPK] Task: `GKE Cluster Get ConfigMap` is implemented by the following command not running since it is a dry run.
@@ -170,9 +170,6 @@ spec:
170170
[XPK] Try 1: Updating jobset Controller Manager resources
171171
[XPK] Task: `Updating jobset Controller Manager resources` is implemented by the following command not running since it is a dry run.
172172
kubectl apply -f 1b31e624e490f9c8c4ef4e369f08d3fa467990af5a261e4405bd045265d70e95
173-
[XPK] Try 1: Install PathwaysJob on golden-cluster
174-
[XPK] Task: `Install PathwaysJob on golden-cluster` is implemented by the following command not running since it is a dry run.
175-
kubectl apply --server-side -f https://github.com/google/pathways-job/releases/download/v0.1.4/install.yaml
176173
[XPK] Enabling Kueue on the cluster
177174
[XPK] Task: `Get kueue version on server` is implemented by the following command not running since it is a dry run.
178175
kubectl get deployment kueue-controller-manager -n kueue-system -o jsonpath='{.spec.template.spec.containers[0].image}'

recipes/Cluster_create_private.md

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -51,11 +51,11 @@ kubectl wait deployment/coredns --for=condition=Available=true --namespace=kube-
5151
[XPK] Task: `Determine current gke master version` is implemented by the following command not running since it is a dry run.
5252
gcloud beta container clusters describe golden-cluster-private --location us-central1 --project golden-project --format="value(currentMasterVersion)"
5353
[XPK] Creating 1 node pool or pools of v5p-8
54-
We assume that the underlying system is: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu-v5p-slice', gce_machine_type='ct5p-hightpu-4t', chips_per_vm=4, accelerator_type=TPU, device_type='v5p-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=1)
54+
We assume that the underlying system is: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu-v5p-slice', gce_machine_type='ct5p-hightpu-4t', chips_per_vm=4, accelerator_type=TPU, device_type='v5p-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=1, pathways_tpu_version='tpuv5')
5555
[XPK] Task: `Get All Node Pools` is implemented by the following command not running since it is a dry run.
5656
gcloud beta container node-pools list --cluster golden-cluster-private --project=golden-project --location=us-central1 --format="csv[no-heading](name)"
5757
[XPK] Creating 1 node pool or pools of v5p-8
58-
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu-v5p-slice', gce_machine_type='ct5p-hightpu-4t', chips_per_vm=4, accelerator_type=TPU, device_type='v5p-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=1)
58+
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu-v5p-slice', gce_machine_type='ct5p-hightpu-4t', chips_per_vm=4, accelerator_type=TPU, device_type='v5p-8', supports_sub_slicing=False, supports_super_slicing=False, supports_accelerator_network_profile=False, docker_platform=<DockerPlatform.AMD: 'linux/amd64'>, requires_workload_policy=False, gpu_config=None, parallel_containers=1, pathways_tpu_version='tpuv5')
5959
[XPK] Task: `Get Node Pool Zone` is implemented by the following command not running since it is a dry run.
6060
gcloud beta container node-pools describe 0 --cluster golden-cluster-private --project=golden-project --location=us-central1 --format="value(locations)"
6161
[XPK] Task: `GKE Cluster Get ConfigMap` is implemented by the following command not running since it is a dry run.
@@ -178,9 +178,6 @@ spec:
178178
[XPK] Try 1: Updating jobset Controller Manager resources
179179
[XPK] Task: `Updating jobset Controller Manager resources` is implemented by the following command not running since it is a dry run.
180180
kubectl apply -f 1b31e624e490f9c8c4ef4e369f08d3fa467990af5a261e4405bd045265d70e95
181-
[XPK] Try 1: Install PathwaysJob on golden-cluster-private
182-
[XPK] Task: `Install PathwaysJob on golden-cluster-private` is implemented by the following command not running since it is a dry run.
183-
kubectl apply --server-side -f https://github.com/google/pathways-job/releases/download/v0.1.4/install.yaml
184181
[XPK] Enabling Kueue on the cluster
185182
[XPK] Task: `Get kueue version on server` is implemented by the following command not running since it is a dry run.
186183
kubectl get deployment kueue-controller-manager -n kueue-system -o jsonpath='{.spec.template.spec.containers[0].image}'

0 commit comments

Comments
 (0)