Commit 837cf86
authored
Pathwaysjob migration (#1108)
* Migrate PathwaysJob CRD to JobSet
This commit switches the XPK workload generation template for pathways workloads from using the PathwaysJob CRD directly over to a standard JobSet object.
- Updates PW_WORKLOAD_CREATE_YAML to use JobSet API
- Modifies component YAML generation in pathways.py to output Pod containers rather than custom PathwaysJob components
- Adds unit tests to workload_test.py to verify correct layout of pathways jobsets
* Fix args.worker_image attribute error in Pathways worker container string formatting
* Ensure Pathways JobSet parity with legacy PathwaysJob controller output
* Convert proxy and RM sidecars to initContainers with restartPolicy: Always.
* Ensure all container ports specify protocol: TCP.
* Enforce restartPolicy: OnFailure for the worker template.
* Inject required environment variables (JAX_PLATFORMS, JAX_BACKEND_TARGET, XCLOUD_ENVIRONMENT) natively into the primary user workload container.
* Add completionMode: Indexed to both head and worker replicated jobs.
* Set successPolicy, startupPolicy, and suspend fields to match legacy generated JobSet.
* Update pathways instance_type formatting in JobSet migration
* Added get_pathways_instance_type to core/pathways.py
* Format the gke_accelerator natively for the RM sidecar (e.g. tpuv6e:2x2)
* Update backoffLimit for Pathways worker to match legacy controller scaling
* Set worker backoffLimit to args.max_slice_restarts * 4 to replicate the legacy controller's logic.
* Remove PathwaysJob CRD installation from cluster creation
* Since XPK now deploys Pathways workloads using native JobSet API, we no longer need to install the PathwaysJob CRD when creating or adapting a cluster.
* Refactor append_custom_pathways_flags to remove magic number offsetting
* Changed prev_indentation to base_indentation to specify exactly how many spaces to prepend.
* Removed the convoluted len(indentation) - 2 calls in favor of passing base_indentation=16 directly.
* Refactor env variable injection to use regex
* Instead of manually checking for multiple variants of trailing whitespace after the 'env:' key, use the re module to robustly identify and inject the required Pathways environment variables.
* Remove unnecessary else block from env injection
* Since docker_container.py generates the container YAML template using `env: {env}\n`, the key will always be followed by an optional space and a newline, whether {env} is empty or not. Thus, the defensive else block was unreachable and unnecessary.
* Move regex import to top of pathways.py
* Moved the inline 'import re' from get_user_workload_for_pathways to the module level.
* Simplify environment variable injection logic
* Removed regex pattern matching in favor of a direct string replacement on 'env:' since 'env: {env}' always appears in the template even if {env} evaluates to an empty string. This ensures our injected block always gets prepended correctly before any user variables.
* Remove redundant 'env:' from env_injection string
* Since we always append the variables directly after the 'env:' key from the main container template, we can just define the injected block without the 'env:' header entirely and avoid having to strip it out.
* Add return type hint to get_user_workload_container
* Solves the 'Returning Any from function declared to return str' type checker error in pathways.py by explicitly defining the tuple return type of get_user_workload_container.
* Refactor test_workload_create_pathways_jobset_yaml assertions
* Replaced hardcoded strings and numbers (like 'test-pw-workload', 'test-docker', and 2) with formatted variables linked directly to the mocked args (e.g. {args.workload}, {args.num_slices}) to make the unit test more robust and maintainable.
* Expand test_workload_create_pathways_jobset_yaml assertions
* Added explicit assertions for the newly migrated JobSet fields including the coordinator block, network dns configurations, restart strategies, completion modes, and strict verification of the dynamically scaled backoffLimit calculation.
* Apply auto-formatting to pathways and workload files
* Fix inline worker_backoff_limit calculation inside format string
* Bypass CRD check during Pathways workload creation
* Pass system characteristics to dynamically calculate google.com/tpu limits
* Migrate PathwaysJob CRD to JobSet
This commit switches the XPK workload generation template for pathways workloads from using the PathwaysJob CRD directly over to a standard JobSet object.
- Updates PW_WORKLOAD_CREATE_YAML to use JobSet API
- Modifies component YAML generation in pathways.py to output Pod containers rather than custom PathwaysJob components
- Adds unit tests to workload_test.py to verify correct layout of pathways jobsets
* Fix args.worker_image attribute error in Pathways worker container string formatting
* Ensure Pathways JobSet parity with legacy PathwaysJob controller output
* Convert proxy and RM sidecars to initContainers with restartPolicy: Always.
* Ensure all container ports specify protocol: TCP.
* Enforce restartPolicy: OnFailure for the worker template.
* Inject required environment variables (JAX_PLATFORMS, JAX_BACKEND_TARGET, XCLOUD_ENVIRONMENT) natively into the primary user workload container.
* Add completionMode: Indexed to both head and worker replicated jobs.
* Set successPolicy, startupPolicy, and suspend fields to match legacy generated JobSet.
* Update pathways instance_type formatting in JobSet migration
* Added get_pathways_instance_type to core/pathways.py
* Format the gke_accelerator natively for the RM sidecar (e.g. tpuv6e:2x2)
* Update backoffLimit for Pathways worker to match legacy controller scaling
* Set worker backoffLimit to args.max_slice_restarts * 4 to replicate the legacy controller's logic.
* Remove PathwaysJob CRD installation from cluster creation
* Since XPK now deploys Pathways workloads using native JobSet API, we no longer need to install the PathwaysJob CRD when creating or adapting a cluster.
* Refactor append_custom_pathways_flags to remove magic number offsetting
* Changed prev_indentation to base_indentation to specify exactly how many spaces to prepend.
* Removed the convoluted len(indentation) - 2 calls in favor of passing base_indentation=16 directly.
* Refactor env variable injection to use regex
* Instead of manually checking for multiple variants of trailing whitespace after the 'env:' key, use the re module to robustly identify and inject the required Pathways environment variables.
* Remove unnecessary else block from env injection
* Since docker_container.py generates the container YAML template using `env: {env}\n`, the key will always be followed by an optional space and a newline, whether {env} is empty or not. Thus, the defensive else block was unreachable and unnecessary.
* Move regex import to top of pathways.py
* Moved the inline 'import re' from get_user_workload_for_pathways to the module level.
* Simplify environment variable injection logic
* Removed regex pattern matching in favor of a direct string replacement on 'env:' since 'env: {env}' always appears in the template even if {env} evaluates to an empty string. This ensures our injected block always gets prepended correctly before any user variables.
* Remove redundant 'env:' from env_injection string
* Since we always append the variables directly after the 'env:' key from the main container template, we can just define the injected block without the 'env:' header entirely and avoid having to strip it out.
* Add return type hint to get_user_workload_container
* Solves the 'Returning Any from function declared to return str' type checker error in pathways.py by explicitly defining the tuple return type of get_user_workload_container.
* Refactor test_workload_create_pathways_jobset_yaml assertions
* Replaced hardcoded strings and numbers (like 'test-pw-workload', 'test-docker', and 2) with formatted variables linked directly to the mocked args (e.g. {args.workload}, {args.num_slices}) to make the unit test more robust and maintainable.
* Expand test_workload_create_pathways_jobset_yaml assertions
* Added explicit assertions for the newly migrated JobSet fields including the coordinator block, network dns configurations, restart strategies, completion modes, and strict verification of the dynamically scaled backoffLimit calculation.
* Apply auto-formatting to pathways and workload files
* Fix inline worker_backoff_limit calculation inside format string
* Bypass CRD check during Pathways workload creation
* Pass system characteristics to dynamically calculate google.com/tpu limits
* Linter fix
* Fix unit tests
* fix linter
* Address PR feedback for Pathways JobSet migration
* Fixed get_pathways_instance_type to explicitly map GCE machine types to TPU versions.
* Fixed headless mode to properly inject proxy and RM into 'containers' instead of 'initContainers'.
* Fixed worker backoffLimit calculation to accurately scale by vms_per_slice.
* Respected --elastic-slices CLI argument by passing it to pathways-proxy.
* Added deprecation warning for the legacy colocate_head_with_workers deployment mode.
* Cleaned up dead targetReplicatedJob tracking and conditionally disabled successPolicy for headless workloads.
* Updated unit tests to mock and assert the new conditional formats and fields correctly.
* update goldens
* fix linter
* Handle unknown pathways machine types and fix tests
* Refactor Pathways TPU version handling to use SystemCharacteristics directly
* Added an optional field directly to .
* Populated the for all supported TPU generations (v4, v5e, v5p, v6e, tpu7, tpu7x) directly in their system definitions in .
* Removed the hardcoded fallback mapping and its associated exception from entirely, fetching the version dynamically from the system configuration instead.
* Enforce pathways_tpu_version as a mandatory keyword argument for TPU configurations
* Fix unit tests after changing get_tpu_system_characteristics_map signature
* some formatting changes
* golden updates
* cleanup
* Readme update
* Fix YAML indentation for pathways_head_containers injection
* Goldens update
* fix the identation problem1 parent a72a752 commit 837cf86
25 files changed
Lines changed: 646 additions & 271 deletions
File tree
- recipes
- src/xpk
- commands
- core
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
86 | 86 | | |
87 | 87 | | |
88 | 88 | | |
89 | | - | |
90 | 89 | | |
91 | 90 | | |
92 | 91 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
45 | 45 | | |
46 | 46 | | |
47 | 47 | | |
48 | | - | |
| 48 | + | |
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
52 | | - | |
| 52 | + | |
53 | 53 | | |
54 | 54 | | |
55 | 55 | | |
| |||
170 | 170 | | |
171 | 171 | | |
172 | 172 | | |
173 | | - | |
174 | | - | |
175 | | - | |
176 | 173 | | |
177 | 174 | | |
178 | 175 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
50 | | - | |
| 50 | + | |
51 | 51 | | |
52 | 52 | | |
53 | 53 | | |
54 | | - | |
| 54 | + | |
55 | 55 | | |
56 | 56 | | |
57 | 57 | | |
| |||
173 | 173 | | |
174 | 174 | | |
175 | 175 | | |
176 | | - | |
177 | | - | |
178 | | - | |
179 | 176 | | |
180 | 177 | | |
181 | 178 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
45 | 45 | | |
46 | 46 | | |
47 | 47 | | |
48 | | - | |
| 48 | + | |
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
52 | | - | |
| 52 | + | |
53 | 53 | | |
54 | 54 | | |
55 | 55 | | |
| |||
172 | 172 | | |
173 | 173 | | |
174 | 174 | | |
175 | | - | |
176 | | - | |
177 | | - | |
178 | 175 | | |
179 | 176 | | |
180 | 177 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
45 | 45 | | |
46 | 46 | | |
47 | 47 | | |
48 | | - | |
| 48 | + | |
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
52 | | - | |
| 52 | + | |
53 | 53 | | |
54 | 54 | | |
55 | 55 | | |
| |||
170 | 170 | | |
171 | 171 | | |
172 | 172 | | |
173 | | - | |
174 | | - | |
175 | | - | |
176 | 173 | | |
177 | 174 | | |
178 | 175 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
51 | 51 | | |
52 | 52 | | |
53 | 53 | | |
54 | | - | |
| 54 | + | |
55 | 55 | | |
56 | 56 | | |
57 | 57 | | |
58 | | - | |
| 58 | + | |
59 | 59 | | |
60 | 60 | | |
61 | 61 | | |
| |||
178 | 178 | | |
179 | 179 | | |
180 | 180 | | |
181 | | - | |
182 | | - | |
183 | | - | |
184 | 181 | | |
185 | 182 | | |
186 | 183 | | |
| |||
0 commit comments