Add functionallity to select number of servers to sample from#567
Add functionallity to select number of servers to sample from#567kmontemayor2-sc wants to merge 11 commits intomainfrom
Conversation
|
/all_test |
GiGL Automation@ 18:29:14UTC : 🔄 @ 18:39:15UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 18:29:15UTC : 🔄 @ 19:58:56UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 18:29:17UTC : 🔄 @ 19:53:27UTC : ❌ Workflow failed. |
GiGL Automation@ 18:29:19UTC : 🔄 @ 18:36:22UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 18:29:20UTC : 🔄 @ 19:49:32UTC : ✅ Workflow completed successfully. |
|
/all_test |
GiGL Automation@ 19:12:09UTC : 🔄 @ 20:43:16UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 19:12:10UTC : 🔄 @ 20:26:28UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 19:12:12UTC : 🔄 @ 19:20:02UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 19:12:13UTC : 🔄 @ 20:37:09UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 19:12:13UTC : 🔄 @ 19:19:46UTC : ✅ Workflow completed successfully. |
mkolodner-sc
left a comment
There was a problem hiding this comment.
Thanks Kyle! Did an initial pass here -- it seems like a lot of the APIs are becoming more complex now with mutually exclusive rank/world_sizes and shard_idx/num_shards. Do you think we can simplify this a bit by making the naming here a bit more generic, allowing us to have some split_idx and num_splits field?
| num_storage_nodes: int, | ||
| num_assigned_storage_ranks: int, | ||
| ) -> tuple[dict[int, list[int]], dict[int, list[int]], dict[int, tuple[int, int]]]: | ||
| """Plan storage-rank assignments and local shard ownership for one compute rank.""" |
There was a problem hiding this comment.
Can we add more detail on the docstring about how this is being done?
There was a problem hiding this comment.
Comment has been updated, is this a bit better?
Replace mutually exclusive rank/world_size and shard_index/num_shards params with a single split_idx/num_splits pair. Use torch.tensor_split server-side instead of custom _slice_nodes_for_shard. Expand docstrings for _plan_storage_rank_shards_for_compute_rank and num_assigned_storage_ranks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/all_test |
GiGL Automation@ 16:15:36UTC : 🔄 @ 17:38:53UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 16:15:36UTC : 🔄 @ 17:39:55UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 16:15:36UTC : 🔄 @ 17:47:11UTC : ❌ Workflow failed. |
GiGL Automation@ 16:15:36UTC : 🔄 @ 16:24:51UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 16:15:37UTC : 🔄 @ 16:23:02UTC : ✅ Workflow completed successfully. |
…um_splits Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…t, StorageRankShardAssignment] Drop the intermediate compute_rank_to_storage_ranks and storage_rank_to_compute_ranks mappings from the return value — callers only need the assigned storage ranks (dict keys) and shard info (dict values) for the current rank. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ventions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/all_test |
GiGL Automation@ 18:02:33UTC : 🔄 @ 19:16:03UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 18:02:33UTC : 🔄 @ 20:35:22UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 18:02:33UTC : 🔄 @ 18:10:17UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 18:02:37UTC : 🔄 @ 19:12:04UTC : ❌ Workflow failed. |
GiGL Automation@ 18:02:38UTC : 🔄 @ 18:11:45UTC : ✅ Workflow completed successfully. |
mkolodner-sc
left a comment
There was a problem hiding this comment.
Thanks Kyle! Generally LGTM with a few small comments. One question -- do you have profiling available for the performance delta with this change?
| ) | ||
|
|
||
| compute_rank_to_storage_ranks: dict[int, list[int]] = {} | ||
| for compute_rank in range(world_size): |
There was a problem hiding this comment.
nit: consider adding more comments to this code fn to help readability
|
|
||
| for server_rank in range(self.cluster_info.num_storage_nodes): | ||
| requests: list[FetchNodesRequest] = [] | ||
| if num_assigned_storage_ranks is None: |
There was a problem hiding this comment.
Seems a lot of similar code between these two conditionals -- do we these two conditional blocks, or can we unify this? Ditto for ABLP
| rank=rank, | ||
| world_size=world_size, | ||
| split=split, | ||
| node_type=node_type, |
There was a problem hiding this comment.
nit: order here is different than below, ditto for ABLP
| world_size: Optional[int] = None, | ||
| split: Optional[Literal["train", "val", "test"]] = None, | ||
| node_type: Optional[NodeType] = None, | ||
| num_assigned_storage_ranks: Optional[int] = None, |
There was a problem hiding this comment.
Do you think majority of use cases will want to be setting this to None? If not, perhaps we should set the default to utilize this.
| world_size: Optional[int] = None, | ||
| anchor_node_type: Optional[NodeType] = None, | ||
| supervision_edge_type: Optional[EdgeType] = None, | ||
| num_assigned_storage_ranks: Optional[int] = None, |
There was a problem hiding this comment.
qq -- how does this value interplay between random and ABLP? Is this required/should this be the same for both ABLP and random sampling?
Scope of work done
We do this so that we can have graph store mode sample from select servers, as we don't want to full cross-cluster fanout which is quite noisy and causes problems. In practice this is probably going to be like,
< 4but tbd on topology .Where is the documentation for this feature?: N/A
Did you add automated tests or write a test plan?
Updated Changelog.md? NO
Ready for code review?: NO