Skip to content

Add functionallity to select number of servers to sample from#567

Draft
kmontemayor2-sc wants to merge 11 commits intomainfrom
kmonte/select-num-servers
Draft

Add functionallity to select number of servers to sample from#567
kmontemayor2-sc wants to merge 11 commits intomainfrom
kmonte/select-num-servers

Conversation

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator

@kmontemayor2-sc kmontemayor2-sc commented Mar 31, 2026

Scope of work done

We do this so that we can have graph store mode sample from select servers, as we don't want to full cross-cluster fanout which is quite noisy and causes problems. In practice this is probably going to be like, < 4 but tbd on topology .

Where is the documentation for this feature?: N/A

Did you add automated tests or write a test plan?

Updated Changelog.md? NO

Ready for code review?: NO

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/all_test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 31, 2026

GiGL Automation

@ 18:29:14UTC : 🔄 Scala Unit Test started.

@ 18:39:15UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 31, 2026

GiGL Automation

@ 18:29:15UTC : 🔄 E2E Test started.

@ 19:58:56UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 31, 2026

GiGL Automation

@ 18:29:17UTC : 🔄 Integration Test started.

@ 19:53:27UTC : ❌ Workflow failed.
Please check the logs for more details.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 31, 2026

GiGL Automation

@ 18:29:19UTC : 🔄 Lint Test started.

@ 18:36:22UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 31, 2026

GiGL Automation

@ 18:29:20UTC : 🔄 Python Unit Test started.

@ 19:49:32UTC : ✅ Workflow completed successfully.

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/all_test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 31, 2026

GiGL Automation

@ 19:12:09UTC : 🔄 Integration Test started.

@ 20:43:16UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 31, 2026

GiGL Automation

@ 19:12:10UTC : 🔄 Python Unit Test started.

@ 20:26:28UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 31, 2026

GiGL Automation

@ 19:12:12UTC : 🔄 Scala Unit Test started.

@ 19:20:02UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 31, 2026

GiGL Automation

@ 19:12:13UTC : 🔄 E2E Test started.

@ 20:37:09UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 31, 2026

GiGL Automation

@ 19:12:13UTC : 🔄 Lint Test started.

@ 19:19:46UTC : ✅ Workflow completed successfully.

Copy link
Copy Markdown
Collaborator

@mkolodner-sc mkolodner-sc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Kyle! Did an initial pass here -- it seems like a lot of the APIs are becoming more complex now with mutually exclusive rank/world_sizes and shard_idx/num_shards. Do you think we can simplify this a bit by making the naming here a bit more generic, allowing us to have some split_idx and num_splits field?

num_storage_nodes: int,
num_assigned_storage_ranks: int,
) -> tuple[dict[int, list[int]], dict[int, list[int]], dict[int, tuple[int, int]]]:
"""Plan storage-rank assignments and local shard ownership for one compute rank."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more detail on the docstring about how this is being done?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment has been updated, is this a bit better?

kmonte and others added 2 commits April 1, 2026 00:09
Replace mutually exclusive rank/world_size and shard_index/num_shards
params with a single split_idx/num_splits pair. Use torch.tensor_split
server-side instead of custom _slice_nodes_for_shard. Expand docstrings
for _plan_storage_rank_shards_for_compute_rank and
num_assigned_storage_ranks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/all_test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

GiGL Automation

@ 16:15:36UTC : 🔄 E2E Test started.

@ 17:38:53UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

GiGL Automation

@ 16:15:36UTC : 🔄 Python Unit Test started.

@ 17:39:55UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

GiGL Automation

@ 16:15:36UTC : 🔄 Integration Test started.

@ 17:47:11UTC : ❌ Workflow failed.
Please check the logs for more details.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

GiGL Automation

@ 16:15:36UTC : 🔄 Scala Unit Test started.

@ 16:24:51UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

GiGL Automation

@ 16:15:37UTC : 🔄 Lint Test started.

@ 16:23:02UTC : ✅ Workflow completed successfully.

kmonte and others added 6 commits April 1, 2026 16:42
…um_splits

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…t, StorageRankShardAssignment]

Drop the intermediate compute_rank_to_storage_ranks and storage_rank_to_compute_ranks
mappings from the return value — callers only need the assigned storage ranks (dict keys)
and shard info (dict values) for the current rank.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ventions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/all_test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

GiGL Automation

@ 18:02:33UTC : 🔄 Integration Test started.

@ 19:16:03UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

GiGL Automation

@ 18:02:33UTC : 🔄 E2E Test started.

@ 20:35:22UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

GiGL Automation

@ 18:02:33UTC : 🔄 Lint Test started.

@ 18:10:17UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

GiGL Automation

@ 18:02:37UTC : 🔄 Python Unit Test started.

@ 19:12:04UTC : ❌ Workflow failed.
Please check the logs for more details.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

GiGL Automation

@ 18:02:38UTC : 🔄 Scala Unit Test started.

@ 18:11:45UTC : ✅ Workflow completed successfully.

Copy link
Copy Markdown
Collaborator

@mkolodner-sc mkolodner-sc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Kyle! Generally LGTM with a few small comments. One question -- do you have profiling available for the performance delta with this change?

)

compute_rank_to_storage_ranks: dict[int, list[int]] = {}
for compute_rank in range(world_size):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider adding more comments to this code fn to help readability


for server_rank in range(self.cluster_info.num_storage_nodes):
requests: list[FetchNodesRequest] = []
if num_assigned_storage_ranks is None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a lot of similar code between these two conditionals -- do we these two conditional blocks, or can we unify this? Ditto for ABLP

Comment on lines +299 to +302
rank=rank,
world_size=world_size,
split=split,
node_type=node_type,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: order here is different than below, ditto for ABLP

world_size: Optional[int] = None,
split: Optional[Literal["train", "val", "test"]] = None,
node_type: Optional[NodeType] = None,
num_assigned_storage_ranks: Optional[int] = None,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think majority of use cases will want to be setting this to None? If not, perhaps we should set the default to utilize this.

world_size: Optional[int] = None,
anchor_node_type: Optional[NodeType] = None,
supervision_edge_type: Optional[EdgeType] = None,
num_assigned_storage_ranks: Optional[int] = None,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq -- how does this value interplay between random and ABLP? Is this required/should this be the same for both ABLP and random sampling?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants