Add functionallity to select number of servers to sample from by kmontemayor2-sc · Pull Request #567 · Snapchat/GiGL

kmontemayor2-sc · 2026-03-31T18:28:55Z

Scope of work done

We do this so that we can have graph store mode sample from select servers, as we don't want to full cross-cluster fanout which is quite noisy and causes problems. In practice this is probably going to be like, < 4 but tbd on topology .

Where is the documentation for this feature?: N/A

Did you add automated tests or write a test plan?

Updated Changelog.md? NO

Ready for code review?: NO

kmontemayor2-sc · 2026-03-31T18:29:02Z

/all_test

github-actions · 2026-03-31T18:29:14Z

GiGL Automation

@ 18:29:14UTC : 🔄 Scala Unit Test started.

@ 18:39:15UTC : ✅ Workflow completed successfully.

github-actions · 2026-03-31T18:29:15Z

GiGL Automation

@ 18:29:15UTC : 🔄 E2E Test started.

@ 19:58:56UTC : ✅ Workflow completed successfully.

github-actions · 2026-03-31T18:29:17Z

GiGL Automation

@ 18:29:17UTC : 🔄 Integration Test started.

@ 19:53:27UTC : ❌ Workflow failed.
Please check the logs for more details.

github-actions · 2026-03-31T18:29:20Z

GiGL Automation

@ 18:29:19UTC : 🔄 Lint Test started.

@ 18:36:22UTC : ✅ Workflow completed successfully.

github-actions · 2026-03-31T18:29:20Z

GiGL Automation

@ 18:29:20UTC : 🔄 Python Unit Test started.

@ 19:49:32UTC : ✅ Workflow completed successfully.

kmontemayor2-sc · 2026-03-31T19:11:58Z

/all_test

github-actions · 2026-03-31T19:12:09Z

GiGL Automation

@ 19:12:09UTC : 🔄 Integration Test started.

@ 20:43:16UTC : ✅ Workflow completed successfully.

github-actions · 2026-03-31T19:12:10Z

GiGL Automation

@ 19:12:10UTC : 🔄 Python Unit Test started.

@ 20:26:28UTC : ✅ Workflow completed successfully.

github-actions · 2026-03-31T19:12:13Z

GiGL Automation

@ 19:12:12UTC : 🔄 Scala Unit Test started.

@ 19:20:02UTC : ✅ Workflow completed successfully.

github-actions · 2026-03-31T19:12:13Z

GiGL Automation

@ 19:12:13UTC : 🔄 E2E Test started.

@ 20:37:09UTC : ✅ Workflow completed successfully.

github-actions · 2026-03-31T19:12:14Z

GiGL Automation

@ 19:12:13UTC : 🔄 Lint Test started.

@ 19:19:46UTC : ✅ Workflow completed successfully.

mkolodner-sc

Thanks Kyle! Did an initial pass here -- it seems like a lot of the APIs are becoming more complex now with mutually exclusive rank/world_sizes and shard_idx/num_shards. Do you think we can simplify this a bit by making the naming here a bit more generic, allowing us to have some split_idx and num_splits field?

gigl/distributed/graph_store/dist_server.py

mkolodner-sc · 2026-03-31T22:55:09Z

gigl/distributed/graph_store/remote_dist_dataset.py

+    num_storage_nodes: int,
+    num_assigned_storage_ranks: int,
+) -> tuple[dict[int, list[int]], dict[int, list[int]], dict[int, tuple[int, int]]]:
+    """Plan storage-rank assignments and local shard ownership for one compute rank."""


Can we add more detail on the docstring about how this is being done?

Comment has been updated, is this a bit better?

gigl/distributed/graph_store/remote_dist_dataset.py

Replace mutually exclusive rank/world_size and shard_index/num_shards params with a single split_idx/num_splits pair. Use torch.tensor_split server-side instead of custom _slice_nodes_for_shard. Expand docstrings for _plan_storage_rank_shards_for_compute_rank and num_assigned_storage_ranks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kmontemayor2-sc · 2026-04-01T16:15:23Z

/all_test

github-actions · 2026-04-01T16:15:36Z

GiGL Automation

@ 16:15:36UTC : 🔄 E2E Test started.

@ 17:38:53UTC : ✅ Workflow completed successfully.

github-actions · 2026-04-01T16:15:36Z

GiGL Automation

@ 16:15:36UTC : 🔄 Python Unit Test started.

@ 17:39:55UTC : ✅ Workflow completed successfully.

github-actions · 2026-04-01T16:15:36Z

GiGL Automation

@ 16:15:36UTC : 🔄 Integration Test started.

@ 17:47:11UTC : ❌ Workflow failed.
Please check the logs for more details.

github-actions · 2026-04-01T16:15:36Z

GiGL Automation

@ 16:15:36UTC : 🔄 Scala Unit Test started.

@ 16:24:51UTC : ✅ Workflow completed successfully.

github-actions · 2026-04-01T16:15:37Z

GiGL Automation

@ 16:15:37UTC : 🔄 Lint Test started.

@ 16:23:02UTC : ✅ Workflow completed successfully.

…um_splits Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…t, StorageRankShardAssignment] Drop the intermediate compute_rank_to_storage_ranks and storage_rank_to_compute_ranks mappings from the return value — callers only need the assigned storage ranks (dict keys) and shard info (dict values) for the current rank. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ventions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kmontemayor2-sc · 2026-04-01T18:02:22Z

/all_test

github-actions · 2026-04-01T18:02:33Z

GiGL Automation

@ 18:02:33UTC : 🔄 Integration Test started.

@ 19:16:03UTC : ✅ Workflow completed successfully.

github-actions · 2026-04-01T18:02:33Z

GiGL Automation

@ 18:02:33UTC : 🔄 E2E Test started.

@ 20:35:22UTC : ✅ Workflow completed successfully.

github-actions · 2026-04-01T18:02:34Z

GiGL Automation

@ 18:02:33UTC : 🔄 Lint Test started.

@ 18:10:17UTC : ✅ Workflow completed successfully.

github-actions · 2026-04-01T18:02:37Z

GiGL Automation

@ 18:02:37UTC : 🔄 Python Unit Test started.

@ 19:12:04UTC : ❌ Workflow failed.
Please check the logs for more details.

github-actions · 2026-04-01T18:02:38Z

GiGL Automation

@ 18:02:38UTC : 🔄 Scala Unit Test started.

@ 18:11:45UTC : ✅ Workflow completed successfully.

mkolodner-sc

Thanks Kyle! Generally LGTM with a few small comments. One question -- do you have profiling available for the performance delta with this change?

gigl/distributed/graph_store/remote_dist_dataset.py

mkolodner-sc · 2026-04-01T22:38:40Z

gigl/distributed/graph_store/remote_dist_dataset.py

+        )
+
+    compute_rank_to_storage_ranks: dict[int, list[int]] = {}
+    for compute_rank in range(world_size):


nit: consider adding more comments to this code fn to help readability

mkolodner-sc · 2026-04-01T22:42:51Z

gigl/distributed/graph_store/remote_dist_dataset.py


-        for server_rank in range(self.cluster_info.num_storage_nodes):
+        requests: list[FetchNodesRequest] = []
+        if num_assigned_storage_ranks is None:


Seems a lot of similar code between these two conditionals -- do we these two conditional blocks, or can we unify this? Ditto for ABLP

mkolodner-sc · 2026-04-01T22:43:17Z

gigl/distributed/graph_store/remote_dist_dataset.py

+                    rank=rank,
+                    world_size=world_size,
+                    split=split,
+                    node_type=node_type,


nit: order here is different than below, ditto for ABLP

tests/unit/distributed/graph_store/remote_dist_dataset_test.py

mkolodner-sc · 2026-04-01T22:49:07Z

gigl/distributed/graph_store/remote_dist_dataset.py

        world_size: Optional[int] = None,
        split: Optional[Literal["train", "val", "test"]] = None,
        node_type: Optional[NodeType] = None,
+        num_assigned_storage_ranks: Optional[int] = None,


Do you think majority of use cases will want to be setting this to None? If not, perhaps we should set the default to utilize this.

mkolodner-sc · 2026-04-01T22:50:32Z

gigl/distributed/graph_store/remote_dist_dataset.py

        world_size: Optional[int] = None,
        anchor_node_type: Optional[NodeType] = None,
        supervision_edge_type: Optional[EdgeType] = None,
+        num_assigned_storage_ranks: Optional[int] = None,


qq -- how does this value interplay between random and ABLP? Is this required/should this be the same for both ABLP and random sampling?

Add functionallity to select number of servers to sample from

f633cd3

add tests

1f3b0ad

mkolodner-sc reviewed Mar 31, 2026

View reviewed changes

kmonte and others added 2 commits April 1, 2026 00:09

add docstring

5a21ca5

kmonte and others added 6 commits April 1, 2026 16:42

Revert sharding param names to rank/world_size instead of split_idx/n…

d25eb83

…um_splits Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reduce test churn in dist_server_test.py by keeping main's naming con…

04e06f0

…ventions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Parameterize _plan_storage_rank_shards tests

e954c82

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

update

7d01c1a

comments

663dc34

mkolodner-sc reviewed Apr 1, 2026

View reviewed changes

Merge branch 'main' into kmonte/select-num-servers

0d7ba98

Conversation

kmontemayor2-sc commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kmontemayor2-sc commented Mar 31, 2026

Uh oh!

github-actions bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

github-actions bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

github-actions bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

github-actions bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

github-actions bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

kmontemayor2-sc commented Mar 31, 2026

Uh oh!

github-actions bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

github-actions bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

github-actions bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

github-actions bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

github-actions bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

mkolodner-sc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mkolodner-sc Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

kmontemayor2-sc Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kmontemayor2-sc commented Apr 1, 2026

Uh oh!

github-actions bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

github-actions bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

github-actions bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

github-actions bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

kmontemayor2-sc commented Mar 31, 2026 •

edited

Loading

github-actions bot commented Mar 31, 2026 •

edited

Loading

github-actions bot commented Mar 31, 2026 •

edited

Loading

github-actions bot commented Mar 31, 2026 •

edited

Loading

github-actions bot commented Mar 31, 2026 •

edited

Loading

github-actions bot commented Mar 31, 2026 •

edited

Loading

github-actions bot commented Mar 31, 2026 •

edited

Loading

github-actions bot commented Mar 31, 2026 •

edited

Loading

github-actions bot commented Mar 31, 2026 •

edited

Loading

github-actions bot commented Mar 31, 2026 •

edited

Loading

github-actions bot commented Mar 31, 2026 •

edited

Loading

github-actions bot commented Apr 1, 2026 •

edited

Loading

github-actions bot commented Apr 1, 2026 •

edited

Loading

github-actions bot commented Apr 1, 2026 •

edited

Loading

github-actions bot commented Apr 1, 2026 •

edited

Loading

github-actions bot commented Apr 1, 2026 •

edited

Loading

github-actions bot commented Apr 1, 2026 •

edited

Loading

github-actions bot commented Apr 1, 2026 •

edited

Loading

github-actions bot commented Apr 1, 2026 •

edited

Loading

github-actions bot commented Apr 1, 2026 •

edited

Loading

github-actions bot commented Apr 1, 2026 •

edited

Loading