Skip to content

fix(dynamicemb): traverse nn.Module children in check_emb_collection_modules#355

Open
JacoCheung wants to merge 4 commits intoNVIDIA:mainfrom
JacoCheung:fix/check-emb-submodules
Open

fix(dynamicemb): traverse nn.Module children in check_emb_collection_modules#355
JacoCheung wants to merge 4 commits intoNVIDIA:mainfrom
JacoCheung:fix/check-emb-submodules

Conversation

@JacoCheung
Copy link
Copy Markdown
Collaborator

@JacoCheung JacoCheung commented Apr 10, 2026

Summary

  • check_emb_collection_modules only followed TorchRec's private attributes (_lookups, _emb_modules, _emb_module) but never recursed into standard nn.Module.children(), so it could not discover BatchedDynamicEmbeddingTablesV2 behind wrapper layers (DMP / DDP / Float16Module).
  • Added children() traversal and a visited set to prevent re-entry on circular references and duplicate ret_list entries.
  • Normalized the return-value contract: all branches now return implicit None; callers mutate ret_list in place (addresses greptile P2).
  • Backward compatible — the new visited parameter defaults to None, so all existing callers work unchanged.
  • Dropped the duplicate examples-side helper commons.utils.dynamicemb_utils.find_dynamicemb_modules; both call sites (dynamicemb_cache_stats, pretrain_gr_ranking's FILL_DYNAMICEMB_TABLES path) now use the public dynamicemb.dump_load.get_dynamic_emb_module.
  • Rewrote check_counter_table_checkpoint in test_embedding_dump_load.py against the fused MultiTableKVCounter API. The old helper iterated _admission_counter as a list of per-table KVCounters, but since the fusion refactor ([Feature] dynamicemb table fusion and expansion #343, commit 97b80bf) it is a single counter object. The test silently no-op'd on branches where get_dynamic_emb_module couldn't traverse DMP wrappers, and raised TypeError: 'MultiTableKVCounter' object is not iterable on branches where it could. Now per-logical-table: export (keys, frequency) via cnt.table_._batched_export_keys_scores([freq_name], device, table_id), look up via cnt.table_.lookup(keys, table_ids, score_arg), assert founds.all() and torch.equal(frequencies, score_out); handle the None (no-admission) case symmetrically.

Closes #353

Test plan

  • New regression test assert_get_dynamic_emb_module_finds_submodules in test_embedding_dump_load.py asserts both traversal paths (via find_sharded_modules vs directly on the DMP wrapper) return the same set of BatchedDynamicEmbeddingTablesV2 modules
  • dynamicemb_test_load_dump_8gpus (which exercises check_counter_table_checkpoint) — the rewritten helper now actively verifies dump/load counter round-trip instead of silently iterating zero tables
  • Existing unit tests in test_embedding_dump_load.py, test_alignment.py, test_lfu_scores.py, test_embedding_admission.py pass without changes
  • Verify get_dynamic_emb_module finds modules when called on a DMP-wrapped model (not just a pre-extracted ShardedEmbeddingCollection)

🤖 Generated with Claude Code

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 10, 2026

Greptile Summary

This PR fixes check_emb_collection_modules in dump_load.py to recurse through wrapper layers (DMP/DDP/Float16Module) via nn.Module.children(), supplemented by a visited set to prevent revisits. It also rewrites the broken check_counter_table_checkpoint test against the post-fusion MultiTableKVCounter API, drops the now-redundant commons.utils.dynamicemb_utils helper, and updates all callers to use the public get_dynamic_emb_module.

Confidence Score: 5/5

Safe to merge — the traversal fix is correct, the visited-set prevents duplicates, all callers are updated, and the test rewrite actively verifies the counter round-trip.

All findings are P2 or lower. Prior P1 (inconsistent return values) was addressed in this PR. The core fix, test additions, and helper consolidation are logically sound with no remaining blocking issues.

No files require special attention.

Important Files Changed

Filename Overview
corelib/dynamicemb/dynamicemb/dump_load.py Adds visited set and children() traversal to check_emb_collection_modules; return contract is now consistently implicit None. Logic is correct; visited prevents duplicates from double-reachable children.
corelib/dynamicemb/test/unit_tests/test_embedding_dump_load.py Rewrites check_counter_table_checkpoint against the fused MultiTableKVCounter API; adds assert_get_dynamic_emb_module_finds_submodules regression test. Both paths are logically sound.
examples/commons/utils/dynamicemb_utils.py File deleted — duck-typed helper superseded by the improved get_dynamic_emb_module.
examples/commons/utils/dynamicemb_cache_stats.py Swaps find_dynamicemb_modules for get_dynamic_emb_module; BatchedDynamicEmbeddingTablesV2 exposes all attributes (table_names, cache, set_record_cache_metrics) the hook relies on, so behaviorally identical.
examples/hstu/training/pretrain_gr_ranking.py Updates FILL_DYNAMICEMB_TABLES path to use get_dynamic_emb_module; fill_tables attribute check ensures safe fallback for modules without the method.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["get_dynamic_emb_module(model)"] --> B["check_emb_collection_modules(module, ret_list, visited)"]
    B --> C{"id in visited?"}
    C -->|yes| D["return"]
    C -->|no| E["visited.add(id)"]
    E --> F{"BatchedDynamicEmbeddingTablesV2?"}
    F -->|yes| G["ret_list.append(module) then return"]
    F -->|no| H{"nn.Module?"}
    H -->|yes| I["Traverse _lookups, _emb_modules, _emb_module"]
    I --> J["Recurse into each item"]
    H -->|yes| K["module.children() - DMP/DDP/Float16Module"]
    K --> L["Recurse into each child"]
    J --> B
    L --> B
Loading

Reviews (10): Last reviewed commit: "fix(dynamicemb/test): rewrite check_coun..." | Re-trigger Greptile

Comment thread corelib/dynamicemb/dynamicemb/dump_load.py Outdated
@JacoCheung JacoCheung requested review from jiashuy and shijieliu April 10, 2026 07:53
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 10, 2026

Tip:

Greploop — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

@JacoCheung JacoCheung force-pushed the fix/check-emb-submodules branch from 5aae19f to 14cb873 Compare April 10, 2026 08:25
Comment thread corelib/dynamicemb/dynamicemb/dump_load.py
Copy link
Copy Markdown
Collaborator

@jiashuy jiashuy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JacoCheung JacoCheung force-pushed the fix/check-emb-submodules branch from 7b79801 to ad6f688 Compare April 14, 2026 10:19
@JacoCheung
Copy link
Copy Markdown
Collaborator Author

/build

4 similar comments
@JacoCheung
Copy link
Copy Markdown
Collaborator Author

/build

@JacoCheung
Copy link
Copy Markdown
Collaborator Author

/build

@JacoCheung
Copy link
Copy Markdown
Collaborator Author

/build

@JacoCheung
Copy link
Copy Markdown
Collaborator Author

/build

@JacoCheung
Copy link
Copy Markdown
Collaborator Author

JacoCheung commented Apr 15, 2026

Pipeline #48604221 -- failed

Job Status Log
pre_check ✅ success view
train_build ✅ success view
inference_build ✅ success view
tritonserver_build ✅ success view
build_whl ❌ failed view
dynamicemb_test_fwd_bwd_8gpus ✅ success view
dynamicemb_test_load_dump_8gpus ❌ failed view
unit_test_1gpu_a100 ✅ success view
unit_test_1gpu_h100 ✅ success view
unit_test_4gpu ❌ failed view
unit_test_tp_4gpu ❌ failed view
L20_unit_test_1gpu ✅ success view
inference_unit_test_1gpu ❌ failed view
inference_test_1gpu ❌ failed view

Result: 8/14 jobs passed

View full pipeline

@JacoCheung
Copy link
Copy Markdown
Collaborator Author

/build

@shijieliu shijieliu force-pushed the fix/check-emb-submodules branch from ad6f688 to 4356c1a Compare April 17, 2026 01:53
@JacoCheung
Copy link
Copy Markdown
Collaborator Author

/build

@JacoCheung
Copy link
Copy Markdown
Collaborator Author

JacoCheung commented Apr 17, 2026

Pipeline #48747674 -- canceling

Job Status Log
pre_check ✅ success view
train_build ❔ canceling view
inference_build ❔ canceling view
tritonserver_build ❔ canceling view
build_whl 🚫 canceled view
dynamicemb_test_fwd_bwd_8gpus 🚫 canceled view
dynamicemb_test_load_dump_8gpus 🚫 canceled view
unit_test_1gpu_a100 🚫 canceled view
unit_test_1gpu_h100 🚫 canceled view
unit_test_4gpu 🚫 canceled view
unit_test_tp_4gpu 🚫 canceled view
L20_unit_test_1gpu 🚫 canceled view
inference_unit_test_1gpu 🚫 canceled view
inference_test_1gpu 🚫 canceled view

View full pipeline

@JacoCheung
Copy link
Copy Markdown
Collaborator Author

/build

@JacoCheung
Copy link
Copy Markdown
Collaborator Author

JacoCheung commented Apr 17, 2026

Pipeline #48748868 -- failed

Job Status Log
pre_check ✅ success view
train_build ✅ success view
inference_build ✅ success view
tritonserver_build ✅ success view
build_whl ✅ success view
dynamicemb_test_fwd_bwd_8gpus ✅ success view
dynamicemb_test_load_dump_8gpus ❌ failed view
unit_test_1gpu_a100 ❌ failed view
unit_test_1gpu_h100 ✅ success view
unit_test_4gpu ❌ failed view
unit_test_tp_4gpu ❌ failed view
L20_unit_test_1gpu ✅ success view
inference_unit_test_1gpu ✅ success view
inference_test_1gpu ✅ success view

Result: 10/14 jobs passed

View full pipeline

JacoCheung added a commit to JacoCheung/recsys-examples that referenced this pull request Apr 17, 2026
…d MultiTableKVCounter

After the admission-counter fusion refactor (NVIDIA#343), `_admission_counter`
became a single MultiTableKVCounter (or None) instead of a list of
per-table KVCounters. The verification helper still iterated it as a
list, so the dump/load counter check was silently no-op on any branch
whose `get_dynamic_emb_module` could traverse into DMP wrappers and
raised `TypeError: 'MultiTableKVCounter' object is not iterable` on any
branch that could. PR NVIDIA#355's `children()` traversal surfaced the latter.

Rewrite the check against the current API:
- Iterate logical tables via `range(len(table._table_names))`.
- Export one table_id's (keys, frequency) with
  `cnt.table_._batched_export_keys_scores([freq_name], device, table_id)`.
- Look those keys up in the peer counter via
  `cnt_y.table_.lookup(keys, table_ids, score_arg)` and assert both
  `founds.all()` and that the returned score_out matches the exported
  frequency.
- Handle the None (no-admission) case symmetrically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JacoCheung
Copy link
Copy Markdown
Collaborator Author

/build

JacoCheung and others added 4 commits April 17, 2026 00:39
…modules

The function only checked direct attributes, missing EmbeddingCollection
wrapped inside DistributedModelParallel or other nn.Module containers.
Now recursively walks module.children() to find the embedding submodule.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
check_emb_collection_modules now traverses nn.Module.children() and
guards against cycles, so the examples-side duplicate
find_dynamicemb_modules is redundant. Route both call sites (cache stats
hook + FILL_DYNAMICEMB_TABLES fill path) through dynamicemb's public
get_dynamic_emb_module and delete commons/utils/dynamicemb_utils.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d MultiTableKVCounter

After the admission-counter fusion refactor (NVIDIA#343), `_admission_counter`
became a single MultiTableKVCounter (or None) instead of a list of
per-table KVCounters. The verification helper still iterated it as a
list, so the dump/load counter check was silently no-op on any branch
whose `get_dynamic_emb_module` could traverse into DMP wrappers and
raised `TypeError: 'MultiTableKVCounter' object is not iterable` on any
branch that could. PR NVIDIA#355's `children()` traversal surfaced the latter.

Rewrite the check against the current API:
- Iterate logical tables via `range(len(table._table_names))`.
- Export one table_id's (keys, frequency) with
  `cnt.table_._batched_export_keys_scores([freq_name], device, table_id)`.
- Look those keys up in the peer counter via
  `cnt_y.table_.lookup(keys, table_ids, score_arg)` and assert both
  `founds.all()` and that the returned score_out matches the exported
  frequency.
- Handle the None (no-admission) case symmetrically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JacoCheung
Copy link
Copy Markdown
Collaborator Author

JacoCheung commented Apr 17, 2026

Pipeline #48764104 -- canceling

Job Status Log
pre_check ❔ canceling view
train_build 🚫 canceled view
inference_build 🚫 canceled view
tritonserver_build 🚫 canceled view
build_whl 🚫 canceled view
dynamicemb_test_fwd_bwd_8gpus 🚫 canceled view
dynamicemb_test_load_dump_8gpus 🚫 canceled view
unit_test_1gpu_a100 🚫 canceled view
unit_test_1gpu_h100 🚫 canceled view
unit_test_4gpu 🚫 canceled view
unit_test_tp_4gpu 🚫 canceled view
L20_unit_test_1gpu 🚫 canceled view
inference_unit_test_1gpu 🚫 canceled view
inference_test_1gpu 🚫 canceled view

View full pipeline

@JacoCheung JacoCheung force-pushed the fix/check-emb-submodules branch from 02b4a92 to b096feb Compare April 17, 2026 07:41
@JacoCheung
Copy link
Copy Markdown
Collaborator Author

/build

@JacoCheung
Copy link
Copy Markdown
Collaborator Author

JacoCheung commented Apr 17, 2026

Pipeline #48764351 -- failed

Job Status Log
pre_check ✅ success view
train_build ✅ success view
inference_build ✅ success view
tritonserver_build ✅ success view
build_whl ✅ success view
dynamicemb_test_fwd_bwd_8gpus ✅ success view
dynamicemb_test_load_dump_8gpus ✅ success view
unit_test_1gpu_a100 ✅ success view
unit_test_1gpu_h100 ❌ failed view
unit_test_4gpu ✅ success view
unit_test_tp_4gpu ❌ failed view
L20_unit_test_1gpu ✅ success view
inference_unit_test_1gpu ✅ success view
inference_test_1gpu ✅ success view

Result: 12/14 jobs passed

View full pipeline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] check_emb_collection_modules search sub-modules

2 participants