This document explains the full workflow:
- Build and install required dependency:
inference_emb_ops.sofor dynamicemb custom ops- NVEmbedding package and shared libraries
- Run the Python export pipeline to generate:
- exported model →
model.pt2,metadata.json,weights/{nve_layer_module_name}.nve - test cases → tensors
values.pt,lengths.pt,num_candidates.pt,ref_logits.ptfor multiple batches
- Build the C++ E2E inference demos.
- Run the C++ inference and compare output numerically.
{RECSYS_DIR}: root ofrecsys-dynmicemb-alex{NVE_DIR}: root oftrt-recsys{RECSYS_INFERENCE_DIR}:{RECSYS_DIR}/examples/hstu/inference{CPP_INFERENCE_BUILD_DIR}:{RECSYS_INFERENCE_DIR}/cpp_inference/build{CPP_INFERENCE_LIB_DIR}:{RECSYS_INFERENCE_DIR}/cpp_inference/lib
- Linux with CUDA GPU available
- PyTorch 2.11 + DynamicEmb + NVEmbedding
From repository root:
cd {RECSYS_DIR}/corelib/dynamicemb
mkdir -p torch_binding_build && cd torch_binding_build
cmake .. && make -jExpected output:
{RECSYS_DIR}/corelib/dynamicemb/torch_binding_build/inference_emb_ops.so
Placeholder command block:
cd {NVE_DIR} # at the repository root dir
git submodule update --init --recursive
git clone https://github.com/NVIDIA/NVTX.git third_party/NVTX
CPLUS_INCLUDE_PATH=$(realpath ./third_party/NVTX/c/include/):${CPLUS_INCLUDE_PATH} pip install .Expected output:
- Output library:
{NVE_DIR}/build/lib/libnve-common.soand{NVE_DIR}/build/lib/libnve_torch_ops.so
From repository root:
cd {RECSYS_DIR}/examples/hstu/
export DYNAMICEMB_OPS_LIB_DIR=$(realpath ../../corelib/dynamicemb/torch_binding_build/)
python3 ./inference/export_inference_gr_ranking.py --gin_config_file ./inference/configs/kuairand_1k_inference_ranking.gin --checkpoint_dir ci_checkpoint/fused_kuairand_1k_ckpt_v2cd {RECSYS_INFERENCE_DIR}/cpp_inference
CMAKE_PREFIX_PATH="$(python -c 'import os, torch; print(os.path.join(os.path.dirname(torch.__file__), "share", "cmake"))')" cmake -S . -B build
cmake --build build --config Release -jExpected output:
- Output library:
libhstu_cuda_ops_runtime.so - C++ Inference Executable:
inference_hstu_gr_ranking_exported_model
cd {RECSYS_INFERENCE_DIR}/cpp_inference
./build/inference_hstu_gr_ranking_exported_model \
{RECSYS_INFERENCE_DIR}/hstu_gr_ranking_model \
{RECSYS_INFERENCE_DIR}/export_test_dumpFBGEMM shared libraries are also loaded in the C++ inference. Their path are hard-coded.
The current export in
export_inference_gr_ranking.pyuses pytree inputs:
HSTUBatchSo the C++ runtime sees a flattened tensor list, not those Python container objects. Current flattened input order
batch.features.values()batch.features.lengths()batch.num_candidates