NVIDIA RecSys Examples

Overview

NVIDIA RecSys Examples is a collection of optimized recommender models and components.

The project includes:

Examples for large-scale HSTU ranking and retrieval training through TorchRec and Megatron-Core integration
HSTU inference with paged KV cache, Triton Inference Server integration, CUDA graph usage, and C++ deployment with AOTInductor (guide)
Examples for semantic-id based retrieval model through TorchRec and Megatron-Core integration
DynamicEmb for model-parallel dynamic embedding tables with zero-collision hashing, eviction, admission control, table fusion, and TorchRec integration (documentation)

What's New

[2026/4/14] 🎉v26.03 released!
- We added Torch export and AOTInductor packaging for end-to-end HSTU C++ inference. See the HSTU inference overview and the C++ inference guide.
- We improved DynamicEmb with table fusion and expansion, relaxed embedding-table alignment (no longer power-of-two), and capacity sizing aligned to bucket_capacity. See DynamicEmb.
- We added an HSTU end-to-end training benchmark suite with progressive optimizations. See the HSTU training benchmark and E2E benchmark notes.
- We published HSTU inference benchmark results on B200 in the HSTU inference benchmark.
- We migrated HSTU attention to fbgemm_gpu_hstu, removed the legacy compatibility layer, and improved the training stack (fewer device-to-host syncs in jagged tensor handling, balancer tuning, and debug logging). See HSTU training setup.
[2026/2/13] 🎉v26.01 released!
- We optimized HSTU KVCacheManager, moving Python-based KV cache management to optimized C++ implementation with asynchronous onload/offload operation and compression support. Benchmark shows onload and offload latency can be fully hidden under HSTU inference.
- We introduced a HSTU training optimization with workload-balanced batch shuffling for data parallel training.
- We added caching and prefetching support for EmbeddingBagCollection.
[2026/1/13] 🎉v25.12 released!
- Support TritonServer for HSTU inference. Follow the HSTU inference TritonServer example to try it out.
- We introduced our first semantic-id retrieval model example. Follow the semantic‑id retrieval (sid_gr) documentation to run it.
[2025/12/10] 🎉v25.11 released!
- DynamicEmb supports embedding admission, that decides whether a new feature ID is allowed to create or update an embedding entry in the dynamic embedding table. By controlling admission, the system can prevent very rare or noisy IDs from consuming parameters and optimizer state that bring little training benefit.

More

[2025/11/11] 🎉v25.10 released!
- HSTU training example supports sequence parallelism.
- DynamicEmb supports LRU score checkpointing, gradient clipping.
- Decouple scaling sequence length from the maximum sequence length limit in HSTU attention and extend HSTU support to the SM89 GPU architecture for training.
[2025/10/20] 🎉v25.09 released!
- Integrated prefetching and caching into the HSTU training example.
- DynamicEmb now supports distributed embedding dumping and memory scaling.
- Added kernel fusion in the HSTU block for inference, including KVCache fixes.
- HSTU attention now supports FP8 quantization.
[2025/9/8] 🎉v25.08 released!
- Added cache support for dynamicemb, enabling seamless hot embedding migration between cache and storage.
- Released an end-to-end HSTU inference example, demonstrating precision aligned with training.
- Enabled evaluation mode support for dynamicemb.
[2025/8/1] 🎉v25.07 released!
- Released HSTU inference benchmark, including paged kvcache HSTU kernel, kvcache manager based on trt-llm, CUDA graph, and other optimizations.
- Added support for Tensor Parallelism in the HSTU layer.
[2025/7/4] 🎉v25.06 released!
- Dynamicemb lookup module performance improvement and LFU eviction support.
- Pipeline support for HSTU example, recompute support for HSTU layer and customized cuda ops for jagged tensor concat.
[2025/5/29] 🎉v25.05 released!
- Enhancements to the dynamicemb functionality, including support for EmbeddingBagCollection, truncated normal initialization, and initial_accumulator_value for Adagrad.
- Fusion of operations like layernorm and dropout in the HSTU layer, resulting in about 1.2x end-to-end speedup.
- Fix convergence issues on the Kuairand dataset.

For more detailed release notes, please refer to our releases.

Get Started

The examples we supported:

HSTU recommender examples
HSTU inference — KV cache, Triton Inference Server, C++ AOTInductor
SID based generative recommender examples

Benchmarks

Contribution Guidelines

Please see our contributing guidelines for details on how to contribute to this project.

Resources

Video

Blog

Community

Join our community channels to ask questions, provide feedback, and interact with other users and developers:

GitHub Issues: For bug reports and feature requests
NVIDIA Developer Forums

References

If you use RecSys Examples in your research, please cite:

@Manual{,
  title = {RecSys Examples: A collection of recommender system implementations},
  author = {NVIDIA Corporation},
  year = {2024},
  url = {https://github.com/NVIDIA/recsys-examples},
}

For more citation information and referenced papers, see CITATION.md.

License

This project is licensed under the Apache License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 196 Commits
.github		.github
corelib		corelib
docker		docker
examples		examples
jenkins		jenkins
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CITATION.md		CITATION.md
CLA.md		CLA.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
HOW_TO_BUILD_AND_RUN_DEMO.md		HOW_TO_BUILD_AND_RUN_DEMO.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
print_env.sh		print_env.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NVIDIA RecSys Examples

Overview

What's New

Get Started

Benchmarks

Contribution Guidelines

Resources

Video

Blog

Community

References

License

About

Uh oh!

Releases 10

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NVIDIA RecSys Examples

Overview

What's New

Get Started

Benchmarks

Contribution Guidelines

Resources

Video

Blog

Community

References

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages