NVIDIA RecSys Examples is a collection of optimized recommender models and components.
The project includes:
- Examples for large-scale HSTU ranking and retrieval training through TorchRec and Megatron-Core integration
- HSTU inference with paged KV cache, Triton Inference Server integration, CUDA graph usage, and C++ deployment with AOTInductor (guide)
- Examples for semantic-id based retrieval model through TorchRec and Megatron-Core integration
- DynamicEmb for model-parallel dynamic embedding tables with zero-collision hashing, eviction, admission control, table fusion, and TorchRec integration (documentation)
- [2026/4/14] 🎉v26.03 released!
- We added Torch export and AOTInductor packaging for end-to-end HSTU C++ inference. See the HSTU inference overview and the C++ inference guide.
- We improved DynamicEmb with table fusion and expansion, relaxed embedding-table alignment (no longer power-of-two), and capacity sizing aligned to
bucket_capacity. See DynamicEmb. - We added an HSTU end-to-end training benchmark suite with progressive optimizations. See the HSTU training benchmark and E2E benchmark notes.
- We published HSTU inference benchmark results on B200 in the HSTU inference benchmark.
- We migrated HSTU attention to
fbgemm_gpu_hstu, removed the legacy compatibility layer, and improved the training stack (fewer device-to-host syncs in jagged tensor handling, balancer tuning, and debug logging). See HSTU training setup.
- [2026/2/13] 🎉v26.01 released!
- We optimized HSTU KVCacheManager, moving Python-based KV cache management to optimized C++ implementation with asynchronous onload/offload operation and compression support. Benchmark shows onload and offload latency can be fully hidden under HSTU inference.
- We introduced a HSTU training optimization with workload-balanced batch shuffling for data parallel training.
- We added caching and prefetching support for
EmbeddingBagCollection.
- [2026/1/13] 🎉v25.12 released!
- Support TritonServer for HSTU inference. Follow the HSTU inference TritonServer example to try it out.
- We introduced our first semantic-id retrieval model example. Follow the semantic‑id retrieval (sid_gr) documentation to run it.
- [2025/12/10] 🎉v25.11 released!
- DynamicEmb supports embedding admission, that decides whether a new feature ID is allowed to create or update an embedding entry in the dynamic embedding table. By controlling admission, the system can prevent very rare or noisy IDs from consuming parameters and optimizer state that bring little training benefit.
More
-
[2025/11/11] 🎉v25.10 released!
- HSTU training example supports sequence parallelism.
- DynamicEmb supports LRU score checkpointing, gradient clipping.
- Decouple scaling sequence length from the maximum sequence length limit in HSTU attention and extend HSTU support to the SM89 GPU architecture for training.
-
[2025/10/20] 🎉v25.09 released!
- Integrated prefetching and caching into the HSTU training example.
- DynamicEmb now supports distributed embedding dumping and memory scaling.
- Added kernel fusion in the HSTU block for inference, including KVCache fixes.
- HSTU attention now supports FP8 quantization.
-
[2025/9/8] 🎉v25.08 released!
- Added cache support for dynamicemb, enabling seamless hot embedding migration between cache and storage.
- Released an end-to-end HSTU inference example, demonstrating precision aligned with training.
- Enabled evaluation mode support for dynamicemb.
-
[2025/8/1] 🎉v25.07 released!
- Released HSTU inference benchmark, including paged kvcache HSTU kernel, kvcache manager based on trt-llm, CUDA graph, and other optimizations.
- Added support for Tensor Parallelism in the HSTU layer.
-
[2025/7/4] 🎉v25.06 released!
- Dynamicemb lookup module performance improvement and LFU eviction support.
- Pipeline support for HSTU example, recompute support for HSTU layer and customized cuda ops for jagged tensor concat.
-
[2025/5/29] 🎉v25.05 released!
- Enhancements to the dynamicemb functionality, including support for EmbeddingBagCollection, truncated normal initialization, and initial_accumulator_value for Adagrad.
- Fusion of operations like layernorm and dropout in the HSTU layer, resulting in about 1.2x end-to-end speedup.
- Fix convergence issues on the Kuairand dataset.
For more detailed release notes, please refer to our releases.
The examples we supported:
- HSTU recommender examples
- HSTU inference — KV cache, Triton Inference Server, C++ AOTInductor
- SID based generative recommender examples
Please see our contributing guidelines for details on how to contribute to this project.
- NVIDIA Platform Delivers Lowest Token Cost Enabled by Extreme Co-Design
- NVIDIA recsys-examples: 生成式推荐系统大规模训练推理的高效实践(上篇)
Join our community channels to ask questions, provide feedback, and interact with other users and developers:
- GitHub Issues: For bug reports and feature requests
- NVIDIA Developer Forums
If you use RecSys Examples in your research, please cite:
@Manual{,
title = {RecSys Examples: A collection of recommender system implementations},
author = {NVIDIA Corporation},
year = {2024},
url = {https://github.com/NVIDIA/recsys-examples},
}
For more citation information and referenced papers, see CITATION.md.
This project is licensed under the Apache License - see the LICENSE file for details.