What's Changed
Features & Enhancements
- Add Torch export for HSTU model by @jensenhwa in #327
- [Feature] dynamicemb table fusion and expansion by @jiashuy in #343
- feat(benchmark): HSTU E2E training benchmark suite with progressive optimizations by @JacoCheung in #340
- Add HSTU inference benchmark results on B200 by @geoffreyQiu in #338
- Relax alignment requirements(remove pow of 2) in dynamicemb by @jiashuy in #312
- perf: avoid D2H sync in _Split2DJaggedFunction by precomputing split lengths by @JacoCheung in #318
- refactor: migrate to fbgemm_gpu_hstu, remove legacy HSTU compat layer by @JacoCheung in #321
- Optimize balancer and setup debug logger. by @JacoCheung in #308
- fix: align DynamicEmb capacity to bucket_capacity instead of DEMB_TABLE_ALIGN_SIZE by @JacoCheung in #329
Bug Fixes
- fix missing import by @gameofdimension in #320
- refactor: remove redundant apply_optimizer_in_backward in sharding.py by @ShaobinChen-AH in #330
- error handling for empty kv list by @gameofdimension in #331
- Fix docker, cmake and imports after torch export support by @geoffreyQiu in #358
- Make table_ptrs_dev persistent by @jiashuy in #356
- Create DynamicEmbStorage when zero local hbm; reset _prefetch_outstanding_keys only in reset_cache_states by @jiashuy in #354
- Fix empty batch hang fundamentally by @jiashuy in #349
- [bugfix] fix hang issue when fed empty batch by @gameofdimension in #342
- Fix optimizer states dim(ckpt) of rowwise adagrad by @jiashuy in #305
- Refactor test for alignment; add get_sharded_table_capacity by @jiashuy in #348
Misc
- fix(pipeline): drain eval pipeline naturally to prevent batch leak by @JacoCheung in #314
- Fix NVE dependency by @geoffreyQiu in #323
- refactor: move HSTU build to devel stage by @shijieliu in #325
- Upgrade to Torch 2.11 with Cuda 13.1 by @geoffreyQiu in #347
- Update HSTU inference README file by @geoffreyQiu in #360
New Contributors
- @jensenhwa made their first contribution in #327
Full Changelog: v26.01...v26.03