We provide a set of benchmarks for HSTU Inference on end-to-end inference and paged HSTU inference layer.
~$ cd recsys-examples/examples/hstu
~$ export PYTHONPATH=${PYTHONPATH}:$(realpath ../)
~$ python3 ./inference/benchmark/inference_benchmark.py
~$ python3 ./inference/benchmark/paged_hstu_with_kvcache_benchmark.pyHere we present the benchmark results of the HSTU layers with KV cache on single L20 gpu.
HSTU Setup for benchmark:
| Parameter | Value |
|---|---|
| Number of HSTU layers | 8 |
| Hidden Dim Size | 1024 |
| Number of Heads | 4 |
| Head Dim Size | 256 |
| Max Batchsize | 16 |
| Max Per Sequence Length | 4096 |
| Per Sequence Targets Number | 256 |
Notice: This config is for all following benchmarks (unless otherwise specified).
Here we benchmarked with a synthetic input dataset:
- Each user's input sequence starts from 256 tokens to 4096 in increments of 256.
- Each input request has 256 item candidates for ranking.
- Generate data for 1, 2, 4 and 8 users to benchmark with different batch size.
We can achieve 1.3x ~ 2.6x performance speedup for inference (with batch size ranging from 1 to 8), after utilizing the KV cache and CUDA graph optimization.
Performance results:
Note:
- The baseline performance is based on our implementation without KVCache support and CUDA Graph optimization.
- The end-to-end performance includes the embedding part, which utilizes both native
EmbeddingCollectionfrom TorchRec andDynamicEmbedding. - The number of input sequences from the synthetic dataset increases according to the batch size. All test cases have 16 batches in total.
- In the test cases with KVCache enabled, the kvcache preparation and onloading/offloading are within time measurement, but they are hidden as asynchronous operations.
- Performance Results on L20:
When the input sequence has 4096 tokens in which 3968 tokens have KV data cached, we can achieve on HSTU block a speedup of 3x ~ 20x without candidate items, and a speedup of 3x ~ 8x with extra 256 candidates for each sequence (for batch size = 1 and 8 respectively).
Notice: In the L20 benchmark, the page size of KVCache is 32-token. The head dim of hstu attention kernel is 256.
- Performance Results on B200 (using HSTU Blackwell CuteDSL kernel):
When the input sequence has 4096 tokens in which 3968 tokens have KV data cached, we can achieve on HSTU block a speedup of 3.3 ~ 7.7x with extra 256 candidates for each sequence (for batch size = 1 and 8 respectively).
Notice: In the B200 benchmark, the page size of KVCache is 128-token. The head dim of hstu attention kernel is 128. This is limited by current kernel implementations.




