Hi, I have been experimenting with FluxRT on an RTX 3090 and wanted to share the methodology and benchmark findings before opening any PRs.
The main result I would highlight is not just raw FPS. The 4-bit path cuts peak CUDA memory enough to make the live 24GB RTX 3090 setup much more practical, while still keeping comparable or better throughput after the runtime optimizations described below.
My local fork has diverged quite a bit from upstream, so I kept upstream as the reference implementation and compared three paths separately:
- upstream BF16
- upstream Quanto int8
- my local RTX 3090 optimized 4-bit path
Methodology
Benchmark setup was kept matched across runs:
- GPU: RTX 3090 24GB
- resolution: 256x448
- steps: 2
- same prompt and seed
- RIFE x2 interpolation enabled
- OpenCV preview disabled during benchmark
- upstream scripts launched with
PYTHONPATH=src to ensure the upstream worktree package was used
I also downloaded and tested the upstream int8 weights from aydin99/FLUX.2-klein-4B-int8.
Since optimum-quanto requires a newer PyTorch stack than my local benchmark environment, I tested int8 in a separate conda environment to avoid affecting the local optimized 4-bit path.
The local optimized path includes:
- BitsAndBytes 4-bit / FP16 compute
- RTX 3090-oriented config
- VAE TensorRT decoder-only path
- compact/sparse single-block optimizations
- benchmark/profiling improvements
Results
Memory headroom
For the local A/B runs, the largest gain is peak VRAM reduction:
| Mode |
Peak allocated |
Peak reserved |
| BF16 local path |
21363.62 MB |
21404 MB |
| 8-bit local path |
14332.73 MB |
14800 MB |
| 4-bit local path |
10292.01 MB |
10762 MB |
| 4-bit delta vs BF16 |
-11071.61 MB |
-10642 MB |
So the 4-bit path recovers about 10.6-11.1 GB of CUDA memory headroom compared with the local BF16 baseline. On a 24GB RTX 3090, that is the part that changes the operating envelope: less OOM pressure, more room for RIFE, reference-image features, caches, and future TensorRT experiments.
One caveat: the upstream benchmark script does not currently report CUDA memory, so I am not presenting this as a direct upstream BF16/int8 memory comparison. The memory numbers above come from the local BF16/8-bit/4-bit A/B profiler. I can port the same memory reporting to the upstream benchmark path if that would be useful.
Throughput
Base FPS here means the benchmark's averaged per-frame FPS before RIFE interpolation.
Interpolated FPS includes RIFE x2, matching the live output path.
The upstream benchmark labels its final column as FPS, but it averages per-frame FPS after applying the interpolation_exp=1 RIFE x2 multiplier. For upstream runs, base FPS below is computed as reported averaged FPS / 2. This matches what upstream would report if it averaged per-frame base FPS directly. It is not computed as 1 / average processing time. The local benchmark reports both base and interpolated FPS directly.
| Dynamic area |
Upstream BF16 base / interp |
Upstream int8 base / interp |
Optimized 4-bit base / interp |
| 0% |
4.33 / 8.65 |
4.15 / 8.29 |
7.24 / 14.48 |
| 10% |
3.72 / 7.43 |
3.15 / 6.29 |
5.16 / 10.32 |
| 25% |
3.24 / 6.47 |
3.07 / 6.13 |
4.71 / 9.42 |
| 50% |
3.07 / 6.14 |
2.92 / 5.83 |
4.32 / 8.64 |
| 75% |
3.18 / 6.35 |
2.72 / 5.44 |
4.02 / 8.04 |
| 90% |
2.84 / 5.68 |
2.64 / 5.28 |
3.66 / 7.33 |
| 100% |
2.22 / 4.44 |
2.46 / 4.92 |
3.02 / 6.05 |
Average base FPS across tested dynamic areas:
- upstream BF16: ~3.23
- upstream int8: ~3.02
- optimized 4-bit path: ~4.59
Average interpolated FPS across tested dynamic areas:
- upstream BF16: ~6.45
- upstream int8: ~6.03
- optimized 4-bit path: ~9.18
Findings
- The most useful result for RTX 3090 support is the memory reduction: the 4-bit local path reduced peak allocated CUDA memory from ~21.36 GB to ~10.29 GB, and peak reserved memory from ~21.40 GB to ~10.76 GB.
- This makes the 24GB setup less fragile and leaves headroom for the rest of the live pipeline instead of running close to the card limit.
- The optimized 4-bit path was faster than both upstream BF16 and upstream int8 across all tested dynamic-area levels.
- Compared with the best upstream result per dynamic area, the speedup ranged from ~1.23x to ~1.67x.
- Compared directly with upstream int8, the optimized 4-bit path ranged from ~1.23x to ~1.75x faster.
- Upstream int8 was not a general performance win on my RTX 3090: it was slower than BF16 from 0% to 90% dynamic area, and only faster at 100%.
- Even at 100%, where int8 beat BF16, the optimized 4-bit path was still faster.
Important caveat: I would not interpret this as "4-bit is always faster than int8". The tested path combines 4-bit loading with other RTX 3090-specific runtime optimizations, so the fair claim is that this optimized 4-bit path performs better on my 3090 under the benchmark conditions above.
Quantization and TensorRT caveat
The optimized path uses BitsAndBytes as a practical PyTorch runtime path:
- 4-bit NF4 weight storage
- FP16 compute
- quantized transformer and text encoder loading from the BF16 checkpoint
This was very useful for reducing VRAM pressure on the RTX 3090 and made the rest of the optimization path practical.
However, I would not present BitsAndBytes as the final TensorRT path. In local TensorRT/ONNX probes, the 4-bit BitsAndBytes projection modules did not export cleanly. So the current conclusion is:
- BitsAndBytes is useful for the current PyTorch RTX 3090 runtime.
- A full TensorRT transformer path would likely need a separate design, such as an FP16 TensorRT path, a TensorRT-native INT8/INT4 path, or custom/plugin handling for quantized linear layers and FluxRT's sparse/cache behavior.
Suggested next step
Since this is more than one isolated change, I did not want to open one large PR immediately.
Would you prefer that I split this into smaller PRs, for example:
- benchmark/config improvements first,
- RTX 3090 config and reproduction docs,
- runtime optimization PRs split by area?
I attached the benchmark logs, public configs, and a technical inventory of the local fork changes used for the comparison.
benchmark_summary.md
local_4bit_optimized_benchmark_clean.txt
local_changes_technical_inventory.md
stream_processor_3090_4bit_config_public.json
upstream_3090_compare_config_public.json
upstream_bf16_benchmark_clean.txt
upstream_int8_benchmark_clean.txt
Hi, I have been experimenting with FluxRT on an RTX 3090 and wanted to share the methodology and benchmark findings before opening any PRs.
The main result I would highlight is not just raw FPS. The 4-bit path cuts peak CUDA memory enough to make the live 24GB RTX 3090 setup much more practical, while still keeping comparable or better throughput after the runtime optimizations described below.
My local fork has diverged quite a bit from upstream, so I kept upstream as the reference implementation and compared three paths separately:
Methodology
Benchmark setup was kept matched across runs:
PYTHONPATH=srcto ensure the upstream worktree package was usedI also downloaded and tested the upstream int8 weights from
aydin99/FLUX.2-klein-4B-int8.Since
optimum-quantorequires a newer PyTorch stack than my local benchmark environment, I tested int8 in a separate conda environment to avoid affecting the local optimized 4-bit path.The local optimized path includes:
Results
Memory headroom
For the local A/B runs, the largest gain is peak VRAM reduction:
So the 4-bit path recovers about 10.6-11.1 GB of CUDA memory headroom compared with the local BF16 baseline. On a 24GB RTX 3090, that is the part that changes the operating envelope: less OOM pressure, more room for RIFE, reference-image features, caches, and future TensorRT experiments.
One caveat: the upstream benchmark script does not currently report CUDA memory, so I am not presenting this as a direct upstream BF16/int8 memory comparison. The memory numbers above come from the local BF16/8-bit/4-bit A/B profiler. I can port the same memory reporting to the upstream benchmark path if that would be useful.
Throughput
Base FPS here means the benchmark's averaged per-frame FPS before RIFE interpolation.
Interpolated FPS includes RIFE x2, matching the live output path.
The upstream benchmark labels its final column as
FPS, but it averages per-frame FPS after applying theinterpolation_exp=1RIFE x2 multiplier. For upstream runs, base FPS below is computed asreported averaged FPS / 2. This matches what upstream would report if it averaged per-frame base FPS directly. It is not computed as1 / average processing time. The local benchmark reports both base and interpolated FPS directly.Average base FPS across tested dynamic areas:
Average interpolated FPS across tested dynamic areas:
Findings
Important caveat: I would not interpret this as "4-bit is always faster than int8". The tested path combines 4-bit loading with other RTX 3090-specific runtime optimizations, so the fair claim is that this optimized 4-bit path performs better on my 3090 under the benchmark conditions above.
Quantization and TensorRT caveat
The optimized path uses BitsAndBytes as a practical PyTorch runtime path:
This was very useful for reducing VRAM pressure on the RTX 3090 and made the rest of the optimization path practical.
However, I would not present BitsAndBytes as the final TensorRT path. In local TensorRT/ONNX probes, the 4-bit BitsAndBytes projection modules did not export cleanly. So the current conclusion is:
Suggested next step
Since this is more than one isolated change, I did not want to open one large PR immediately.
Would you prefer that I split this into smaller PRs, for example:
I attached the benchmark logs, public configs, and a technical inventory of the local fork changes used for the comparison.
benchmark_summary.md
local_4bit_optimized_benchmark_clean.txt
local_changes_technical_inventory.md
stream_processor_3090_4bit_config_public.json
upstream_3090_compare_config_public.json
upstream_bf16_benchmark_clean.txt
upstream_int8_benchmark_clean.txt