[Discussion] RTX 3090 optimized 4-bit path benchmark comparison vs upstream BF16 and int8


Hi, I have been experimenting with FluxRT on an RTX 3090 and wanted to share the methodology and benchmark findings before opening any PRs.

The main result I would highlight is not just raw FPS. The 4-bit path cuts peak CUDA memory enough to make the live 24GB RTX 3090 setup much more practical, while still keeping comparable or better throughput after the runtime optimizations described below.

My local fork has diverged quite a bit from upstream, so I kept upstream as the reference implementation and compared three paths separately:

1. upstream BF16
2. upstream Quanto int8
3. my local RTX 3090 optimized 4-bit path

## Methodology

Benchmark setup was kept matched across runs:

- GPU: RTX 3090 24GB
- resolution: 256x448
- steps: 2
- same prompt and seed
- RIFE x2 interpolation enabled
- OpenCV preview disabled during benchmark
- upstream scripts launched with `PYTHONPATH=src` to ensure the upstream worktree package was used

I also downloaded and tested the upstream int8 weights from `aydin99/FLUX.2-klein-4B-int8`.

Since `optimum-quanto` requires a newer PyTorch stack than my local benchmark environment, I tested int8 in a separate conda environment to avoid affecting the local optimized 4-bit path.

The local optimized path includes:

- BitsAndBytes 4-bit / FP16 compute
- RTX 3090-oriented config
- VAE TensorRT decoder-only path
- compact/sparse single-block optimizations
- benchmark/profiling improvements

## Results


### Memory headroom

For the local A/B runs, the largest gain is peak VRAM reduction:

| Mode | Peak allocated | Peak reserved |
| --- | ---: | ---: |
| BF16 local path | 21363.62 MB | 21404 MB |
| 8-bit local path | 14332.73 MB | 14800 MB |
| 4-bit local path | 10292.01 MB | 10762 MB |
| 4-bit delta vs BF16 | -11071.61 MB | -10642 MB |

So the 4-bit path recovers about 10.6-11.1 GB of CUDA memory headroom compared with the local BF16 baseline. On a 24GB RTX 3090, that is the part that changes the operating envelope: less OOM pressure, more room for RIFE, reference-image features, caches, and future TensorRT experiments.

One caveat: the upstream benchmark script does not currently report CUDA memory, so I am not presenting this as a direct upstream BF16/int8 memory comparison. The memory numbers above come from the local BF16/8-bit/4-bit A/B profiler. I can port the same memory reporting to the upstream benchmark path if that would be useful.

### Throughput

Base FPS here means the benchmark's averaged per-frame FPS before RIFE interpolation.
Interpolated FPS includes RIFE x2, matching the live output path.

The upstream benchmark labels its final column as `FPS`, but it averages per-frame FPS after applying the `interpolation_exp=1` RIFE x2 multiplier. For upstream runs, base FPS below is computed as `reported averaged FPS / 2`. This matches what upstream would report if it averaged per-frame base FPS directly. It is not computed as `1 / average processing time`. The local benchmark reports both base and interpolated FPS directly.

| Dynamic area | Upstream BF16 base / interp | Upstream int8 base / interp | Optimized 4-bit base / interp |
| ---: | ---: | ---: | ---: |
| 0% | 4.33 / 8.65 | 4.15 / 8.29 | 7.24 / 14.48 |
| 10% | 3.72 / 7.43 | 3.15 / 6.29 | 5.16 / 10.32 |
| 25% | 3.24 / 6.47 | 3.07 / 6.13 | 4.71 / 9.42 |
| 50% | 3.07 / 6.14 | 2.92 / 5.83 | 4.32 / 8.64 |
| 75% | 3.18 / 6.35 | 2.72 / 5.44 | 4.02 / 8.04 |
| 90% | 2.84 / 5.68 | 2.64 / 5.28 | 3.66 / 7.33 |
| 100% | 2.22 / 4.44 | 2.46 / 4.92 | 3.02 / 6.05 |

Average base FPS across tested dynamic areas:

- upstream BF16: ~3.23
- upstream int8: ~3.02
- optimized 4-bit path: ~4.59

Average interpolated FPS across tested dynamic areas:

- upstream BF16: ~6.45
- upstream int8: ~6.03
- optimized 4-bit path: ~9.18

## Findings

- The most useful result for RTX 3090 support is the memory reduction: the 4-bit local path reduced peak allocated CUDA memory from ~21.36 GB to ~10.29 GB, and peak reserved memory from ~21.40 GB to ~10.76 GB.
- This makes the 24GB setup less fragile and leaves headroom for the rest of the live pipeline instead of running close to the card limit.
- The optimized 4-bit path was faster than both upstream BF16 and upstream int8 across all tested dynamic-area levels.
- Compared with the best upstream result per dynamic area, the speedup ranged from ~1.23x to ~1.67x.
- Compared directly with upstream int8, the optimized 4-bit path ranged from ~1.23x to ~1.75x faster.
- Upstream int8 was not a general performance win on my RTX 3090: it was slower than BF16 from 0% to 90% dynamic area, and only faster at 100%.
- Even at 100%, where int8 beat BF16, the optimized 4-bit path was still faster.

Important caveat: I would not interpret this as "4-bit is always faster than int8". The tested path combines 4-bit loading with other RTX 3090-specific runtime optimizations, so the fair claim is that this optimized 4-bit path performs better on my 3090 under the benchmark conditions above.

## Quantization and TensorRT caveat

The optimized path uses BitsAndBytes as a practical PyTorch runtime path:

- 4-bit NF4 weight storage
- FP16 compute
- quantized transformer and text encoder loading from the BF16 checkpoint

This was very useful for reducing VRAM pressure on the RTX 3090 and made the rest of the optimization path practical.

However, I would not present BitsAndBytes as the final TensorRT path. In local TensorRT/ONNX probes, the 4-bit BitsAndBytes projection modules did not export cleanly. So the current conclusion is:

- BitsAndBytes is useful for the current PyTorch RTX 3090 runtime.
- A full TensorRT transformer path would likely need a separate design, such as an FP16 TensorRT path, a TensorRT-native INT8/INT4 path, or custom/plugin handling for quantized linear layers and FluxRT's sparse/cache behavior.

## Suggested next step

Since this is more than one isolated change, I did not want to open one large PR immediately.

Would you prefer that I split this into smaller PRs, for example:

1. benchmark/config improvements first,
2. RTX 3090 config and reproduction docs,
3. runtime optimization PRs split by area?

I attached the benchmark logs, public configs, and a technical inventory of the local fork changes used for the comparison.

[benchmark_summary.md](https://github.com/user-attachments/files/27569344/benchmark_summary.md)
[local_4bit_optimized_benchmark_clean.txt](https://github.com/user-attachments/files/27569343/local_4bit_optimized_benchmark_clean.txt)
[local_changes_technical_inventory.md](https://github.com/user-attachments/files/27569346/local_changes_technical_inventory.md)
[stream_processor_3090_4bit_config_public.json](https://github.com/user-attachments/files/27569347/stream_processor_3090_4bit_config_public.json)
[upstream_3090_compare_config_public.json](https://github.com/user-attachments/files/27569345/upstream_3090_compare_config_public.json)
[upstream_bf16_benchmark_clean.txt](https://github.com/user-attachments/files/27569348/upstream_bf16_benchmark_clean.txt)
[upstream_int8_benchmark_clean.txt](https://github.com/user-attachments/files/27569349/upstream_int8_benchmark_clean.txt)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] RTX 3090 optimized 4-bit path benchmark comparison vs upstream BF16 and int8 #12

Methodology

Results

Memory headroom

Throughput

Findings

Quantization and TensorRT caveat

Suggested next step

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Mode	Peak allocated	Peak reserved
BF16 local path	21363.62 MB	21404 MB
8-bit local path	14332.73 MB	14800 MB
4-bit local path	10292.01 MB	10762 MB
4-bit delta vs BF16	-11071.61 MB	-10642 MB

Dynamic area	Upstream BF16 base / interp	Upstream int8 base / interp	Optimized 4-bit base / interp
0%	4.33 / 8.65	4.15 / 8.29	7.24 / 14.48
10%	3.72 / 7.43	3.15 / 6.29	5.16 / 10.32
25%	3.24 / 6.47	3.07 / 6.13	4.71 / 9.42
50%	3.07 / 6.14	2.92 / 5.83	4.32 / 8.64
75%	3.18 / 6.35	2.72 / 5.44	4.02 / 8.04
90%	2.84 / 5.68	2.64 / 5.28	3.66 / 7.33
100%	2.22 / 4.44	2.46 / 4.92	3.02 / 6.05

[Discussion] RTX 3090 optimized 4-bit path benchmark comparison vs upstream BF16 and int8 #12

Description

Methodology

Results

Memory headroom

Throughput

Findings

Quantization and TensorRT caveat

Suggested next step

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions