FSR

One-step rectified flow in a frozen VAE decoder feature space for x2 super-resolution.

FSR stands for Feature-Space Rectified Flow Super-Resolution: a compact name for the core idea tested here, moving decoder features directly with a learned rectified-flow vector field.

This repository packages an overnight experiment that asks a narrow question:

Can we learn a vector field inside an intermediate VAE decoder feature space that transports bicubic-upsampled LR decoder features toward HR decoder features, then use that field in one step at inference?

The short answer from this run is yes, at the selected decoder cut f3 (decoder.up_blocks.1) the learned one-step vector field clearly improves over feature bicubic, reaches the same rough x2 quality range as our local LUA benchmark, and is well above the current LSRNA x2 snapshot in this workspace. Four-step Euler is kept only as a diagnostic check of the learned field, not as the headline comparison.

This is not a diffusion-UNet SR method. It does not use a pretrained denoising UNet, does not call scheduler.add_noise(), and does not feed decoder features into SDXL, SD2.1, or FLUX denoisers. The learned module is a lightweight vector field trained directly in decoder feature space.

What Problem Is This Solving?

Many latent SR systems upscale the latent tensor itself or apply a deterministic feature projector. Here we test a different object: a transport field between two frozen VAE decoder feature distributions.

Given a high-resolution image crop x_HR, we construct a low-resolution image x_LR by bicubic downsampling. A frozen FLUX VAE encoder E maps both images into latent space, and the frozen decoder D is split at an internal cut k:

D = D_>k o D_<=k

For a selected decoder feature cut:

f_H = D_<=k(E(x_HR))
f_L = D_<=k(E(x_LR))
f_B = Bicubic(f_L, spatial_size=f_H)

The model learns a vector field v_theta that moves f_B toward f_H. The pixel target is the VAE reconstruction x_H_rec = D_>k(f_H), not raw HR. That keeps the objective on the frozen VAE decoder manifold.

Rectified-Flow Formulation

For each feature pair (f_B, f_H), define:

f_0 = f_B
f_1 = f_H
sigma(t) = sigma_max * t * (1 - t)
f_t = (1 - t) * f_0 + t * f_1 + sigma(t) * eps

The model is trained with flow matching:

v_theta(f_t, t, cond=f_0) ~= f_1 - f_0

At inference, the main path is one-step Euler:

f_hat = f_B + v_theta(f_B, t=0, cond=f_B)
x_hat = D_>k(f_hat)

We also recorded two-step/four-step Euler and stochastic one-step variants for diagnostics, but the reported method is the one-step path above.

Method Summary

Frozen VAE: black-forest-labs/FLUX.1-dev, vae subfolder.
Decoder split candidates: f1 to f5.
Selected cut: f3, mapped to decoder.up_blocks.1.
Model: lightweight residual convolutional feature vector field with sinusoidal time embedding and optional gate.
Main signal: rectified-flow vector matching over random t.
Auxiliary signals: endpoint feature loss, FFT loss, one-step feature loss, decoded RGB loss, low-frequency anchor, high-frequency loss, drift control, and gate regularization.

The core training script is:

scripts/train_feature_rectified_flow_sr.py

Data Used

Training:

DIV2K train HR
Local path used in this run: /home/juhwan/Documents/sr/BasicSR/datasets/DIV2K/DIV2K_train_HR
Scale: x2
HR crop size: 512
LR construction: bicubic downsampling from HR crop

Validation/evaluation:

Set5
Set14
B100
Urban100
Manga109
FLUX179 generated images

Metrics in this repository are reported in two ways:

Primary experiment metrics compare to the FLUX VAE reconstruction target x_H_rec, because the model is explicitly trained to move along the frozen decoder feature manifold.
Cross-method metrics compare to raw HR so the result can be read beside local LUA and LSRNA x2 benchmark summaries. Those numbers are useful context, but are not a strict leaderboard because the evaluators and VAE backbones differ.

Cut Probe

The experiment first probed decoder cuts f1 to f5 on 32 DIV2K images. The selection score preferred a cut that had non-trivial feature/high-frequency gap, stable decoded pixels, moderate decoder sensitivity, and feasible runtime.

Cut	Decoder Stage	Score	RGB L1	HF Error	FFT Gap	Sensitivity	Probe VRAM
f3	up_blocks.1	0.4707	0.09968	0.07354	2.13634	0.14372	1.083 GiB
f4	up_blocks.2	0.4315	0.06940	0.04755	3.77468	0.03831	1.721 GiB
f2	up_blocks.0	0.0014	0.12887	0.08451	0.63098	0.30649	0.864 GiB
f1	conv_in	-0.5271	0.17867	0.08729	0.52488	0.42026	0.802 GiB
f5	conv_act	-0.7000	0.07063	0.04542	0.03181	0.48919	0.955 GiB

f3 was selected. f4 was the second-best probe cut, but the default training configuration OOMed at 512 due to the large feature tensor 1 x 256 x 512 x 512. A reduced f4 run would need smaller hidden width, fewer blocks, lower HR size, lower pixel-loss frequency, or feature tiling.

Overnight Run

Hardware:

GPU: NVIDIA RTX 3090 24 GB
Precision: bf16
Batch size: 1
Gradient accumulation: 8
Hidden channels: 128
Blocks: 8
Gate: enabled
sigma_max: 0.03

Training budget and progress:

Phase	Steps	Wall-clock
f3 warm-up	0 -> 2000	3.26 h
f3 main resume	2000 -> 6063	6.72 h
Total f3 updates	6063 optimizer steps	~9.98 h

The final main run stopped by time budget at step 6063, saved checkpoints, ran final validation, and synced W&B.

W&B run:

https://wandb.ai/standard_juhwan/feature-rectified-flow-sr/runs/sc1t0349

Training Cost In Context

This was deliberately a small hypothesis test: one selected f3 cut, x2 only, and no large multi-scale pretraining. Even so, it reached the useful regime in 6063 optimizer steps on one RTX 3090.

Method	Scope	Hardware	Wall-clock	GPU-hours	Steps / iters	Reported data
FSR	this repo, x2 f3	1x RTX 3090 24GB	9.98 h	9.98	6063	DIV2K crops, about 48.5K crop presentations
LSRNA LSR module	paper v1 arbitrary-scale LSR	1x V100-SXM2	26 h	26	200K	4.7M LR-HR latent pairs
LUA latent upscaler	paper x2/x4 multi-scale adapter	8x H100 80GB	34.1 h	272.8	375K	3.8M OpenImages latent pairs

The comparison is not apples-to-apples: LUA trains a multi-scale x2/x4 adapter, and LSRNA trains an arbitrary-scale LSR module for use with RNA and a guided denoising stage. The useful point is narrower: this decoder-feature RF prototype was much cheaper to validate as a one-step x2 transport hypothesis.

Sources: LUA arXiv 2511.10629 reports 3.8M pairs, three 125K-step stages, and 8x H100 training; the LSRNA CVPR 2025 paper/supplement reports 4.7M LR-HR latent pairs and the v1 200K-iteration, 26-hour V100 LSR training setting.

Results

The main result is the one-step feature-space rectified flow. Four-step Euler is not used as a headline baseline; it is saved separately in results/tables/diagnostic_four_step.csv.

Primary VAE-Target Result

These values measure the experiment on its intended target, the FLUX VAE reconstruction x_H_rec.

Dataset	Method	PSNR	SSIM	RGB L1	LPIPS
Set5	feature bicubic	24.606	0.6974	0.07898	0.15612
Set5	RF one-step	28.478	0.8301	0.04984	0.07966
Set14	feature bicubic	22.713	0.5895	0.09785	0.19393
Set14	RF one-step	26.161	0.7321	0.06857	0.09724
B100	feature bicubic	22.465	0.5442	0.10204	0.23221
B100	RF one-step	25.516	0.6834	0.07373	0.14454
Urban100	feature bicubic	20.022	0.5595	0.12862	-
Urban100	RF one-step	24.114	0.7488	0.08163	-
Manga109	feature bicubic	21.679	0.6991	0.09224	0.07181
Manga109	RF one-step	26.942	0.8508	0.05601	0.02019
FLUX179	feature bicubic	27.209	0.8106	0.04595	0.07732
FLUX179	RF one-step	30.964	0.8853	0.03075	0.02918

Observations:

RF one-step improves strongly over feature bicubic on every benchmark.
The gain is consistent across all local SR validation sets and the generated FLUX179 set.
This supports the core hypothesis that a random-time rectified-flow objective can learn a useful one-step transport field at the f3 decoder cut.

Contextual Comparison With LUA and LSRNA

The table below uses raw-HR RGB metrics so the result can be read beside the local LUA/LSRNA benchmark files in this workspace. Interpret this as contextual: our RF row comes from this f3 experiment's raw-HR logs, LUA is a FLUX VAE x2 benchmark with crop_border=2, and LSRNA is an SDXL VAE x2 benchmark.

Dataset	Ours RF 1-step RGB PSNR/SSIM	LUA x2 RGB PSNR/SSIM	LSRNA x2 RGB PSNR/SSIM
Set5	28.026 / 0.8138	27.988 / 0.8297	15.772 / 0.3903
Set14	25.566 / 0.7058	26.085 / 0.7406	15.116 / 0.3744
B100	25.284 / 0.6742	25.850 / 0.7142	15.325 / 0.3709
Urban100	23.764 / 0.7381	24.985 / 0.7861	14.253 / 0.3965
Manga109	26.549 / 0.8382	27.468 / 0.8647	15.385 / 0.5344

Takeaway:

Against LUA, RF one-step is essentially tied on Set5 RGB PSNR, and trails by about 0.5 to 1.2 dB on Set14/B100/Urban100/Manga109.
Against this LSRNA snapshot, RF one-step is much stronger on all listed datasets.
A strict paper-style comparison should re-run RF, LUA, and LSRNA through one shared evaluator with the same crop border, color space, VAE backbone, and output saving path.

Base Preservation and Detail

To test the actual desired behavior, we added a post-hoc evaluator that measures whether the upsampled result still downscales back to the LR/base image while adding controlled high-frequency content.

Macro average across Set5/Set14/B100/Urban100/Manga109:

Method	Raw RGB PSNR	Raw RGB SSIM	Base L1 RGB	Base Grad L1
RF one-step	25.837	0.7540	0.0300	0.0767
LUA x2	26.475	0.7871	0.0237	0.0155
LSRNA x2	15.170	0.4133	0.1123	0.0671

For the generated FLUX179 images, we also ran RF in LR-only mode from a 1024px base to a 2048px output, matching the existing LUA/LSRNA generated visual comparison setup. On all 179 generated images:

Method	Base PSNR RGB	Base SSIM RGB	Base L1 RGB	HF Gain vs Base
feature bicubic	31.380	0.9048	0.01533	1.758
RF one-step	34.245	0.9460	0.01360	1.156

On the shared 5-image generated visual subset:

Method	Base PSNR RGB	Base SSIM RGB	Base L1 RGB	HF Gain vs Bicubic
bicubic x2	41.937	0.9875	0.00426	1.000
RF one-step	33.784	0.9153	0.01378	1.285
LUA x2	34.279	0.9185	0.01307	0.992
LSRNA x2	9.673	0.3717	0.25925	1.609

This is the most relevant qualitative signal: RF is close to LUA in base preservation on generated x2 samples, while increasing high-frequency energy more than LUA. LSRNA has strong high-frequency change but poor base preservation in this generated visual subset.

Paper-Style OpenImages x2 Metrics

We re-ran the generated-image distribution evaluation in the same style as the LUA paper table: FID, pFID, KID, pKID, CLIP, and runtime. This is now the headline OpenImages distribution result, replacing the earlier 5-image visual diagnostic.

Protocol:

Generated set: all 179 saved FLUX 1024 latent/prompt records.
Target setting: x2, 1024 -> 2048.
Real reference: cached OpenImages HR Inception features, 150 full images and 2400 patches.
Generated patches: 16 patches per generated image, 2864 total patches.
Feature extractor: torchvision InceptionV3 ImageNet weights, final FC replaced by identity.
CLIP: openai/clip-vit-base-patch32, image-text cosine against the saved FLUX prompt.

Resolution	Method	FID ↓	pFID ↓	KID ↓	pKID ↓	CLIP ↑	Time (s) ↓
2048x2048	bicubic x2	309.00	113.12	0.06830	0.03735	0.3455	0.000
2048x2048	RF f3 one-step	308.86	105.70	0.06792	0.03386	0.3453	1.31
2048x2048	LUA x2	309.20	120.61	0.06860	0.04369	0.3459	0.88

Interpretation:

RF one-step is best on patch distribution metrics (pFID, pKID), which are the most sensitive to local texture/detail at the target resolution.
Full-image FID and CLIP are effectively tied across the three methods.
LUA is faster in this local timing because it starts from the saved FLUX latent, while RF starts from the decoded 1024 RGB base and re-enters the FLUX VAE feature path.
The runtime is the x2 stage only, not full text-to-image generation time.

This still is not the exact LUA paper table: our run is FLUX-latent x2 only, not SDXL 1024/2048/4096 generation. A full matched LSRNA row is also not listed because this workspace only has 5 saved LSRNA generated x2 samples. Those saved LSRNA samples took about 109 s/image, so generating the full 179-image matched set would take roughly 5.4 hours before metric extraction.

Visuals

Representative images are committed under assets/. The Set5 butterfly grid is a diagnostic artifact from the run and includes extra columns such as four-step Euler and feature-delta maps; the main distribution comparison in this README is the paper-style x2 table above against LUA, while LSRNA is kept as a 5-image base/detail visual diagnostic because a full matched LSRNA output set is not available locally.

Representative base/detail crop:

For img_0000003 from the generated FLUX visual subset, RF one-step preserves the base nearly as well as LUA while adding more local high-frequency energy: RF has base L1 0.0158 and HF gain 1.19x; LUA has base L1 0.0190 and HF gain 1.08x; LSRNA reaches HF gain 1.62x but drifts far from the base (base L1 0.1752). This is the behavior we wanted to isolate: detail creation inside the latent/decoder-feature path without losing the generated base.

Generated FLUX x2 comparison for the same img_0000003 sample:

Urban100 sample outputs are included as separate files, not as a huge panel.

assets/urban100_samples/
  img001_vae_target.png
  img001_feature_bicubic.png
  img001_rf_1step.png

The full local Urban100 export from the run was stored outside this repo at:

runs/feature_rectified_flow_x2_f3_resume_bench_wandb/train_main/benchmarks/Urban100_final_step_6063/

Runtime

The learned vector field is not the main runtime bottleneck; the frozen FLUX VAE decoder tail is larger.

Method	Input -> Output	Total	Front/Encode	Vector/Model	Tail/Decode	Peak
FSR	512 -> 1024	300.8 ms	56.3 ms	87.3 ms	157.4 ms	3.37 GiB
FSR	1024 -> 2048	1.22 s	240.5 ms	347.4 ms	634.8 ms	12.96 GiB
LUA x2	512 -> 1024	421.1 ms	31.3 ms	140.6 ms	249.3 ms	2.79 GiB
LUA x2	1024 -> 2048	1.93 s	132.5 ms	605.6 ms	1192.2 ms	9.94 GiB

For x2, this f3 one-step RF path is faster than the measured LUA x2 full pipeline on the same machine. x4 is not directly compared here because this RF experiment is x2. A separate x4 or tiled inference study is needed for fair 1024 -> 4096 claims.

Reproducing

Install dependencies:

pip install -r requirements.txt

Run the overnight f1-f5 auto-probe plus training:

bash configs/train_f3_x2_overnight.sh

The actual resumed main run used:

bash configs/resume_f3_main.sh

Post-hoc base/detail evaluation:

python scripts/evaluate_base_detail_rf.py \
  --checkpoint runs/feature_rectified_flow_x2_f3_resume_bench_wandb/train_main/checkpoints/last.pt \
  --output_dir runs/feature_rectified_flow_x2_f3_base_detail_eval \
  --enable_gate \
  --paired_roots Set5=/path/to/Set5 Set14=/path/to/Set14 B100=/path/to/B100 Urban100=/path/to/Urban100 Manga109=/path/to/Manga109 \
  --generated_root /path/to/flux_random_1024_merged_179/images

Rebuild README figures and the training-cost table:

python scripts/make_representative_figures.py

Compute the paper-style OpenImages x2 metrics:

python scripts/evaluate_x2_paper_style_openimages.py \
  --output_dir results/paper_style_openimages_x2_full \
  --methods bicubic_x2 RF_f3_one_step_x2 LUA_x2_to_2048 \
  --save_images 5

The older 5-image visual diagnostic can still be rebuilt with:

python scripts/evaluate_openimages_visual_subset_metrics.py

The main training script writes:

probe/probe_metrics.csv
probe/probe_summary.json
probe/probe_visual_grid.png
train_main/summary.json
train_main/benchmark_log.csv
train_main/validation/*/comparison_grid.png
train_main/benchmarks/*_metrics.csv
train_main/checkpoints/*.pt

Checkpoints are intentionally ignored by git. Put them under checkpoints/ or runs/ locally if you want to resume.

Repository Contents

scripts/train_feature_rectified_flow_sr.py  # main experiment script
scripts/evaluate_base_detail_rf.py          # post-hoc base/detail evaluator
scripts/evaluate_x2_paper_style_openimages.py  # paper-style FID/pFID/KID/pKID/CLIP
scripts/evaluate_openimages_visual_subset_metrics.py  # FID/KID diagnostic
scripts/make_representative_figures.py      # README figures and training-cost chart
configs/                                # runnable command templates
docs/                                   # formulation and experiment notes
assets/                                 # representative visual outputs
results/raw/                            # copied raw summaries and CSVs
results/tables/                         # compact human-readable tables

Limitations

This is an exploratory overnight experiment, not a SOTA SR model.
Metrics are against a VAE reconstruction target, so they should not be mixed with classic raw-HR SR leaderboards without explanation.
The model was trained at x2 and f3 only.
f4 looked promising in probing but OOMed under the default 512/hidden-128 setting.
x4 and tiled 4096-output inference remain future work.

Suggested Next Steps

Train a reduced f4 variant: hr_size=384, hidden_channels=64/96, num_blocks=4, pixel_loss_every=4.
Add tiled f3/f4 inference for 4096 outputs.
Re-run x2 RF, LUA, and LSRNA under a fixed raw-HR benchmark protocol.
Add a stricter one-step consistency or distillation term only if diagnostic multi-step sampling starts to beat one-step clearly.
Save model cards/checkpoints through Git LFS or Hugging Face Hub if this is shared publicly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FSR

What Problem Is This Solving?

Rectified-Flow Formulation

Method Summary

Data Used

Cut Probe

Overnight Run

Training Cost In Context

Results

Primary VAE-Target Result

Contextual Comparison With LUA and LSRNA

Base Preservation and Detail

Paper-Style OpenImages x2 Metrics

Visuals

Runtime

Reproducing

Repository Contents

Limitations

Suggested Next Steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
configs		configs
docs		docs
results		results
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

FSR

What Problem Is This Solving?

Rectified-Flow Formulation

Method Summary

Data Used

Cut Probe

Overnight Run

Training Cost In Context

Results

Primary VAE-Target Result

Contextual Comparison With LUA and LSRNA

Base Preservation and Detail

Paper-Style OpenImages x2 Metrics

Visuals

Runtime

Reproducing

Repository Contents

Limitations

Suggested Next Steps

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages