Skip to content

standard-jh/FSR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FSR

One-step rectified flow in a frozen VAE decoder feature space for x2 super-resolution.

FSR stands for Feature-Space Rectified Flow Super-Resolution: a compact name for the core idea tested here, moving decoder features directly with a learned rectified-flow vector field.

This repository packages an overnight experiment that asks a narrow question:

Can we learn a vector field inside an intermediate VAE decoder feature space that transports bicubic-upsampled LR decoder features toward HR decoder features, then use that field in one step at inference?

The short answer from this run is yes, at the selected decoder cut f3 (decoder.up_blocks.1) the learned one-step vector field clearly improves over feature bicubic, reaches the same rough x2 quality range as our local LUA benchmark, and is well above the current LSRNA x2 snapshot in this workspace. Four-step Euler is kept only as a diagnostic check of the learned field, not as the headline comparison.

This is not a diffusion-UNet SR method. It does not use a pretrained denoising UNet, does not call scheduler.add_noise(), and does not feed decoder features into SDXL, SD2.1, or FLUX denoisers. The learned module is a lightweight vector field trained directly in decoder feature space.

What Problem Is This Solving?

Many latent SR systems upscale the latent tensor itself or apply a deterministic feature projector. Here we test a different object: a transport field between two frozen VAE decoder feature distributions.

Given a high-resolution image crop x_HR, we construct a low-resolution image x_LR by bicubic downsampling. A frozen FLUX VAE encoder E maps both images into latent space, and the frozen decoder D is split at an internal cut k:

D = D_>k o D_<=k

For a selected decoder feature cut:

f_H = D_<=k(E(x_HR))
f_L = D_<=k(E(x_LR))
f_B = Bicubic(f_L, spatial_size=f_H)

The model learns a vector field v_theta that moves f_B toward f_H. The pixel target is the VAE reconstruction x_H_rec = D_>k(f_H), not raw HR. That keeps the objective on the frozen VAE decoder manifold.

Rectified-Flow Formulation

For each feature pair (f_B, f_H), define:

f_0 = f_B
f_1 = f_H
sigma(t) = sigma_max * t * (1 - t)
f_t = (1 - t) * f_0 + t * f_1 + sigma(t) * eps

The model is trained with flow matching:

v_theta(f_t, t, cond=f_0) ~= f_1 - f_0

At inference, the main path is one-step Euler:

f_hat = f_B + v_theta(f_B, t=0, cond=f_B)
x_hat = D_>k(f_hat)

We also recorded two-step/four-step Euler and stochastic one-step variants for diagnostics, but the reported method is the one-step path above.

Method Summary

  • Frozen VAE: black-forest-labs/FLUX.1-dev, vae subfolder.
  • Decoder split candidates: f1 to f5.
  • Selected cut: f3, mapped to decoder.up_blocks.1.
  • Model: lightweight residual convolutional feature vector field with sinusoidal time embedding and optional gate.
  • Main signal: rectified-flow vector matching over random t.
  • Auxiliary signals: endpoint feature loss, FFT loss, one-step feature loss, decoded RGB loss, low-frequency anchor, high-frequency loss, drift control, and gate regularization.

The core training script is:

scripts/train_feature_rectified_flow_sr.py

Data Used

Training:

  • DIV2K train HR
  • Local path used in this run: /home/juhwan/Documents/sr/BasicSR/datasets/DIV2K/DIV2K_train_HR
  • Scale: x2
  • HR crop size: 512
  • LR construction: bicubic downsampling from HR crop

Validation/evaluation:

  • Set5
  • Set14
  • B100
  • Urban100
  • Manga109
  • FLUX179 generated images

Metrics in this repository are reported in two ways:

  • Primary experiment metrics compare to the FLUX VAE reconstruction target x_H_rec, because the model is explicitly trained to move along the frozen decoder feature manifold.
  • Cross-method metrics compare to raw HR so the result can be read beside local LUA and LSRNA x2 benchmark summaries. Those numbers are useful context, but are not a strict leaderboard because the evaluators and VAE backbones differ.

Cut Probe

The experiment first probed decoder cuts f1 to f5 on 32 DIV2K images. The selection score preferred a cut that had non-trivial feature/high-frequency gap, stable decoded pixels, moderate decoder sensitivity, and feasible runtime.

Cut Decoder Stage Score RGB L1 HF Error FFT Gap Sensitivity Probe VRAM
f3 up_blocks.1 0.4707 0.09968 0.07354 2.13634 0.14372 1.083 GiB
f4 up_blocks.2 0.4315 0.06940 0.04755 3.77468 0.03831 1.721 GiB
f2 up_blocks.0 0.0014 0.12887 0.08451 0.63098 0.30649 0.864 GiB
f1 conv_in -0.5271 0.17867 0.08729 0.52488 0.42026 0.802 GiB
f5 conv_act -0.7000 0.07063 0.04542 0.03181 0.48919 0.955 GiB

f3 was selected. f4 was the second-best probe cut, but the default training configuration OOMed at 512 due to the large feature tensor 1 x 256 x 512 x 512. A reduced f4 run would need smaller hidden width, fewer blocks, lower HR size, lower pixel-loss frequency, or feature tiling.

Overnight Run

Hardware:

  • GPU: NVIDIA RTX 3090 24 GB
  • Precision: bf16
  • Batch size: 1
  • Gradient accumulation: 8
  • Hidden channels: 128
  • Blocks: 8
  • Gate: enabled
  • sigma_max: 0.03

Training budget and progress:

Phase Steps Wall-clock
f3 warm-up 0 -> 2000 3.26 h
f3 main resume 2000 -> 6063 6.72 h
Total f3 updates 6063 optimizer steps ~9.98 h

The final main run stopped by time budget at step 6063, saved checkpoints, ran final validation, and synced W&B.

W&B run:

https://wandb.ai/standard_juhwan/feature-rectified-flow-sr/runs/sc1t0349

Training Cost In Context

This was deliberately a small hypothesis test: one selected f3 cut, x2 only, and no large multi-scale pretraining. Even so, it reached the useful regime in 6063 optimizer steps on one RTX 3090.

Training cost comparison

Method Scope Hardware Wall-clock GPU-hours Steps / iters Reported data
FSR this repo, x2 f3 1x RTX 3090 24GB 9.98 h 9.98 6063 DIV2K crops, about 48.5K crop presentations
LSRNA LSR module paper v1 arbitrary-scale LSR 1x V100-SXM2 26 h 26 200K 4.7M LR-HR latent pairs
LUA latent upscaler paper x2/x4 multi-scale adapter 8x H100 80GB 34.1 h 272.8 375K 3.8M OpenImages latent pairs

The comparison is not apples-to-apples: LUA trains a multi-scale x2/x4 adapter, and LSRNA trains an arbitrary-scale LSR module for use with RNA and a guided denoising stage. The useful point is narrower: this decoder-feature RF prototype was much cheaper to validate as a one-step x2 transport hypothesis.

Sources: LUA arXiv 2511.10629 reports 3.8M pairs, three 125K-step stages, and 8x H100 training; the LSRNA CVPR 2025 paper/supplement reports 4.7M LR-HR latent pairs and the v1 200K-iteration, 26-hour V100 LSR training setting.

Results

The main result is the one-step feature-space rectified flow. Four-step Euler is not used as a headline baseline; it is saved separately in results/tables/diagnostic_four_step.csv.

Primary VAE-Target Result

These values measure the experiment on its intended target, the FLUX VAE reconstruction x_H_rec.

Dataset Method PSNR SSIM RGB L1 LPIPS
Set5 feature bicubic 24.606 0.6974 0.07898 0.15612
Set5 RF one-step 28.478 0.8301 0.04984 0.07966
Set14 feature bicubic 22.713 0.5895 0.09785 0.19393
Set14 RF one-step 26.161 0.7321 0.06857 0.09724
B100 feature bicubic 22.465 0.5442 0.10204 0.23221
B100 RF one-step 25.516 0.6834 0.07373 0.14454
Urban100 feature bicubic 20.022 0.5595 0.12862 -
Urban100 RF one-step 24.114 0.7488 0.08163 -
Manga109 feature bicubic 21.679 0.6991 0.09224 0.07181
Manga109 RF one-step 26.942 0.8508 0.05601 0.02019
FLUX179 feature bicubic 27.209 0.8106 0.04595 0.07732
FLUX179 RF one-step 30.964 0.8853 0.03075 0.02918

Observations:

  • RF one-step improves strongly over feature bicubic on every benchmark.
  • The gain is consistent across all local SR validation sets and the generated FLUX179 set.
  • This supports the core hypothesis that a random-time rectified-flow objective can learn a useful one-step transport field at the f3 decoder cut.

Contextual Comparison With LUA and LSRNA

The table below uses raw-HR RGB metrics so the result can be read beside the local LUA/LSRNA benchmark files in this workspace. Interpret this as contextual: our RF row comes from this f3 experiment's raw-HR logs, LUA is a FLUX VAE x2 benchmark with crop_border=2, and LSRNA is an SDXL VAE x2 benchmark.

Dataset Ours RF 1-step RGB PSNR/SSIM LUA x2 RGB PSNR/SSIM LSRNA x2 RGB PSNR/SSIM
Set5 28.026 / 0.8138 27.988 / 0.8297 15.772 / 0.3903
Set14 25.566 / 0.7058 26.085 / 0.7406 15.116 / 0.3744
B100 25.284 / 0.6742 25.850 / 0.7142 15.325 / 0.3709
Urban100 23.764 / 0.7381 24.985 / 0.7861 14.253 / 0.3965
Manga109 26.549 / 0.8382 27.468 / 0.8647 15.385 / 0.5344

Takeaway:

  • Against LUA, RF one-step is essentially tied on Set5 RGB PSNR, and trails by about 0.5 to 1.2 dB on Set14/B100/Urban100/Manga109.
  • Against this LSRNA snapshot, RF one-step is much stronger on all listed datasets.
  • A strict paper-style comparison should re-run RF, LUA, and LSRNA through one shared evaluator with the same crop border, color space, VAE backbone, and output saving path.

Base Preservation and Detail

To test the actual desired behavior, we added a post-hoc evaluator that measures whether the upsampled result still downscales back to the LR/base image while adding controlled high-frequency content.

Macro average across Set5/Set14/B100/Urban100/Manga109:

Method Raw RGB PSNR Raw RGB SSIM Base L1 RGB Base Grad L1
RF one-step 25.837 0.7540 0.0300 0.0767
LUA x2 26.475 0.7871 0.0237 0.0155
LSRNA x2 15.170 0.4133 0.1123 0.0671

For the generated FLUX179 images, we also ran RF in LR-only mode from a 1024px base to a 2048px output, matching the existing LUA/LSRNA generated visual comparison setup. On all 179 generated images:

Method Base PSNR RGB Base SSIM RGB Base L1 RGB HF Gain vs Base
feature bicubic 31.380 0.9048 0.01533 1.758
RF one-step 34.245 0.9460 0.01360 1.156

On the shared 5-image generated visual subset:

Method Base PSNR RGB Base SSIM RGB Base L1 RGB HF Gain vs Bicubic
bicubic x2 41.937 0.9875 0.00426 1.000
RF one-step 33.784 0.9153 0.01378 1.285
LUA x2 34.279 0.9185 0.01307 0.992
LSRNA x2 9.673 0.3717 0.25925 1.609

This is the most relevant qualitative signal: RF is close to LUA in base preservation on generated x2 samples, while increasing high-frequency energy more than LUA. LSRNA has strong high-frequency change but poor base preservation in this generated visual subset.

Paper-Style OpenImages x2 Metrics

We re-ran the generated-image distribution evaluation in the same style as the LUA paper table: FID, pFID, KID, pKID, CLIP, and runtime. This is now the headline OpenImages distribution result, replacing the earlier 5-image visual diagnostic.

Protocol:

  • Generated set: all 179 saved FLUX 1024 latent/prompt records.
  • Target setting: x2, 1024 -> 2048.
  • Real reference: cached OpenImages HR Inception features, 150 full images and 2400 patches.
  • Generated patches: 16 patches per generated image, 2864 total patches.
  • Feature extractor: torchvision InceptionV3 ImageNet weights, final FC replaced by identity.
  • CLIP: openai/clip-vit-base-patch32, image-text cosine against the saved FLUX prompt.
Resolution Method FID ↓ pFID ↓ KID ↓ pKID ↓ CLIP ↑ Time (s) ↓
2048x2048 bicubic x2 309.00 113.12 0.06830 0.03735 0.3455 0.000
2048x2048 RF f3 one-step 308.86 105.70 0.06792 0.03386 0.3453 1.31
2048x2048 LUA x2 309.20 120.61 0.06860 0.04369 0.3459 0.88

Interpretation:

  • RF one-step is best on patch distribution metrics (pFID, pKID), which are the most sensitive to local texture/detail at the target resolution.
  • Full-image FID and CLIP are effectively tied across the three methods.
  • LUA is faster in this local timing because it starts from the saved FLUX latent, while RF starts from the decoded 1024 RGB base and re-enters the FLUX VAE feature path.
  • The runtime is the x2 stage only, not full text-to-image generation time.

This still is not the exact LUA paper table: our run is FLUX-latent x2 only, not SDXL 1024/2048/4096 generation. A full matched LSRNA row is also not listed because this workspace only has 5 saved LSRNA generated x2 samples. Those saved LSRNA samples took about 109 s/image, so generating the full 179-image matched set would take roughly 5.4 hours before metric extraction.

Visuals

Representative images are committed under assets/. The Set5 butterfly grid is a diagnostic artifact from the run and includes extra columns such as four-step Euler and feature-delta maps; the main distribution comparison in this README is the paper-style x2 table above against LUA, while LSRNA is kept as a 5-image base/detail visual diagnostic because a full matched LSRNA output set is not available locally.

Representative base/detail crop:

Representative base/detail crop

For img_0000003 from the generated FLUX visual subset, RF one-step preserves the base nearly as well as LUA while adding more local high-frequency energy: RF has base L1 0.0158 and HF gain 1.19x; LUA has base L1 0.0190 and HF gain 1.08x; LSRNA reaches HF gain 1.62x but drifts far from the base (base L1 0.1752). This is the behavior we wanted to isolate: detail creation inside the latent/decoder-feature path without losing the generated base.

Generated FLUX x2 comparison for the same img_0000003 sample:

Generated FLUX x2 comparison

Urban100 sample outputs are included as separate files, not as a huge panel.

assets/urban100_samples/
  img001_vae_target.png
  img001_feature_bicubic.png
  img001_rf_1step.png

The full local Urban100 export from the run was stored outside this repo at:

runs/feature_rectified_flow_x2_f3_resume_bench_wandb/train_main/benchmarks/Urban100_final_step_6063/

Runtime

The learned vector field is not the main runtime bottleneck; the frozen FLUX VAE decoder tail is larger.

Method Input -> Output Total Front/Encode Vector/Model Tail/Decode Peak
FSR 512 -> 1024 300.8 ms 56.3 ms 87.3 ms 157.4 ms 3.37 GiB
FSR 1024 -> 2048 1.22 s 240.5 ms 347.4 ms 634.8 ms 12.96 GiB
LUA x2 512 -> 1024 421.1 ms 31.3 ms 140.6 ms 249.3 ms 2.79 GiB
LUA x2 1024 -> 2048 1.93 s 132.5 ms 605.6 ms 1192.2 ms 9.94 GiB

For x2, this f3 one-step RF path is faster than the measured LUA x2 full pipeline on the same machine. x4 is not directly compared here because this RF experiment is x2. A separate x4 or tiled inference study is needed for fair 1024 -> 4096 claims.

Reproducing

Install dependencies:

pip install -r requirements.txt

Run the overnight f1-f5 auto-probe plus training:

bash configs/train_f3_x2_overnight.sh

The actual resumed main run used:

bash configs/resume_f3_main.sh

Post-hoc base/detail evaluation:

python scripts/evaluate_base_detail_rf.py \
  --checkpoint runs/feature_rectified_flow_x2_f3_resume_bench_wandb/train_main/checkpoints/last.pt \
  --output_dir runs/feature_rectified_flow_x2_f3_base_detail_eval \
  --enable_gate \
  --paired_roots Set5=/path/to/Set5 Set14=/path/to/Set14 B100=/path/to/B100 Urban100=/path/to/Urban100 Manga109=/path/to/Manga109 \
  --generated_root /path/to/flux_random_1024_merged_179/images

Rebuild README figures and the training-cost table:

python scripts/make_representative_figures.py

Compute the paper-style OpenImages x2 metrics:

python scripts/evaluate_x2_paper_style_openimages.py \
  --output_dir results/paper_style_openimages_x2_full \
  --methods bicubic_x2 RF_f3_one_step_x2 LUA_x2_to_2048 \
  --save_images 5

The older 5-image visual diagnostic can still be rebuilt with:

python scripts/evaluate_openimages_visual_subset_metrics.py

The main training script writes:

probe/probe_metrics.csv
probe/probe_summary.json
probe/probe_visual_grid.png
train_main/summary.json
train_main/benchmark_log.csv
train_main/validation/*/comparison_grid.png
train_main/benchmarks/*_metrics.csv
train_main/checkpoints/*.pt

Checkpoints are intentionally ignored by git. Put them under checkpoints/ or runs/ locally if you want to resume.

Repository Contents

scripts/train_feature_rectified_flow_sr.py  # main experiment script
scripts/evaluate_base_detail_rf.py          # post-hoc base/detail evaluator
scripts/evaluate_x2_paper_style_openimages.py  # paper-style FID/pFID/KID/pKID/CLIP
scripts/evaluate_openimages_visual_subset_metrics.py  # FID/KID diagnostic
scripts/make_representative_figures.py      # README figures and training-cost chart
configs/                                # runnable command templates
docs/                                   # formulation and experiment notes
assets/                                 # representative visual outputs
results/raw/                            # copied raw summaries and CSVs
results/tables/                         # compact human-readable tables

Limitations

  • This is an exploratory overnight experiment, not a SOTA SR model.
  • Metrics are against a VAE reconstruction target, so they should not be mixed with classic raw-HR SR leaderboards without explanation.
  • The model was trained at x2 and f3 only.
  • f4 looked promising in probing but OOMed under the default 512/hidden-128 setting.
  • x4 and tiled 4096-output inference remain future work.

Suggested Next Steps

  1. Train a reduced f4 variant: hr_size=384, hidden_channels=64/96, num_blocks=4, pixel_loss_every=4.
  2. Add tiled f3/f4 inference for 4096 outputs.
  3. Re-run x2 RF, LUA, and LSRNA under a fixed raw-HR benchmark protocol.
  4. Add a stricter one-step consistency or distillation term only if diagnostic multi-step sampling starts to beat one-step clearly.
  5. Save model cards/checkpoints through Git LFS or Hugging Face Hub if this is shared publicly.

About

decoder-feature super resolution with rectified flow

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors