Skip to content

Add image generation track (#16)#17

Open
ethanphan3993 wants to merge 2 commits into
mainfrom
feature/image-generation-track
Open

Add image generation track (#16)#17
ethanphan3993 wants to merge 2 commits into
mainfrom
feature/image-generation-track

Conversation

@ethanphan3993

@ethanphan3993 ethanphan3993 commented May 18, 2026

Copy link
Copy Markdown
Owner

Closes #16.

What this delivers

A working /images track for local diffusion / flow-matching models, parallel to the text-LLM track. Per the issue, this can't share controls with text — model formats (safetensors vs GGUF), runtimes (Drawthings/ComfyUI/Mochi vs Ollama/LM Studio/MLX), benchmarks (GenEval / imagen-arena ELO / Emu-Edit vs HumanEval/IFEval/MMLU), and the cost model (compute vs bandwidth) are all different.

Layer What
Catalog 20 hand-curated models with cited per-source benchmark scores (backend/data/image_aliases.yaml). Each entry carries a real Hugging Face repo ID and ComfyUI subfolder hint.
Use cases image_generation (text-to-image) and image_editing (img2img / inpainting / instruct-edit). Editing-only filter requires supports_editing.
Harnesses Draw Things, Mochi Diffusion, ComfyUI, InvokeAI, Stable Diffusion WebUI Forge, Diffusers — each with format/family requirements and a real install-command template.
Cost model New compute-bound model: time_per_image = default_steps × scaled_time_per_step + overhead. Empirical M3 Max / M4 Max reference times in the catalog scaled by chip FP16 TFLOPS. VRAM-fit picker chooses the highest-quality quant (FP16 / Q8 / Q4) that fits.
Scoring Same 3-axis weighting as text: 0.55 × use_case + 0.30 × hardware + 0.15 × harness. Storage rubric and combine weights extracted to backend/services/hardware_fit.py; memory thresholds stay track-specific (image is stricter past 0.85 because diffusion runtimes hang rather than degrade).
API GET /api/images/{use-cases,harnesses,catalog} and POST /api/images/recommend. 15 s hardware-snapshot cache matching /api/recommend.
Frontend New /images route with two views: ranked picks for the user's Mac, and browse-all (family + capability filter). Install rows show real, copy-pasteable commands and a Hugging Face link.

Catalog

FLUX.1 dev/schnell/Kontext · SD 3.5 Large/Medium/Large-Turbo · SDXL Base/Turbo · Stable Cascade · SD 1.5 · HiDream-I1 dev/fast + HiDream-E1 · AuraFlow v0.3 · OmniGen v1 · Sana 1.6B · Lumina-Next-T2I · PixArt-Σ · Kolors · InstructPix2Pix.

Each entry cites its source (Black Forest Labs / Stability / NVIDIA model cards, GenEval paper, imagen-leaderboard.dev with snapshot date, technical reports). Refresh is currently a manual quarterly task — a fetcher isn't included because imagen-leaderboard.dev is fragmented enough that hand-curation is more reliable than scraping.

End-to-end smoke (Apple M5 Pro, 64 GB)

```
chip: Apple M5 Pro (10.0 TFLOPS)
#1 FLUX.1 [schnell] fit=8.21 time=~12s vram=24.0 GB
#2 Stable Diffusion 3.5 Large fit=8.00 time=~40s vram=16.0 GB
#3 FLUX.1 [dev] fit=7.78 time=~66s vram=24.0 GB
#4 SDXL 1.0 Base fit=7.76 time=~18s vram= 7.0 GB
#5 HiDream-I1 [fast] fit=7.68 time=~28s vram=32.0 GB
```

Schnell ranks above dev because 4-step distillation amortizes the per-step cost; SDXL holds top-5 on its low VRAM and fast step time despite a lower GenEval score. Editing use case correctly filters to Kontext / OmniGen / InstructPix2Pix / HiDream-E1. Mochi Diffusion harness correctly drops FLUX and SD3 (CoreML doesn't support them yet).

Install commands resolve to real, copy-pasteable strings

```
ComfyUI: huggingface-cli download black-forest-labs/FLUX.1-dev
--local-dir ComfyUI/models/unet/flux-1-dev/
Diffusers: from diffusers import DiffusionPipeline;
DiffusionPipeline.from_pretrained('black-forest-labs/FLUX.1-dev')
InvokeAI: Model Manager → Add via URL →
https://huggingface.co/black-forest-labs/FLUX.1-dev
```

Local-dir paths use slugs (flux-1-dev) not display names (FLUX.1 [dev]) so the commands are shell-safe.

Test plan

  • `pytest backend/tests/` — 43 passed (25 existing + 18 new image tests covering catalog load, score normalization including FID inversion, TFLOPS scaling, quantization picking, harness filters, install command substitution, shell-safe local-dirs, and ComfyUI subfolder routing)
  • `tsc --noEmit` — clean
  • `vite build` — clean (255 KB → 74 KB gzipped)
  • Live API smoke (image_generation / image_editing, drawthings/mochi/comfyui harness filters, install command substitution end-to-end)
  • Manual frontend pass: visit `/images`, switch between "Ranked picks" and "Browse all 20 models", expand a card to see provenance, copy a ComfyUI install command

🤖 Generated with Claude Code

Ethan Phan and others added 2 commits May 18, 2026 23:10
Introduces a separate /images surface for local diffusion / flow-matching
models. The text-LLM track was unfit to host this: different model formats
(safetensors vs GGUF), different runtimes, different benchmarks (GenEval /
imagen-arena ELO / Emu-Edit), and a fundamentally different cost model
(compute-bound time-per-image vs bandwidth-bound TPS).

What ships:

  - Curated catalog of ~20 models in backend/data/image_aliases.yaml,
    each with cited sources for benchmark scores: FLUX.1 dev/schnell/Kontext,
    Stable Diffusion 3.5 Large/Medium/Turbo, SDXL Base/Turbo, Stable
    Cascade, SD 1.5, HiDream-I1 dev/fast and HiDream-E1, AuraFlow v0.3,
    OmniGen v1, Sana 1.6B, Lumina-Next-T2I, PixArt-Sigma, Kolors,
    InstructPix2Pix.

  - Image use cases (image_generation, image_editing) and image harnesses
    (Draw Things, Mochi Diffusion, ComfyUI, InvokeAI, Forge, Diffusers)
    in their own YAMLs. Editing-only filter requires supports_editing.

  - backend/services/images/ — self-contained module with catalog loader,
    compute-bound hardware-fit recommender, and 3-axis scoring matching
    the text track's structure (0.55 use_case + 0.30 hardware + 0.15 harness).
    The hardware model uses chip FP16 TFLOPS to scale empirical M3 Max /
    M4 Max reference times to other Apple Silicon chips, plus a VRAM-fit
    check at the picked quantization (FP16 / Q8 / Q4).

  - backend/routers/images.py — GET /api/images/{use-cases,harnesses,catalog}
    plus POST /api/images/recommend. Same 15s hardware-snapshot cache
    pattern as the text /api/recommend.

  - frontend/src/pages/Images.tsx — new /images route with use-case +
    harness picker, ranked cards showing time-per-image, VRAM at chosen
    quant, license, install hint per harness, and benchmark provenance
    on expand. Nav entry next to Home.

  - 14 new pytest cases in backend/tests/test_image_recommender.py
    covering catalog load, score normalization (including FID inversion),
    TFLOPS scaling (M1 should be ~5x slower than M3 Max for FLUX),
    quantization picking, harness filtering (Mochi rejects FLUX/SD3),
    use-case filtering (editing keeps only edit-capable models), and
    install command resolution. All 39 backend tests pass; frontend
    type-checks and builds cleanly.

What's deliberately out of scope for this PR (planned as follow-ups so
this lands in a reviewable size):

  - Live data fetcher for imagen-leaderboard.dev. Catalog is currently
    refreshed manually from cited sources; the leaderboard is incomplete
    enough that a fetcher would mostly replicate hand-curation.
  - Generalizing the text recommender's score_hardware to accept the
    workload-specific cost model. Image track has its own score_hardware
    today; consolidating can come once both have proven shapes.
  - SQLite persistence for image catalog. ~20 entries doesn't justify a
    schema migration; YAML is fine until we add a fetcher.

Per the issue, /images is a separate route rather than a use-case under
the text recommender — the mental models (memory-bandwidth vs compute,
GGUF vs safetensors, TPS vs seconds-per-image) are too different to share
controls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rdware-fit primitives

Three follow-ups to the initial image track in 14ffd06, addressing
functional gaps I'd left rather than time-budget cuts.

  1. Install commands now actually work.

     The harness templates had {url} / {hf_id} / {folder} placeholders
     that the recommender never substituted, so users got copy-pasteable
     strings like 'drawthings://import?url={url}' — useless. Each catalog
     entry now carries a real Hugging Face repo ID (hf_id) and a
     comfyui_folder hint (unet for FLUX, checkpoints for everything else).
     The recommender substitutes these into harness templates so:

       - Diffusers: from diffusers import DiffusionPipeline;
                    DiffusionPipeline.from_pretrained('black-forest-labs/FLUX.1-dev')
       - ComfyUI:   huggingface-cli download black-forest-labs/FLUX.1-dev
                    --local-dir ComfyUI/models/unet/flux-1-dev/
       - InvokeAI:  Model Manager → Add via URL →
                    https://huggingface.co/black-forest-labs/FLUX.1-dev
       - Forge:     huggingface-cli download ... --local-dir
                    stable-diffusion-webui-forge/models/Stable-diffusion/
       - Mochi:     CoreML conversion command targeting the HF repo
       - Drawthings: 'search the in-app catalog' (Drawthings has its own
                    curated download flow, not HF-based)

     Local-dir paths use canonical_id ('flux-1-dev') instead of
     display_name ('FLUX.1 [dev]') so they're shell-safe — no spaces or
     square brackets. install_options also expose a download_url pointing
     at the model's HF page when applicable.

     Two new tests pin this:
       - test_install_commands_have_no_unsubstituted_placeholders
         (asserts no {...} survives substitution for any (model, harness) pair)
       - test_install_command_local_dirs_are_shell_safe
         (asserts no whitespace or [ ] in --local-dir tokens)

  2. /images now has a full-catalog browse view.

     Previously only the top-10 ranked picks were visible; the catalog
     endpoint existed but wasn't surfaced. /images now has a 'Ranked picks
     for your Mac' / 'Browse all 20 models' toggle. The catalog grid shows
     each model's family, architecture, params, FP16/Q8/Q4 VRAM, capability
     (gen/edit), compatible harnesses, license, and a Hugging Face link.
     Family + capability filters built in. The text /browse stays focused
     on the SQLite text catalog — the data shapes are different enough
     that mixing them would have been more cost than benefit.

  3. score_hardware shares its rubric across both tracks.

     Storage scoring (cost of failed download) and combine weights
     (0.50 mem + 0.40 speed + 0.10 storage) are identical between text
     and image, and were duplicated. Extracted to
     backend/services/hardware_fit.py with a `bucket()` helper for the
     interpolated thresholds. Memory thresholds remain track-specific
     because the physics differs: text gracefully degrades when memory
     is tight while diffusion runtimes hang, so the image track is
     stricter at ratio > 0.85. Comment in each call site explains why.

Tests: 43 pass (was 39); frontend tsc + vite build clean (255 KB → 74 KB
gzipped JS, +4 KB for the catalog grid).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Image generation track — recommend local diffusion models

1 participant