Skip to content

willbnu/Qwen-3.5-16G-Vram-Local

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

102 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

qwen-llm

CI Release License

Config-driven presets, launchers, and benchmark artifacts for running Qwen3.5 GGUF models on a single 16 GB NVIDIA GPU with llama.cpp.

This repo is deliberately narrow:

  • one server at a time
  • checked-in JSON artifacts behind every benchmark claim
  • public docs aligned to the shipped config/servers.yaml
  • current release guidance grounded on the tested RTX 5080 16 GB path

If you want a fast answer instead of reading the whole repo:

<<<<<<< HEAD

  • use Qwen3.5-35B-A3B-Q3_K_S.gguf for the strongest shipped 35B path
  • use Qwen3.5-9B-UD-Q4_K_XL.gguf for the strongest shipped 9B daily multimodal path
  • use Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic.i1-Q3_K_M.gguf for the strongest local 27B benchmark path on this machine =======
  • use Qwen3.5-35B-A3B-Q3_K_S.gguf for the strongest overall local 35B setup
  • use Qwen3.5-9B-UD-Q4_K_XL.gguf for the best practical 9B daily multimodal setup
  • use Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic.i1-Q3_K_M.gguf for the strongest 27B option on this machine

origin/main

Quick Links

origin/main

Recommended Presets

<<<<<<< HEAD

Model Size Best For Context Estimated VRAM Caveat
Qwen3.5-35B-A3B-Q3_K_S.gguf 35B Strongest shipped 35B daily path 120K daily / 256K max 15.3 GB 256K is validated, but the fit-safe daily path is still the 120K coding preset
Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic.i1-Q3_K_M.gguf 27B Strongest local 27B benchmark path 32K benchmark / 128K+ lab 14.1 GB This is not the shipped quality route; the shipped 27B quality path remains quality_ud_q3kxl on the tested SM120 setup
Qwen3.5-9B-UD-Q4_K_XL.gguf 9B Strongest shipped 9B daily multimodal path 256K 10.6 GB Lower absolute capability than the 27B and 35B paths

Release Scope

Status What it means here
Shipped Present in config/servers.yaml, reflected in launcher helpers, and described in public docs
Artifact-backed Recommendation or speed/context claim is tied to a checked-in JSON result
Experimental Kept in the repo for local comparison or follow-up work, but not part of the public default story
=======
Model Size
--- ---
Qwen3.5-35B-A3B-Q3_K_S.gguf 35B
Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic.i1-Q3_K_M.gguf 27B
Qwen3.5-9B-UD-Q4_K_XL.gguf 9B

Release Scope

Status What it means here
Shipped Present in config/servers.yaml, reflected in launcher helpers, and described in public docs
Artifact-backed Recommendation or speed/context claim is tied to a checked-in JSON result
Experimental Kept in the repo for local comparison or follow-up work, but not part of the public default story

origin/main

Current release stance:

  • Best 35B daily recommendation: Qwen3.5-35B-A3B-Q3_K_S.gguf
  • Best practical 9B daily recommendation: Qwen3.5-9B-UD-Q4_K_XL.gguf <<<<<<< HEAD
  • Shipped/default 27B quality route on the tested SM120 path: Qwen3.5-27B-UD-Q3_K_XL.gguf
  • Strongest local 27B benchmark path on this machine: Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic.i1-Q3_K_M.gguf

27B Comparison

The current 27B story in this repo has two layers:

  • the shipped/default quality route is quality_ud_q3kxl on the tested native SM120 / RTX 5080 path
  • the strongest local benchmark-only 27B result is Heretic Q3_K_M on the dedicated comparison preset

That is an environment-specific result, not a universal ranking claim.

Most relevant artifacts:

What those artifacts support:

  • UD-Q3_K_XL was unstable on the default runtime path but stable on the native SM120 retest.
  • IQ4_XS also passed cleanly and remained slightly faster.
  • Both UD-Q3_K_XL and IQ4_XS reached 262144 effective context in the same SM120 sweep.
  • UD-Q3_K_XL used about 514 MiB less visible CUDA buffer memory at that 256K ceiling.

What they do not support:

  • a universal claim that UD-Q3_K_XL beats IQ4_XS everywhere
  • a hardware-agnostic recommendation for non-SM120 GPUs
  • a multimodal quality ranking beyond the checked-in local runs
  • a claim that the shipped quality route and the strongest benchmark-only 27B path are the same thing

Quick Start

1. Install llama.cpp

Place a CUDA build in ./llama-bin/, or build the native SM120 binary if you are on RTX 5080 / 5090.

2. Download the model files

The repo keeps separate local projector filenames so 35B, 27B, and 9B do not overwrite each other in ./models/unsloth-gguf/.

Use the helper:

=======

  • Best 27B recommendation on the current tested path: Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic.i1-Q3_K_M.gguf

27B Comparison

The current 27B comparison in this repo is based on the tested native SM120 / RTX 5080 path. That is an environment-specific choice, not a universal ranking claim.

Most relevant artifacts:

What those artifacts support:

  • UD-Q3_K_XL was unstable on the default runtime path but stable on the native SM120 retest.
  • IQ4_XS also passed cleanly and remained slightly faster.
  • Both UD-Q3_K_XL and IQ4_XS reached 262144 effective context in the same SM120 sweep.
  • UD-Q3_K_XL used about 514 MiB less visible CUDA buffer memory at that 256K ceiling.

What they do not support:

  • a universal claim that UD-Q3_K_XL beats IQ4_XS everywhere
  • a hardware-agnostic recommendation for non-SM120 GPUs
  • a multimodal quality ranking beyond the checked-in local runs

Quick Start

1. Install llama.cpp

Place a CUDA build in ./llama-bin/, or build the native SM120 binary if you are on RTX 5080 / 5090.

2. Download the model files

The repo keeps separate local projector filenames so 35B, 27B, and 9B do not overwrite each other in ./models/unsloth-gguf/.

Use the helper:

origin/main

powershell -ExecutionPolicy Bypass -File scripts/windows/download_model.ps1 -Model all

Or download manually and rename the projector locally:

huggingface-cli download unsloth/Qwen3.5-35B-A3B-GGUF \
  Qwen3.5-35B-A3B-Q3_K_S.gguf \
  mmproj-F16.gguf \
  --local-dir ./models/unsloth-gguf/
mv ./models/unsloth-gguf/mmproj-F16.gguf ./models/unsloth-gguf/mmproj-35B-F16.gguf

huggingface-cli download unsloth/Qwen3.5-9B-GGUF \
  Qwen3.5-9B-UD-Q4_K_XL.gguf \
  mmproj-F16.gguf \
  --local-dir ./models/unsloth-gguf/
mv ./models/unsloth-gguf/mmproj-F16.gguf ./models/unsloth-gguf/mmproj-9B-F16.gguf

huggingface-cli download unsloth/Qwen3.5-27B-GGUF \
  Qwen3.5-27B-UD-Q3_K_XL.gguf \
  mmproj-F16.gguf \
  --local-dir ./models/unsloth-gguf/
mv ./models/unsloth-gguf/mmproj-F16.gguf ./models/unsloth-gguf/mmproj-27B-F16.gguf

Optional 27B fallback:

  • quality_vision still uses Qwen3.5-27B-Q3_K_S.gguf
  • quality_iq4xs still uses Qwen3.5-27B-IQ4_XS.gguf

3. Start one server

Windows helper:

start_servers_speed.bat coding
start_servers_speed.bat vision
start_servers_speed.bat quality

Cross-platform Python manager:

python server_manager.py start --server coding
python server_manager.py start --server fast_vision
python server_manager.py start --profile quality
python server_manager.py stop

Shell helper:

./scripts/start_servers.sh coding
./scripts/start_servers.sh vision
./scripts/start_servers.sh quality

4. Verify the server

curl http://127.0.0.1:8002/health
curl http://127.0.0.1:8002/v1/models

Benchmark Index

Public benchmark guidance lives in:

If a narrative summary and a JSON artifact disagree, prefer the JSON artifact.

Benchmark V2 — Surface-Isolated Test Suite

The repo now includes a full surface-isolated benchmark suite for reproducible, artifact-backed validation:

# Run a full benchmark bundle on a server preset
python -m tests.benchmark_bundle --server quality_heretic_q3km_benchmark --bundle-profile smoke

# Run individual lanes
python -m tests.perplexity_gate --server quality_heretic_q3km_benchmark --output results/quality_long.json
python -m tests.context_scaling_gate --server quality_heretic_q3km_benchmark --output results/context_sweep.json
python -m tests.long_context_retrieval --server quality_heretic_q3km_benchmark --output results/retrieval.json
python -m tests.llama_native_bench --server quality_heretic_q3km_benchmark --output results/native_bench.json

# Download benchmark corpora (wikitext-2/103)
python scripts/fetch_quality_long_corpora.py

Each bundle produces a directory under results/benchmark_bundles/ with:

  • bundle.json — master artifact with provenance, system info, and lane statuses
  • Per-lane JSON files with full metrics, logs, and error traces
  • Server logs in logs/ for debugging

Current control results (Heretic Q3_K_M, RTX 5080, CUDA 13.2, build 8559):

  • quality_long: PPL 4.35 (wikitext-2), 9.03 (wikitext-103) ✅
  • context_sweep: partial (SSM model limitations) ⚠️
  • retrieval: under investigation (SSM model limitations) 🔍

Bundle evolution evidence: results/benchmark_bundles/ — 9 versions documenting the full development path from v1 through v8.

TurboQuant Lab

  • TurboQuant is lab-only in this repo. It does not change the shipped defaults.
  • Current state:
    • turbo3 is a meaningful 128K+ lab win on the Heretic Q3_K_M benchmark path
    • turbo4 is endpoint-recovered but still non-promoted
    • 32K TurboQuant rows are informational only, not a general recommendation
  • External community benchmarks, for context only:
    • TheTom/turboquant_plus commit 0935dca adds README-only community hardware notes for RTX 3090 CUDA and M1 Max Metal. The external signal is useful for fit and hardware context, but it is not local proof and does not change this repo's checked-in ranking.
  • Current cycle state lives in:

Context Guidance

  • coding now ships at 262144 for maximum validated local context, but 64K to 120K remains the safer daily operating point when speed and headroom matter more than the ceiling.
  • fast_vision ships at 262144 and remains the practical 9B daily preset on this machine.
  • quality_ud_q3kxl ships at 98304 and is chosen for the tested SM120 path, not because it universally beats IQ4_XS.
  • True 1M 9B runs exist only in a local patched text-only path and are not part of this release line.

What This Repo Is Not

  • not a leaderboard of every public Qwen quant on Hugging Face
  • not a promise that every 16 GB GPU will match the same speeds or fit points
  • not a proof that text-only benchmark winners are automatically the best vision models

Terminal Chat and API Helper

Terminal chat:

python chat.py
python chat.py --port 8003

Python helper:

from qwen_api import api_35b, api_9b_vision, SamplingMode

response = api_35b.chat(
    prompt="Write a Python function to reverse a list.",
    mode=SamplingMode.THINKING_CODING,
)

vision = api_9b_vision.vision(
    prompt="Describe this image.",
    image_path="example.png",
)

Project Layout

config/                  Canonical server settings (YAML presets)
config/templates/        Chat template overrides (Jinja2)
docs/                    Technical notes and workflow docs
results/                 Checked-in benchmark artifacts
results/benchmark_bundles/  Surface-isolated benchmark bundles (v1-v8+)
tests/                   Benchmark and validation scripts
tests/data/              Benchmark corpora (wikitext manifests)
scripts/windows/         Windows helpers (download, fetch, start)
scripts/fetch_quality_long_corpora.py  Cross-platform corpus fetcher
server_manager.py        Config-driven launcher and process manager
qwen_api.py              Minimal Python API helper
chat.py                  Terminal chat client with image support

Development Workflow

=======

If a narrative summary and a JSON artifact disagree, prefer the JSON artifact.

TurboQuant Lab

  • Set TURBOQUANT_LLAMA to a TurboQuant-enabled llama-server.exe (see lab/README.md for the build instructions).

  • Run the lab harness (baseline + turbo variants):

    python eval/run_cycle1_live.py `
      --server quality_heretic_q3km `
      --track baseline `
      --cache-type-k iq4_nl `
      --cache-type-v iq4_nl `
      --output results/turboquant_heretic_baseline_<timestamp>.json
    
    python eval/run_cycle1_live.py `
      --server quality_heretic_q3km `
      --track turbo `
      --cache-type-k turbo3 `
      --cache-type-v turbo3 `
      --output results/turboquant_heretic_track_pending.json
  • The artifact follows the schema documented in protocols/artifact_schema.md.

  • Logs are written to artifacts/turboquant/ and are ignored by the release repo.

  • Current state: baseline succeeds; turbo3/turbo4 remain unstable on this machine. See results/BENCHMARK_RESULTS.md for artifact links.

Context Guidance

  • coding now ships at 262144 for maximum validated local context, but 64K to 120K remains the safer daily operating point when speed and headroom matter more than the ceiling.
  • fast_vision ships at 262144 and remains the practical 9B daily preset on this machine.
  • quality_ud_q3kxl ships at 98304 and is chosen for the tested SM120 path, not because it universally beats IQ4_XS.
  • True 1M 9B runs exist only in a local patched text-only path and are not part of this release line.

What This Repo Is Not

  • not a leaderboard of every public Qwen quant on Hugging Face
  • not a promise that every 16 GB GPU will match the same speeds or fit points
  • not a proof that text-only benchmark winners are automatically the best vision models

Terminal Chat and API Helper

Terminal chat:

python chat.py
python chat.py --port 8003

Python helper:

from qwen_api import api_35b, api_9b_vision, SamplingMode

response = api_35b.chat(
    prompt="Write a Python function to reverse a list.",
    mode=SamplingMode.THINKING_CODING,
)

vision = api_9b_vision.vision(
    prompt="Describe this image.",
    image_path="example.png",
)

Project Layout

config/                  Canonical server settings
docs/                    Technical notes and workflow docs
results/                 Checked-in benchmark artifacts
tests/                   Benchmark and validation scripts
scripts/windows/         Windows helpers
server_manager.py        Config-driven launcher and process manager
qwen_api.py              Minimal Python API helper
chat.py                  Terminal chat client with image support

Development Workflow

origin/main This repo is maintained with two local worktrees:

  • one development worktree for day-to-day changes
  • one clean main worktree for release validation and pushes

<<<<<<< HEAD Phase-gated research cycles use docs/QWEN_SWARM_WORKFLOW.md.

=======

origin/main The release rule is simple:

  1. Develop and review in the development worktree.
  2. Replay only releasable commits into the clean main worktree.
  3. Run release checks there.
  4. Push and tag only from the clean main worktree.