Config-driven presets, launchers, and benchmark artifacts for running Qwen3.5 GGUF models on a single 16 GB NVIDIA GPU with llama.cpp.
This repo is deliberately narrow:
- one server at a time
- checked-in JSON artifacts behind every benchmark claim
- public docs aligned to the shipped
config/servers.yaml - current release guidance grounded on the tested RTX 5080 16 GB path
If you want a fast answer instead of reading the whole repo:
<<<<<<< HEAD
- use
Qwen3.5-35B-A3B-Q3_K_S.gguffor the strongest shipped 35B path - use
Qwen3.5-9B-UD-Q4_K_XL.gguffor the strongest shipped 9B daily multimodal path - use
Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic.i1-Q3_K_M.gguffor the strongest local 27B benchmark path on this machine ======= - use
Qwen3.5-35B-A3B-Q3_K_S.gguffor the strongest overall local 35B setup - use
Qwen3.5-9B-UD-Q4_K_XL.gguffor the best practical 9B daily multimodal setup - use
Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic.i1-Q3_K_M.gguffor the strongest 27B option on this machine
origin/main
- Recommended presets
- 27B comparison
- Quick start
- Benchmark index <<<<<<< HEAD
- Swarm workflow =======
origin/main
<<<<<<< HEAD
| Model | Size | Best For | Context | Estimated VRAM | Caveat |
|---|---|---|---|---|---|
Qwen3.5-35B-A3B-Q3_K_S.gguf |
35B | Strongest shipped 35B daily path | 120K daily / 256K max | 15.3 GB | 256K is validated, but the fit-safe daily path is still the 120K coding preset |
Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic.i1-Q3_K_M.gguf |
27B | Strongest local 27B benchmark path | 32K benchmark / 128K+ lab | 14.1 GB | This is not the shipped quality route; the shipped 27B quality path remains quality_ud_q3kxl on the tested SM120 setup |
Qwen3.5-9B-UD-Q4_K_XL.gguf |
9B | Strongest shipped 9B daily multimodal path | 256K | 10.6 GB | Lower absolute capability than the 27B and 35B paths |
| Status | What it means here |
|---|---|
| Shipped | Present in config/servers.yaml, reflected in launcher helpers, and described in public docs |
| Artifact-backed | Recommendation or speed/context claim is tied to a checked-in JSON result |
| Experimental | Kept in the repo for local comparison or follow-up work, but not part of the public default story |
| ======= | |
| Model | Size |
| --- | --- |
Qwen3.5-35B-A3B-Q3_K_S.gguf |
35B |
Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic.i1-Q3_K_M.gguf |
27B |
Qwen3.5-9B-UD-Q4_K_XL.gguf |
9B |
| Status | What it means here |
|---|---|
| Shipped | Present in config/servers.yaml, reflected in launcher helpers, and described in public docs |
| Artifact-backed | Recommendation or speed/context claim is tied to a checked-in JSON result |
| Experimental | Kept in the repo for local comparison or follow-up work, but not part of the public default story |
origin/main
Current release stance:
- Best 35B daily recommendation:
Qwen3.5-35B-A3B-Q3_K_S.gguf - Best practical 9B daily recommendation:
Qwen3.5-9B-UD-Q4_K_XL.gguf<<<<<<< HEAD - Shipped/default 27B quality route on the tested SM120 path:
Qwen3.5-27B-UD-Q3_K_XL.gguf - Strongest local 27B benchmark path on this machine:
Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic.i1-Q3_K_M.gguf
The current 27B story in this repo has two layers:
- the shipped/default quality route is
quality_ud_q3kxlon the tested native SM120 / RTX 5080 path - the strongest local benchmark-only 27B result is Heretic
Q3_K_Mon the dedicated comparison preset
That is an environment-specific result, not a universal ranking claim.
Most relevant artifacts:
- results/deterministic_mini_eval_20260322_220723.json
- results/deterministic_mini_eval_20260322_222113.json
- results/context_compare_27b_ud_vs_iq4xs_text_vision_sm120_20260323_000227.json
What those artifacts support:
UD-Q3_K_XLwas unstable on the default runtime path but stable on the native SM120 retest.IQ4_XSalso passed cleanly and remained slightly faster.- Both
UD-Q3_K_XLandIQ4_XSreached262144effective context in the same SM120 sweep. UD-Q3_K_XLused about 514 MiB less visible CUDA buffer memory at that 256K ceiling.
What they do not support:
- a universal claim that
UD-Q3_K_XLbeatsIQ4_XSeverywhere - a hardware-agnostic recommendation for non-SM120 GPUs
- a multimodal quality ranking beyond the checked-in local runs
- a claim that the shipped
qualityroute and the strongest benchmark-only 27B path are the same thing
Place a CUDA build in ./llama-bin/, or build the native SM120 binary if you are on RTX 5080 / 5090.
- Build guide: docs/RTX5080-NATIVE-BUILD.md
- Releases: ggml-org/llama.cpp releases
The repo keeps separate local projector filenames so 35B, 27B, and 9B do not overwrite each other in ./models/unsloth-gguf/.
Use the helper:
=======
- Best 27B recommendation on the current tested path:
Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic.i1-Q3_K_M.gguf
The current 27B comparison in this repo is based on the tested native SM120 / RTX 5080 path. That is an environment-specific choice, not a universal ranking claim.
Most relevant artifacts:
- results/deterministic_mini_eval_20260322_220723.json
- results/deterministic_mini_eval_20260322_222113.json
- results/context_compare_27b_ud_vs_iq4xs_text_vision_sm120_20260323_000227.json
What those artifacts support:
UD-Q3_K_XLwas unstable on the default runtime path but stable on the native SM120 retest.IQ4_XSalso passed cleanly and remained slightly faster.- Both
UD-Q3_K_XLandIQ4_XSreached262144effective context in the same SM120 sweep. UD-Q3_K_XLused about 514 MiB less visible CUDA buffer memory at that 256K ceiling.
What they do not support:
- a universal claim that
UD-Q3_K_XLbeatsIQ4_XSeverywhere - a hardware-agnostic recommendation for non-SM120 GPUs
- a multimodal quality ranking beyond the checked-in local runs
Place a CUDA build in ./llama-bin/, or build the native SM120 binary if you are on RTX 5080 / 5090.
- Build guide: docs/RTX5080-NATIVE-BUILD.md
- Releases: ggml-org/llama.cpp releases
The repo keeps separate local projector filenames so 35B, 27B, and 9B do not overwrite each other in ./models/unsloth-gguf/.
Use the helper:
origin/main
powershell -ExecutionPolicy Bypass -File scripts/windows/download_model.ps1 -Model allOr download manually and rename the projector locally:
huggingface-cli download unsloth/Qwen3.5-35B-A3B-GGUF \
Qwen3.5-35B-A3B-Q3_K_S.gguf \
mmproj-F16.gguf \
--local-dir ./models/unsloth-gguf/
mv ./models/unsloth-gguf/mmproj-F16.gguf ./models/unsloth-gguf/mmproj-35B-F16.gguf
huggingface-cli download unsloth/Qwen3.5-9B-GGUF \
Qwen3.5-9B-UD-Q4_K_XL.gguf \
mmproj-F16.gguf \
--local-dir ./models/unsloth-gguf/
mv ./models/unsloth-gguf/mmproj-F16.gguf ./models/unsloth-gguf/mmproj-9B-F16.gguf
huggingface-cli download unsloth/Qwen3.5-27B-GGUF \
Qwen3.5-27B-UD-Q3_K_XL.gguf \
mmproj-F16.gguf \
--local-dir ./models/unsloth-gguf/
mv ./models/unsloth-gguf/mmproj-F16.gguf ./models/unsloth-gguf/mmproj-27B-F16.ggufOptional 27B fallback:
quality_visionstill usesQwen3.5-27B-Q3_K_S.ggufquality_iq4xsstill usesQwen3.5-27B-IQ4_XS.gguf
Windows helper:
start_servers_speed.bat coding
start_servers_speed.bat vision
start_servers_speed.bat qualityCross-platform Python manager:
python server_manager.py start --server coding
python server_manager.py start --server fast_vision
python server_manager.py start --profile quality
python server_manager.py stopShell helper:
./scripts/start_servers.sh coding
./scripts/start_servers.sh vision
./scripts/start_servers.sh qualitycurl http://127.0.0.1:8002/health
curl http://127.0.0.1:8002/v1/modelsPublic benchmark guidance lives in:
- results/BENCHMARK_RESULTS.md
- docs/BENCHMARKING.md
- docs/PERFORMANCE_MATRIX.md <<<<<<< HEAD
- docs/ROI_TIMELINE.md
- docs/QWEN_REPO_INTELLIGENCE.md
If a narrative summary and a JSON artifact disagree, prefer the JSON artifact.
The repo now includes a full surface-isolated benchmark suite for reproducible, artifact-backed validation:
# Run a full benchmark bundle on a server preset
python -m tests.benchmark_bundle --server quality_heretic_q3km_benchmark --bundle-profile smoke
# Run individual lanes
python -m tests.perplexity_gate --server quality_heretic_q3km_benchmark --output results/quality_long.json
python -m tests.context_scaling_gate --server quality_heretic_q3km_benchmark --output results/context_sweep.json
python -m tests.long_context_retrieval --server quality_heretic_q3km_benchmark --output results/retrieval.json
python -m tests.llama_native_bench --server quality_heretic_q3km_benchmark --output results/native_bench.json
# Download benchmark corpora (wikitext-2/103)
python scripts/fetch_quality_long_corpora.pyEach bundle produces a directory under results/benchmark_bundles/ with:
bundle.json— master artifact with provenance, system info, and lane statuses- Per-lane JSON files with full metrics, logs, and error traces
- Server logs in
logs/for debugging
Current control results (Heretic Q3_K_M, RTX 5080, CUDA 13.2, build 8559):
quality_long: PPL 4.35 (wikitext-2), 9.03 (wikitext-103) ✅context_sweep: partial (SSM model limitations)⚠️ retrieval: under investigation (SSM model limitations) 🔍
Bundle evolution evidence: results/benchmark_bundles/ — 9 versions documenting the full development path from v1 through v8.
- TurboQuant is lab-only in this repo. It does not change the shipped defaults.
- Current state:
turbo3is a meaningful128K+lab win on the HereticQ3_K_Mbenchmark pathturbo4is endpoint-recovered but still non-promoted32KTurboQuant rows are informational only, not a general recommendation
- External community benchmarks, for context only:
- TheTom/turboquant_plus commit 0935dca adds README-only community hardware notes for RTX 3090 CUDA and M1 Max Metal. The external signal is useful for fit and hardware context, but it is not local proof and does not change this repo's checked-in ranking.
- Current cycle state lives in:
codingnow ships at262144for maximum validated local context, but64Kto120Kremains the safer daily operating point when speed and headroom matter more than the ceiling.fast_visionships at262144and remains the practical 9B daily preset on this machine.quality_ud_q3kxlships at98304and is chosen for the tested SM120 path, not because it universally beatsIQ4_XS.- True 1M 9B runs exist only in a local patched text-only path and are not part of this release line.
- not a leaderboard of every public Qwen quant on Hugging Face
- not a promise that every 16 GB GPU will match the same speeds or fit points
- not a proof that text-only benchmark winners are automatically the best vision models
Terminal chat:
python chat.py
python chat.py --port 8003Python helper:
from qwen_api import api_35b, api_9b_vision, SamplingMode
response = api_35b.chat(
prompt="Write a Python function to reverse a list.",
mode=SamplingMode.THINKING_CODING,
)
vision = api_9b_vision.vision(
prompt="Describe this image.",
image_path="example.png",
)config/ Canonical server settings (YAML presets)
config/templates/ Chat template overrides (Jinja2)
docs/ Technical notes and workflow docs
results/ Checked-in benchmark artifacts
results/benchmark_bundles/ Surface-isolated benchmark bundles (v1-v8+)
tests/ Benchmark and validation scripts
tests/data/ Benchmark corpora (wikitext manifests)
scripts/windows/ Windows helpers (download, fetch, start)
scripts/fetch_quality_long_corpora.py Cross-platform corpus fetcher
server_manager.py Config-driven launcher and process manager
qwen_api.py Minimal Python API helper
chat.py Terminal chat client with image support
=======
If a narrative summary and a JSON artifact disagree, prefer the JSON artifact.
-
Set
TURBOQUANT_LLAMAto a TurboQuant-enabledllama-server.exe(seelab/README.mdfor the build instructions). -
Run the lab harness (baseline + turbo variants):
python eval/run_cycle1_live.py ` --server quality_heretic_q3km ` --track baseline ` --cache-type-k iq4_nl ` --cache-type-v iq4_nl ` --output results/turboquant_heretic_baseline_<timestamp>.json python eval/run_cycle1_live.py ` --server quality_heretic_q3km ` --track turbo ` --cache-type-k turbo3 ` --cache-type-v turbo3 ` --output results/turboquant_heretic_track_pending.json
-
The artifact follows the schema documented in protocols/artifact_schema.md.
-
Logs are written to
artifacts/turboquant/and are ignored by the release repo. -
Current state: baseline succeeds; turbo3/turbo4 remain unstable on this machine. See
results/BENCHMARK_RESULTS.mdfor artifact links.
codingnow ships at262144for maximum validated local context, but64Kto120Kremains the safer daily operating point when speed and headroom matter more than the ceiling.fast_visionships at262144and remains the practical 9B daily preset on this machine.quality_ud_q3kxlships at98304and is chosen for the tested SM120 path, not because it universally beatsIQ4_XS.- True 1M 9B runs exist only in a local patched text-only path and are not part of this release line.
- not a leaderboard of every public Qwen quant on Hugging Face
- not a promise that every 16 GB GPU will match the same speeds or fit points
- not a proof that text-only benchmark winners are automatically the best vision models
Terminal chat:
python chat.py
python chat.py --port 8003Python helper:
from qwen_api import api_35b, api_9b_vision, SamplingMode
response = api_35b.chat(
prompt="Write a Python function to reverse a list.",
mode=SamplingMode.THINKING_CODING,
)
vision = api_9b_vision.vision(
prompt="Describe this image.",
image_path="example.png",
)config/ Canonical server settings
docs/ Technical notes and workflow docs
results/ Checked-in benchmark artifacts
tests/ Benchmark and validation scripts
scripts/windows/ Windows helpers
server_manager.py Config-driven launcher and process manager
qwen_api.py Minimal Python API helper
chat.py Terminal chat client with image support
origin/main This repo is maintained with two local worktrees:
- one development worktree for day-to-day changes
- one clean
mainworktree for release validation and pushes
<<<<<<< HEAD Phase-gated research cycles use docs/QWEN_SWARM_WORKFLOW.md.
=======
origin/main The release rule is simple:
- Develop and review in the development worktree.
- Replay only releasable commits into the clean
mainworktree. - Run release checks there.
- Push and tag only from the clean
mainworktree.