qwen-llm

Config-driven presets, launchers, and benchmark artifacts for running Qwen3.5 GGUF models on a single 16 GB NVIDIA GPU with llama.cpp.

This repo is deliberately narrow:

one server at a time
checked-in JSON artifacts behind every benchmark claim
public docs aligned to the shipped config/servers.yaml
current release guidance grounded on the tested RTX 5080 16 GB path

If you want a fast answer instead of reading the whole repo:

<<<<<<< HEAD

use Qwen3.5-35B-A3B-Q3_K_S.gguf for the strongest shipped 35B path
use Qwen3.5-9B-UD-Q4_K_XL.gguf for the strongest shipped 9B daily multimodal path
use Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic.i1-Q3_K_M.gguf for the strongest local 27B benchmark path on this machine =======
use Qwen3.5-35B-A3B-Q3_K_S.gguf for the strongest overall local 35B setup
use Qwen3.5-9B-UD-Q4_K_XL.gguf for the best practical 9B daily multimodal setup
use Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic.i1-Q3_K_M.gguf for the strongest 27B option on this machine

origin/main

Quick Links

Recommended presets
27B comparison
Quick start
Benchmark index <<<<<<< HEAD
Swarm workflow =======

origin/main

Context guidance
Development workflow

Recommended Presets

<<<<<<< HEAD

Model	Size	Best For	Context	Estimated VRAM	Caveat
`Qwen3.5-35B-A3B-Q3_K_S.gguf`	35B	Strongest shipped 35B daily path	120K daily / 256K max	15.3 GB	256K is validated, but the fit-safe daily path is still the 120K `coding` preset
`Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic.i1-Q3_K_M.gguf`	27B	Strongest local 27B benchmark path	32K benchmark / 128K+ lab	14.1 GB	This is not the shipped `quality` route; the shipped 27B quality path remains `quality_ud_q3kxl` on the tested SM120 setup
`Qwen3.5-9B-UD-Q4_K_XL.gguf`	9B	Strongest shipped 9B daily multimodal path	256K	10.6 GB	Lower absolute capability than the 27B and 35B paths

Release Scope

Status	What it means here
Shipped	Present in `config/servers.yaml`, reflected in launcher helpers, and described in public docs
Artifact-backed	Recommendation or speed/context claim is tied to a checked-in JSON result
Experimental	Kept in the repo for local comparison or follow-up work, but not part of the public default story
=======
Model	Size
---	---
`Qwen3.5-35B-A3B-Q3_K_S.gguf`	35B
`Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic.i1-Q3_K_M.gguf`	27B
`Qwen3.5-9B-UD-Q4_K_XL.gguf`	9B

Release Scope

Status	What it means here
Shipped	Present in `config/servers.yaml`, reflected in launcher helpers, and described in public docs
Artifact-backed	Recommendation or speed/context claim is tied to a checked-in JSON result
Experimental	Kept in the repo for local comparison or follow-up work, but not part of the public default story

origin/main

Current release stance:

Best 35B daily recommendation: Qwen3.5-35B-A3B-Q3_K_S.gguf
Best practical 9B daily recommendation: Qwen3.5-9B-UD-Q4_K_XL.gguf <<<<<<< HEAD
Shipped/default 27B quality route on the tested SM120 path: Qwen3.5-27B-UD-Q3_K_XL.gguf
Strongest local 27B benchmark path on this machine: Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic.i1-Q3_K_M.gguf

27B Comparison

The current 27B story in this repo has two layers:

the shipped/default quality route is quality_ud_q3kxl on the tested native SM120 / RTX 5080 path
the strongest local benchmark-only 27B result is Heretic Q3_K_M on the dedicated comparison preset

That is an environment-specific result, not a universal ranking claim.

Most relevant artifacts:

What those artifacts support:

UD-Q3_K_XL was unstable on the default runtime path but stable on the native SM120 retest.
IQ4_XS also passed cleanly and remained slightly faster.
Both UD-Q3_K_XL and IQ4_XS reached 262144 effective context in the same SM120 sweep.
UD-Q3_K_XL used about 514 MiB less visible CUDA buffer memory at that 256K ceiling.

What they do not support:

a universal claim that UD-Q3_K_XL beats IQ4_XS everywhere
a hardware-agnostic recommendation for non-SM120 GPUs
a multimodal quality ranking beyond the checked-in local runs
a claim that the shipped quality route and the strongest benchmark-only 27B path are the same thing

Quick Start

1. Install `llama.cpp`

Place a CUDA build in ./llama-bin/, or build the native SM120 binary if you are on RTX 5080 / 5090.

Build guide: docs/RTX5080-NATIVE-BUILD.md
Releases: ggml-org/llama.cpp releases

2. Download the model files

The repo keeps separate local projector filenames so 35B, 27B, and 9B do not overwrite each other in ./models/unsloth-gguf/.

Use the helper:

=======

Best 27B recommendation on the current tested path: Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic.i1-Q3_K_M.gguf

27B Comparison

The current 27B comparison in this repo is based on the tested native SM120 / RTX 5080 path. That is an environment-specific choice, not a universal ranking claim.

Most relevant artifacts:

What those artifacts support:

UD-Q3_K_XL was unstable on the default runtime path but stable on the native SM120 retest.
IQ4_XS also passed cleanly and remained slightly faster.
Both UD-Q3_K_XL and IQ4_XS reached 262144 effective context in the same SM120 sweep.
UD-Q3_K_XL used about 514 MiB less visible CUDA buffer memory at that 256K ceiling.

What they do not support:

a universal claim that UD-Q3_K_XL beats IQ4_XS everywhere
a hardware-agnostic recommendation for non-SM120 GPUs
a multimodal quality ranking beyond the checked-in local runs

Quick Start

1. Install `llama.cpp`

Place a CUDA build in ./llama-bin/, or build the native SM120 binary if you are on RTX 5080 / 5090.

Build guide: docs/RTX5080-NATIVE-BUILD.md
Releases: ggml-org/llama.cpp releases

2. Download the model files

The repo keeps separate local projector filenames so 35B, 27B, and 9B do not overwrite each other in ./models/unsloth-gguf/.

Use the helper:

origin/main

powershell -ExecutionPolicy Bypass -File scripts/windows/download_model.ps1 -Model all

Or download manually and rename the projector locally:

huggingface-cli download unsloth/Qwen3.5-35B-A3B-GGUF \
  Qwen3.5-35B-A3B-Q3_K_S.gguf \
  mmproj-F16.gguf \
  --local-dir ./models/unsloth-gguf/
mv ./models/unsloth-gguf/mmproj-F16.gguf ./models/unsloth-gguf/mmproj-35B-F16.gguf

huggingface-cli download unsloth/Qwen3.5-9B-GGUF \
  Qwen3.5-9B-UD-Q4_K_XL.gguf \
  mmproj-F16.gguf \
  --local-dir ./models/unsloth-gguf/
mv ./models/unsloth-gguf/mmproj-F16.gguf ./models/unsloth-gguf/mmproj-9B-F16.gguf

huggingface-cli download unsloth/Qwen3.5-27B-GGUF \
  Qwen3.5-27B-UD-Q3_K_XL.gguf \
  mmproj-F16.gguf \
  --local-dir ./models/unsloth-gguf/
mv ./models/unsloth-gguf/mmproj-F16.gguf ./models/unsloth-gguf/mmproj-27B-F16.gguf

Optional 27B fallback:

quality_vision still uses Qwen3.5-27B-Q3_K_S.gguf
quality_iq4xs still uses Qwen3.5-27B-IQ4_XS.gguf

3. Start one server

Windows helper:

start_servers_speed.bat coding
start_servers_speed.bat vision
start_servers_speed.bat quality

Cross-platform Python manager:

python server_manager.py start --server coding
python server_manager.py start --server fast_vision
python server_manager.py start --profile quality
python server_manager.py stop

Shell helper:

./scripts/start_servers.sh coding
./scripts/start_servers.sh vision
./scripts/start_servers.sh quality

4. Verify the server

curl http://127.0.0.1:8002/health
curl http://127.0.0.1:8002/v1/models

Benchmark Index

Public benchmark guidance lives in:

If a narrative summary and a JSON artifact disagree, prefer the JSON artifact.

Benchmark V2 — Surface-Isolated Test Suite

The repo now includes a full surface-isolated benchmark suite for reproducible, artifact-backed validation:

# Run a full benchmark bundle on a server preset
python -m tests.benchmark_bundle --server quality_heretic_q3km_benchmark --bundle-profile smoke

# Run individual lanes
python -m tests.perplexity_gate --server quality_heretic_q3km_benchmark --output results/quality_long.json
python -m tests.context_scaling_gate --server quality_heretic_q3km_benchmark --output results/context_sweep.json
python -m tests.long_context_retrieval --server quality_heretic_q3km_benchmark --output results/retrieval.json
python -m tests.llama_native_bench --server quality_heretic_q3km_benchmark --output results/native_bench.json

# Download benchmark corpora (wikitext-2/103)
python scripts/fetch_quality_long_corpora.py

Each bundle produces a directory under results/benchmark_bundles/ with:

bundle.json — master artifact with provenance, system info, and lane statuses
Per-lane JSON files with full metrics, logs, and error traces
Server logs in logs/ for debugging

Current control results (Heretic Q3_K_M, RTX 5080, CUDA 13.2, build 8559):

quality_long: PPL 4.35 (wikitext-2), 9.03 (wikitext-103) ✅
context_sweep: partial (SSM model limitations) ⚠️
retrieval: under investigation (SSM model limitations) 🔍

Bundle evolution evidence: results/benchmark_bundles/ — 9 versions documenting the full development path from v1 through v8.

TurboQuant Lab

TurboQuant is lab-only in this repo. It does not change the shipped defaults.
Current state:
- turbo3 is a meaningful 128K+ lab win on the Heretic Q3_K_M benchmark path
- turbo4 is endpoint-recovered but still non-promoted
- 32K TurboQuant rows are informational only, not a general recommendation
External community benchmarks, for context only:
- TheTom/turboquant_plus commit 0935dca adds README-only community hardware notes for RTX 3090 CUDA and M1 Max Metal. The external signal is useful for fit and hardware context, but it is not local proof and does not change this repo's checked-in ranking.
Current cycle state lives in:

Context Guidance

coding now ships at 262144 for maximum validated local context, but 64K to 120K remains the safer daily operating point when speed and headroom matter more than the ceiling.
fast_vision ships at 262144 and remains the practical 9B daily preset on this machine.
quality_ud_q3kxl ships at 98304 and is chosen for the tested SM120 path, not because it universally beats IQ4_XS.
True 1M 9B runs exist only in a local patched text-only path and are not part of this release line.

What This Repo Is Not

not a leaderboard of every public Qwen quant on Hugging Face
not a promise that every 16 GB GPU will match the same speeds or fit points
not a proof that text-only benchmark winners are automatically the best vision models

Terminal Chat and API Helper

Terminal chat:

python chat.py
python chat.py --port 8003

Python helper:

from qwen_api import api_35b, api_9b_vision, SamplingMode

response = api_35b.chat(
    prompt="Write a Python function to reverse a list.",
    mode=SamplingMode.THINKING_CODING,
)

vision = api_9b_vision.vision(
    prompt="Describe this image.",
    image_path="example.png",
)

Project Layout

config/                  Canonical server settings (YAML presets)
config/templates/        Chat template overrides (Jinja2)
docs/                    Technical notes and workflow docs
results/                 Checked-in benchmark artifacts
results/benchmark_bundles/  Surface-isolated benchmark bundles (v1-v8+)
tests/                   Benchmark and validation scripts
tests/data/              Benchmark corpora (wikitext manifests)
scripts/windows/         Windows helpers (download, fetch, start)
scripts/fetch_quality_long_corpora.py  Cross-platform corpus fetcher
server_manager.py        Config-driven launcher and process manager
qwen_api.py              Minimal Python API helper
chat.py                  Terminal chat client with image support

Development Workflow

=======

If a narrative summary and a JSON artifact disagree, prefer the JSON artifact.

TurboQuant Lab

Set TURBOQUANT_LLAMA to a TurboQuant-enabled llama-server.exe (see lab/README.md for the build instructions).

Run the lab harness (baseline + turbo variants):

python eval/run_cycle1_live.py `
  --server quality_heretic_q3km `
  --track baseline `
  --cache-type-k iq4_nl `
  --cache-type-v iq4_nl `
  --output results/turboquant_heretic_baseline_<timestamp>.json

python eval/run_cycle1_live.py `
  --server quality_heretic_q3km `
  --track turbo `
  --cache-type-k turbo3 `
  --cache-type-v turbo3 `
  --output results/turboquant_heretic_track_pending.json

The artifact follows the schema documented in protocols/artifact_schema.md.
Logs are written to artifacts/turboquant/ and are ignored by the release repo.
Current state: baseline succeeds; turbo3/turbo4 remain unstable on this machine. See results/BENCHMARK_RESULTS.md for artifact links.

Context Guidance

coding now ships at 262144 for maximum validated local context, but 64K to 120K remains the safer daily operating point when speed and headroom matter more than the ceiling.
fast_vision ships at 262144 and remains the practical 9B daily preset on this machine.
quality_ud_q3kxl ships at 98304 and is chosen for the tested SM120 path, not because it universally beats IQ4_XS.
True 1M 9B runs exist only in a local patched text-only path and are not part of this release line.

What This Repo Is Not

not a leaderboard of every public Qwen quant on Hugging Face
not a promise that every 16 GB GPU will match the same speeds or fit points
not a proof that text-only benchmark winners are automatically the best vision models

Terminal Chat and API Helper

Terminal chat:

python chat.py
python chat.py --port 8003

Python helper:

from qwen_api import api_35b, api_9b_vision, SamplingMode

response = api_35b.chat(
    prompt="Write a Python function to reverse a list.",
    mode=SamplingMode.THINKING_CODING,
)

vision = api_9b_vision.vision(
    prompt="Describe this image.",
    image_path="example.png",
)

Project Layout

config/                  Canonical server settings
docs/                    Technical notes and workflow docs
results/                 Checked-in benchmark artifacts
tests/                   Benchmark and validation scripts
scripts/windows/         Windows helpers
server_manager.py        Config-driven launcher and process manager
qwen_api.py              Minimal Python API helper
chat.py                  Terminal chat client with image support

Development Workflow

origin/main This repo is maintained with two local worktrees:

one development worktree for day-to-day changes
one clean main worktree for release validation and pushes

<<<<<<< HEAD Phase-gated research cycles use docs/QWEN_SWARM_WORKFLOW.md.

=======

origin/main The release rule is simple:

Develop and review in the development worktree.
Replay only releasable commits into the clean main worktree.
Run release checks there.
Push and tag only from the clean main worktree.

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
.github		.github
.hermes/plans		.hermes/plans
.playwright-cli		.playwright-cli
config		config
dashboard		dashboard
docs		docs
results		results
scripts		scripts
tests		tests
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.semgrepignore		.semgrepignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
DISCOVERY.md		DISCOVERY.md
GEMINI.md		GEMINI.md
LICENSE		LICENSE
MEMORY_HANDOFF.md		MEMORY_HANDOFF.md
README.md		README.md
REPO_ORGANIZATION.md		REPO_ORGANIZATION.md
SECURITY.md		SECURITY.md
chat.py		chat.py
hermes_conversation_20260326_003227.json		hermes_conversation_20260326_003227.json
pyproject.toml		pyproject.toml
qwen_api.py		qwen_api.py
requirements.txt		requirements.txt
server_manager.py		server_manager.py
skill.md		skill.md
start_servers_speed.bat		start_servers_speed.bat
start_servers_standard.bat		start_servers_standard.bat
stop_servers.bat		stop_servers.bat
tmp_token_probe.txt		tmp_token_probe.txt
tmp_token_probe2.txt		tmp_token_probe2.txt
tmp_yarn_prefix.txt		tmp_yarn_prefix.txt
tmp_yarn_suffix.txt		tmp_yarn_suffix.txt

Folders and files

Latest commit

History

Repository files navigation

qwen-llm

Quick Links

Recommended Presets

Release Scope

Release Scope

27B Comparison

Quick Start

1. Install llama.cpp

2. Download the model files

27B Comparison

Quick Start

1. Install llama.cpp

2. Download the model files

3. Start one server

4. Verify the server

Benchmark Index

Benchmark V2 — Surface-Isolated Test Suite

TurboQuant Lab

Context Guidance

What This Repo Is Not

Terminal Chat and API Helper

Project Layout

Development Workflow

TurboQuant Lab

Context Guidance

What This Repo Is Not

Terminal Chat and API Helper

Project Layout

Development Workflow

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

1. Install `llama.cpp`

1. Install `llama.cpp`

Packages