|
1 | | -#Quantization Benchmark Suite |
| 1 | +# Quantization Benchmark Suite |
2 | 2 |
|
3 | 3 | This repository now provides a reproducible benchmark that compares FP32, |
4 | | - post - training ternary quantization(PTQ), and quantization - aware training(QAT) through a small Fashion-MNIST classifier. The script is located at `scripts/ternary_quantization_benchmark.py` and is designed to log accuracy, latency, and storage so you can understand the benefits of moving from float32 weights to ternary-trained representations. |
| 4 | +post-training ternary quantization (PTQ), and quantization-aware training (QAT) |
| 5 | +through a small Fashion-MNIST classifier. The script is located at |
| 6 | +`scripts/ternary_quantization_benchmark.py` and is designed to log accuracy, |
| 7 | +latency, and storage so you can understand the benefits of moving from float32 |
| 8 | +weights to ternary-trained representations. |
5 | 9 |
|
6 | 10 | ## Benchmark matrix |
7 | 11 |
|
@@ -61,6 +65,67 @@ Expected output: |
61 | 65 | - Console summary with size, accuracy/loss, and images/s for baseline/PTQ/QAT. |
62 | 66 | - `benchmarks/vit_cifar10_baseline.json` with stage metrics and model metadata. |
63 | 67 |
|
| 68 | +### Fast-mode recipes (quick baselines) |
| 69 | + |
| 70 | +Use these when you want a low-latency run to confirm the pipeline without |
| 71 | +waiting for full PTQ/QAT loops. |
| 72 | + |
| 73 | +ViT size + accuracy baseline (skip throughput, minimal eval): |
| 74 | + |
| 75 | +```bash |
| 76 | +python scripts/vit_ptq_qat_benchmark.py \ |
| 77 | + --model-id google/vit-base-patch16-224 \ |
| 78 | + --device cpu \ |
| 79 | + --threshold 0.45 \ |
| 80 | + --batch-size 16 \ |
| 81 | + --max-train-samples 256 \ |
| 82 | + --max-eval-samples 128 \ |
| 83 | + --eval-batches 1 \ |
| 84 | + --max-eval-batches 1 \ |
| 85 | + --skip-throughput \ |
| 86 | + --json-output benchmarks/vit_cifar10_quick.json |
| 87 | +``` |
| 88 | + |
| 89 | +Observed output (CPU, size-only run with `--max-eval-batches 0` + `--skip-throughput`): |
| 90 | +- baseline size: 0.32 GiB |
| 91 | +- PTQ size: 0.03 GiB |
| 92 | +- accuracy/loss/images_per_s: 0.0 (skipped) |
| 93 | + |
| 94 | +Phi-3 baseline PPL only (skip latency + PTQ PPL/QAT): |
| 95 | + |
| 96 | +```bash |
| 97 | +python scripts/phi3_ptq_qat_benchmark.py \ |
| 98 | + --model-id microsoft/Phi-3-mini-4k-instruct \ |
| 99 | + --device cpu \ |
| 100 | + --dtype float32 \ |
| 101 | + --max-eval-tokens 512 \ |
| 102 | + --eval-texts 16 \ |
| 103 | + --max-new-tokens 16 \ |
| 104 | + --skip-latency \ |
| 105 | + --skip-ptq-ppl \ |
| 106 | + --json-output benchmarks/phi3_baseline_ppl.json |
| 107 | +``` |
| 108 | + |
| 109 | +Status: PTQ PPL + short QAT pending (CPU-only PTQ conversion exceeded 2h locally). Resume on GPU: |
| 110 | + |
| 111 | +```bash |
| 112 | +python scripts/phi3_ptq_qat_benchmark.py \ |
| 113 | + --model-id microsoft/Phi-3-mini-4k-instruct \ |
| 114 | + --device auto \ |
| 115 | + --dtype bfloat16 \ |
| 116 | + --threshold 0.45 \ |
| 117 | + --max-eval-tokens 128 \ |
| 118 | + --eval-texts 2 \ |
| 119 | + --max-new-tokens 0 \ |
| 120 | + --skip-latency \ |
| 121 | + --run-qat \ |
| 122 | + --qat-steps 5 \ |
| 123 | + --train-split 'train[:10]' \ |
| 124 | + --json-output benchmarks/phi3_ptq_qat_fast.json |
| 125 | +``` |
| 126 | + |
| 127 | +Note: PTQ still runs on CPU (t81.torch fallback), so keep enough host RAM available. |
| 128 | + |
64 | 129 | ### 4) GGUF export + load check |
65 | 130 |
|
66 | 131 | ```bash |
@@ -148,6 +213,50 @@ Each row contains: |
148 | 213 |
|
149 | 214 | Use this CSV to plot accuracy vs. storage or compare latency across the three modes. |
150 | 215 |
|
| 216 | +## JSON artifact schema (ViT + Phi-3) |
| 217 | + |
| 218 | +The ViT and Phi-3 scripts emit JSON when you pass `--json-output`. These files |
| 219 | +are intended to be committed alongside baseline numbers. |
| 220 | + |
| 221 | +ViT JSON keys (from `scripts/vit_ptq_qat_benchmark.py`): |
| 222 | + |
| 223 | +```json |
| 224 | +{ |
| 225 | + "model_id": "google/vit-base-patch16-224", |
| 226 | + "dataset": "cifar10", |
| 227 | + "device": "cpu", |
| 228 | + "threshold": 0.45, |
| 229 | + "baseline": {"size_gib": 0.00, "accuracy": 0.0, "loss": 0.0, "images_per_s": 0.0}, |
| 230 | + "ptq": {"size_gib": 0.00, "accuracy": 0.0, "loss": 0.0, "images_per_s": 0.0}, |
| 231 | + "qat": null |
| 232 | +} |
| 233 | +``` |
| 234 | + |
| 235 | +Phi-3 JSON keys (from `scripts/phi3_ptq_qat_benchmark.py`): |
| 236 | + |
| 237 | +```json |
| 238 | +{ |
| 239 | + "model_id": "microsoft/Phi-3-mini-4k-instruct", |
| 240 | + "dataset": "wikitext-2-raw-v1", |
| 241 | + "device": "cpu", |
| 242 | + "dtype": "float32", |
| 243 | + "threshold": 0.45, |
| 244 | + "max_eval_tokens": 1024, |
| 245 | + "eval_texts": 32, |
| 246 | + "max_new_tokens": 64, |
| 247 | + "skip_latency": true, |
| 248 | + "skip_ptq_ppl": false, |
| 249 | + "run_qat": false, |
| 250 | + "qat_steps": 5, |
| 251 | + "train_split": "train[:1%]", |
| 252 | + "learning_rate": 5e-5, |
| 253 | + "compression_ratio": 0.0, |
| 254 | + "baseline": {"size_gib": 0.00, "ppl": 0.0, "tok_s": 0.0}, |
| 255 | + "ptq": {"size_gib": 0.00, "ppl": null, "tok_s": 0.0}, |
| 256 | + "qat": null |
| 257 | +} |
| 258 | +``` |
| 259 | + |
151 | 260 | ## Diagrams |
152 | 261 |
|
153 | 262 | View the [benchmark comparison diagram](docs/diagrams/benchmarks.mermaid.md) for a quick latency/storage summary that highlights the 15–22× wins. |
|
0 commit comments