|
3 | 3 | This repository now provides a reproducible benchmark that compares FP32, |
4 | 4 | post - training ternary quantization(PTQ), and quantization - aware training(QAT) through a small Fashion-MNIST classifier. The script is located at `scripts/ternary_quantization_benchmark.py` and is designed to log accuracy, latency, and storage so you can understand the benefits of moving from float32 weights to ternary-trained representations. |
5 | 5 |
|
| 6 | +## Benchmark matrix |
| 7 | + |
| 8 | +Use these baseline runs to keep results comparable across machines. The “Expected output” snippets show the fields or files to look for, not exact numbers. |
| 9 | + |
| 10 | +### 1) Fashion-MNIST FP32/PTQ/QAT |
| 11 | + |
| 12 | +```bash |
| 13 | +python scripts/ternary_quantization_benchmark.py \ |
| 14 | + --data-dir ~/data/fashion-mnist \ |
| 15 | + --batch-size 128 \ |
| 16 | + --fp32-epochs 3 \ |
| 17 | + --qat-epochs 3 \ |
| 18 | + --threshold 0.45 \ |
| 19 | + --device cpu \ |
| 20 | + --output benchmarks/fashion_mnist_quantization.csv |
| 21 | +``` |
| 22 | + |
| 23 | +Expected output: |
| 24 | +- `benchmarks/fashion_mnist_quantization.csv` with `mode`, `accuracy`, `loss`, `latency_s`, `bytes`. |
| 25 | + |
| 26 | +### 2) GPT-2 PTQ/QAT micro-run (sanity check) |
| 27 | + |
| 28 | +```bash |
| 29 | +python scripts/phi3_ptq_qat_benchmark.py \ |
| 30 | + --model-id gpt2 \ |
| 31 | + --device cpu \ |
| 32 | + --dtype float32 \ |
| 33 | + --max-eval-tokens 32 \ |
| 34 | + --eval-texts 8 \ |
| 35 | + --max-new-tokens 4 \ |
| 36 | + --run-qat \ |
| 37 | + --qat-steps 5 \ |
| 38 | + --train-split 'train[:20]' |
| 39 | +``` |
| 40 | + |
| 41 | +Expected output: |
| 42 | +- Console summary with size, compression ratio, perplexity, and tok/s for baseline/PTQ/QAT. |
| 43 | + |
| 44 | +### 3) ViT CIFAR-10 PTQ/QAT baseline (quick) |
| 45 | + |
| 46 | +```bash |
| 47 | +python scripts/vit_ptq_qat_benchmark.py \ |
| 48 | + --model-id google/vit-base-patch16-224 \ |
| 49 | + --device cpu \ |
| 50 | + --threshold 0.45 \ |
| 51 | + --batch-size 32 \ |
| 52 | + --max-train-samples 2048 \ |
| 53 | + --max-eval-samples 512 \ |
| 54 | + --eval-batches 16 \ |
| 55 | + --run-qat \ |
| 56 | + --qat-steps 50 \ |
| 57 | + --json-output benchmarks/vit_cifar10_baseline.json |
| 58 | +``` |
| 59 | + |
| 60 | +Expected output: |
| 61 | +- Console summary with size, accuracy/loss, and images/s for baseline/PTQ/QAT. |
| 62 | +- `benchmarks/vit_cifar10_baseline.json` with stage metrics and model metadata. |
| 63 | + |
| 64 | +### 4) GGUF export + load check |
| 65 | + |
| 66 | +```bash |
| 67 | +t81 convert microsoft/Phi-3-mini-4k-instruct phi3-t81 --threshold 0.45 --force-cpu-device-map |
| 68 | +t81 gguf phi3-tq1.gguf --from-t81 phi3-t81 --quant TQ1_0 --validate |
| 69 | +python scripts/gguf_benchmark.py --gguf phi3-tq1.gguf --llama-cli /path/to/llama-cli --n-predict 128 |
| 70 | +``` |
| 71 | + |
| 72 | +Expected output: |
| 73 | +- `t81 gguf` prints validation success. |
| 74 | +- `gguf_benchmark.py` prints size, peak RSS, and eval ms/token. |
| 75 | + |
| 76 | +Phi-3 GGUF baseline (TQ1_0, CPU-only, llama.cpp build 7340): |
| 77 | + |
| 78 | +```bash |
| 79 | +python scripts/gguf_benchmark.py \ |
| 80 | + --gguf phi3-tq1-fixed12.gguf \ |
| 81 | + --llama-cli /opt/homebrew/bin/llama-cli \ |
| 82 | + --n-predict 64 --extra --device none --n-gpu-layers 0 |
| 83 | +``` |
| 84 | + |
| 85 | +Observed output: |
| 86 | +- size: 1481.96 MiB |
| 87 | +- peak RSS: 2260.02 MiB |
| 88 | +- prompt: 54.35 ms/token (18.4 tok/s) |
| 89 | +- eval: 56.22 ms/token (17.79 tok/s) |
| 90 | + |
| 91 | +### 5) GEMM throughput (CPU) |
| 92 | + |
| 93 | +```bash |
| 94 | +python - <<'PY' |
| 95 | +import numpy as np |
| 96 | +import t81lib |
| 97 | +
|
| 98 | +m, n, k = 1024, 1024, 1024 |
| 99 | +weights = np.random.randn(m, k).astype(np.float32) |
| 100 | +packed = t81lib.pack_dense_matrix(weights, threshold=0.45) |
| 101 | +rhs = np.random.randn(k, n).astype(np.float32) |
| 102 | +out = np.zeros((m, n), dtype=np.float32) |
| 103 | +t81lib.gemm_ternary(packed, packed, out, m, n, k) |
| 104 | +print("gemm_ternary OK", out.shape) |
| 105 | +PY |
| 106 | +``` |
| 107 | + |
| 108 | +Expected output: |
| 109 | +- Console prints `gemm_ternary OK (1024, 1024)`. |
| 110 | + |
6 | 111 | ## Script overview |
7 | 112 |
|
8 | 113 | 1. The benchmark builds a `TinyClassifier` (a single `nn.Linear` head on flattened Fashion-MNIST images) and trains it in FP32 for a few epochs. |
|
0 commit comments