Skip to content

Commit dde6137

Browse files
committed
benchmarks-phi3
1 parent a95bde3 commit dde6137

16 files changed

Lines changed: 1536 additions & 313 deletions

BENCHMARKS.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,111 @@
33
This repository now provides a reproducible benchmark that compares FP32,
44
post - training ternary quantization(PTQ), and quantization - aware training(QAT) through a small Fashion-MNIST classifier. The script is located at `scripts/ternary_quantization_benchmark.py` and is designed to log accuracy, latency, and storage so you can understand the benefits of moving from float32 weights to ternary-trained representations.
55

6+
## Benchmark matrix
7+
8+
Use these baseline runs to keep results comparable across machines. The “Expected output” snippets show the fields or files to look for, not exact numbers.
9+
10+
### 1) Fashion-MNIST FP32/PTQ/QAT
11+
12+
```bash
13+
python scripts/ternary_quantization_benchmark.py \
14+
--data-dir ~/data/fashion-mnist \
15+
--batch-size 128 \
16+
--fp32-epochs 3 \
17+
--qat-epochs 3 \
18+
--threshold 0.45 \
19+
--device cpu \
20+
--output benchmarks/fashion_mnist_quantization.csv
21+
```
22+
23+
Expected output:
24+
- `benchmarks/fashion_mnist_quantization.csv` with `mode`, `accuracy`, `loss`, `latency_s`, `bytes`.
25+
26+
### 2) GPT-2 PTQ/QAT micro-run (sanity check)
27+
28+
```bash
29+
python scripts/phi3_ptq_qat_benchmark.py \
30+
--model-id gpt2 \
31+
--device cpu \
32+
--dtype float32 \
33+
--max-eval-tokens 32 \
34+
--eval-texts 8 \
35+
--max-new-tokens 4 \
36+
--run-qat \
37+
--qat-steps 5 \
38+
--train-split 'train[:20]'
39+
```
40+
41+
Expected output:
42+
- Console summary with size, compression ratio, perplexity, and tok/s for baseline/PTQ/QAT.
43+
44+
### 3) ViT CIFAR-10 PTQ/QAT baseline (quick)
45+
46+
```bash
47+
python scripts/vit_ptq_qat_benchmark.py \
48+
--model-id google/vit-base-patch16-224 \
49+
--device cpu \
50+
--threshold 0.45 \
51+
--batch-size 32 \
52+
--max-train-samples 2048 \
53+
--max-eval-samples 512 \
54+
--eval-batches 16 \
55+
--run-qat \
56+
--qat-steps 50 \
57+
--json-output benchmarks/vit_cifar10_baseline.json
58+
```
59+
60+
Expected output:
61+
- Console summary with size, accuracy/loss, and images/s for baseline/PTQ/QAT.
62+
- `benchmarks/vit_cifar10_baseline.json` with stage metrics and model metadata.
63+
64+
### 4) GGUF export + load check
65+
66+
```bash
67+
t81 convert microsoft/Phi-3-mini-4k-instruct phi3-t81 --threshold 0.45 --force-cpu-device-map
68+
t81 gguf phi3-tq1.gguf --from-t81 phi3-t81 --quant TQ1_0 --validate
69+
python scripts/gguf_benchmark.py --gguf phi3-tq1.gguf --llama-cli /path/to/llama-cli --n-predict 128
70+
```
71+
72+
Expected output:
73+
- `t81 gguf` prints validation success.
74+
- `gguf_benchmark.py` prints size, peak RSS, and eval ms/token.
75+
76+
Phi-3 GGUF baseline (TQ1_0, CPU-only, llama.cpp build 7340):
77+
78+
```bash
79+
python scripts/gguf_benchmark.py \
80+
--gguf phi3-tq1-fixed12.gguf \
81+
--llama-cli /opt/homebrew/bin/llama-cli \
82+
--n-predict 64 --extra --device none --n-gpu-layers 0
83+
```
84+
85+
Observed output:
86+
- size: 1481.96 MiB
87+
- peak RSS: 2260.02 MiB
88+
- prompt: 54.35 ms/token (18.4 tok/s)
89+
- eval: 56.22 ms/token (17.79 tok/s)
90+
91+
### 5) GEMM throughput (CPU)
92+
93+
```bash
94+
python - <<'PY'
95+
import numpy as np
96+
import t81lib
97+
98+
m, n, k = 1024, 1024, 1024
99+
weights = np.random.randn(m, k).astype(np.float32)
100+
packed = t81lib.pack_dense_matrix(weights, threshold=0.45)
101+
rhs = np.random.randn(k, n).astype(np.float32)
102+
out = np.zeros((m, n), dtype=np.float32)
103+
t81lib.gemm_ternary(packed, packed, out, m, n, k)
104+
print("gemm_ternary OK", out.shape)
105+
PY
106+
```
107+
108+
Expected output:
109+
- Console prints `gemm_ternary OK (1024, 1024)`.
110+
6111
## Script overview
7112

8113
1. The benchmark builds a `TinyClassifier` (a single `nn.Linear` head on flattened Fashion-MNIST images) and trains it in FP32 for a few epochs.

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -225,6 +225,7 @@ See [docs/api-overview.md](docs/api-overview.md) for the full surface described
225225
## Benchmarks
226226

227227
See [BENCHMARKS.md](BENCHMARKS.md) for the Fashion-MNIST FP32/PTQ/QAT comparison.
228+
Phi-3 GGUF baseline (TQ1_0, CPU-only, llama.cpp): size 1481.96 MiB, peak RSS 2260.02 MiB, prompt 54.35 ms/token (18.4 tok/s), eval 56.22 ms/token (17.79 tok/s).
228229

229230
```bash
230231
cmake -S . -B build -DT81LIB_BUILD_BENCH=ON -DT81LIB_USE_GOOGLE_BENCH=ON

docs/ROADMAP.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,20 @@ Recent work has delivered parts of this roadmap:
6262

6363
* **Recommendation 1** — quickstart matrix + common workflows added to `DEVELOPMENT.md`.
6464
* **Recommendation 2** — CI matrix expanded for OS/build types and SIMD guards; Python tests standardized on Linux.
65-
* **Recommendation 3** — Python entry-points table added to `docs/python-api.md` and `docs/python-cookbook.md`, with links from `docs/index.md`.
65+
* **Recommendation 3** — Python entry-points table added to `docs/python-api.md` and `docs/python-cookbook.md`, with links from `docs/index.md`. **In progress (benchmark visibility added in `README.md`, `BENCHMARKS.md`, and the Phi-3 notebook).**
66+
* **GGUF compatibility** — Phi-3 export validated (`phi3-tq1-fixed12.gguf`); QKV split experiment reverted for llama.cpp parity.
67+
68+
### Status timeline (recent highlights)
69+
70+
* Python entry-point discoverability refreshed (docs landing page + cookbook + API entry table).
71+
* Phi-3 GGUF export validated with llama.cpp baseline metrics captured for reference.
72+
* CLI documentation updated to call out Phi-3 GGUF compatibility expectations.
73+
74+
### High-impact next priorities (effort vs. impact)
75+
76+
1. **Recommendation 4 — Standardized QAT benchmark (high effort, high impact)**: define a reproducible suite (e.g., Phi-3 Mini fine-tune on OpenAssistant/oasst1 or a small ViT on CIFAR-10) that compares FP16 → PTQ → QAT. Capture perplexity/accuracy, model size, and tok/s on CPU (llama.cpp) plus GPU (when bindings are ready), and publish baselines in `BENCHMARKS.md` with JSON artifacts.
77+
2. **Recommendation 5 — GPU tensor metadata + dispatcher hardening (medium effort, high impact)**: stabilize the `TensorMetadata` ABI for safe `device_ptr` handling, add broadcasting/contiguous fallbacks, and certify CUDA/ROCm kernels in CI with latency/accuracy parity against CPU.
78+
3. **Polish & community**: add issue templates + “good first issue” labels, ship pre-built wheels (including CUDA-enabled variants), and publish the Phi-3 TQ1_0 GGUF for community testing.
6679

6780
Remaining items are listed below with the next steps still required.
6881

@@ -91,6 +104,7 @@ Remaining items are listed below with the next steps still required.
91104
* **Why**: Python users currently discover helpers across `t81lib`, `t81`, and CLI docs.
92105
* **Benefits**: Easier discoverability, faster adoption, clearer path from C++ bindings to Torch wrappers.
93106
* **Effort**: Low-Medium.
107+
* **Status**: In progress (benchmark visibility added in `README.md`, `BENCHMARKS.md`, and the Phi-3 notebook).
94108
* **Implementation**:
95109
1. Expand MkDocs coverage by generating the Python API reference via mkdocstrings and ensuring key modules are linked from `docs/index.md`. **Done (entry-point links + extra directives).**
96110
2. Keep the “Python Cookbook” up to date with end-to-end recipes (bindings + `t81.torch` + CLI), and add a short "choose your entry point" table. **Done (entry points table).**

docs/references/cli-usage.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,8 @@ Pass `--validate` when you want the fresh GGUF bundle checked by both the Python
5757

5858
Use the same `--threshold`, `--device-map`, `--torch-dtype`, and `--force-cpu-device-map` knobs as `t81 convert` because `t81 gguf` delegates to that CLI internally.
5959

60+
Phi-3 GGUF compatibility: the exporter now splits fused `qkv_proj`/`gate_up_proj` weights and skips ternary quantization for non-block-aligned matrices so `llama.cpp` can load Phi-3 bundles without patching. If you have an older GGUF, re-export with the latest `t81 gguf`.
61+
6062
Use `--profile compression-first` to force the compression-first profile (TQ1_0 + default threshold) and stamp profile metadata into the bundle. Use `--profile tq1_1-draft` with `T81_ENABLE_TQ1_1=1` to write the experimental TQ1_1 payloads.
6163

6264
### Compression-first wedge (FP16 to ternary GGUF)

0 commit comments

Comments
 (0)