t81dev
diff --git a/‎BENCHMARKS.md‎
Lines changed: 105 additions & 0 deletions b/‎BENCHMARKS.md‎
Lines changed: 105 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 1 addition & 0 deletions b/‎README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/ROADMAP.md‎
Lines changed: 15 additions & 1 deletion b/‎docs/ROADMAP.md‎
Lines changed: 15 additions & 1 deletion
diff --git a/‎docs/references/cli-usage.md‎
Lines changed: 2 additions & 0 deletions b/‎docs/references/cli-usage.md‎
Lines changed: 2 additions & 0 deletions
@@ -3,6 +3,111 @@
 This repository now provides a reproducible benchmark that compares FP32,
     post - training ternary quantization(PTQ), and quantization - aware training(QAT) through a small Fashion-MNIST classifier. The script is located at `scripts/ternary_quantization_benchmark.py` and is designed to log accuracy, latency, and storage so you can understand the benefits of moving from float32 weights to ternary-trained representations.
 
+## Benchmark matrix
+
+Use these baseline runs to keep results comparable across machines. The “Expected output” snippets show the fields or files to look for, not exact numbers.
+
+### 1) Fashion-MNIST FP32/PTQ/QAT
+
+```bash
+python scripts/ternary_quantization_benchmark.py \
+  --data-dir ~/data/fashion-mnist \
+  --batch-size 128 \
+  --fp32-epochs 3 \
+  --qat-epochs 3 \
+  --threshold 0.45 \
+  --device cpu \
+  --output benchmarks/fashion_mnist_quantization.csv
+```
+
+Expected output:
+- `benchmarks/fashion_mnist_quantization.csv` with `mode`, `accuracy`, `loss`, `latency_s`, `bytes`.
+
+### 2) GPT-2 PTQ/QAT micro-run (sanity check)
+
+```bash
+python scripts/phi3_ptq_qat_benchmark.py \
+  --model-id gpt2 \
+  --device cpu \
+  --dtype float32 \
+  --max-eval-tokens 32 \
+  --eval-texts 8 \
+  --max-new-tokens 4 \
+  --run-qat \
+  --qat-steps 5 \
+  --train-split 'train[:20]'
+```
+
+Expected output:
+- Console summary with size, compression ratio, perplexity, and tok/s for baseline/PTQ/QAT.
+
+### 3) ViT CIFAR-10 PTQ/QAT baseline (quick)
+
+```bash
+python scripts/vit_ptq_qat_benchmark.py \
+  --model-id google/vit-base-patch16-224 \
+  --device cpu \
+  --threshold 0.45 \
+  --batch-size 32 \
+  --max-train-samples 2048 \
+  --max-eval-samples 512 \
+  --eval-batches 16 \
+  --run-qat \
+  --qat-steps 50 \
+  --json-output benchmarks/vit_cifar10_baseline.json
+```
+
+Expected output:
+- Console summary with size, accuracy/loss, and images/s for baseline/PTQ/QAT.
+- `benchmarks/vit_cifar10_baseline.json` with stage metrics and model metadata.
+
+### 4) GGUF export + load check
+
+```bash
+t81 convert microsoft/Phi-3-mini-4k-instruct phi3-t81 --threshold 0.45 --force-cpu-device-map
+t81 gguf phi3-tq1.gguf --from-t81 phi3-t81 --quant TQ1_0 --validate
+python scripts/gguf_benchmark.py --gguf phi3-tq1.gguf --llama-cli /path/to/llama-cli --n-predict 128
+```
+
+Expected output:
+- `t81 gguf` prints validation success.
+- `gguf_benchmark.py` prints size, peak RSS, and eval ms/token.
+
+Phi-3 GGUF baseline (TQ1_0, CPU-only, llama.cpp build 7340):
+
+```bash
+python scripts/gguf_benchmark.py \
+  --gguf phi3-tq1-fixed12.gguf \
+  --llama-cli /opt/homebrew/bin/llama-cli \
+  --n-predict 64 --extra --device none --n-gpu-layers 0
+```
+
+Observed output:
+- size: 1481.96 MiB
+- peak RSS: 2260.02 MiB
+- prompt: 54.35 ms/token (18.4 tok/s)
+- eval: 56.22 ms/token (17.79 tok/s)
+
+### 5) GEMM throughput (CPU)
+
+```bash
+python - <<'PY'
+import numpy as np
+import t81lib
+
+m, n, k = 1024, 1024, 1024
+weights = np.random.randn(m, k).astype(np.float32)
+packed = t81lib.pack_dense_matrix(weights, threshold=0.45)
+rhs = np.random.randn(k, n).astype(np.float32)
+out = np.zeros((m, n), dtype=np.float32)
+t81lib.gemm_ternary(packed, packed, out, m, n, k)
+print("gemm_ternary OK", out.shape)
+PY
+```
+
+Expected output:
+- Console prints `gemm_ternary OK (1024, 1024)`.
+
 ## Script overview
 
 1. The benchmark builds a `TinyClassifier` (a single `nn.Linear` head on flattened Fashion-MNIST images) and trains it in FP32 for a few epochs.
 
@@ -225,6 +225,7 @@ See [docs/api-overview.md](docs/api-overview.md) for the full surface described
 ## Benchmarks
 
 See [BENCHMARKS.md](BENCHMARKS.md) for the Fashion-MNIST FP32/PTQ/QAT comparison.
+Phi-3 GGUF baseline (TQ1_0, CPU-only, llama.cpp): size 1481.96 MiB, peak RSS 2260.02 MiB, prompt 54.35 ms/token (18.4 tok/s), eval 56.22 ms/token (17.79 tok/s).
 
 ```bash
 cmake -S . -B build -DT81LIB_BUILD_BENCH=ON -DT81LIB_USE_GOOGLE_BENCH=ON
 
@@ -62,7 +62,20 @@ Recent work has delivered parts of this roadmap:
 
 * **Recommendation 1** — quickstart matrix + common workflows added to `DEVELOPMENT.md`.
 * **Recommendation 2** — CI matrix expanded for OS/build types and SIMD guards; Python tests standardized on Linux.
-* **Recommendation 3** — Python entry-points table added to `docs/python-api.md` and `docs/python-cookbook.md`, with links from `docs/index.md`.
+* **Recommendation 3** — Python entry-points table added to `docs/python-api.md` and `docs/python-cookbook.md`, with links from `docs/index.md`. **In progress (benchmark visibility added in `README.md`, `BENCHMARKS.md`, and the Phi-3 notebook).**
+* **GGUF compatibility** — Phi-3 export validated (`phi3-tq1-fixed12.gguf`); QKV split experiment reverted for llama.cpp parity.
+
+### Status timeline (recent highlights)
+
+* Python entry-point discoverability refreshed (docs landing page + cookbook + API entry table).
+* Phi-3 GGUF export validated with llama.cpp baseline metrics captured for reference.
+* CLI documentation updated to call out Phi-3 GGUF compatibility expectations.
+
+### High-impact next priorities (effort vs. impact)
+
+1. **Recommendation 4 — Standardized QAT benchmark (high effort, high impact)**: define a reproducible suite (e.g., Phi-3 Mini fine-tune on OpenAssistant/oasst1 or a small ViT on CIFAR-10) that compares FP16 → PTQ → QAT. Capture perplexity/accuracy, model size, and tok/s on CPU (llama.cpp) plus GPU (when bindings are ready), and publish baselines in `BENCHMARKS.md` with JSON artifacts.
+2. **Recommendation 5 — GPU tensor metadata + dispatcher hardening (medium effort, high impact)**: stabilize the `TensorMetadata` ABI for safe `device_ptr` handling, add broadcasting/contiguous fallbacks, and certify CUDA/ROCm kernels in CI with latency/accuracy parity against CPU.
+3. **Polish & community**: add issue templates + “good first issue” labels, ship pre-built wheels (including CUDA-enabled variants), and publish the Phi-3 TQ1_0 GGUF for community testing.
 
 Remaining items are listed below with the next steps still required.
 
@@ -91,6 +104,7 @@ Remaining items are listed below with the next steps still required.
 * **Why**: Python users currently discover helpers across `t81lib`, `t81`, and CLI docs.
 * **Benefits**: Easier discoverability, faster adoption, clearer path from C++ bindings to Torch wrappers.
 * **Effort**: Low-Medium.
+* **Status**: In progress (benchmark visibility added in `README.md`, `BENCHMARKS.md`, and the Phi-3 notebook).
 * **Implementation**:
   1. Expand MkDocs coverage by generating the Python API reference via mkdocstrings and ensuring key modules are linked from `docs/index.md`. **Done (entry-point links + extra directives).**
   2. Keep the “Python Cookbook” up to date with end-to-end recipes (bindings + `t81.torch` + CLI), and add a short "choose your entry point" table. **Done (entry points table).**
 
@@ -57,6 +57,8 @@ Pass `--validate` when you want the fresh GGUF bundle checked by both the Python
 
 Use the same `--threshold`, `--device-map`, `--torch-dtype`, and `--force-cpu-device-map` knobs as `t81 convert` because `t81 gguf` delegates to that CLI internally.
 
+Phi-3 GGUF compatibility: the exporter now splits fused `qkv_proj`/`gate_up_proj` weights and skips ternary quantization for non-block-aligned matrices so `llama.cpp` can load Phi-3 bundles without patching. If you have an older GGUF, re-export with the latest `t81 gguf`.
+
 Use `--profile compression-first` to force the compression-first profile (TQ1_0 + default threshold) and stamp profile metadata into the bundle. Use `--profile tq1_1-draft` with `T81_ENABLE_TQ1_1=1` to write the experimental TQ1_1 payloads.
 
 ### Compression-first wedge (FP16 to ternary GGUF)