Readme fix

t81dev · t81dev · commit b7c59f6e6ad0 · 2025-12-12T11:07:49.000-05:00
diff --git a/AGENTS.md b/AGENTS.md
@@ -56,5 +56,5 @@ This file helps AI agents discover and understand how to work with this reposito
 - Expanded `python/t81/__init__.py` so the higher-level `t81` package re-exports the compiled binding helpers (`t81lib`, `BigInt`, `Limb`, `gemm_ternary`, etc.) while staying import-safe when the extension is unavailable.
 - Added `scripts/ternary_quantization_benchmark.py` plus `BENCHMARKS.md` so contributors can reproduce a Fashion-MNIST FP32/PTQ/QAT benchmark and log accuracy/latency/storage for each mode; README now links the benchmark doc.
 - Rewrote `pyproject.toml` with valid TOML sections so editable installs (and `pip install -e '.[torch]'`) can parse the metadata cleanly before building the extension.
-- Restructured `README.md` into a onboarding-focused front door and added companion docs (`docs/use-cases.md`, `docs/hardware.md`, `docs/api-overview.md`, `docs/python-install.md`, `docs/torch.md`, `examples/README.md`) so heavy reference material lives outside the visitor-facing overview.
+- Restructured `README.md` into a onboarding-focused front door and added companion docs (`docs/use-cases.md`, `docs/hardware.md`, `docs/api-overview.md`, `docs/python-install.md`, `docs/torch.md`, `docs/gpu.md`, `examples/README.md`) so heavy reference material lives outside the visitor-facing overview.
 - Added optional CUDA/ROCm toggles plus a GPU dispatcher sketch (`include/t81/linalg/gemm_gpu.hpp`, `src/linalg/{gemm_cuda.cu,gemm_dispatch.cpp,gemm_rocm.cpp}`) so future teams can wire the new `where`/`clamp`/`lerp`/`addcmul` helpers into GPU kernels, introduced `t81::TensorMetadata` + Python helpers (`python/bindings.cpp`) that extract metadata from NumPy/Torch tensors, and expanded `tests/python/test_gpu_ops.py` to cover the metadata-backed bindings on both CPU and GPU paths.
diff --git a/README.md b/README.md
@@ -18,15 +18,16 @@ AI workflows.
 #include <t81/t81lib.hpp>
 
 int main() {
-  t81::Int sum = t81::Int{1} + t81::Int{2};
-  return sum == 3 ? 0 : 1;
+  using t81::Int;
+  Int sum = Int::from_int(1) + Int::from_int(2);
+  return (sum == Int::from_int(3)) ? 0 : 1;
 }
 ```
 
 ```python
 import t81lib
 
-print(t81lib.Float.from_string("1.5") + t81lib.Float.from_string("1.5"))
+print(t81lib.BigInt(3) * t81lib.BigInt(7))
 ```
 
 ## Who is this for?
@@ -97,9 +98,9 @@ target_link_libraries(... t81::t81lib)
 
 `pip install .[torch]` unlocks the `t81lib`/`t81` namespace, NumPy quantization helpers, and the `t81.torch`/`t81.nn` layers that mix ternary weights with FP32/BF16 biases. Jump deeper via [docs/python-api.md](docs/python-api.md), [docs/python-cookbook.md](docs/python-cookbook.md), and [docs/torch.md](docs/torch.md).
 
-## GPU backends & tensor metadata
+## GPU backends
 
-Enable CUDA/ROCm through the optional `-DUSE_CUDA=ON` and `-DUSE_ROCM=ON` flags during CMake configuration so the Python bindings link against the new GPU kernels (`python/CMakeLists.txt`). Once enabled, `t81lib.where`, `t81lib.clamp`, `t81lib.lerp`, and `t81lib.addcmul` accept either NumPy buffers or PyTorch tensors and route the work directly to CUDA/HIP kernels via the lightweight [`t81::TensorMetadata`](include/t81/tensor_metadata.hpp) ABI. The metadata struct carries device/dtype/shape/stride info plus raw `data_ptr`, letting the dispatcher avoid host copies and keep outputs on-device. When torch is installed, `t81lib` automatically wraps GPU tensors; when only NumPy is available it falls back to CPU buffers. Consult [docs/torch.md](docs/torch.md) and `python/bindings.cpp` for the extraction helpers and lifetime semantics.
+Optional CUDA/ROCm backends can be enabled with `-DUSE_CUDA=ON` / `-DUSE_ROCM=ON` so the Python bindings link against the GPU kernels. `t81lib` exposes a compact `TensorMetadata` ABI that carries device, dtype, shape, and stride info, allowing `where`, `clamp`, `lerp`, and `addcmul` to work directly on NumPy arrays or Torch tensors. See [docs/gpu.md](docs/gpu.md) for build flags, device routing, and tensor metadata details.
 
 ## CLI helpers
 
@@ -130,7 +131,7 @@ See [docs/api-overview.md](docs/api-overview.md) for the full surface described
 
 ## Stability & compatibility
 
-- Supported toolchains: recent Clang/GCC/MSVC or `pip install`’s compatible CPython builds; CMake config defaults to host SIMD if available (AVX2/AVX-512, NEON) while falling back to portable kernels.
+- Supported toolchains: C++20-capable Clang/GCC/MSVC (or `pip install`’s compatible CPython builds) with CMake ≥ 3.22; the build auto-detects AVX2/AVX-512/NEON and falls back to portable kernels when those SIMD targets are unavailable.
 - We track the ABI/API surface via `include/t81/t81lib.hpp`; expect the core headers to evolve until we reach a stable v1 release and consult [CHANGELOG.md](CHANGELOG.md) for migration notes.
 
 ## Docs & resources
diff --git a/docs/gpu.md b/docs/gpu.md
@@ -0,0 +1,5 @@
+# GPU backends & tensor metadata
+
+CUDA/ROCm kernels can be built when you configure with `-DUSE_CUDA=ON` or `-DUSE_ROCM=ON` (see `python/CMakeLists.txt`). The bindings expose `t81lib.where`, `t81lib.clamp`, `t81lib.lerp`, and `t81lib.addcmul`, which accept either NumPy buffers or PyTorch tensors and dispatch directly to the GPU kernels.
+
+Dispatch relies on `t81::TensorMetadata` (`include/t81/tensor_metadata.hpp`): a lightweight struct that carries device tags, dtype codes, shape, strides, and `data_ptr` so the dispatcher can call the right CUDA/HIP kernel without copies. When torch is available, `t81lib` automatically wraps tensors; without torch it gracefully falls back to CPU buffers. Review `python/bindings.cpp` for the extraction helpers and lifetime management.
diff --git a/docs/index.md b/docs/index.md
@@ -38,6 +38,7 @@ to understand the balanced ternary engine without digging through specs immediat
   from command-line workflows.
 - **Use cases & demos** — [`docs/use-cases.md`](use-cases.md) and [`examples/README.md`](../examples/README.md) capture the canonical scripts, notebooks, and research stories.
 - **Hardware simulation** — [`docs/hardware.md`](hardware.md) details `t81.hardware.TernaryEmulator`, fuzzy helpers, and the visualizer notebook.
+- **GPU backends** — [`docs/gpu.md`](gpu.md) explains the CUDA/ROCm build flags and tensor metadata routing.
 - **API overview** — [`docs/api-overview.md`](api-overview.md) summarizes the numeric containers and helpers exposed via `<t81/t81lib.hpp>`.
 - **Tests & benchmarks** — [`tests/`](../tests/) documents the unit/property coverage while [`bench/`](../bench/) shows throughput patterns.