Skip to content

Commit fe0bd37

Browse files
Update readme.md
1 parent 86583e8 commit fe0bd37

1 file changed

Lines changed: 142 additions & 8 deletions

File tree

README.md

Lines changed: 142 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,8 @@ Inference on CPU for a 1.58-bit LLM decoding step. Click the image to view the o
1313

1414
[![RSR vs Baseline](assets/rsr_baseline_compare.webp)](https://drive.google.com/file/d/1ub-MITJUepmfBLkyUZFb50hbJsuhgwCH/view?usp=sharing)
1515

16-
## Installation 🛠️
17-
16+
## Usage 🛠️
17+
### Installation 📦
1818
**Prerequisites:** Python >= 3.10, a C compiler for CPU kernels, and optionally CUDA for GPU support.
1919

2020
```bash
@@ -23,9 +23,146 @@ cd RSR-Core
2323
pip install -e .
2424
```
2525

26+
### Prepare a model (once) 🧱
27+
Run `integrations/hf/model_prep.py` once per model to preprocess the ternary
28+
weights and save the RSR metadata needed for inference.
29+
30+
```bash
31+
python -m integrations.hf.model_prep \
32+
--model microsoft/bitnet-b1.58-2B-4T \
33+
--output ./preprocessed_model \
34+
--device cpu \
35+
--trust-remote-code \
36+
--best-k-json benchmarking/bit_1_58/reports/best_k_cpu.json
37+
```
38+
39+
```text
40+
CLI args for integrations/hf/model_prep.py:
41+
--model, -m HuggingFace model ID or local path (required)
42+
--output, -o Output directory for the preprocessed model (required)
43+
--k Block height for RSR decomposition (default: from best_k_{device}.json)
44+
--version RSR multiplier version to use (default: adaptive)
45+
--device Device for model loading: cpu or cuda (default: cpu)
46+
--trust-remote-code Allow remote code when loading HuggingFace models
47+
--best-k-json Optional path to a per-layer best-k JSON file
48+
Default:
49+
benchmarking/bit_1_58/reports/best_k_{device}.json
50+
```
51+
52+
### Run model inference 🤖
53+
Use `integrations/hf/model_infer.py` to run generation from a preprocessed
54+
model directory. The default backend is `rsr`.
55+
56+
```bash
57+
python -m integrations.hf.model_infer \
58+
--model-dir ./preprocessed_model \
59+
--backend rsr \
60+
--device cpu \
61+
--prompt "Write the numbers from one to ten in words." \
62+
--max-new-tokens 64 \
63+
--stream
64+
```
65+
66+
```text
67+
CLI args for integrations/hf/model_infer.py:
68+
--model-dir Directory with rsr_config.json and safetensors artifacts
69+
(default: integrations/hf)
70+
--backend Inference backend: rsr or hf (default: rsr)
71+
--tokenizer Optional tokenizer source
72+
Default: rsr_config.json:model_name
73+
--device Target device; auto-detected from model-dir suffix
74+
(_cpu / _cuda) if omitted
75+
--dtype Optional dtype cast: float32, float16, or bfloat16
76+
--prompt Prompt text to generate from (required)
77+
--max-new-tokens Maximum number of tokens to generate (default: 64)
78+
--no-chat-template Tokenize the raw prompt directly
79+
--stream Stream decoded output as tokens are generated
80+
```
81+
82+
#### Benchmark on your machine ⏱️
83+
Use the scripts under `benchmarking/` to reproduce the local numbers for
84+
kernel-level matvec benchmarks and end-to-end LLM inference.
85+
86+
**Find the best `k` for ternary RSR**
87+
88+
```bash
89+
python -m benchmarking.bit_1_58.bench_best_k \
90+
--device cpu \
91+
--shapes 2560x2560 4096x14336 \
92+
--k-values 2 4 6 8 10 12 \
93+
--warmup 10 \
94+
--repeats 30
95+
```
96+
97+
```text
98+
CLI args for benchmarking/bit_1_58/bench_best_k.py:
99+
--device Target device: cpu or cuda (required)
100+
--shapes Optional list of matrix shapes in NxM format
101+
Default: all known preprocessed model shapes
102+
--k-values Optional list of k values to test
103+
Default: 2 4 6 8 10 12
104+
--warmup Warmup iterations before timing (default: 10)
105+
--repeats Timed iterations per shape/k (default: 30)
106+
```
107+
108+
This writes:
109+
`benchmarking/bit_1_58/reports/best_k_{device}.csv` and
110+
`benchmarking/bit_1_58/reports/best_k_{device}.json`
111+
112+
**Benchmark matrix-vector multiplication**
113+
114+
The shape benchmark scripts do not take CLI arguments. Configure them by
115+
editing the constants at the top of the script:
116+
`SHAPES`, `K_VALUES`, `METHODS`, `REPEATS`, and `WARMUP`.
117+
118+
```bash
119+
python benchmarking/bit_1/bench_shapes_cpu.py
120+
python benchmarking/bit_1/bench_shapes_cuda.py
121+
python benchmarking/bit_1_58/bench_shapes_cpu.py
122+
python benchmarking/bit_1_58/bench_shapes_cuda.py
123+
```
124+
125+
Reports are written to:
126+
`benchmarking/bit_1/reports/results_shapes_{device}.csv`
127+
`benchmarking/bit_1_58/reports/results_shapes_{device}.csv`
128+
129+
**Benchmark end-to-end LLM inference**
130+
131+
Pass either a single preprocessed model directory or a parent directory that
132+
contains multiple `*_cpu` or `*_cuda` model directories.
133+
134+
```bash
135+
python -m benchmarking.llms.bench_inference \
136+
--model-dir integrations/hf/preprocessed \
137+
--device cpu \
138+
--prompt "Write the numbers from one to two hundred in words separated by commas only:" \
139+
--max-new-tokens 64 \
140+
--warmup 1 \
141+
--repeats 3 \
142+
--backends rsr hf_float32 hf_bfloat16
143+
```
144+
145+
```text
146+
CLI args for benchmarking/llms/bench_inference.py:
147+
--model-dir Single preprocessed model directory or parent directory
148+
containing multiple preprocessed models (required)
149+
--prompt Prompt text to generate from
150+
Default: "Write the numbers from one to two hundred in
151+
words separated by commas only:"
152+
--max-new-tokens Maximum number of generated tokens (default: 64)
153+
--warmup Warmup generations before timing (default: 1)
154+
--repeats Timed generations per backend/model (default: 3)
155+
--no-chat-template Tokenize the raw prompt directly
156+
--device Target device and model suffix filter: cpu or cuda
157+
(required)
158+
--backends Optional backend list:
159+
rsr, hf_float32, hf_bfloat16, hf_float16
160+
Default: rsr + the standard HF dtypes for the device
161+
```
162+
26163
## Benchmark Results 📊
27164

28-
### Matrix-Vector Multiplication
165+
### Matrix-Vector Multiplication 🧮
29166

30167
#### CPU 🖥️
31168

@@ -39,7 +176,7 @@ pip install -e .
39176
|:---:|:---:|
40177
| ![1-bit CUDA](assets/cuda_bit_1.png) | ![1.58-bit CUDA](assets/cuda_bit_1_58.png) |
41178

42-
### Ternary (1.58bit) LLMs
179+
### Ternary (1.58bit) LLMs 🤖
43180

44181
Speedup is computed against the HuggingFace `bfloat16` baseline for the same model.
45182

@@ -62,10 +199,7 @@ Speedup is computed against the HuggingFace `bfloat16` baseline for the same mod
62199
| bitnet-b1.58-2B-4T | 41.6 | **57.1** | **1.4x** |
63200

64201
## Updates 📝
65-
66-
<!--
67-
- Add project updates here.
68-
-->
202+
* [03/25/2026] Support HuggingFace models interface.
69203

70204
## Project Structure 🗂️
71205

0 commit comments

Comments
 (0)