@@ -13,8 +13,8 @@ Inference on CPU for a 1.58-bit LLM decoding step. Click the image to view the o
1313
1414[ ![ RSR vs Baseline] ( assets/rsr_baseline_compare.webp )] ( https://drive.google.com/file/d/1ub-MITJUepmfBLkyUZFb50hbJsuhgwCH/view?usp=sharing )
1515
16- ## Installation 🛠️
17-
16+ ## Usage 🛠️
17+ ### Installation 📦
1818** Prerequisites:** Python >= 3.10, a C compiler for CPU kernels, and optionally CUDA for GPU support.
1919
2020``` bash
@@ -23,9 +23,146 @@ cd RSR-Core
2323pip install -e .
2424```
2525
26+ ### Prepare a model (once) 🧱
27+ Run ` integrations/hf/model_prep.py ` once per model to preprocess the ternary
28+ weights and save the RSR metadata needed for inference.
29+
30+ ``` bash
31+ python -m integrations.hf.model_prep \
32+ --model microsoft/bitnet-b1.58-2B-4T \
33+ --output ./preprocessed_model \
34+ --device cpu \
35+ --trust-remote-code \
36+ --best-k-json benchmarking/bit_1_58/reports/best_k_cpu.json
37+ ```
38+
39+ ``` text
40+ CLI args for integrations/hf/model_prep.py:
41+ --model, -m HuggingFace model ID or local path (required)
42+ --output, -o Output directory for the preprocessed model (required)
43+ --k Block height for RSR decomposition (default: from best_k_{device}.json)
44+ --version RSR multiplier version to use (default: adaptive)
45+ --device Device for model loading: cpu or cuda (default: cpu)
46+ --trust-remote-code Allow remote code when loading HuggingFace models
47+ --best-k-json Optional path to a per-layer best-k JSON file
48+ Default:
49+ benchmarking/bit_1_58/reports/best_k_{device}.json
50+ ```
51+
52+ ### Run model inference 🤖
53+ Use ` integrations/hf/model_infer.py ` to run generation from a preprocessed
54+ model directory. The default backend is ` rsr ` .
55+
56+ ``` bash
57+ python -m integrations.hf.model_infer \
58+ --model-dir ./preprocessed_model \
59+ --backend rsr \
60+ --device cpu \
61+ --prompt " Write the numbers from one to ten in words." \
62+ --max-new-tokens 64 \
63+ --stream
64+ ```
65+
66+ ``` text
67+ CLI args for integrations/hf/model_infer.py:
68+ --model-dir Directory with rsr_config.json and safetensors artifacts
69+ (default: integrations/hf)
70+ --backend Inference backend: rsr or hf (default: rsr)
71+ --tokenizer Optional tokenizer source
72+ Default: rsr_config.json:model_name
73+ --device Target device; auto-detected from model-dir suffix
74+ (_cpu / _cuda) if omitted
75+ --dtype Optional dtype cast: float32, float16, or bfloat16
76+ --prompt Prompt text to generate from (required)
77+ --max-new-tokens Maximum number of tokens to generate (default: 64)
78+ --no-chat-template Tokenize the raw prompt directly
79+ --stream Stream decoded output as tokens are generated
80+ ```
81+
82+ #### Benchmark on your machine ⏱️
83+ Use the scripts under ` benchmarking/ ` to reproduce the local numbers for
84+ kernel-level matvec benchmarks and end-to-end LLM inference.
85+
86+ ** Find the best ` k ` for ternary RSR**
87+
88+ ``` bash
89+ python -m benchmarking.bit_1_58.bench_best_k \
90+ --device cpu \
91+ --shapes 2560x2560 4096x14336 \
92+ --k-values 2 4 6 8 10 12 \
93+ --warmup 10 \
94+ --repeats 30
95+ ```
96+
97+ ``` text
98+ CLI args for benchmarking/bit_1_58/bench_best_k.py:
99+ --device Target device: cpu or cuda (required)
100+ --shapes Optional list of matrix shapes in NxM format
101+ Default: all known preprocessed model shapes
102+ --k-values Optional list of k values to test
103+ Default: 2 4 6 8 10 12
104+ --warmup Warmup iterations before timing (default: 10)
105+ --repeats Timed iterations per shape/k (default: 30)
106+ ```
107+
108+ This writes:
109+ ` benchmarking/bit_1_58/reports/best_k_{device}.csv ` and
110+ ` benchmarking/bit_1_58/reports/best_k_{device}.json `
111+
112+ ** Benchmark matrix-vector multiplication**
113+
114+ The shape benchmark scripts do not take CLI arguments. Configure them by
115+ editing the constants at the top of the script:
116+ ` SHAPES ` , ` K_VALUES ` , ` METHODS ` , ` REPEATS ` , and ` WARMUP ` .
117+
118+ ``` bash
119+ python benchmarking/bit_1/bench_shapes_cpu.py
120+ python benchmarking/bit_1/bench_shapes_cuda.py
121+ python benchmarking/bit_1_58/bench_shapes_cpu.py
122+ python benchmarking/bit_1_58/bench_shapes_cuda.py
123+ ```
124+
125+ Reports are written to:
126+ ` benchmarking/bit_1/reports/results_shapes_{device}.csv `
127+ ` benchmarking/bit_1_58/reports/results_shapes_{device}.csv `
128+
129+ ** Benchmark end-to-end LLM inference**
130+
131+ Pass either a single preprocessed model directory or a parent directory that
132+ contains multiple ` *_cpu ` or ` *_cuda ` model directories.
133+
134+ ``` bash
135+ python -m benchmarking.llms.bench_inference \
136+ --model-dir integrations/hf/preprocessed \
137+ --device cpu \
138+ --prompt " Write the numbers from one to two hundred in words separated by commas only:" \
139+ --max-new-tokens 64 \
140+ --warmup 1 \
141+ --repeats 3 \
142+ --backends rsr hf_float32 hf_bfloat16
143+ ```
144+
145+ ``` text
146+ CLI args for benchmarking/llms/bench_inference.py:
147+ --model-dir Single preprocessed model directory or parent directory
148+ containing multiple preprocessed models (required)
149+ --prompt Prompt text to generate from
150+ Default: "Write the numbers from one to two hundred in
151+ words separated by commas only:"
152+ --max-new-tokens Maximum number of generated tokens (default: 64)
153+ --warmup Warmup generations before timing (default: 1)
154+ --repeats Timed generations per backend/model (default: 3)
155+ --no-chat-template Tokenize the raw prompt directly
156+ --device Target device and model suffix filter: cpu or cuda
157+ (required)
158+ --backends Optional backend list:
159+ rsr, hf_float32, hf_bfloat16, hf_float16
160+ Default: rsr + the standard HF dtypes for the device
161+ ```
162+
26163## Benchmark Results 📊
27164
28- ### Matrix-Vector Multiplication
165+ ### Matrix-Vector Multiplication 🧮
29166
30167#### CPU 🖥️
31168
@@ -39,7 +176,7 @@ pip install -e .
39176| :---:| :---:|
40177| ![ 1-bit CUDA] ( assets/cuda_bit_1.png ) | ![ 1.58-bit CUDA] ( assets/cuda_bit_1_58.png ) |
41178
42- ### Ternary (1.58bit) LLMs
179+ ### Ternary (1.58bit) LLMs 🤖
43180
44181Speedup is computed against the HuggingFace ` bfloat16 ` baseline for the same model.
45182
@@ -62,10 +199,7 @@ Speedup is computed against the HuggingFace `bfloat16` baseline for the same mod
62199| bitnet-b1.58-2B-4T | 41.6 | ** 57.1** | ** 1.4x** |
63200
64201## Updates 📝
65-
66- <!--
67- - Add project updates here.
68- -->
202+ * [ 03/25/2026] Support HuggingFace models interface.
69203
70204## Project Structure 🗂️
71205
0 commit comments