You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Inference on CPU for a 1.58-bit LLM decoding step. Click the image to view the original high-quality video. `HF` denotes the Hugging Face baseline running `bfloat16` on PyTorch.
@@ -63,9 +64,8 @@ CLI args for integrations/hf/model_prep.py:
63
64
```
64
65
65
66
> [!NOTE]
66
-
> `k` is hardware-dependent, so run the `best_k` benchmark on the same machine
67
-
> and device you plan to use for inference, then reuse the generated JSON. If
68
-
> no `best_k_{device}.json` is found, `model_prep.py` falls back to `--k`.
67
+
> `k` might be hardware-dependent, so run the `best_k` benchmark on the same machine
68
+
> and device you plan to use for inference, then reuse the generated JSON.
69
69
70
70
### Run model inference 🤖
71
71
Use `integrations/hf/model_infer.py` to run generation from a preprocessed
@@ -273,3 +273,16 @@ RSR-core/
273
273
│ └── frontend/ # React dashboard
274
274
└── tests/ # Unit and integration tests
275
275
```
276
+
277
+
## Citation 📝
278
+
279
+
If you use this repository in your research or project, please cite our work:
280
+
281
+
```bibtex
282
+
@inproceedings{dehghankarefficient,
283
+
title={An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks},
284
+
author={Dehghankar, Mohsen and Erfanian, Mahdi and Asudeh, Abolfazl},
285
+
booktitle={Forty-second International Conference on Machine Learning},
0 commit comments