UIC-InDeXLab
diff --git a/‎.gitignore‎
Lines changed: 5 additions & 2 deletions b/‎.gitignore‎
Lines changed: 5 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 39 additions & 21 deletions b/‎README.md‎
Lines changed: 39 additions & 21 deletions
diff --git a/‎assets/cpu_bit_1.png‎
71.8 KB b/‎assets/cpu_bit_1.png‎
71.8 KB
diff --git a/‎assets/cpu_bit_1_58.png‎
65.9 KB b/‎assets/cpu_bit_1_58.png‎
65.9 KB
diff --git a/‎assets/cuda_bit_1.png‎
61.5 KB b/‎assets/cuda_bit_1.png‎
61.5 KB
diff --git a/‎assets/cuda_bit_1_58.png‎
72 KB b/‎assets/cuda_bit_1_58.png‎
72 KB
diff --git a/‎assets/plot_shapes_cpu.png‎
65.9 KB b/‎assets/plot_shapes_cpu.png‎
65.9 KB
diff --git a/‎assets/plot_shapes_cuda.png‎
72 KB b/‎assets/plot_shapes_cuda.png‎
72 KB
diff --git a/‎benchmarking/bit_1/bench_shapes_cpu.py‎
Lines changed: 212 additions & 0 deletions b/‎benchmarking/bit_1/bench_shapes_cpu.py‎
Lines changed: 212 additions & 0 deletions
@@ -65,11 +65,14 @@ env/
 
 .key
 
-benchmarking/**/reports/
+benchmarking/**/reports/**
+!benchmarking/**/reports/
+!benchmarking/**/reports/**/
+!benchmarking/**/reports/**/*.json
 
 out.log
 
 integrations/**/*.json
 integrations/**/*.safetensors
 
-.claude
+.claude
@@ -8,38 +8,56 @@ Reference: [UIC-InDeXLab/RSR](https://github.com/UIC-InDeXLab/RSR)
 
 ```
 RSR-core/
-├── poc/            # Python proof-of-concept implementation
-├── kernels/
-│   ├── cpu/        # CPU kernels (C/C++)
-│   └── cuda/       # CUDA GPU kernels
-├── tests/          # Unit and integration tests
-└── benchmarks/     # Performance benchmarks
+├── multiplier/             # Python wrappers for kernels
+│   ├── bit_1/              # 1-bit (binary) multipliers (CPU/CUDA)
+│   └── bit_1_58/           # 1.58-bit (ternary) multipliers (CPU/CUDA)
+├── kernels/                # Low-level C/CUDA kernel source
+│   ├── bit_1/
+│   │   ├── cpu/            #   C kernels
+│   │   └── cuda/           #   CUDA kernels (.cu)
+│   └── bit_1_58/
+│       ├── cpu/            #   C kernels
+│       └── cuda/           #   CUDA kernels (.cu)
+├── integrations/           # Model integrations
+│   └── hf/                 #   HuggingFace integration
+├── benchmarking/           # Benchmarking scripts & results
+└── tests/                  # Unit and integration tests
 ```
 
 ## Benchmark Results
 
 ### Matrix-Vector Multiplication
 
-#### CPU:
+#### CPU 🖥️
+
+| 1-bit | 1.58-bit |
+|:---:|:---:|
+| ![1-bit CPU](assets/cpu_bit_1.png) | ![1.58-bit CPU](assets/cpu_bit_1_58.png) |
+
+#### CUDA ⚡
 
-#### CUDA:
+| 1-bit | 1.58-bit |
+|:---:|:---:|
+| ![1-bit CUDA](assets/cuda_bit_1.png) | ![1.58-bit CUDA](assets/cuda_bit_1_58.png) |
 
 ### Ternary (1.58bit) LLMs
 
-Speedup is computed from `Avg Time` against the `HF bfloat16` baseline for the same model.
+Speedup is computed against the HuggingFace `bfloat16` baseline for the same model.
 
 #### CPU 🖥️
-| Model | HF Time | RSR (ours) Time | HF Tok/s | RSR (ours) Tok/s | Speedup vs HF |
-| --- | ---: | ---: | ---: | ---: | ---: |
-| Falcon3-10B-Instruct-1.58bit | 351.215s | **5.663s** | 0.2 | **11.3** | **62.0x** |
-| Llama3-8B-1.58-100B-tokens | 261.557s | **4.862s** | 0.2 | **13.4** | **53.8x** |
-| bitnet-b1.58-2B-4T-bf16 | 31.446s | **2.258s** | 2.1 | **28.8** | **13.9x** |
-| bitnet-b1.58-2B-4T | 4.582s | **2.221s** | 14.2 | **29.3** | **2.1x** |
+
+| Model | HF Tok/s | RSR Tok/s | Speedup |
+| :--- | ---: | ---: | ---: |
+| Falcon3-10B-Instruct-1.58bit | 0.2 | **11.3** | **62.0x** |
+| Llama3-8B-1.58-100B-tokens | 0.2 | **13.4** | **53.8x** |
+| bitnet-b1.58-2B-4T-bf16 | 2.1 | **28.8** | **13.9x** |
+| bitnet-b1.58-2B-4T | 14.2 | **29.3** | **2.1x** |
 
 #### CUDA ⚡
-| Model | HF Time | RSR (ours) Time | HF Tok/s | RSR (ours) Tok/s | Speedup vs HF |
-| --- | ---: | ---: | ---: | ---: | ---: |
-| Falcon3-10B-Instruct-1.58bit | 2.536s | **1.351s** | 25.2 | **47.4** | **1.9x** |
-| Llama3-8B-1.58-100B-tokens | 2.035s | **1.097s** | 31.9 | **59.3** | **1.9x** |
-| bitnet-b1.58-2B-4T-bf16 | 1.966s | **1.133s** | 33.1 | **57.4** | **1.7x** |
-| bitnet-b1.58-2B-4T | 1.563s | **1.139s** | 41.6 | **57.1** | **1.4x** |
+
+| Model | HF Tok/s | RSR Tok/s | Speedup |
+| :--- | ---: | ---: | ---: |
+| Falcon3-10B-Instruct-1.58bit | 25.2 | **47.4** | **1.9x** |
+| Llama3-8B-1.58-100B-tokens | 31.9 | **59.3** | **1.9x** |
+| bitnet-b1.58-2B-4T-bf16 | 33.1 | **57.4** | **1.7x** |
+| bitnet-b1.58-2B-4T | 41.6 | **57.1** | **1.4x** |
@@ -0,0 +1,212 @@
+"""
+Benchmark CPU binary (1-bit) multipliers on a given list of matrix shapes.
+
+Edit SHAPES and K_VALUES below to configure the benchmark.
+Timing: median inference latency (preprocessing excluded).
+"""
+
+import csv
+import importlib
+import inspect
+import os
+import sys
+import time
+from pathlib import Path
+
+import numpy as np
+import torch
+
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
+
+# ---------------------------------------------------------------------------
+# Configure here
+# ---------------------------------------------------------------------------
+
+SHAPES = [
+    (1024, 1024),
+    (2048, 2048),
+    (4096, 4096),
+    (8192, 8192),
+    (16384, 16384),
+    (32768, 32768),
+]
+
+K_VALUES = [2, 4, 6, 8, 10]
+
+# Limit to these method labels; empty list = all discovered methods
+METHODS = ["BitNet", "RSR", "pytorch"]
+
+REPEATS = 30
+WARMUP = 10
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+
+def random_binary_matrix(rows, cols):
+    return torch.randint(0, 2, (rows, cols), dtype=torch.float32)
+
+
+def bench(multiplier, v, warmup=WARMUP, repeats=REPEATS):
+    for _ in range(warmup):
+        multiplier(v)
+    times = []
+    for _ in range(repeats):
+        t0 = time.perf_counter()
+        multiplier(v)
+        t1 = time.perf_counter()
+        times.append(t1 - t0)
+    return np.median(times)
+
+
+def fmt(t):
+    if t is None or np.isnan(t):
+        return "N/A"
+    return f"{t * 1e3:.3f}ms"
+
+
+# ---------------------------------------------------------------------------
+# Version discovery
+# ---------------------------------------------------------------------------
+
+_LABEL_MAP = {
+    "bitnet": "BitNet",
+    "tmac": "T-MAC",
+    "rsr_cpp": "v1",
+    "rsr_cpp_v2_4": "v2.4",
+    "rsr_cpp_v4_2": "v4.2",
+    "rsr_adaptive": "adaptive",
+    "rsr_cpp_nonsquare": "RSR",
+}
+
+_EXCLUDE = {"__init__", "base"}
+
+
+def _stem_to_label(stem):
+    if stem in _LABEL_MAP:
+        return _LABEL_MAP[stem]
+    if stem.startswith("rsr_cpp_v"):
+        return "v" + stem[len("rsr_cpp_v") :].replace("_", ".")
+    return stem
+
+
+def discover_versions():
+    versions = []
+
+    from multiplier.bit_1.pytorch import PytorchBF16Multiplier
+
+    versions.append(("pytorch", PytorchBF16Multiplier, False))
+
+    cpu_dir = Path(__file__).resolve().parents[2] / "multiplier" / "bit_1" / "cpu"
+    for p in sorted(cpu_dir.glob("*.py")):
+        if p.stem in _EXCLUDE or p.stem.startswith("_"):
+            continue
+        full = f"multiplier.bit_1.cpu.{p.stem}"
+        label = _stem_to_label(p.stem)
+        try:
+            mod = importlib.import_module(full)
+        except Exception as e:
+            print(f"  [skip {p.stem}: {e}]")
+            continue
+        cls = next(
+            (
+                obj
+                for _, obj in inspect.getmembers(mod, inspect.isclass)
+                if obj.__module__ == full and obj.__name__.endswith("Multiplier")
+            ),
+            None,
+        )
+        if cls is None:
+            continue
+        needs_k = "k" in inspect.signature(cls.__init__).parameters
+        versions.append((label, cls, needs_k))
+
+    return versions
+
+
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+
+
+def main():
+    versions = discover_versions()
+    if METHODS:
+        versions = [(l, c, nk) for l, c, nk in versions if l in METHODS]
+
+    baselines = [(l, c) for l, c, nk in versions if not nk]
+    rsr_vers = [(l, c) for l, c, nk in versions if nk]
+    all_labels = [l for l, _ in baselines] + [l for l, _ in rsr_vers]
+
+    reports_dir = Path(__file__).parent / "reports"
+    reports_dir.mkdir(parents=True, exist_ok=True)
+    csv_path = reports_dir / "results_shapes_cpu.csv"
+    csv_file = open(csv_path, "w", newline="")
+    writer = csv.writer(csv_file)
+    writer.writerow(["rows", "cols", "k"] + all_labels)
+
+    col_w = 12
+
+    for rows, cols in SHAPES:
+        print(f"\n{'='*80}")
+        print(f"  shape = ({rows}, {cols})")
+        print(f"{'='*80}")
+
+        M = random_binary_matrix(rows, cols)
+        v = torch.randn(cols, dtype=torch.float32)
+
+        base_times = []
+        for lbl, cls in baselines:
+            try:
+                m = cls(M)
+                t = bench(m, v)
+            except Exception as e:
+                print(f"  [error {lbl}: {e}]")
+                t = float("nan")
+            base_times.append(t)
+
+        header = f"  {'k':>4}  " + "  ".join(f"{c:>{col_w}}" for c in all_labels)
+        print(f"\n  [Inference — median over {REPEATS} runs]")
+        print(header)
+        print("  " + "-" * (len(header) - 2))
+
+        for k in K_VALUES:
+            rsr_times = []
+            for lbl, cls in rsr_vers:
+                if rows % k != 0:
+                    rsr_times.append(float("nan"))
+                    continue
+                try:
+                    m = cls(M, k)
+                    rsr_times.append(bench(m, v))
+                except Exception as e:
+                    print(f"  [error {lbl} k={k}: {e}]")
+                    rsr_times.append(float("nan"))
+
+            all_times = base_times + rsr_times
+            valid = [t for t in all_times if not np.isnan(t)]
+            best = min(valid) if valid else None
+
+            cells = []
+            for t in all_times:
+                s = fmt(t)
+                if best is not None and not np.isnan(t) and abs(t - best) < 1e-9:
+                    s = f"*{s}*"
+                cells.append(s.rjust(col_w))
+
+            print(f"  {k:>4}  " + "  ".join(cells))
+            writer.writerow(
+                [rows, cols, k]
+                + ["" if np.isnan(t) else round(t * 1e3, 6) for t in all_times]
+            )
+            csv_file.flush()
+
+        print()
+
+    csv_file.close()
+    print(f"Results saved to {csv_path}")
+
+
+if __name__ == "__main__":
+    main()