You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Clarify performance methodology and add industry validation roadmap
- Update performance claims to explicitly state 'vs. modeled baseline' comparison
- Add methodology note clarifying baseline is theoretical, not measured hardware
- Add Future Work section outlining FPGA integration, MLPerf benchmarks, and industry comparisons
- Add Related Work section citing attention accelerators and distinguishing Garuda's approach
- Update references with attention mechanism and accelerator papers
- Note: Industry comparisons pending FPGA/ASIC implementation
Copy file name to clipboardExpand all lines: README.md
+56-4Lines changed: 56 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@
13
13
14
14
**Garuda** is a CVXIF coprocessor that extends RISC-V with custom INT8 multiply-accumulate (MAC) instructions for efficient neural network inference. Unlike throughput-oriented accelerators that require batching, Garuda optimizes for **batch-1 tail latency** (p99), making it ideal for real-time transformer inference, voice assistants, and local LLM attention workloads.
15
15
16
-
**Key advantage**: Achieves **7.5-9× latency reduction** vsbaseline CPU-style loops (p99: 307→34 cycles) for attention dot products while maintaining competitive throughput for larger dense layers.
16
+
**Key advantage**: Achieves **7.5-9× latency reduction** vs. modeled baseline (CPU-style SIMD_DOT with dispatch jitter) for attention dot products (p99: 307→34 cycles) while maintaining competitive throughput for larger dense layers.*Industry comparisons pending FPGA/ASIC implementation.*
*Measured via `tb_attention_microkernel_latency.sv` (1000 trials, Icarus simulation). Baseline models CPU-style loop with dispatch jitter; microkernel uses deterministic internal loop.*
58
+
*Measured via `tb_attention_microkernel_latency.sv` (1000 trials, Icarus simulation). Baseline models CPU-style SIMD_DOT loop with dispatch jitter (0-12 cycle random bubbles); microkernel uses deterministic internal loop.*
59
+
60
+
**Methodology note**: The baseline is a *modeled* CPU-style implementation (theoretical dispatch overhead), not measured hardware. Industry comparisons against real RISC-V CPUs (CVA6, Rocket, BOOM) and other accelerators require FPGA/ASIC implementation and will be published as future work.
59
61
60
62
**Why this matters**: Lower tail latency (p99) is critical for real-time applications. Garuda's microkernel engine eliminates dispatch overhead by running the dot-product loop internally, achieving deterministic, predictable latency.
The following enhancements are planned to validate Garuda's performance against industry standards:
407
+
408
+
-**FPGA Integration with CVA6 CPU**: Full integration of Garuda with CVA6 on FPGA (e.g., Xilinx Zynq-7000) to measure end-to-end latency with real CPU interaction, cache coherence, and memory hierarchy.
409
+
-**MLPerf Inference Benchmarks**: Submission to MLPerf Inference benchmark suite to compare against published accelerator results (TPU, Tensor Cores, etc.) using standardized workloads and metrics.
410
+
-**Comparison Against Published Accelerator Numbers**: Quantitative comparison against cycle-accurate models and published results from:
411
+
- Google TPU (attention microkernel latency)
412
+
- NVIDIA Tensor Cores (small-batch dot products)
413
+
- Apple Neural Engine (transformer inference)
414
+
- Open-source RISC-V accelerators (Snitch, NVDLA on RISC-V)
415
+
-**ASIC Implementation**: Synthesis to ASIC process node (e.g., TSMC 28nm or 7nm) for area, power, and timing analysis with target-specific optimizations.
416
+
-**Full Transformer End-to-End Benchmark**: Measure complete attention layer latency (Q·K·V with softmax) and compare against software baselines on real hardware.
417
+
418
+
---
419
+
420
+
## Related Work
421
+
422
+
Garuda builds upon research in low-latency neural network accelerators and attention mechanism optimization. Related approaches include:
423
+
424
+
**Attention Accelerators:**
425
+
-**Spatial Accelerators**: Systolic arrays (TPU-style) excel at large-batch GEMMs but suffer from underutilization for small-batch attention queries. Garuda addresses this with fused microkernel instructions that eliminate dispatch overhead.
426
+
-**In-Memory Computing**: PIM (Processing-In-Memory) accelerators reduce memory bandwidth but introduce latency from memory access patterns. Garuda's operand staging provides cache-hot access for low-latency loops.
427
+
-**Multi-Head Attention Optimization**: Prior work optimizes multi-head parallelism; Garuda focuses on single-head tail latency for real-time applications where head-level parallelism may be limited.
428
+
429
+
**RISC-V Accelerator Extensions:**
430
+
-**CVXIF Coprocessors**: Standard CVXIF interface enables modular accelerator design; Garuda extends this with attention-specific fused operations (`ATT_DOT_*` instructions).
431
+
-**Register Renaming**: Multi-issue support via rename table enables 4-wide instruction issue, similar to modern out-of-order CPUs, but tailored for deterministic coprocessor execution.
432
+
433
+
**What Makes Garuda Different:**
434
+
-**Batch-1 Tail Latency Focus**: Unlike throughput-oriented accelerators, Garuda explicitly optimizes p99 latency for small-batch workloads (batch=1), making it suitable for real-time inference.
-**Deterministic Execution**: Internal loop execution within the microkernel engine eliminates variance from CPU dispatch jitter, providing predictable latency for real-time applications.
437
+
438
+
**Key Publications:**
439
+
- Attention mechanism acceleration: "Attention Is All You Need" (Vaswani et al., 2017) — foundation for transformer architectures
440
+
- Low-latency inference: "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision" (Howard et al., 2017) — mobile-optimized architectures
441
+
- Systolic arrays: "In-Datacenter Performance Analysis of a Tensor Processing Unit" (Jouppi et al., 2017) — TPU architecture and trade-offs
442
+
- RISC-V extensions: "The RISC-V Instruction Set Manual" — custom instruction extensions via CVXIF
443
+
444
+
---
445
+
402
446
## Contributing
403
447
404
448
Contributions are welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on:
@@ -422,6 +466,14 @@ Contributions are welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines
422
466
-[Quantization and Training of Neural Networks](https://arxiv.org/abs/1712.05877)
423
467
-[Survey of Quantization Methods](https://arxiv.org/abs/2103.13630)
424
468
469
+
**Attention Mechanisms & Accelerators:**
470
+
-[Attention Is All You Need (Vaswani et al., 2017)](https://arxiv.org/abs/1706.03762) — Transformer architecture foundation
471
+
-[In-Datacenter Performance Analysis of a Tensor Processing Unit (Jouppi et al., 2017)](https://arxiv.org/abs/1704.04760) — TPU architecture and systolic arrays
472
+
-[MobileNets: Efficient Convolutional Neural Networks for Mobile Vision (Howard et al., 2017)](https://arxiv.org/abs/1704.04861) — Mobile-optimized architectures
473
+
474
+
**Benchmarks:**
475
+
-[MLPerf Inference Benchmark](https://mlcommons.org/en/inference-edge-11/) — Standardized ML inference benchmarks
0 commit comments