Skip to content

Commit 9d7a10a

Browse files
Clarify performance methodology and add industry validation roadmap
- Update performance claims to explicitly state 'vs. modeled baseline' comparison - Add methodology note clarifying baseline is theoretical, not measured hardware - Add Future Work section outlining FPGA integration, MLPerf benchmarks, and industry comparisons - Add Related Work section citing attention accelerators and distinguishing Garuda's approach - Update references with attention mechanism and accelerator papers - Note: Industry comparisons pending FPGA/ASIC implementation
1 parent 1bdd89f commit 9d7a10a

1 file changed

Lines changed: 56 additions & 4 deletions

File tree

README.md

Lines changed: 56 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313

1414
**Garuda** is a CVXIF coprocessor that extends RISC-V with custom INT8 multiply-accumulate (MAC) instructions for efficient neural network inference. Unlike throughput-oriented accelerators that require batching, Garuda optimizes for **batch-1 tail latency** (p99), making it ideal for real-time transformer inference, voice assistants, and local LLM attention workloads.
1515

16-
**Key advantage**: Achieves **7.5-9× latency reduction** vs baseline CPU-style loops (p99: 307→34 cycles) for attention dot products while maintaining competitive throughput for larger dense layers.
16+
**Key advantage**: Achieves **7.5-9× latency reduction** vs. modeled baseline (CPU-style SIMD_DOT with dispatch jitter) for attention dot products (p99: 307→34 cycles) while maintaining competitive throughput for larger dense layers. *Industry comparisons pending FPGA/ASIC implementation.*
1717

1818
### Quick Start
1919

@@ -35,7 +35,7 @@ vvp sim_test.vvp
3535

3636
## Why Garuda?
3737

38-
- **Low tail latency**: 7.5-9× faster p99 latency for batch-1 attention microkernels (307→34 cycles)
38+
- **Low tail latency**: 7.5-9× faster p99 latency vs. modeled baseline for batch-1 attention microkernels (307→34 cycles)
3939
- **Standard integration**: CVXIF protocol — no CPU modifications required
4040
- **High throughput**: SIMD_DOT instruction provides 4× speedup vs scalar operations
4141
- **On-SoC optimized**: Designed for cache-coherent integration next to RISC-V CPU cores
@@ -49,13 +49,15 @@ vvp sim_test.vvp
4949

5050
**Workload**: Q·K dot product (K=128 INT8 elements = 32 words × 4 INT8/word) — single-head attention score computation
5151

52-
| Metric | Baseline (CPU-style) | Garuda Microkernel | Improvement |
52+
| Metric | Baseline (Modeled CPU-style) | Garuda Microkernel | Improvement |
5353
|---|---:|---:|---:|
5454
| p50 latency | 256 cycles | 34 cycles | **7.5×** |
5555
| p95 latency | 291 cycles | 34 cycles | **8.6×** |
5656
| p99 latency | 307 cycles | 34 cycles | **9.0×** |
5757

58-
*Measured via `tb_attention_microkernel_latency.sv` (1000 trials, Icarus simulation). Baseline models CPU-style loop with dispatch jitter; microkernel uses deterministic internal loop.*
58+
*Measured via `tb_attention_microkernel_latency.sv` (1000 trials, Icarus simulation). Baseline models CPU-style SIMD_DOT loop with dispatch jitter (0-12 cycle random bubbles); microkernel uses deterministic internal loop.*
59+
60+
**Methodology note**: The baseline is a *modeled* CPU-style implementation (theoretical dispatch overhead), not measured hardware. Industry comparisons against real RISC-V CPUs (CVA6, Rocket, BOOM) and other accelerators require FPGA/ASIC implementation and will be published as future work.
5961

6062
**Why this matters**: Lower tail latency (p99) is critical for real-time applications. Garuda's microkernel engine eliminates dispatch overhead by running the dot-product loop internally, achieving deterministic, predictable latency.
6163

@@ -399,6 +401,48 @@ yosys -p "synth_xilinx -top int8_mac_unit -flatten; write_json output.json" \
399401

400402
---
401403

404+
## Future Work
405+
406+
The following enhancements are planned to validate Garuda's performance against industry standards:
407+
408+
- **FPGA Integration with CVA6 CPU**: Full integration of Garuda with CVA6 on FPGA (e.g., Xilinx Zynq-7000) to measure end-to-end latency with real CPU interaction, cache coherence, and memory hierarchy.
409+
- **MLPerf Inference Benchmarks**: Submission to MLPerf Inference benchmark suite to compare against published accelerator results (TPU, Tensor Cores, etc.) using standardized workloads and metrics.
410+
- **Comparison Against Published Accelerator Numbers**: Quantitative comparison against cycle-accurate models and published results from:
411+
- Google TPU (attention microkernel latency)
412+
- NVIDIA Tensor Cores (small-batch dot products)
413+
- Apple Neural Engine (transformer inference)
414+
- Open-source RISC-V accelerators (Snitch, NVDLA on RISC-V)
415+
- **ASIC Implementation**: Synthesis to ASIC process node (e.g., TSMC 28nm or 7nm) for area, power, and timing analysis with target-specific optimizations.
416+
- **Full Transformer End-to-End Benchmark**: Measure complete attention layer latency (Q·K·V with softmax) and compare against software baselines on real hardware.
417+
418+
---
419+
420+
## Related Work
421+
422+
Garuda builds upon research in low-latency neural network accelerators and attention mechanism optimization. Related approaches include:
423+
424+
**Attention Accelerators:**
425+
- **Spatial Accelerators**: Systolic arrays (TPU-style) excel at large-batch GEMMs but suffer from underutilization for small-batch attention queries. Garuda addresses this with fused microkernel instructions that eliminate dispatch overhead.
426+
- **In-Memory Computing**: PIM (Processing-In-Memory) accelerators reduce memory bandwidth but introduce latency from memory access patterns. Garuda's operand staging provides cache-hot access for low-latency loops.
427+
- **Multi-Head Attention Optimization**: Prior work optimizes multi-head parallelism; Garuda focuses on single-head tail latency for real-time applications where head-level parallelism may be limited.
428+
429+
**RISC-V Accelerator Extensions:**
430+
- **CVXIF Coprocessors**: Standard CVXIF interface enables modular accelerator design; Garuda extends this with attention-specific fused operations (`ATT_DOT_*` instructions).
431+
- **Register Renaming**: Multi-issue support via rename table enables 4-wide instruction issue, similar to modern out-of-order CPUs, but tailored for deterministic coprocessor execution.
432+
433+
**What Makes Garuda Different:**
434+
- **Batch-1 Tail Latency Focus**: Unlike throughput-oriented accelerators, Garuda explicitly optimizes p99 latency for small-batch workloads (batch=1), making it suitable for real-time inference.
435+
- **Fused Attention Microkernels**: Custom instructions (`ATT_DOT_SETUP`, `ATT_DOT_RUN`, `ATT_DOT_RUN_SCALE`, `ATT_DOT_RUN_CLIP`) combine loop control, scaling, and clipping into single-instruction operations, minimizing dispatch overhead.
436+
- **Deterministic Execution**: Internal loop execution within the microkernel engine eliminates variance from CPU dispatch jitter, providing predictable latency for real-time applications.
437+
438+
**Key Publications:**
439+
- Attention mechanism acceleration: "Attention Is All You Need" (Vaswani et al., 2017) — foundation for transformer architectures
440+
- Low-latency inference: "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision" (Howard et al., 2017) — mobile-optimized architectures
441+
- Systolic arrays: "In-Datacenter Performance Analysis of a Tensor Processing Unit" (Jouppi et al., 2017) — TPU architecture and trade-offs
442+
- RISC-V extensions: "The RISC-V Instruction Set Manual" — custom instruction extensions via CVXIF
443+
444+
---
445+
402446
## Contributing
403447

404448
Contributions are welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on:
@@ -422,6 +466,14 @@ Contributions are welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines
422466
- [Quantization and Training of Neural Networks](https://arxiv.org/abs/1712.05877)
423467
- [Survey of Quantization Methods](https://arxiv.org/abs/2103.13630)
424468

469+
**Attention Mechanisms & Accelerators:**
470+
- [Attention Is All You Need (Vaswani et al., 2017)](https://arxiv.org/abs/1706.03762) — Transformer architecture foundation
471+
- [In-Datacenter Performance Analysis of a Tensor Processing Unit (Jouppi et al., 2017)](https://arxiv.org/abs/1704.04760) — TPU architecture and systolic arrays
472+
- [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision (Howard et al., 2017)](https://arxiv.org/abs/1704.04861) — Mobile-optimized architectures
473+
474+
**Benchmarks:**
475+
- [MLPerf Inference Benchmark](https://mlcommons.org/en/inference-edge-11/) — Standardized ML inference benchmarks
476+
425477
---
426478

427479
## Use Cases

0 commit comments

Comments
 (0)