Skip to content
Draft
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 133 additions & 0 deletions docs/README_bench.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# iris.bench - Unified Benchmarking Harness

A standardized benchmarking infrastructure for Iris that reduces code duplication and provides consistent performance measurement across examples and benchmarks.

## Quick Start

```python
import iris
from iris.bench import benchmark

# Simple decorator-based benchmarking
@benchmark(name="my_kernel", warmup=5, iters=50)
def run_kernel():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot the decorator way is the only way we need. Remove everything else. Also, it is safe to assume that the bench harness will construct the iris instance and pass it to the user benchmark function. When using the decorator the user will need to also annotate parts of the code that are presetup (eg tensor allocation), preamble per run (eg resetting flags) and code to actually benchmark (kernel launch ).

kernel[grid](buffer, size)

result = run_kernel()
result.print_summary()
```

## Features

- ✅ **Automatic warmup and timing** - No more manual warmup loops
- ✅ **Rich statistics** - mean, median, p50, p99, min, max
- ✅ **Parameter sweeps** - Easy iteration over configurations
- ✅ **Multi-GPU support** - Built-in barrier synchronization
- ✅ **JSON export** - Structured results for CI/CD integration
- ✅ **Utility functions** - `torch_dtype_from_str`, `compute_bandwidth_gbps`

## What Problem Does This Solve?

Before `iris.bench`, every benchmark had ~100 lines of duplicated code for:
- Argument parsing (datatype, warmup, iterations)
- Dtype string-to-torch conversion
- Manual warmup loops
- Timing and synchronization
- Result formatting and printing

This led to:
- 🔴 Copy-pasted code across 20+ benchmark files
- 🔴 Inconsistent measurement patterns
- 🔴 No standardized statistics (p50, p99)
- 🔴 Hard to maintain and extend

With `iris.bench`:
- ✅ ~50% less code per benchmark
- ✅ Standardized API across all benchmarks
- ✅ Easy to add new benchmarks
- ✅ CI-ready JSON export

## Examples

### Example 1: Simple Benchmark
```python
from iris.bench import BenchmarkRunner

runner = BenchmarkRunner(name="test", barrier_fn=shmem.barrier)

def operation():
kernel[grid](buffer)

result = runner.run(fn=operation, warmup=5, iters=50)
result.print_summary()
```

### Example 2: Parameter Sweep
```python
from iris.bench import BenchmarkRunner, torch_dtype_from_str

runner = BenchmarkRunner(name="dtype_sweep")

for dtype_str in ["fp16", "fp32"]:
for size in [1024, 2048]:
dtype = torch_dtype_from_str(dtype_str)

def op():
tensor = torch.zeros(size, size, dtype=dtype, device="cuda")
result = tensor @ tensor

runner.run(fn=op, warmup=5, iters=20,
params={"size": size, "dtype": dtype_str})

runner.save_json("results.json")
```

## Documentation

- 📖 [Full API Documentation](bench_harness.md)
- 📖 [Migration Guide](bench_migration_example.md)
- 💻 [Complete Examples](../examples/benchmark/bench_harness_example.py)

## Testing

```bash
# Run basic tests (no GPU required)
python3 tests/unittests/test_bench_basic.py

# Run full test suite (requires GPU)
pytest tests/unittests/test_bench.py
```

## API Overview

### BenchmarkResult
Stores benchmark results with automatic statistics computation.

### BenchmarkRunner
Main class for running benchmarks with parameter sweeps.

### @benchmark
Decorator for simple function benchmarking.

### Utilities
- `torch_dtype_from_str(dtype_str)` - Convert string to torch.dtype
- `compute_bandwidth_gbps(bytes, time_ms)` - Calculate bandwidth

## Integration

The harness is designed to work alongside existing `iris.do_bench` usage:
- `BenchmarkRunner` internally uses `iris.do_bench`
- All existing barrier functions work with `barrier_fn` parameter
- Gradual migration path - old benchmarks continue to work

## Contributing

When adding new benchmarks:
1. Use `iris.bench` for all new code
2. Consider migrating nearby old benchmarks
3. Export results to JSON for CI integration
4. Follow examples in `examples/benchmark/`

## License

MIT License - Copyright (c) 2025-2026 Advanced Micro Devices, Inc.
Loading