Nvbench compare improvements by oleksandr-pavlyk · Pull Request #383 · NVIDIA/nvbench

oleksandr-pavlyk · 2026-06-02T17:39:48Z

First change: prefer robust statistics

Update nvbench_compare to parse GPU timing summaries into a richer structured form and prefer robust "median + relative IQR" when available, falling back to the older "mean + relative stdev" summaries otherwise
Keep unavailable noise distinct from encoded infinite noise
Report improvements separately from regressions
Fix plotting behavior around missing noise and plot ordering
Change the CLI exit behavior so completed comparisons do not use the regression count as the process status
Add focused tests in python/test/test_nvbench_compare.py for robust-summary preference, unavailable noise, non-finite centers, plot-axis handling, and exit-code behavior.

Second change: improvement to CLI

Add more precise matching and filtering semantics to nvbench_compare:
- ordered --benchmark/--axis handling
- benchmark-scoped axis filters
- explicit --reference-devices and --compare-devices
- position-based device pairing
- occurrence-based matching for duplicate benchmark states within each filtered device section.
Make selected cross-device comparisons possible while keeping unfiltered device metadata mismatches fatal.
Expand tests for
- duplicate-state matching,
- filter-before-match behavior,
- device filter parsing/validation,
- explicit cross-device pairing,
- benchmark-scoped axis filtering

Notes

Handling of duplicates with the same axis_values

This change fixes comparison for datasets where the same benchmark is repeated multiple times.

Steps to reproduce the claim

Create datasets

./build/bin/nvbench.example.cpp20.axes -b copy_sweep_grid_shape -a "BlockSize[pow2]=[8,8,8,8]" -a NumBlocks=64 --cold-warmup-runs 100 --no-batch --stopping-criterion entropy --jsonbin /tmp/run1.json
./build/bin/nvbench.example.cpp20.axes -b copy_sweep_grid_shape -a "BlockSize[pow2]=[8,8,8,8]" -a NumBlocks=64 --cold-warmup-runs 100 --no-batch --stopping-criterion entropy --jsonbin /tmp/run2.json

Comparison using `nvbench_compare` from main

(py313) opavlyk@NV-22T4X34:~/repos/nvbench$ nvbench-compare /tmp/run1.json /tmp/run2.json
['/tmp/run1.json', '/tmp/run2.json']
# copy_sweep_grid_shape

## [0] NVIDIA RTX 3500 Ada Generation Laptop GPU

|  BlockSize  |  NumBlocks  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-------------|-------------|------------|-------------|------------|-------------|------------|---------|----------|
|     2^8     |     64      |   1.830 ms |       2.85% |   1.827 ms |       3.13% |  -2.767 us |  -0.15% |   SAME   |
|     2^8     |     64      |   1.830 ms |       2.85% |   1.785 ms |       1.16% | -44.272 us |  -2.42% |   FAST   |
|     2^8     |     64      |   1.830 ms |       2.85% |   1.787 ms |       1.66% | -42.921 us |  -2.35% |   FAST   |
|     2^8     |     64      |   1.830 ms |       2.85% |   1.783 ms |       1.15% | -46.344 us |  -2.53% |   FAST   |

# Summary

- Total Matches: 4
  - Pass    (diff <= min_noise): 1
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  3

Comparison using `nvbench_compare` from this branch

(py313) opavlyk@NV-22T4X34:~/repos/nvbench$ nvbench-compare /tmp/run1.json /tmp/run2.json
# copy_sweep_grid_shape

## [0] NVIDIA RTX 3500 Ada Generation Laptop GPU

|  BlockSize  |  NumBlocks  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|-------------|-------------|------------|-------------|------------|-------------|-----------|---------|----------|
|     2^8     |     64      |   1.821 ms |       4.84% |   1.818 ms |       4.90% | -3.072 us |  -0.17% |   SAME   |
|     2^8     |     64      |   1.776 ms |       2.02% |   1.776 ms |       2.02% |  0.000 us |   0.00% |   SAME   |
|     2^8     |     64      |   1.775 ms |       0.81% |   1.776 ms |       2.13% |  1.024 us |   0.06% |   SAME   |
|     2^8     |     64      |   1.777 ms |       1.96% |   1.776 ms |       0.92% | -1.024 us |  -0.06% |   SAME   |

# Summary

- Total Matches: 4
  - Pass        (abs(%Diff) <= max_noise): 4
  - Improvement (abs(%Diff) > max_noise, %Diff < 0): 0
  - Regression  (abs(%Diff) > max_noise, %Diff > 0): 0
  - Unknown     (infinite or unavailable noise): 0

Notes

Notice that in main the same value of the first run from reference dataset is listed for each value from compare dataset.
In this branch, due to improved handling of duplicates, timing from run k from reference dataset is matched against timing from matching run k in compare dataset.

Comparing different devices

It is possible to compare datasets that have different sets of devices.

Reference may have data collected on B200, RTX 5090 and RTX A6000, while compare may have RTX 5090 and B300. To compare such datasets one needs to use --reference-devices and --compare-devices to indicate subsets of devices from reference dataset and devices from compare dataset.

The cardinality of sets (number of devices) must match. Using --ignore-devices is implied in such a case. Default values of --reference-devices and --compare-devices is all. Devices must match, unless --ignore-devices is used. Even if --ignore-devices is specified, cardinalities of sets of devices in reference and compare datasets must be equal.

`nvbench_compare` uses the same scoping of benchmark/axis options processing as NVBench benchmarks do

Permit filtering of relevant axis values per benchmark

Teach nvbench_compare to parse GPU timing summaries into structured values and prefer the robust median/IQR summaries when both compared measurements provide them. Fall back to the existing mean/stdev summaries when robust summaries are not available. Classify comparisons with the larger available relative noise estimate instead of the smaller one, keep unavailable noise distinct from encoded infinite noise, and report improvements separately from regressions. Keep the process exit code as success for completed comparisons; regression counts are reported in the summary instead of being used as the process status. Make plotting tolerate unavailable noise by leaving gaps in confidence bands, sort plotted series by the plotted axis, and avoid reusing pyplot state across plot calls. Add focused Python tests for robust-summary preference, unavailable-noise classification, non-finite timing centers, plot-along handling when the selected axis is absent, and the exit-code contract.

Teach nvbench_compare to keep the order of --benchmark and --axis arguments so axis filters can apply either globally or to the most recent benchmark. Build a filter plan from the ordered CLI arguments and apply the same plan to table output and plotting labels. Add explicit --reference-devices and --compare-devices filters. The filters accept all, a single device id, or a comma-separated list of ids; ordered lists and duplicates are preserved so selected reference and compare devices can be paired by position. Device-section mismatches remain fatal for unfiltered all-vs-all comparisons, but become warnings when the user explicitly selects devices and the selected device counts match. Match duplicate benchmark states by occurrence within each filtered device section instead of matching only by state name across the whole benchmark. This keeps repeated axis values and filtered duplicate states aligned between the reference and compare inputs, and reports mismatched occurrence counts instead of silently dropping extra states. Add Python tests for duplicate-state matching, axis filtering before matching, device filter parsing and validation, explicit cross-device pairing, and benchmark-scoped axis filters. Original commit messages folded into this change: Tweaks for nvbench_compare 1. When JSON files contain multiple entries with the same name and axis values, make sure that scripts compares corresponding entries. Previous logic would extract the first entry from ref data, and would compare measurements for each state in cmp against the first entry from ref. The change introduces a counter to know which nth entry we process for a particular axis value, and retrieve corresponding entry in ref. Scope occurrence matching by device. Device pairing in nvbench_compare.py is strictly index-based under --ignore-devices, reused IDs in a different order no longer pair against the wrong reference device. Require devices in ref and cmp to have the same cardinality Handle mismatch when number of duplicates in ref data is not same as in cmp data Use pytest monkeypatch fixture to pretend third-party package dependencies are available during test run for nvbench_compare without introducing test-time dependency Added the happy-path test and fixed its direct-call setup by initializing the device globals that main() normally populates. Fix to filter-before-matching. - compare_benches() now pairs devices by selected position instead of taking a device id. - For each device pair, compare_benches() now builds: - ref_device_states: matching reference device and axis filters - cmp_device_states: matching compare device and axis filters - State occurrence counts and duplicate occurrence matching now operate only on those filtered per-device lists. - Removed the later matches_axis_filters() skip inside the compare-state loop because filtering now happens before matching. Added a regression test where ref/cmp have duplicate state names in opposite order, and --axis keeps only one of them. The test verifies the kept compare state is matched against the kept reference state, not the first unfiltered occurrence. Introduce device filtering in nvbench_compare - --reference-devices all|ID|ID,ID,... - --compare-devices all|ID|ID,ID,... - Integer lists preserve order and duplicates. - Requested IDs are validated against the file-level device list. - Filtered reference/compare device counts must match before comparison. - compare_benches() pairs selected reference and compare devices by position. - Each benchmark validates that requested device IDs are present in its own devices list. Implemented benchmark-scoped --axis handling. - --axis and --benchmark now share an ordered argparse action, so their relative CLI order is preserved. - -a before any -b becomes a global axis filter. - -a after -b <name> applies to that most recent benchmark only. - Repeated -b entries are treated as separate filter scopes and combined as alternatives for that benchmark. - Device filtering remains global and is applied independently. Allow non-matching devices for explicit device selection Now the device-section equality check remains fatal only for unfiltered all-vs-all comparisons. If either --reference-devices or --compare-devices is explicit, mismatched selected device metadata is printed as a warning, but comparison proceeds after the selected device counts have been validated. Fix for resolve_benchmark_device_ids, add comments The return value of resolve_benchmark_device_ids now always owns its list. Use monkeypatch class in set_test_devices helper Stricted device id validation Test for device id validation

oleksandr-pavlyk · 2026-06-02T17:40:44Z

@coderabbitai full review

coderabbitai · 2026-06-02T17:40:50Z

✅ Actions performed

Full review triggered.

coderabbitai · 2026-06-02T17:46:13Z

📝 Walkthrough

Summary by CodeRabbit

Release Notes

New Features
- Added device-specific filtering options for benchmark comparisons (reference and compare device selection).
- Enhanced GPU-time comparison model with robust median/IQR-based metrics and confidence-band visualization.
- Improved handling of missing or invalid noise data in benchmark results.
Tests
- Added comprehensive test suite for benchmark comparison functionality.

Walkthrough

The PR refactors nvbench_compare.py to introduce structured benchmark and device filtering, replace mean/stdev time extraction with robust median/IQR-based summaries, enforce stricter device pairing with duplicate-state occurrence matching, and add log-scale plotting with per-point confidence bands. A comprehensive test suite validates comparison behavior, filtering mechanics, device resolution, and plotting integration across multiple scenarios.

Changes

GPU time comparison refactoring with structured filtering and robust time estimation

Layer / File(s)	Summary
Data model and type infrastructure `python/scripts/nvbench_compare.py`, `python/test/test_nvbench_compare.py`	Added frozen dataclasses for GPU time summaries, filter plans, and filter scopes; changed Emoji from StrEnum to str-based Enum; introduced GPU timing tag constants and device filter parsing; test fixtures monkeypatch dependencies and provide builder helpers for benchmark/state construction.
GPU time summary extraction and robust dispersion computation `python/scripts/nvbench_compare.py`, `python/test/test_nvbench_compare.py`	Refactored time extraction from mean/stdev to median/IQR-based summaries with helper functions to locate tags, validate types, and select between robust vs. mean-based relative dispersion; tests verify non-finite center skipping and summary preference when median/IQR tags are available.
Structured filtering plan builder and axis filter matching `python/scripts/nvbench_compare.py`, `python/test/test_nvbench_compare.py`	Added benchmark filter plan builder that transforms ordered filter actions into per-benchmark axis-filter scopes; implemented group-based axis filter matching; tests validate duplicate-state matching after axis filtering and axis filter scoping to most recent benchmark.
Device resolution, validation, and duplicate state pairing `python/scripts/nvbench_compare.py`, `python/test/test_nvbench_compare.py`	Enforces strict device ID resolution, validates matching device counts, filters states by device and axis-filter-group, groups duplicate states by occurrence order, and raises on mismatch; tests cover device filter parsing (accepting `all`, duplicates, rejecting invalid), position-based pairing with separate ref/cmp filters, and matching duplicate occurrences.
Core comparison logic with new time models and threshold application `python/scripts/nvbench_compare.py`, `python/test/test_nvbench_compare.py`	Builds common time estimates from parsed GPU summaries, validates finite positive times, derives noise from relative dispersion, classifies results using new noise rules, and applies thresholds with matching axis filters; tests verify comparison result classification and unknown-status handling for missing/null noise.
Output formatting and plotting improvements `python/scripts/nvbench_compare.py`, `python/test/test_nvbench_compare.py`	Updated axis-value display with power-of-two exponent notation and type-specific formatters; implemented format_percentage for None/NaN/infinity; added log-scale plotting with per-point confidence bands; test verifies axis filtering respects plot_along selection.
CLI argument parsing and main execution flow `python/scripts/nvbench_compare.py`, `python/test/test_nvbench_compare.py`	Added --reference-devices and --compare-devices options; switched --axis and --benchmark to OrderedBenchmarkFilterAction; updated main() to select devices, validate matches, report mismatches, conditionally abort, and pass filter-plan/device-filter parameters to compare_benches; integration test verifies main() returns exit code 0 with regression detection.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

python/scripts/nvbench_compare.py (1)

534-547: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: Negative durations are always formatted in microseconds. Unit selection is based on the signed value, so values like -0.01 print as -10000.000 us instead of -10.000 ms. Choose the unit from abs(seconds) and keep the original sign only in the formatted result.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e6f909dd-05a1-460e-a704-459561ab177d

📥 Commits

Reviewing files that changed from the base of the PR and between ee4b9f0 and 1d13b49.

📒 Files selected for processing (2)

python/scripts/nvbench_compare.py
python/test/test_nvbench_compare.py

coderabbitai · 2026-06-02T17:46:16Z

+def compute_common_time_estimates(ref_summary, cmp_summary):
+    if has_robust_estimate(ref_summary) and has_robust_estimate(cmp_summary):
+        return (
+            TimeEstimate(
+                center=ref_summary.median,
+                relative_dispersion=select_relative_dispersion(
+                    ref_summary.interquartile_range_relative,
+                    ref_summary.interquartile_range,
+                    ref_summary.median,
+                ),
+            ),
+            TimeEstimate(
+                center=cmp_summary.median,
+                relative_dispersion=select_relative_dispersion(
+                    cmp_summary.interquartile_range_relative,
+                    cmp_summary.interquartile_range,
+                    cmp_summary.median,
+                ),
+            ),
+        )
+
+    if has_mean_estimate(ref_summary) and has_mean_estimate(cmp_summary):
+        return (
+            TimeEstimate(
+                center=ref_summary.mean,
+                relative_dispersion=select_relative_dispersion(
+                    ref_summary.stdev_relative, ref_summary.stdev, ref_summary.mean
+                ),
+            ),
+            TimeEstimate(
+                center=cmp_summary.mean,
+                relative_dispersion=select_relative_dispersion(
+                    cmp_summary.stdev_relative, cmp_summary.stdev, cmp_summary.mean
+                ),
+            ),
+        )
+
+    return (
+        TimeEstimate(
+            center=ref_summary.mean,
+            relative_dispersion=compute_relative_dispersion(
+                ref_summary.stdev, ref_summary.mean
+            ),
+        ),
+        TimeEstimate(
+            center=cmp_summary.mean,
+            relative_dispersion=compute_relative_dispersion(
+                cmp_summary.stdev, cmp_summary.mean
+            ),
+        ),
+    )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: Mixed robust/non-robust inputs still discard the available robust estimate. If only one side has median/IQR, this falls back to mean for both sides, and can even skip a comparison when the robust side has no mean summary. Select the best estimate per side independently, then compare those two estimates, and add a mixed-summary regression test.

coderabbitai · 2026-06-02T17:46:16Z

        "-a",
        "--axis",
-        action="append",
-        default=[],
-        help="Filter on axis value, e.g. -a Elements{io}=2^20 (can repeat)",
+        dest="filter_actions",
+        action=OrderedBenchmarkFilterAction,
+        help=(
+            "Filter on axis value, e.g. -a Elements{io}=2^20. Applies to the "
+            "most recent --benchmark, or all benchmarks if specified before any "
+            "--benchmark arguments."
+        ),


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

suggestion: The --axis example does not match what parse_axis_filters accepts. Elements{io}=2^20 is treated as a literal string today; powers of two are only expanded when the axis name uses the [pow2] suffix. Update the help text to show either the raw numeric value or the [pow2] form.

bernhardmgruber · 2026-06-03T08:55:39Z

@oleksandr-pavlyk can you show an example of how a comparison of a CUB benchmark would look like now?

And are all uses for nvbench_compare.py in the CUB docs still correct then? https://nvidia.github.io/cccl/unstable/cub/benchmarking.html

oleksandr-pavlyk · 2026-06-03T12:10:58Z

@oleksandr-pavlyk can you show an example of how a comparison of a CUB benchmark would look like now?

And are all uses for nvbench_compare.py in the CUB docs still correct then? https://nvidia.github.io/cccl/unstable/cub/benchmarking.html

All documented usage is valid. But it is a great point that updating this document should kept in mind.

usage: nvbench_compare [reference.json compare.json | reference_dir/ compare_dir/]

options:
  -h, --help            show this help message and exit
  --ignore-devices      Ignore differences in the device sections and compare anyway
  --threshold-diff THRESHOLD
                        only show benchmarks where percentage diff is >= THRESHOLD
  --plot-along PLOT_ALONG
                        plot results
  --plot                plot comparison summary
  --dark                Use dark theme (black background, white text)
  --no-color            Use emoji instead of ANSI color codes (useful for GitHub issues/PRs)
  --reference-devices REFERENCE_DEVICES
                        Reference devices to compare: all, a non-negative integer id, or comma-separated ids
  --compare-devices COMPARE_DEVICES
                        Compare devices to compare: all, a non-negative integer id, or comma-separated ids
  -a, --axis FILTER_ACTIONS
                        Filter on axis value, e.g. -a 'Elements{io}[pow2]=20'. Applies to the most recent --benchmark, or all benchmarks if specified before any --benchmark arguments.
  -b, --benchmark FILTER_ACTIONS
                        Filter by benchmark name (can repeat)

What changed in this PR is --axis now applies to the most recent benchmark, like it does in NVBench benchmarks themselves, and script now supports filtering by device IDs in JSON data.

This allows for comparison of JSON datasets not only from different single devices, but of JSON datasets with different number of devices measured on. So on a machine with two GPUs

 # benchmarks for all devices on the machine
 ./path/to/nvbench-instrumented-benchmarks --jsonbin run1.json
 # benchmarks only for the CUDA_DEVICE 0
 ./path/to/nvbench-instrumented-benchmarks -d 0 --jsonbin run2.json
 nvbench-compare run1.json run2.json --compare-devices 0 --reference-devices 0

oleksandr-pavlyk added 2 commits June 2, 2026 11:47

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nvbench compare improvements#383

Nvbench compare improvements#383
oleksandr-pavlyk wants to merge 2 commits into
NVIDIA:mainfrom
oleksandr-pavlyk:nvbench-compare-improvements

oleksandr-pavlyk commented Jun 2, 2026

Uh oh!

oleksandr-pavlyk commented Jun 2, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 2, 2026

Uh oh!

coderabbitai Bot Jun 2, 2026

Uh oh!

bernhardmgruber commented Jun 3, 2026

Uh oh!

oleksandr-pavlyk commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

oleksandr-pavlyk commented Jun 2, 2026

First change: prefer robust statistics

Second change: improvement to CLI

Notes

Handling of duplicates with the same axis_values

Create datasets

Comparison using nvbench_compare from main

Comparison using nvbench_compare from this branch

Notes

Comparing different devices

nvbench_compare uses the same scoping of benchmark/axis options processing as NVBench benchmarks do

Uh oh!

oleksandr-pavlyk commented Jun 2, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber commented Jun 3, 2026

Uh oh!

oleksandr-pavlyk commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comparison using `nvbench_compare` from main

Comparison using `nvbench_compare` from this branch

`nvbench_compare` uses the same scoping of benchmark/axis options processing as NVBench benchmarks do