Nvbench compare improvements#383
Conversation
Teach nvbench_compare to parse GPU timing summaries into structured values and prefer the robust median/IQR summaries when both compared measurements provide them. Fall back to the existing mean/stdev summaries when robust summaries are not available. Classify comparisons with the larger available relative noise estimate instead of the smaller one, keep unavailable noise distinct from encoded infinite noise, and report improvements separately from regressions. Keep the process exit code as success for completed comparisons; regression counts are reported in the summary instead of being used as the process status. Make plotting tolerate unavailable noise by leaving gaps in confidence bands, sort plotted series by the plotted axis, and avoid reusing pyplot state across plot calls. Add focused Python tests for robust-summary preference, unavailable-noise classification, non-finite timing centers, plot-along handling when the selected axis is absent, and the exit-code contract.
Teach nvbench_compare to keep the order of --benchmark and --axis arguments so
axis filters can apply either globally or to the most recent benchmark. Build a
filter plan from the ordered CLI arguments and apply the same plan to table
output and plotting labels.
Add explicit --reference-devices and --compare-devices filters. The filters
accept all, a single device id, or a comma-separated list of ids; ordered lists
and duplicates are preserved so selected reference and compare devices can be
paired by position. Device-section mismatches remain fatal for unfiltered
all-vs-all comparisons, but become warnings when the user explicitly selects
devices and the selected device counts match.
Match duplicate benchmark states by occurrence within each filtered device
section instead of matching only by state name across the whole benchmark. This
keeps repeated axis values and filtered duplicate states aligned between the
reference and compare inputs, and reports mismatched occurrence counts instead
of silently dropping extra states.
Add Python tests for duplicate-state matching, axis filtering before matching,
device filter parsing and validation, explicit cross-device pairing, and
benchmark-scoped axis filters.
Original commit messages folded into this change:
Tweaks for nvbench_compare
1. When JSON files contain multiple entries with the same name and axis values,
make sure that scripts compares corresponding entries.
Previous logic would extract the first entry from ref data, and would compare
measurements for each state in cmp against the first entry from ref. The
change introduces a counter to know which nth entry we process for a
particular axis value, and retrieve corresponding entry in ref.
Scope occurrence matching by device.
Device pairing in nvbench_compare.py is strictly index-based under
--ignore-devices, reused IDs in a different order no longer pair against the
wrong reference device.
Require devices in ref and cmp to have the same cardinality
Handle mismatch when number of duplicates in ref data is not same as in cmp data
Use pytest monkeypatch fixture to pretend third-party package dependencies are
available during test run for nvbench_compare without introducing test-time
dependency
Added the happy-path test and fixed its direct-call setup by initializing the
device globals that main() normally populates.
Fix to filter-before-matching.
- compare_benches() now pairs devices by selected position instead of taking a
device id.
- For each device pair, compare_benches() now builds:
- ref_device_states: matching reference device and axis filters
- cmp_device_states: matching compare device and axis filters
- State occurrence counts and duplicate occurrence matching now operate only
on those filtered per-device lists.
- Removed the later matches_axis_filters() skip inside the compare-state loop
because filtering now happens before matching.
Added a regression test where ref/cmp have duplicate state names in opposite
order, and --axis keeps only one of them. The test verifies the kept compare
state is matched against the kept reference state, not the first unfiltered
occurrence.
Introduce device filtering in nvbench_compare
- --reference-devices all|ID|ID,ID,...
- --compare-devices all|ID|ID,ID,...
- Integer lists preserve order and duplicates.
- Requested IDs are validated against the file-level device list.
- Filtered reference/compare device counts must match before comparison.
- compare_benches() pairs selected reference and compare devices by position.
- Each benchmark validates that requested device IDs are present in its own
devices list.
Implemented benchmark-scoped --axis handling.
- --axis and --benchmark now share an ordered argparse action, so their
relative CLI order is preserved.
- -a before any -b becomes a global axis filter.
- -a after -b <name> applies to that most recent benchmark only.
- Repeated -b entries are treated as separate filter scopes and combined as
alternatives for that benchmark.
- Device filtering remains global and is applied independently.
Allow non-matching devices for explicit device selection
Now the device-section equality check remains fatal only for unfiltered
all-vs-all comparisons. If either --reference-devices or --compare-devices is
explicit, mismatched selected device metadata is printed as a warning, but
comparison proceeds after the selected device counts have been validated.
Fix for resolve_benchmark_device_ids, add comments
The return value of resolve_benchmark_device_ids now always owns its list.
Use monkeypatch class in set_test_devices helper
Stricted device id validation
Test for device id validation
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
📝 WalkthroughSummary by CodeRabbitRelease Notes
WalkthroughThe PR refactors ChangesGPU time comparison refactoring with structured filtering and robust time estimation
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/scripts/nvbench_compare.py (1)
534-547:⚠️ Potential issue | 🟠 Major | ⚡ Quick winimportant: Negative durations are always formatted in microseconds. Unit selection is based on the signed value, so values like
-0.01print as-10000.000 usinstead of-10.000 ms. Choose the unit fromabs(seconds)and keep the original sign only in the formatted result.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e6f909dd-05a1-460e-a704-459561ab177d
📒 Files selected for processing (2)
python/scripts/nvbench_compare.pypython/test/test_nvbench_compare.py
| def compute_common_time_estimates(ref_summary, cmp_summary): | ||
| if has_robust_estimate(ref_summary) and has_robust_estimate(cmp_summary): | ||
| return ( | ||
| TimeEstimate( | ||
| center=ref_summary.median, | ||
| relative_dispersion=select_relative_dispersion( | ||
| ref_summary.interquartile_range_relative, | ||
| ref_summary.interquartile_range, | ||
| ref_summary.median, | ||
| ), | ||
| ), | ||
| TimeEstimate( | ||
| center=cmp_summary.median, | ||
| relative_dispersion=select_relative_dispersion( | ||
| cmp_summary.interquartile_range_relative, | ||
| cmp_summary.interquartile_range, | ||
| cmp_summary.median, | ||
| ), | ||
| ), | ||
| ) | ||
|
|
||
| if has_mean_estimate(ref_summary) and has_mean_estimate(cmp_summary): | ||
| return ( | ||
| TimeEstimate( | ||
| center=ref_summary.mean, | ||
| relative_dispersion=select_relative_dispersion( | ||
| ref_summary.stdev_relative, ref_summary.stdev, ref_summary.mean | ||
| ), | ||
| ), | ||
| TimeEstimate( | ||
| center=cmp_summary.mean, | ||
| relative_dispersion=select_relative_dispersion( | ||
| cmp_summary.stdev_relative, cmp_summary.stdev, cmp_summary.mean | ||
| ), | ||
| ), | ||
| ) | ||
|
|
||
| return ( | ||
| TimeEstimate( | ||
| center=ref_summary.mean, | ||
| relative_dispersion=compute_relative_dispersion( | ||
| ref_summary.stdev, ref_summary.mean | ||
| ), | ||
| ), | ||
| TimeEstimate( | ||
| center=cmp_summary.mean, | ||
| relative_dispersion=compute_relative_dispersion( | ||
| cmp_summary.stdev, cmp_summary.mean | ||
| ), | ||
| ), | ||
| ) |
There was a problem hiding this comment.
important: Mixed robust/non-robust inputs still discard the available robust estimate. If only one side has median/IQR, this falls back to mean for both sides, and can even skip a comparison when the robust side has no mean summary. Select the best estimate per side independently, then compare those two estimates, and add a mixed-summary regression test.
| "-a", | ||
| "--axis", | ||
| action="append", | ||
| default=[], | ||
| help="Filter on axis value, e.g. -a Elements{io}=2^20 (can repeat)", | ||
| dest="filter_actions", | ||
| action=OrderedBenchmarkFilterAction, | ||
| help=( | ||
| "Filter on axis value, e.g. -a Elements{io}=2^20. Applies to the " | ||
| "most recent --benchmark, or all benchmarks if specified before any " | ||
| "--benchmark arguments." | ||
| ), |
There was a problem hiding this comment.
suggestion: The --axis example does not match what parse_axis_filters accepts. Elements{io}=2^20 is treated as a literal string today; powers of two are only expanded when the axis name uses the [pow2] suffix. Update the help text to show either the raw numeric value or the [pow2] form.
|
@oleksandr-pavlyk can you show an example of how a comparison of a CUB benchmark would look like now? And are all uses for |
All documented usage is valid. But it is a great point that updating this document should kept in mind. What changed in this PR is This allows for comparison of JSON datasets not only from different single devices, but of JSON datasets with different number of devices measured on. So on a machine with two GPUs |
First change: prefer robust statistics
nvbench_compareto parse GPU timing summaries into a richer structured form and prefer robust "median + relative IQR" when available, falling back to the older "mean + relative stdev" summaries otherwisepython/test/test_nvbench_compare.pyfor robust-summary preference, unavailable noise, non-finite centers, plot-axis handling, and exit-code behavior.Second change: improvement to CLI
nvbench_compare:--benchmark/--axishandling--reference-devicesand--compare-devicesNotes
Handling of duplicates with the same axis_values
Steps to reproduce the claim
Create datasets
Comparison using
nvbench_comparefrom mainComparison using
nvbench_comparefrom this branchNotes
kfrom reference dataset is matched against timing from matching runkin compare dataset.Comparing different devices
Reference may have data collected on B200, RTX 5090 and RTX A6000, while compare may have RTX 5090 and B300. To compare such datasets one needs to use
--reference-devicesand--compare-devicesto indicate subsets of devices from reference dataset and devices from compare dataset.The cardinality of sets (number of devices) must match. Using
--ignore-devicesis implied in such a case. Default values of--reference-devicesand--compare-devicesisall. Devices must match, unless--ignore-devicesis used. Even if--ignore-devicesis specified, cardinalities of sets of devices in reference and compare datasets must be equal.nvbench_compareuses the same scoping of benchmark/axis options processing as NVBench benchmarks do