Add IdleTimeAnalyser: classify GPU idle gaps by root cause#620
Draft
ajassani wants to merge 5 commits into
Draft
Add IdleTimeAnalyser: classify GPU idle gaps by root cause#620ajassani wants to merge 5 commits into
ajassani wants to merge 5 commits into
Conversation
Classifies each GPU idle interval along two axes: - drain_type: why the GPU queue became empty (sync_drain vs starved) - cpu_during_gap: what the CPU was doing (LAUNCH_ANOMALY, LAUNCH_OVERHEAD_ONLY, RUNTIME_DOMINATED, CPU_DOMINATED, CPU_UNTRACED) Includes: - classify_idle_time.py: core classifier with OverlapIndex for fast queries, self-time based dominant op detection, augmented trace generation, Excel reports - idle_time_extension.py: integration hook for generate_perf_report_pytorch - docs/idle_time_classification.md: column documentation for all three sheets (idle_overview, idle_summary, idle_intervals) - tests/test_classify_idle_time.py: 43 unit tests Made-with: Cursor
Previously, any non-prequeued interval with launch_to_exec > 10µs was classified as LAUNCH_ANOMALY regardless of gap duration. This caused intervals like idle#136 in ddp_resnet18 (858µs gap, 10.2µs launch_to_exec) to be labeled LAUNCH_ANOMALY when the real bottleneck was CPU/runtime overhead consuming 99% of the gap. Now non-prequeued LAUNCH_ANOMALY requires both: - launch_to_exec > 10µs (absolute threshold) - launch_to_exec > 25% of gap duration (significance check) 39 intervals across 3 traces reclassify (all had l2e/dur < 2%). Zero impact on MI325X Llama/Mistral prequeued anomalies. Made-with: Cursor
Promote idle time classification from an external extension to a first-class TraceLens feature: - New TraceLens/IdleTimeAnalyser/ package (classify.py, report.py, __init__.py) wrapping the classification and DataFrame logic in a reusable IdleTimeAnalyser class. - Built-in --enable_idle_analysis and --enable_augmented_trace flags on generate_perf_report_pytorch, adding idle_overview, idle_summary, and idle_intervals sheets without an extension file. - User-facing guide (docs/idle_time_guide.md) with background concepts, quick start, report reading walkthrough, and overhead benchmarks. - Updated docs/generate_perf_report.md and docs/idle_time_classification.md to reference the new flags. - 7 new tests for the wrapper and report module (50 total, all passing). Added test_classify_idle_time.py to CI workflow. - Deprecated idle_time_extension.py with a notice pointing to the built-in flag. Made-with: Cursor
This CUDA extended launch API variant was falling through to OTHER_RUNTIME instead of LAUNCH_STALL, misclassifying 21 intervals across H100 traces. Made-with: Cursor
Add a DDP resnet18 trace (no CUDA memory pool) that exercises the MEMORY_ALLOC runtime sub-type, the only classification branch not covered by existing repo traces. Also add a regression test asserting FSDP rank0 covers LAUNCH_ANOMALY, CPU_DOMINATED, LAUNCH_OVERHEAD_ONLY, and CPU_UNTRACED categories. Made-with: Cursor
This was referenced May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an
IdleTimeAnalysermodule that classifies every gap between consecutive GPU events (kernels, memcpy, memset) by root cause. Instead of reporting GPU idle time as a single number, each idle interval is classified along two axes so engineers get actionable information for optimization:drain_type—sync_drain(a sync event drained the queue while the CPU was blocked) orno_sync(the CPU could not submit work fast enough)cpu_during_gap—LAUNCH_ANOMALY,RUNTIME_DOMINATED, orCPU_DOMINATEDWhat is new
TraceLens/IdleTimeAnalyser/module (classify.py,report.py)TraceLens/Reporting/classify_idle_time.pyandidle_time_extension.pyfor perf-report integrationcudaLaunchKernelExCadded toLAUNCH_NAMESLAUNCH_ANOMALYnow requireslaunch_to_exec > 25%of gap for non-prequeued cases (reduces false positives)tests/test_classify_idle_time.pyincluding a DDP resnet18 trace fixturedocs/idle_time_guide.md,docs/idle_time_classification.mdTest plan
mainto clean up unrelated diffs