Add IdleTimeAnalyser: classify GPU idle gaps by root cause by ajassani · Pull Request #620 · AMD-AGI/TraceLens

ajassani · 2026-05-04T22:59:57Z

Summary

Adds an IdleTimeAnalyser module that classifies every gap between consecutive GPU events (kernels, memcpy, memset) by root cause. Instead of reporting GPU idle time as a single number, each idle interval is classified along two axes so engineers get actionable information for optimization:

drain_type — sync_drain (a sync event drained the queue while the CPU was blocked) or no_sync (the CPU could not submit work fast enough)
cpu_during_gap — LAUNCH_ANOMALY, RUNTIME_DOMINATED, or CPU_DOMINATED

What is new

TraceLens/IdleTimeAnalyser/ module (classify.py, report.py)
TraceLens/Reporting/classify_idle_time.py and idle_time_extension.py for perf-report integration
cudaLaunchKernelExC added to LAUNCH_NAMES
LAUNCH_ANOMALY now requires launch_to_exec > 25% of gap for non-prequeued cases (reduces false positives)
Regression tests in tests/test_classify_idle_time.py including a DDP resnet18 trace fixture
Docs: docs/idle_time_guide.md, docs/idle_time_classification.md

Test plan

Rebase onto latest main to clean up unrelated diffs
Validate classification on additional traces beyond the DDP resnet18 fixture
Address review feedback

Classifies each GPU idle interval along two axes: - drain_type: why the GPU queue became empty (sync_drain vs starved) - cpu_during_gap: what the CPU was doing (LAUNCH_ANOMALY, LAUNCH_OVERHEAD_ONLY, RUNTIME_DOMINATED, CPU_DOMINATED, CPU_UNTRACED) Includes: - classify_idle_time.py: core classifier with OverlapIndex for fast queries, self-time based dominant op detection, augmented trace generation, Excel reports - idle_time_extension.py: integration hook for generate_perf_report_pytorch - docs/idle_time_classification.md: column documentation for all three sheets (idle_overview, idle_summary, idle_intervals) - tests/test_classify_idle_time.py: 43 unit tests Made-with: Cursor

Previously, any non-prequeued interval with launch_to_exec > 10µs was classified as LAUNCH_ANOMALY regardless of gap duration. This caused intervals like idle#136 in ddp_resnet18 (858µs gap, 10.2µs launch_to_exec) to be labeled LAUNCH_ANOMALY when the real bottleneck was CPU/runtime overhead consuming 99% of the gap. Now non-prequeued LAUNCH_ANOMALY requires both: - launch_to_exec > 10µs (absolute threshold) - launch_to_exec > 25% of gap duration (significance check) 39 intervals across 3 traces reclassify (all had l2e/dur < 2%). Zero impact on MI325X Llama/Mistral prequeued anomalies. Made-with: Cursor

Promote idle time classification from an external extension to a first-class TraceLens feature: - New TraceLens/IdleTimeAnalyser/ package (classify.py, report.py, __init__.py) wrapping the classification and DataFrame logic in a reusable IdleTimeAnalyser class. - Built-in --enable_idle_analysis and --enable_augmented_trace flags on generate_perf_report_pytorch, adding idle_overview, idle_summary, and idle_intervals sheets without an extension file. - User-facing guide (docs/idle_time_guide.md) with background concepts, quick start, report reading walkthrough, and overhead benchmarks. - Updated docs/generate_perf_report.md and docs/idle_time_classification.md to reference the new flags. - 7 new tests for the wrapper and report module (50 total, all passing). Added test_classify_idle_time.py to CI workflow. - Deprecated idle_time_extension.py with a notice pointing to the built-in flag. Made-with: Cursor

This CUDA extended launch API variant was falling through to OTHER_RUNTIME instead of LAUNCH_STALL, misclassifying 21 intervals across H100 traces. Made-with: Cursor

Add a DDP resnet18 trace (no CUDA memory pool) that exercises the MEMORY_ALLOC runtime sub-type, the only classification branch not covered by existing repo traces. Also add a regression test asserting FSDP rank0 covers LAUNCH_ANOMALY, CPU_DOMINATED, LAUNCH_OVERHEAD_ONLY, and CPU_UNTRACED categories. Made-with: Cursor

ajassani added 5 commits April 23, 2026 21:15

Add cudaLaunchKernelExC to LAUNCH_NAMES

ce7f43e

This CUDA extended launch API variant was falling through to OTHER_RUNTIME instead of LAUNCH_STALL, misclassifying 21 intervals across H100 traces. Made-with: Cursor

This was referenced May 5, 2026

Diagnose runtime bugs [medium term] #21

Open

CPU Idle and Device Synch Analysis #532

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add IdleTimeAnalyser: classify GPU idle gaps by root cause#620

Add IdleTimeAnalyser: classify GPU idle gaps by root cause#620
ajassani wants to merge 5 commits into
mainfrom
ajassani/idle-time-classification

ajassani commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ajassani commented May 4, 2026

Summary

What is new

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant