Skip to content

Add IdleTimeAnalyser: classify GPU idle gaps by root cause#620

Draft
ajassani wants to merge 5 commits into
mainfrom
ajassani/idle-time-classification
Draft

Add IdleTimeAnalyser: classify GPU idle gaps by root cause#620
ajassani wants to merge 5 commits into
mainfrom
ajassani/idle-time-classification

Conversation

@ajassani
Copy link
Copy Markdown
Collaborator

@ajassani ajassani commented May 4, 2026

Summary

Adds an IdleTimeAnalyser module that classifies every gap between consecutive GPU events (kernels, memcpy, memset) by root cause. Instead of reporting GPU idle time as a single number, each idle interval is classified along two axes so engineers get actionable information for optimization:

  • drain_typesync_drain (a sync event drained the queue while the CPU was blocked) or no_sync (the CPU could not submit work fast enough)
  • cpu_during_gapLAUNCH_ANOMALY, RUNTIME_DOMINATED, or CPU_DOMINATED

What is new

  • TraceLens/IdleTimeAnalyser/ module (classify.py, report.py)
  • TraceLens/Reporting/classify_idle_time.py and idle_time_extension.py for perf-report integration
  • cudaLaunchKernelExC added to LAUNCH_NAMES
  • LAUNCH_ANOMALY now requires launch_to_exec > 25% of gap for non-prequeued cases (reduces false positives)
  • Regression tests in tests/test_classify_idle_time.py including a DDP resnet18 trace fixture
  • Docs: docs/idle_time_guide.md, docs/idle_time_classification.md

Test plan

  • Rebase onto latest main to clean up unrelated diffs
  • Validate classification on additional traces beyond the DDP resnet18 fixture
  • Address review feedback

Classifies each GPU idle interval along two axes:
- drain_type: why the GPU queue became empty (sync_drain vs starved)
- cpu_during_gap: what the CPU was doing (LAUNCH_ANOMALY, LAUNCH_OVERHEAD_ONLY,
  RUNTIME_DOMINATED, CPU_DOMINATED, CPU_UNTRACED)

Includes:
- classify_idle_time.py: core classifier with OverlapIndex for fast queries,
  self-time based dominant op detection, augmented trace generation, Excel reports
- idle_time_extension.py: integration hook for generate_perf_report_pytorch
- docs/idle_time_classification.md: column documentation for all three sheets
  (idle_overview, idle_summary, idle_intervals)
- tests/test_classify_idle_time.py: 43 unit tests

Made-with: Cursor
Previously, any non-prequeued interval with launch_to_exec > 10µs was
classified as LAUNCH_ANOMALY regardless of gap duration. This caused
intervals like idle#136 in ddp_resnet18 (858µs gap, 10.2µs launch_to_exec)
to be labeled LAUNCH_ANOMALY when the real bottleneck was CPU/runtime
overhead consuming 99% of the gap.

Now non-prequeued LAUNCH_ANOMALY requires both:
  - launch_to_exec > 10µs (absolute threshold)
  - launch_to_exec > 25% of gap duration (significance check)

39 intervals across 3 traces reclassify (all had l2e/dur < 2%).
Zero impact on MI325X Llama/Mistral prequeued anomalies.

Made-with: Cursor
Promote idle time classification from an external extension to a
first-class TraceLens feature:

- New TraceLens/IdleTimeAnalyser/ package (classify.py, report.py,
  __init__.py) wrapping the classification and DataFrame logic in a
  reusable IdleTimeAnalyser class.
- Built-in --enable_idle_analysis and --enable_augmented_trace flags
  on generate_perf_report_pytorch, adding idle_overview,
  idle_summary, and idle_intervals sheets without an extension file.
- User-facing guide (docs/idle_time_guide.md) with background
  concepts, quick start, report reading walkthrough, and overhead
  benchmarks.
- Updated docs/generate_perf_report.md and
  docs/idle_time_classification.md to reference the new flags.
- 7 new tests for the wrapper and report module (50 total, all
  passing). Added test_classify_idle_time.py to CI workflow.
- Deprecated idle_time_extension.py with a notice pointing to the
  built-in flag.

Made-with: Cursor
This CUDA extended launch API variant was falling through to
OTHER_RUNTIME instead of LAUNCH_STALL, misclassifying 21 intervals
across H100 traces.

Made-with: Cursor
Add a DDP resnet18 trace (no CUDA memory pool) that exercises the
MEMORY_ALLOC runtime sub-type, the only classification branch not
covered by existing repo traces. Also add a regression test asserting
FSDP rank0 covers LAUNCH_ANOMALY, CPU_DOMINATED, LAUNCH_OVERHEAD_ONLY,
and CPU_UNTRACED categories.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant