Skip to content

Fix flaky colltrace_graph_watchdog_test by removing unreliable log assertion#2716

Open
siyengar wants to merge 1 commit into
meta-pytorch:mainfrom
siyengar:export-D106600065
Open

Fix flaky colltrace_graph_watchdog_test by removing unreliable log assertion#2716
siyengar wants to merge 1 commit into
meta-pytorch:mainfrom
siyengar:export-D106600065

Conversation

@siyengar
Copy link
Copy Markdown
Contributor

Summary: The testDriverCheckCrashedWithWatchdog() method was searching for "watchdog timeout" in NCCL_DEBUG_FILE, but the WatchdogPlugin logs that message via folly XLOG(FATAL) (stderr), not NCCL's debug file system. This caused the assertion to fail ~97% of the time. The exit code check (all workers crashed with non-zero) already validates the watchdog behavior correctly.

Differential Revision: D106600065

…sertion

Summary: The `testDriverCheckCrashedWithWatchdog()` method was searching for "watchdog timeout" in NCCL_DEBUG_FILE, but the WatchdogPlugin logs that message via folly XLOG(FATAL) (stderr), not NCCL's debug file system. This caused the assertion to fail ~97% of the time. The exit code check (all workers crashed with non-zero) already validates the watchdog behavior correctly.

Differential Revision: D106600065
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 28, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 28, 2026

@siyengar has exported this pull request. If you are a Meta employee, you can view the originating Diff in D106600065.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant