Skip to content
Draft
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion gigl/distributed/utils/networking.py
Original file line number Diff line number Diff line change
Expand Up @@ -233,7 +233,7 @@ def wait_for_readiness_signal(
readiness_uri: Uri,
timeout: float = 3600.0,
poll_interval: float = 10.0,
Comment thread
kmontemayor2-sc marked this conversation as resolved.
Outdated
log_every_n_attempts: int = 10,
log_every_n_attempts: int = 60,
) -> None:
"""Poll for a readiness sentinel file before initiating RPC connections.

Expand All @@ -244,6 +244,7 @@ def wait_for_readiness_signal(
Supports both GcsUri (production) and LocalUri (testing).
timeout: Maximum time in seconds to wait for the signal. Defaults to 3600.
poll_interval: Time in seconds between poll attempts. Defaults to 10.
log_every_n_attempts: Number of attempts between log messages. Defaults to 60.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log_every_n_attempts: Number of attempts between log messages. Defaults to 60.
log_every_n_attempts: Number of attempts between log messages. Defaults to 60. i.e. with poll_interval set to 10, and log_every_n_attempts set to 60, we will log ever 600 seconds which is every 10 minutes.

10 minutes isnt too much? Usually I consider something hanging if no logs for more than 2 -4 mins.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We log updates every 10 minutes, I think every compute rank (e.g. num gpus) dumping every minute is probably too frequent and kind of clogs up the logs.

I guess we can make it 5 minutes? And update the logs here to expect the next update.


Raises:
TimeoutError: If the readiness signal is not found within the timeout.
Expand Down