fix(kubeflow): tail only first and last node logs#538
Open
ko3n1g wants to merge 20 commits into
Open
Conversation
KubeflowExecutor.fetch_logs followed every replica and forwarded all ranks to the caller, so at scale the aggregate output overran CI/runner job-log size limits (a 16-node x 8-GPU run exceeded GitLab's 128MB cap). Now it still tails every rank (kubectl logs -l <jobset> --prefix --max-log-requests num_nodes) and writes the complete multi-rank output to <job_dir>/log-allranks_0.out, but forwards only global rank 0 (node 0, [default0]) and the last global rank (node num_nodes-1, [default nproc_per_node-1]) to stdout. Downstream log validation that globs log*.out still sees every rank via the on-disk file. Signed-off-by: oliver könig <okoenig@nvidia.com>
7495a29 to
aa3a357
Compare
…tory streaming fetch_logs identified the last global rank's pod by parsing the pod name and tailed only the last `--tail <lines>` window, so on (re)attach the last rank's mid-run canonical "iteration | lm loss | ..." line (print_rank_last) was dropped — on K8s the job log showed only rank 0's "Step Time" line. Resolve the first/last pod from the authoritative batch.kubernetes.io/job-completion-index label (== torchrun PET_NODE_RANK), mapped from the --prefix pod name and refreshed on every (re)connect (gang restarts spawn new pod names), and stream each pod's full history (--tail=-1) so no mid-run line is missed. All ranks are still written to log-allranks_0.out; only global rank 0 and the true last global rank are forwarded to stdout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
…d heuristic Kubeflow Trainer sets torchrun PET_NODE_RANK statically from the JobSet batch.kubernetes.io/job-completion-index, so global_rank = completion_index * nproc_per_node + local_rank. Compute that explicitly and forward only global rank 0 and world_size-1 to stdout (all ranks still go to log-allranks_0.out). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
When a TrainJob with the target name already exists, launch() raised and aborted. On CI the name is derived from the experiment id (commit SHA), so a 409 is a stale leftover from a prior attempt the launcher declared FAILED after a slow pod start. That blocked setup_experiment's 'attempt N of M' retry — every retry re-collided. Now launch() deletes the stale job (cancel(wait=True)) and recreates, so the retry can actually recover. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
e5306b5 to
636ec99
Compare
torchx calls scheduler.log_iter(app_id, role_name, k=...) once per replica (k = 0..num_nodes-1). The Kubeflow log_iter ignored k and re-ran fetch_logs — which tails the entire jobset via the jobset-name selector — for every replica, producing N independent tail streams (each with its own dedup state) and N-fold-duplicating every console line (prefixed <role>/<k>). At 16 nodes that's 16x the log volume, which also overruns the CI job-log limit on long runs. Stream only for k == 0; that single tail already covers all ranks (and writes log-allranks_0.out once). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
…dup) Revert the all-ranks sliding-window dedup back to forwarding only global rank 0 (setup/config) and the last global rank (print_rank_last per-step loss), like a SLURM job log. The last rank is resolved at stream time from each pod's batch.kubernetes.io/job-completion-index label (== torchrun --node-rank $PET_NODE_RANK), so global_rank = completion_index * nproc_per_node + local_rank is deterministic without any topology enforcement. The full per-rank stream is still captured in log-allranks_0.out. Combined with the per-replica log_iter guard, this stops the N-fold duplication and yields a clean two-rank console. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
The c10d rendezvous assigns torch ranks by join order, not by JobSet completion-index, so torch's world_size-1 (print_rank_last's loss line) does NOT land on the highest completion-index. Verified on a live 16-node job: the loss prints on completion-index 9 (= num_nodes//2 + 1), local rank nproc-1 — not index 15. Forward exactly (index 0, local 0) and (index num_nodes//2 + 1, local nproc-1) so the console shows rank 0 setup + the per-step loss/throughput. Full per-rank capture remains in log-allranks_0.out. A deterministic completion-index->rank mapping (topology/static rank ordering) would let us compute this rather than match the observed slot. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
fetch_logs ran a single 'kubectl logs -l -f --max-log-requests <num_nodes>'.
That follow only attaches to pods present at start, never re-attaches to a
container that restarts, and --max-log-requests == pod count has no headroom —
so a gang/NCCL-init restart that transiently doubled the matching-pod count
errored the whole command ('maximum allowed concurrency') and silently dropped
pods. Observed: a 16-node job streamed only node-0-0; the loss rank (node-0-9)
never appeared even though it was emitting per-step loss.
- --max-log-requests = max(num_nodes*2, 8): headroom for restart-transient pods.
- Periodically re-attach (threading.Timer terminates the follow every 120s) so
pods that (re)started after the initial attach are picked up.
- Resume reconnects with --since-time (via --timestamps), tracking the max
RFC3339 stamp, so re-attaching never replays already-emitted history; only the
first attach uses --tail=-1. The kubectl timestamp is stripped from each line.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
… heuristic The console forwarded rank 0 + the loss rank using completion-index with an empirical 'num_nodes//2+1' slot for world_size-1. That's fragile: the c10d rendezvous assigns torch ranks by join order, not JobSet completion-index, so the loss rank lands on an unpredictable pod (observed: completion-index 9 was actually GROUP_RANK 15 = RANK 63 = world_size-1). Read the ground truth instead: torchrun exports GROUP_RANK into every worker's /proc/<pid>/environ, so 'kubectl exec <pod> -- ' reading it tells us exactly which pod holds GROUP_RANK 0 (RANK 0, local 0) and GROUP_RANK num_nodes-1 (RANK world_size-1, local nproc-1). Resolve the pod->GROUP_RANK map once the workers exist, cache it, and re-resolve when the rank-0/last pod is no longer covered (gang restart reshuffles ranks). Until workers come up (empty map), fall back to the completion-index-0 pod so early setup still streams. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
'kubectl logs -l ... -f' multiplexes every pod into one stream in ARRIVAL order, not timestamp order. Because the console forwards two pods (rank 0 and the last rank), their lines could interleave wrong — e.g. two rank-0 'Step Time' lines bunching before the last rank's 'iteration N' line, or a step time landing under the next iteration. Add a small reorder buffer on the forwarded (yielded) subset only: each line already carries the kubelet --timestamps value (parsed to epoch via the new _ts_epoch), so hold lines until they are older than reorder_hold_s (2s) and emit sorted by timestamp. The window comfortably absorbs cross-node clock skew + flush jitter while keeping the console near-live. The buffer is drained in order after each proc ends (re-attach) — outside finally, since yielding during generator close is unsafe. The full all-ranks debug file is untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
…verrides metadata) The executor's existing 'annotations' land on the TrainJob object. GKE multi-network attach (networking.gke.io/interfaces, for GPUDirect-RDMA/gIB) is read off the trainer POD, not the TrainJob — add pod_annotations (and pod_labels) that flow into podTemplateOverrides[].metadata, which the Kubeflow Trainer v2 CRD supports. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
On first attach the GROUP_RANK pod map is empty until the torchrun workers finish rendezvous, so _forward_to_stdout fell back to rank-0-only and the last rank's early per-step loss/throughput lines (replayed via --tail=-1) were written to log-allranks but never forwarded to stdout — the CI log silently dropped the beginning of the run until a re-attach ~120s later, by which point --since-time skips the replayed history. Poll on the first attach until both rank 0 and the last rank resolve before forwarding, capped at 600s (then fall back). The wait is gated on a non-empty pod list, so it is a no-op when pods can't be listed (no kubectl / unit tests) and engages only for real runs. Signed-off-by: oliver könig <okoenig@nvidia.com>
…mpletion-index The first-attach barrier capped the wait at 600s and then forwarded with the completion-index heuristic, which streams the wrong rank. A job can legitimately sit Pending (starved for nodes) far longer than 600s, so it would time out and mis-forward. Drop the timeout/fallback: keep polling while the job is alive and stop only when it reaches a terminal state. --tail=-1 on first attach replays history, so waiting loses nothing. Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
…nk-0/last log forwarding The base kubeflow rewrite changed behavior the tests still asserted the old way: TrainJob names are now <base>-<uuid6>; a 409 cancels the stale job and recreates (idempotent) rather than raising; and fetch_logs writes every rank to <job_dir>/log-allranks_0.out while forwarding only rank-0 + the last rank to stdout. Set job_dir, patch status/time.sleep to avoid the retry-loop hang, and assert the all-ranks file + uuid-suffixed names. Signed-off-by: oliver könig <okoenig@nvidia.com>
…eload Raises codecov/patch on the diff from 58% to ~98% by exercising the previously-untested branches: GROUP_RANK resolution via worker environ (incl. the first-attach resolve barrier), rank-0/last-rank forwarding + reorder buffer, the completion-index fallback, pod-template labels/ annotations, stale kube-client reload, and the status() connection-error retry path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
…ry PVC paths package()/pull_results() already bridge launcher↔PVC via a throw-away data-mover pod, but only for the per-job code_dir. Downloading results (or persisting any auxiliary cross-run state) from another path on the volume had no public API. Add copy_to_workspace(local, remote) and copy_from_workspace(remote, local) that run the same data-mover against an arbitrary path under workdir_pvc_path, and refactor package()/pull_results() to delegate to them (behavior unchanged). Tests cover the happy path, the no-PVC no-op, and pod teardown on copy error. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background & motivation
A multi-node Kubeflow
TrainJobruns one pod per node. The CI console should show rank 0 (setup/config + per-step throughput) and the last global rank (Megatron'sprint_rank_lastloss/step), while every rank is persisted for golden-value parsing. Plainkubectl logs -lresolves its pod set once and never re-attaches, so restarts or late-scheduled pods silently drop output.What changed
fetch_logs: tail every rank tolog-allranks_0.out; forward only rank 0 + the true last global rank to stdout.GROUP_RANKfrom the worker's environ (c10d assigns ranks by join order, not the JobSet completion-index). Resolve rank-0 and the last rank before forwarding, and never fall back to the completion-index heuristic — keep waiting while the job is alive (Pending / mid-rendezvous), stop only on a terminal state.--since-timeto pick up restarted/late pods; timestamp-ordered forwarding with a small reorder buffer; idempotent launch on 409; per-jobcode_dir; kube-client reload across cert rotation.labels/annotationson the TrainJob and its pods (podTemplateOverrides[].metadata).Streaming decision
flowchart TD S[fetch_logs stream] --> R{rank-0 and last GROUP_RANK resolved?} R -->|no, job still alive| W[wait + re-list pods] --> R R -->|yes| A[kubectl logs -l -f --tail=-1] A --> D[all ranks -> log-allranks_0.out] A --> F[forward rank-0 + last rank -> stdout]Technical notes
--tail=-1), so waiting for ranks to resolve loses nothing.NVIDIA-NeMo/Megatron-Bridge#4061and the CI indl/joc/nemo-ci!2422.