fix(kubeflow): tail only first and last node logs by ko3n1g · Pull Request #538 · NVIDIA-NeMo/Run

ko3n1g · 2026-05-30T10:30:58Z

Background & motivation

A multi-node Kubeflow TrainJob runs one pod per node. The CI console should show rank 0 (setup/config + per-step throughput) and the last global rank (Megatron's print_rank_last loss/step), while every rank is persisted for golden-value parsing. Plain kubectl logs -l resolves its pod set once and never re-attaches, so restarts or late-scheduled pods silently drop output.

What changed

fetch_logs: tail every rank to log-allranks_0.out; forward only rank 0 + the true last global rank to stdout.
Rank resolution: map pod → GROUP_RANK from the worker's environ (c10d assigns ranks by join order, not the JobSet completion-index). Resolve rank-0 and the last rank before forwarding, and never fall back to the completion-index heuristic — keep waiting while the job is alive (Pending / mid-rendezvous), stop only on a terminal state.
Robust streaming: periodic re-attach via --since-time to pick up restarted/late pods; timestamp-ordered forwarding with a small reorder buffer; idempotent launch on 409; per-job code_dir; kube-client reload across cert rotation.
Pod-template metadata: support labels / annotations on the TrainJob and its pods (podTemplateOverrides[].metadata).
Tests: updated the kubeflow unit tests for the new behavior (uuid-suffixed TrainJob names, idempotent-409 recreate, all-ranks log write + rank-0/last forwarding); all green, ruff-clean.

Streaming decision

flowchart TD
  S[fetch_logs stream] --> R{rank-0 and last GROUP_RANK resolved?}
  R -->|no, job still alive| W[wait + re-list pods] --> R
  R -->|yes| A[kubectl logs -l -f --tail=-1]
  A --> D[all ranks -> log-allranks_0.out]
  A --> F[forward rank-0 + last rank -> stdout]

Technical notes

First attach replays history (--tail=-1), so waiting for ranks to resolve loses nothing.
Used by NVIDIA-NeMo/Megatron-Bridge#4061 and the CI in dl/joc/nemo-ci!2422.

KubeflowExecutor.fetch_logs followed every replica and forwarded all ranks to the caller, so at scale the aggregate output overran CI/runner job-log size limits (a 16-node x 8-GPU run exceeded GitLab's 128MB cap). Now it still tails every rank (kubectl logs -l <jobset> --prefix --max-log-requests num_nodes) and writes the complete multi-rank output to <job_dir>/log-allranks_0.out, but forwards only global rank 0 (node 0, [default0]) and the last global rank (node num_nodes-1, [default nproc_per_node-1]) to stdout. Downstream log validation that globs log*.out still sees every rank via the on-disk file. Signed-off-by: oliver könig <okoenig@nvidia.com>

…tory streaming fetch_logs identified the last global rank's pod by parsing the pod name and tailed only the last `--tail <lines>` window, so on (re)attach the last rank's mid-run canonical "iteration | lm loss | ..." line (print_rank_last) was dropped — on K8s the job log showed only rank 0's "Step Time" line. Resolve the first/last pod from the authoritative batch.kubernetes.io/job-completion-index label (== torchrun PET_NODE_RANK), mapped from the --prefix pod name and refreshed on every (re)connect (gang restarts spawn new pod names), and stream each pod's full history (--tail=-1) so no mid-run line is missed. All ranks are still written to log-allranks_0.out; only global rank 0 and the true last global rank are forwarded to stdout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

…d heuristic Kubeflow Trainer sets torchrun PET_NODE_RANK statically from the JobSet batch.kubernetes.io/job-completion-index, so global_rank = completion_index * nproc_per_node + local_rank. Compute that explicitly and forward only global rank 0 and world_size-1 to stdout (all ranks still go to log-allranks_0.out). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

When a TrainJob with the target name already exists, launch() raised and aborted. On CI the name is derived from the experiment id (commit SHA), so a 409 is a stale leftover from a prior attempt the launcher declared FAILED after a slow pod start. That blocked setup_experiment's 'attempt N of M' retry — every retry re-collided. Now launch() deletes the stale job (cancel(wait=True)) and recreates, so the retry can actually recover. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

torchx calls scheduler.log_iter(app_id, role_name, k=...) once per replica (k = 0..num_nodes-1). The Kubeflow log_iter ignored k and re-ran fetch_logs — which tails the entire jobset via the jobset-name selector — for every replica, producing N independent tail streams (each with its own dedup state) and N-fold-duplicating every console line (prefixed <role>/<k>). At 16 nodes that's 16x the log volume, which also overruns the CI job-log limit on long runs. Stream only for k == 0; that single tail already covers all ranks (and writes log-allranks_0.out once). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

…dup) Revert the all-ranks sliding-window dedup back to forwarding only global rank 0 (setup/config) and the last global rank (print_rank_last per-step loss), like a SLURM job log. The last rank is resolved at stream time from each pod's batch.kubernetes.io/job-completion-index label (== torchrun --node-rank $PET_NODE_RANK), so global_rank = completion_index * nproc_per_node + local_rank is deterministic without any topology enforcement. The full per-rank stream is still captured in log-allranks_0.out. Combined with the per-replica log_iter guard, this stops the N-fold duplication and yields a clean two-rank console. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

The c10d rendezvous assigns torch ranks by join order, not by JobSet completion-index, so torch's world_size-1 (print_rank_last's loss line) does NOT land on the highest completion-index. Verified on a live 16-node job: the loss prints on completion-index 9 (= num_nodes//2 + 1), local rank nproc-1 — not index 15. Forward exactly (index 0, local 0) and (index num_nodes//2 + 1, local nproc-1) so the console shows rank 0 setup + the per-step loss/throughput. Full per-rank capture remains in log-allranks_0.out. A deterministic completion-index->rank mapping (topology/static rank ordering) would let us compute this rather than match the observed slot. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

fetch_logs ran a single 'kubectl logs -l -f --max-log-requests <num_nodes>'. That follow only attaches to pods present at start, never re-attaches to a container that restarts, and --max-log-requests == pod count has no headroom — so a gang/NCCL-init restart that transiently doubled the matching-pod count errored the whole command ('maximum allowed concurrency') and silently dropped pods. Observed: a 16-node job streamed only node-0-0; the loss rank (node-0-9) never appeared even though it was emitting per-step loss. - --max-log-requests = max(num_nodes*2, 8): headroom for restart-transient pods. - Periodically re-attach (threading.Timer terminates the follow every 120s) so pods that (re)started after the initial attach are picked up. - Resume reconnects with --since-time (via --timestamps), tracking the max RFC3339 stamp, so re-attaching never replays already-emitted history; only the first attach uses --tail=-1. The kubectl timestamp is stripped from each line. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

… heuristic The console forwarded rank 0 + the loss rank using completion-index with an empirical 'num_nodes//2+1' slot for world_size-1. That's fragile: the c10d rendezvous assigns torch ranks by join order, not JobSet completion-index, so the loss rank lands on an unpredictable pod (observed: completion-index 9 was actually GROUP_RANK 15 = RANK 63 = world_size-1). Read the ground truth instead: torchrun exports GROUP_RANK into every worker's /proc/<pid>/environ, so 'kubectl exec <pod> -- ' reading it tells us exactly which pod holds GROUP_RANK 0 (RANK 0, local 0) and GROUP_RANK num_nodes-1 (RANK world_size-1, local nproc-1). Resolve the pod->GROUP_RANK map once the workers exist, cache it, and re-resolve when the rank-0/last pod is no longer covered (gang restart reshuffles ranks). Until workers come up (empty map), fall back to the completion-index-0 pod so early setup still streams. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

'kubectl logs -l ... -f' multiplexes every pod into one stream in ARRIVAL order, not timestamp order. Because the console forwards two pods (rank 0 and the last rank), their lines could interleave wrong — e.g. two rank-0 'Step Time' lines bunching before the last rank's 'iteration N' line, or a step time landing under the next iteration. Add a small reorder buffer on the forwarded (yielded) subset only: each line already carries the kubelet --timestamps value (parsed to epoch via the new _ts_epoch), so hold lines until they are older than reorder_hold_s (2s) and emit sorted by timestamp. The window comfortably absorbs cross-node clock skew + flush jitter while keeping the console near-live. The buffer is drained in order after each proc ends (re-attach) — outside finally, since yielding during generator close is unsafe. The full all-ranks debug file is untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

…verrides metadata) The executor's existing 'annotations' land on the TrainJob object. GKE multi-network attach (networking.gke.io/interfaces, for GPUDirect-RDMA/gIB) is read off the trainer POD, not the TrainJob — add pod_annotations (and pod_labels) that flow into podTemplateOverrides[].metadata, which the Kubeflow Trainer v2 CRD supports. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

On first attach the GROUP_RANK pod map is empty until the torchrun workers finish rendezvous, so _forward_to_stdout fell back to rank-0-only and the last rank's early per-step loss/throughput lines (replayed via --tail=-1) were written to log-allranks but never forwarded to stdout — the CI log silently dropped the beginning of the run until a re-attach ~120s later, by which point --since-time skips the replayed history. Poll on the first attach until both rank 0 and the last rank resolve before forwarding, capped at 600s (then fall back). The wait is gated on a non-empty pod list, so it is a no-op when pods can't be listed (no kubectl / unit tests) and engages only for real runs. Signed-off-by: oliver könig <okoenig@nvidia.com>

…mpletion-index The first-attach barrier capped the wait at 600s and then forwarded with the completion-index heuristic, which streams the wrong rank. A job can legitimately sit Pending (starved for nodes) far longer than 600s, so it would time out and mis-forward. Drop the timeout/fallback: keep polling while the job is alive and stop only when it reaches a terminal state. --tail=-1 on first attach replays history, so waiting loses nothing. Signed-off-by: oliver könig <okoenig@nvidia.com>

Signed-off-by: oliver könig <okoenig@nvidia.com>

…nk-0/last log forwarding The base kubeflow rewrite changed behavior the tests still asserted the old way: TrainJob names are now <base>-<uuid6>; a 409 cancels the stale job and recreates (idempotent) rather than raising; and fetch_logs writes every rank to <job_dir>/log-allranks_0.out while forwarding only rank-0 + the last rank to stdout. Set job_dir, patch status/time.sleep to avoid the retry-loop hang, and assert the all-ranks file + uuid-suffixed names. Signed-off-by: oliver könig <okoenig@nvidia.com>

…eload Raises codecov/patch on the diff from 58% to ~98% by exercising the previously-untested branches: GROUP_RANK resolution via worker environ (incl. the first-attach resolve barrier), rank-0/last-rank forwarding + reorder buffer, the completion-index fallback, pod-template labels/ annotations, stale kube-client reload, and the status() connection-error retry path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

…ry PVC paths package()/pull_results() already bridge launcher↔PVC via a throw-away data-mover pod, but only for the per-job code_dir. Downloading results (or persisting any auxiliary cross-run state) from another path on the volume had no public API. Add copy_to_workspace(local, remote) and copy_from_workspace(remote, local) that run the same data-mover against an arbitrary path under workdir_pvc_path, and refactor package()/pull_results() to delegate to them (behavior unchanged). Tests cover the happy path, the no-PVC no-op, and pod teardown on copy error. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g temporarily deployed to public May 30, 2026 10:31 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public May 30, 2026 10:32 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public May 30, 2026 10:33 — with GitHub Actions Inactive

ko3n1g force-pushed the ko3n1g/fix/kubeflow-tail-first-last-rank branch from 7495a29 to aa3a357 Compare May 30, 2026 11:05

ko3n1g temporarily deployed to public May 30, 2026 11:06 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public May 30, 2026 11:07 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public May 30, 2026 11:08 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public May 30, 2026 12:02 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public May 30, 2026 12:03 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public May 30, 2026 12:04 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public May 30, 2026 12:30 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public May 30, 2026 12:32 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public May 30, 2026 12:33 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public May 30, 2026 13:42 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public May 30, 2026 13:44 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public May 30, 2026 13:45 — with GitHub Actions Inactive

ko3n1g force-pushed the ko3n1g/fix/kubeflow-tail-first-last-rank branch from e5306b5 to 636ec99 Compare May 30, 2026 23:29

ko3n1g temporarily deployed to public May 30, 2026 23:29 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public May 30, 2026 23:31 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public May 30, 2026 23:32 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public May 31, 2026 00:09 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public May 31, 2026 00:11 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public May 31, 2026 00:12 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public May 31, 2026 00:22 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public May 31, 2026 00:23 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public May 31, 2026 00:24 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public May 31, 2026 00:30 — with GitHub Actions Inactive

ko3n1g and others added 10 commits May 31, 2026 09:44

style(kubeflow): ruff-format kubeflow.py

2870f12

Signed-off-by: oliver könig <okoenig@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kubeflow): tail only first and last node logs#538

fix(kubeflow): tail only first and last node logs#538
ko3n1g wants to merge 20 commits into
mainfrom
ko3n1g/fix/kubeflow-tail-first-last-rank

ko3n1g commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ko3n1g commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background & motivation

What changed

Streaming decision

Technical notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ko3n1g commented May 30, 2026 •

edited

Loading