Skip to content

fix(kubeflow): tail only first and last node logs#538

Open
ko3n1g wants to merge 20 commits into
mainfrom
ko3n1g/fix/kubeflow-tail-first-last-rank
Open

fix(kubeflow): tail only first and last node logs#538
ko3n1g wants to merge 20 commits into
mainfrom
ko3n1g/fix/kubeflow-tail-first-last-rank

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented May 30, 2026

Background & motivation

A multi-node Kubeflow TrainJob runs one pod per node. The CI console should show rank 0 (setup/config + per-step throughput) and the last global rank (Megatron's print_rank_last loss/step), while every rank is persisted for golden-value parsing. Plain kubectl logs -l resolves its pod set once and never re-attaches, so restarts or late-scheduled pods silently drop output.

What changed

  • fetch_logs: tail every rank to log-allranks_0.out; forward only rank 0 + the true last global rank to stdout.
  • Rank resolution: map pod → GROUP_RANK from the worker's environ (c10d assigns ranks by join order, not the JobSet completion-index). Resolve rank-0 and the last rank before forwarding, and never fall back to the completion-index heuristic — keep waiting while the job is alive (Pending / mid-rendezvous), stop only on a terminal state.
  • Robust streaming: periodic re-attach via --since-time to pick up restarted/late pods; timestamp-ordered forwarding with a small reorder buffer; idempotent launch on 409; per-job code_dir; kube-client reload across cert rotation.
  • Pod-template metadata: support labels / annotations on the TrainJob and its pods (podTemplateOverrides[].metadata).
  • Tests: updated the kubeflow unit tests for the new behavior (uuid-suffixed TrainJob names, idempotent-409 recreate, all-ranks log write + rank-0/last forwarding); all green, ruff-clean.

Streaming decision

flowchart TD
  S[fetch_logs stream] --> R{rank-0 and last GROUP_RANK resolved?}
  R -->|no, job still alive| W[wait + re-list pods] --> R
  R -->|yes| A[kubectl logs -l -f --tail=-1]
  A --> D[all ranks -> log-allranks_0.out]
  A --> F[forward rank-0 + last rank -> stdout]
Loading

Technical notes

  • First attach replays history (--tail=-1), so waiting for ranks to resolve loses nothing.
  • Used by NVIDIA-NeMo/Megatron-Bridge#4061 and the CI in dl/joc/nemo-ci!2422.

KubeflowExecutor.fetch_logs followed every replica and forwarded all ranks to
the caller, so at scale the aggregate output overran CI/runner job-log size
limits (a 16-node x 8-GPU run exceeded GitLab's 128MB cap). Now it still tails
every rank (kubectl logs -l <jobset> --prefix --max-log-requests num_nodes) and
writes the complete multi-rank output to <job_dir>/log-allranks_0.out, but
forwards only global rank 0 (node 0, [default0]) and the last global rank
(node num_nodes-1, [default nproc_per_node-1]) to stdout. Downstream log
validation that globs log*.out still sees every rank via the on-disk file.

Signed-off-by: oliver könig <okoenig@nvidia.com>
…tory streaming

fetch_logs identified the last global rank's pod by parsing the pod name and
tailed only the last `--tail <lines>` window, so on (re)attach the last rank's
mid-run canonical "iteration | lm loss | ..." line (print_rank_last) was
dropped — on K8s the job log showed only rank 0's "Step Time" line.

Resolve the first/last pod from the authoritative
batch.kubernetes.io/job-completion-index label (== torchrun PET_NODE_RANK),
mapped from the --prefix pod name and refreshed on every (re)connect (gang
restarts spawn new pod names), and stream each pod's full history (--tail=-1)
so no mid-run line is missed. All ranks are still written to
log-allranks_0.out; only global rank 0 and the true last global rank are
forwarded to stdout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
…d heuristic

Kubeflow Trainer sets torchrun PET_NODE_RANK statically from the JobSet
batch.kubernetes.io/job-completion-index, so global_rank = completion_index *
nproc_per_node + local_rank. Compute that explicitly and forward only global
rank 0 and world_size-1 to stdout (all ranks still go to log-allranks_0.out).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
When a TrainJob with the target name already exists, launch() raised and aborted. On CI the name is derived from the experiment id (commit SHA), so a 409 is a stale leftover from a prior attempt the launcher declared FAILED after a slow pod start. That blocked setup_experiment's 'attempt N of M' retry — every retry re-collided. Now launch() deletes the stale job (cancel(wait=True)) and recreates, so the retry can actually recover.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
torchx calls scheduler.log_iter(app_id, role_name, k=...) once per replica
(k = 0..num_nodes-1). The Kubeflow log_iter ignored k and re-ran
fetch_logs — which tails the entire jobset via the jobset-name selector — for
every replica, producing N independent tail streams (each with its own dedup
state) and N-fold-duplicating every console line (prefixed <role>/<k>). At 16
nodes that's 16x the log volume, which also overruns the CI job-log limit on
long runs. Stream only for k == 0; that single tail already covers all ranks
(and writes log-allranks_0.out once).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
…dup)

Revert the all-ranks sliding-window dedup back to forwarding only global rank 0
(setup/config) and the last global rank (print_rank_last per-step loss), like a
SLURM job log. The last rank is resolved at stream time from each pod's
batch.kubernetes.io/job-completion-index label (== torchrun --node-rank
$PET_NODE_RANK), so global_rank = completion_index * nproc_per_node +
local_rank is deterministic without any topology enforcement. The full per-rank
stream is still captured in log-allranks_0.out. Combined with the per-replica
log_iter guard, this stops the N-fold duplication and yields a clean two-rank
console.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
The c10d rendezvous assigns torch ranks by join order, not by JobSet
completion-index, so torch's world_size-1 (print_rank_last's loss line) does
NOT land on the highest completion-index. Verified on a live 16-node job:
the loss prints on completion-index 9 (= num_nodes//2 + 1), local rank
nproc-1 — not index 15. Forward exactly (index 0, local 0) and
(index num_nodes//2 + 1, local nproc-1) so the console shows rank 0 setup +
the per-step loss/throughput. Full per-rank capture remains in
log-allranks_0.out. A deterministic completion-index->rank mapping
(topology/static rank ordering) would let us compute this rather than match
the observed slot.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
ko3n1g and others added 10 commits May 31, 2026 09:44
fetch_logs ran a single 'kubectl logs -l -f --max-log-requests <num_nodes>'.
That follow only attaches to pods present at start, never re-attaches to a
container that restarts, and --max-log-requests == pod count has no headroom —
so a gang/NCCL-init restart that transiently doubled the matching-pod count
errored the whole command ('maximum allowed concurrency') and silently dropped
pods. Observed: a 16-node job streamed only node-0-0; the loss rank (node-0-9)
never appeared even though it was emitting per-step loss.

- --max-log-requests = max(num_nodes*2, 8): headroom for restart-transient pods.
- Periodically re-attach (threading.Timer terminates the follow every 120s) so
  pods that (re)started after the initial attach are picked up.
- Resume reconnects with --since-time (via --timestamps), tracking the max
  RFC3339 stamp, so re-attaching never replays already-emitted history; only the
  first attach uses --tail=-1. The kubectl timestamp is stripped from each line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
… heuristic

The console forwarded rank 0 + the loss rank using completion-index with an
empirical 'num_nodes//2+1' slot for world_size-1. That's fragile: the c10d
rendezvous assigns torch ranks by join order, not JobSet completion-index, so
the loss rank lands on an unpredictable pod (observed: completion-index 9 was
actually GROUP_RANK 15 = RANK 63 = world_size-1).

Read the ground truth instead: torchrun exports GROUP_RANK into every worker's
/proc/<pid>/environ, so 'kubectl exec <pod> -- ' reading it tells us exactly
which pod holds GROUP_RANK 0 (RANK 0, local 0) and GROUP_RANK num_nodes-1
(RANK world_size-1, local nproc-1). Resolve the pod->GROUP_RANK map once the
workers exist, cache it, and re-resolve when the rank-0/last pod is no longer
covered (gang restart reshuffles ranks). Until workers come up (empty map),
fall back to the completion-index-0 pod so early setup still streams.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
'kubectl logs -l ... -f' multiplexes every pod into one stream in ARRIVAL
order, not timestamp order. Because the console forwards two pods (rank 0 and
the last rank), their lines could interleave wrong — e.g. two rank-0 'Step
Time' lines bunching before the last rank's 'iteration N' line, or a step time
landing under the next iteration.

Add a small reorder buffer on the forwarded (yielded) subset only: each line
already carries the kubelet --timestamps value (parsed to epoch via the new
_ts_epoch), so hold lines until they are older than reorder_hold_s (2s) and
emit sorted by timestamp. The window comfortably absorbs cross-node clock skew
+ flush jitter while keeping the console near-live. The buffer is drained in
order after each proc ends (re-attach) — outside finally, since yielding during
generator close is unsafe. The full all-ranks debug file is untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
…verrides metadata)

The executor's existing 'annotations' land on the TrainJob object. GKE multi-network
attach (networking.gke.io/interfaces, for GPUDirect-RDMA/gIB) is read off the trainer
POD, not the TrainJob — add pod_annotations (and pod_labels) that flow into
podTemplateOverrides[].metadata, which the Kubeflow Trainer v2 CRD supports.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
On first attach the GROUP_RANK pod map is empty until the torchrun workers
finish rendezvous, so _forward_to_stdout fell back to rank-0-only and the
last rank's early per-step loss/throughput lines (replayed via --tail=-1)
were written to log-allranks but never forwarded to stdout — the CI log
silently dropped the beginning of the run until a re-attach ~120s later,
by which point --since-time skips the replayed history.

Poll on the first attach until both rank 0 and the last rank resolve before
forwarding, capped at 600s (then fall back). The wait is gated on a
non-empty pod list, so it is a no-op when pods can't be listed (no kubectl
/ unit tests) and engages only for real runs.

Signed-off-by: oliver könig <okoenig@nvidia.com>
…mpletion-index

The first-attach barrier capped the wait at 600s and then forwarded with the
completion-index heuristic, which streams the wrong rank. A job can legitimately
sit Pending (starved for nodes) far longer than 600s, so it would time out and
mis-forward. Drop the timeout/fallback: keep polling while the job is alive and
stop only when it reaches a terminal state. --tail=-1 on first attach replays
history, so waiting loses nothing.

Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
…nk-0/last log forwarding

The base kubeflow rewrite changed behavior the tests still asserted the old way:
TrainJob names are now <base>-<uuid6>; a 409 cancels the stale job and recreates
(idempotent) rather than raising; and fetch_logs writes every rank to
<job_dir>/log-allranks_0.out while forwarding only rank-0 + the last rank to
stdout. Set job_dir, patch status/time.sleep to avoid the retry-loop hang, and
assert the all-ranks file + uuid-suffixed names.

Signed-off-by: oliver könig <okoenig@nvidia.com>
…eload

Raises codecov/patch on the diff from 58% to ~98% by exercising the
previously-untested branches: GROUP_RANK resolution via worker environ
(incl. the first-attach resolve barrier), rank-0/last-rank forwarding +
reorder buffer, the completion-index fallback, pod-template labels/
annotations, stale kube-client reload, and the status() connection-error
retry path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
…ry PVC paths

package()/pull_results() already bridge launcher↔PVC via a throw-away data-mover
pod, but only for the per-job code_dir. Downloading results (or persisting any
auxiliary cross-run state) from another path on the volume had no public API.

Add copy_to_workspace(local, remote) and copy_from_workspace(remote, local) that
run the same data-mover against an arbitrary path under workdir_pvc_path, and
refactor package()/pull_results() to delegate to them (behavior unchanged). Tests
cover the happy path, the no-PVC no-op, and pod teardown on copy error.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant