Add debug diagnostics when LLMISVC fails to become Ready by threcc · Pull Request #1342 · opendatahub-io/opendatahub-tests

threcc · 2026-04-03T10:41:17Z

Pull Request

Summary

Structured diagnostic output enables automated triage by AI debugging agents and reduce time-to-root-cause for CI failures.

Changes

Log LLMInferenceService conditions and pod statuses when wait_for_llmisvc times out
Replaces a bare TimeoutExpiredError (which only showed "Timed Out: Ns") with actionable diagnostic output

Example output

============================================================
  LLMISVC llmisvc-tinyllama-oci-cpu timed out in llmd-smoke
============================================================

 Conditions:
  * HTTPRoutesReady: True
  * InferencePoolReady: True
  * MainWorkloadReady: False reason=MinimumReplicasUnavailable message=Deployment does not have minimum availability.
  * PresetsCombined: True
  * Ready: False reason=MinimumReplicasUnavailable message=Deployment does not have minimum availability.
  * RouterReady: False reason=MinimumReplicasUnavailable message=Deployment does not have minimum availability.
  * SchedulerWorkloadReady: False reason=MinimumReplicasUnavailable message=Deployment does not have minimum availability.
  * WorkloadsReady: False reason=MinimumReplicasUnavailable message=Deployment does not have minimum availability.

 Pods:
  * pod=llmisvc-tinyllama-oci-cpu-kserve-8887f4b67-lp5t6 phase=Running restarts=0
  * pod=llmisvc-tinyllama-oci-cpu-kserve-router-scheduler-59c445cdh9gb9 phase=Running restarts=1 | CrashLoopBackOff: back-off 10s restarting failed container=tokenizer pod=llmisvc-tinyllama-oci-cpu-kserve-router-scheduler-59c445cdh9gb9_llmd-smoke(3c94c519-0aad-4fec-9935-bd0ebbbe82db)
============================================================

Please review and indicate how it has been tested

Locally
- Verified with lowered timeout (30s) — conditions and pods logged correctly
- Verified with normal timeout — test passes, no debug output emitted
Jenkins

Summary by CodeRabbit

Release Notes

Bug Fixes

Enhanced timeout error diagnostics with improved logging. When service operations timeout, the system now captures and logs comprehensive debugging information including service conditions, pod statuses, container states, and restart counts. This detailed diagnostic context helps users quickly identify and resolve the underlying cause of timeout failures.

github-actions · 2026-04-03T10:41:34Z

The following are automatically added/executed:

PR size label.
Run pre-commit
Run tox
Add PR author as the PR assignee
Build image based on the PR

Available user actions:

To mark a PR as WIP, add /wip in a comment. To remove it from the PR comment /wip cancel to the PR.
To block merging of a PR, add /hold in a comment. To un-block merging of PR comment /hold cancel.
To mark a PR as approved, add /lgtm in a comment. To remove, add /lgtm cancel.
lgtm label removed on each new commit push.
To mark PR as verified comment /verified to the PR, to un-verify comment /verified cancel to the PR.
verified label removed on each new commit push.
To Cherry-pick a merged PR /cherry-pick <target_branch_name> to the PR. If <target_branch_name> is valid,
and the current PR is merged, a cherry-picked PR would be created and linked to the current PR.
To build and push image to quay, add /build-push-pr-image in a comment. This would create an image with tag
pr-<pr_number> to quay repository. This image tag, however would be deleted on PR merge or close action.

Supported labels

{'/hold', '/wip', '/cherry-pick', '/build-push-pr-image', '/lgtm', '/verified'}

coderabbitai · 2026-04-03T10:50:47Z

📝 Walkthrough

Walkthrough

Two factual edits: appended a newline to the trailing border in LLMISvcConfig.describe, and extended wait_for_llmisvc to catch TimeoutExpiredError, gather and log LLMISvc .status.conditions, related pod statuses, and recent namespace Warning events before re-raising the timeout. Potential information exposure risk noted (CWE-200, CWE-532).

Changes

Cohort / File(s)	Summary
Config Formatting `tests/model_serving/model_server/llmd/llmd_configs/config_base.py`	Appended `"\n"` to the trailing `border` entry returned by `LLMISvcConfig.describe`, altering the final string's trailing newline formatting.
Error Handling & Debug Logging `tests/model_serving/model_server/llmd/utils.py`	`wait_for_llmisvc` now catches `TimeoutExpiredError` and calls `_log_llmisvc_debug_info` before re-raising. Added helpers: `_debug_info_conditions`, `_debug_info_pod_statuses`, and `_log_llmisvc_debug_info` to collect/format LLMISvc conditions, per-pod/container statuses, summed restarts, and recent namespace Warning events; diagnostic collection failures are caught and logged. Also imported `TimeoutExpiredError` from `timeout_sampler`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Add debug diagnostics when LLMISVC fails to become Ready' accurately and specifically describes the main change: adding diagnostic output when LLMISVC timeout occurs.
Description check	✅ Passed	The PR description covers the summary, changes, example output, and testing verification. However, it lacks the template sections for Related Issues (Fixes, JIRA) and Additional Requirements (test image, marker/component updates).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

tests/model_serving/model_server/llmd/utils.py (2)

58-63: Include terminated container states in the pod section.

This only inspects state.waiting. Timeouts caused by OOMKilled or Error often surface in state.terminated or lastState.terminated, so the current output can drop the actual cause and show only restart counts.

Remediation

     for container_status in container_statuses:
-        waiting = (container_status.get("state") or {}).get("waiting")
-        if waiting:
-            reason = waiting.get("reason", "Unknown")
-            message = waiting.get("message", "")
+        state = container_status.get("state") or {}
+        details = (
+            state.get("waiting")
+            or state.get("terminated")
+            or (container_status.get("lastState") or {}).get("terminated")
+        )
+        if details:
+            reason = details.get("reason", "Unknown")
+            message = details.get("message", "")
             parts.append(f"{container_status['name']}: {reason}" + (f" - {message}" if message else ""))

As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/model_serving/model_server/llmd/utils.py` around lines 58 - 63, The
loop over container_statuses only checks state.waiting and thus misses root
causes in state.terminated and lastState.terminated; update the logic that
builds parts (the block referencing container_statuses and
container_status['name']) to inspect in order: state.waiting, state.terminated,
then lastState.terminated (each accessed as (container_status.get("state") or
{}).get(...) and container_status.get("lastState", {})), extract reason and
message from terminated objects (use default reason "Unknown" and empty message
if missing), and append the same formatted string (e.g., "name: reason" with
optional " - message") so terminated reasons like OOMKilled or Error are
included in the pod section.

111-121: Add an automated regression test for the timeout branch.

This branch now has observable behavior: collect diagnostics, then re-raise TimeoutExpiredError. A small pytest test that forces llmisvc.wait_for_condition() to time out and asserts _log_llmisvc_debug_info() was called would keep this path from regressing silently.

As per coding guidelines, "Code should follow python, pytest best practices".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/model_serving/model_server/llmd/utils.py` around lines 111 - 121, Add a
pytest that forces the timeout branch in wait_for_llmisvc: create a mock
LLMInferenceService where LLMInferenceService.wait_for_condition raises
TimeoutExpiredError, patch or mock the module-level _log_llmisvc_debug_info to
capture calls, call wait_for_llmisvc(mock_llmisvc, timeout=1) and assert it
re-raises TimeoutExpiredError and that _log_llmisvc_debug_info was called with
the same llmisvc; use monkeypatch or unittest.mock to set the side_effect for
wait_for_condition and to spy on _log_llmisvc_debug_info so the diagnostic-call
behavior is covered and cannot regress.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/model_serving/model_server/llmd/utils.py`:
- Around line 83-108: The current _log_llmisvc_debug_info function wraps all
debug collection in one try/except so any failure (missing status, malformed
pod, Event lookup) hides all sections; change it to collect and render each
section independently by try/except around each call: call
_debug_info_conditions(llmisvc) inside its own try/except and fallback to a
clear placeholder like "(failed to collect conditions: <error>)", fetch pods via
Pod.get and build pod_section by handling exceptions per-pod (try
_debug_info_pod_status(pod) and on error include a placeholder for that pod),
and call _debug_info_namespace_events(llmisvc.client, ns) in its own try/except;
log each caught exception with LOGGER.exception or LOGGER.error including
context (e.g., "Failed to collect pods" or "Failed to collect events") but
proceed to assemble and emit the final LOGGER.error message containing whatever
sections were successfully collected.

---

Nitpick comments:
In `@tests/model_serving/model_server/llmd/utils.py`:
- Around line 58-63: The loop over container_statuses only checks state.waiting
and thus misses root causes in state.terminated and lastState.terminated; update
the logic that builds parts (the block referencing container_statuses and
container_status['name']) to inspect in order: state.waiting, state.terminated,
then lastState.terminated (each accessed as (container_status.get("state") or
{}).get(...) and container_status.get("lastState", {})), extract reason and
message from terminated objects (use default reason "Unknown" and empty message
if missing), and append the same formatted string (e.g., "name: reason" with
optional " - message") so terminated reasons like OOMKilled or Error are
included in the pod section.
- Around line 111-121: Add a pytest that forces the timeout branch in
wait_for_llmisvc: create a mock LLMInferenceService where
LLMInferenceService.wait_for_condition raises TimeoutExpiredError, patch or mock
the module-level _log_llmisvc_debug_info to capture calls, call
wait_for_llmisvc(mock_llmisvc, timeout=1) and assert it re-raises
TimeoutExpiredError and that _log_llmisvc_debug_info was called with the same
llmisvc; use monkeypatch or unittest.mock to set the side_effect for
wait_for_condition and to spy on _log_llmisvc_debug_info so the diagnostic-call
behavior is covered and cannot regress.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 50906fc5-35fc-4169-ad0c-3d9169864022

📥 Commits

Reviewing files that changed from the base of the PR and between 3c4789f and 20313b7.

📒 Files selected for processing (2)

tests/model_serving/model_server/llmd/llmd_configs/config_base.py
tests/model_serving/model_server/llmd/utils.py

Signed-off-by: threcc <trecchiu@redhat.com>

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

tests/model_serving/model_server/llmd/utils.py (1)

67-94: ⚠️ Potential issue | 🟠 Major

Harden debug collection so one bad pod does not erase pod diagnostics.

_debug_info_pod_statuses still aborts the entire Pods section on a single malformed pod object, and section-level fallback drops root-cause context. Keep pod-level isolation and include exception text/context in logs.

Suggested fix

 def _debug_info_pod_statuses(llmisvc: LLMInferenceService) -> str:
@@
     lines = []
     for pod in pods:
-        phase = pod.instance.status.phase
-        all_statuses = (pod.instance.status.get("initContainerStatuses") or []) + (
-            pod.instance.status.get("containerStatuses") or []
-        )
-        restarts = sum(container_status.get("restartCount", 0) for container_status in all_statuses)
-        parts = [f"* pod={pod.name} phase={phase} restarts={restarts}"]
+        try:
+            status = pod.instance.status or {}
+            phase = status.get("phase", "Unknown")
+            all_statuses = (status.get("initContainerStatuses") or []) + (status.get("containerStatuses") or [])
+            restarts = sum(container_status.get("restartCount", 0) for container_status in all_statuses)
+            parts = [f"* pod={pod.name} phase={phase} restarts={restarts}"]

-        for container_status in all_statuses:
-            state = container_status.get("state") or {}
-            waiting = state.get("waiting")
-            if waiting:
-                # Container is currently waiting (e.g. CrashLoopBackOff, ImagePullBackOff)
-                reason = waiting.get("reason", "Unknown")
-                message = waiting.get("message", "")
-                parts.append(f"{reason}" + (f": {message}" if message else ""))
-            elif container_status.get("restartCount", 0) > 0:
-                # Container is running but has restarted — show why it last crashed
-                terminated = (container_status.get("lastState") or {}).get("terminated")
-                if terminated:
-                    parts.append(
-                        f" {container_status['name']}: last terminated"
-                        f" reason={terminated.get('reason', 'Unknown')}"
-                        f" exitCode={terminated.get('exitCode', '?')}"
-                    )
+            for container_status in all_statuses:
+                state = container_status.get("state") or {}
+                waiting = state.get("waiting")
+                if waiting:
+                    reason = waiting.get("reason", "Unknown")
+                    message = waiting.get("message", "")
+                    parts.append(f"{reason}" + (f": {message}" if message else ""))
+                elif container_status.get("restartCount", 0) > 0:
+                    terminated = (container_status.get("lastState") or {}).get("terminated")
+                    if terminated:
+                        parts.append(
+                            f" {container_status.get('name', 'unknown')}: last terminated"
+                            f" reason={terminated.get('reason', 'Unknown')}"
+                            f" exitCode={terminated.get('exitCode', '?')}"
+                        )

-        lines.append("  " + " | ".join(parts))
+            lines.append("  " + " | ".join(parts))
+        except Exception as exc:  # noqa: BLE001
+            LOGGER.exception("Failed to collect pod diagnostic details", pod=pod.name, namespace=llmisvc.namespace)
+            lines.append(f"  * pod={pod.name} (failed to collect details: {exc})")
     return "\n".join(lines)
@@
-        except Exception:  # noqa: BLE001
-            sections.append(f"\n {label}:\n  (failed to collect)")
+        except Exception as exc:  # noqa: BLE001
+            LOGGER.exception("Failed to collect timeout diagnostics section", section=label, llmisvc=name, namespace=ns)
+            sections.append(f"\n {label}:\n  (failed to collect: {exc})")

As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

Also applies to: 106-114

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/model_serving/model_server/llmd/utils.py` around lines 67 - 94, The
pod-processing loop ("for pod in pods") can raise on a malformed pod and
currently aborts building the whole Pods section; wrap the per-pod processing
(the body that computes phase, all_statuses, restarts, parts and iterates
container_status) in a try/except that catches Exception and appends a
diagnostic line for that pod (using pod.name or an index) containing the
exception message and optionally traceback text so one bad pod doesn't wipe out
diagnostics; continue to the next pod after recording the error. Ensure the
fallback remains at section-level but preserve per-pod context by recording the
exception on the same lines list used by the function (referencing variables
pods, pod, parts, lines and container_status).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/model_serving/model_server/llmd/utils.py`:
- Around line 101-109: The timeout diagnostic output currently builds sections
with only "Conditions" and "Pods"; add the missing namespace warning events
section by inserting a third tuple into the for-list such as ("Namespace Warning
events", lambda: _debug_info_namespace_warning_events(llmisvc)) so the loop
calls the function that returns namespace Warning events (use the actual helper
name if different) alongside _debug_info_conditions(llmisvc) and
_debug_info_pod_statuses(llmisvc), ensuring the final assembled sections include
namespace warning events for triage.
- Around line 42-47: The logged condition.reason and condition.message are
untrusted and must be sanitized and length-bounded before concatenation to
prevent log forging/leakage; implement a helper like
sanitize_status_field(value) that strips or escapes control characters (e.g.,
remove \r, \n, other non-printables), masks or removes known sensitive patterns,
and truncates to a safe max length (e.g., 200 chars), then call it when building
the line in the block that constructs "line" (use condition['reason'] and
condition['message']), replacing direct use with sanitized values; apply the
same helper to the other similar places noted (lines around the other two blocks
reported).

---

Duplicate comments:
In `@tests/model_serving/model_server/llmd/utils.py`:
- Around line 67-94: The pod-processing loop ("for pod in pods") can raise on a
malformed pod and currently aborts building the whole Pods section; wrap the
per-pod processing (the body that computes phase, all_statuses, restarts, parts
and iterates container_status) in a try/except that catches Exception and
appends a diagnostic line for that pod (using pod.name or an index) containing
the exception message and optionally traceback text so one bad pod doesn't wipe
out diagnostics; continue to the next pod after recording the error. Ensure the
fallback remains at section-level but preserve per-pod context by recording the
exception on the same lines list used by the function (referencing variables
pods, pod, parts, lines and container_status).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 22ae7b4d-fb21-4251-a1f1-cffb8d362112

📥 Commits

Reviewing files that changed from the base of the PR and between 445295a and 561be3f.

📒 Files selected for processing (2)

tests/model_serving/model_server/llmd/llmd_configs/config_base.py
tests/model_serving/model_server/llmd/utils.py

✅ Files skipped from review due to trivial changes (1)

tests/model_serving/model_server/llmd/llmd_configs/config_base.py

github-actions · 2026-04-07T16:16:29Z

Status of building tag latest: success.
Status of pushing tag latest to image registry: success.

threcc requested a review from mwaykole as a code owner April 3, 2026 10:41

github-actions Bot added the ModelServing label Apr 3, 2026

github-actions Bot assigned threcc Apr 3, 2026

github-actions Bot added the size/m label Apr 3, 2026

threcc force-pushed the improve-llmisvc-not-ready branch from 401aab1 to cc843fb Compare April 3, 2026 10:43

coderabbitai Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread tests/model_serving/model_server/llmd/utils.py Outdated

threcc marked this pull request as draft April 3, 2026 11:01

threcc force-pushed the improve-llmisvc-not-ready branch from 9dd2afb to d20bdab Compare April 3, 2026 11:11

add debug info when llmisvc creation timeout

c962eb7

Signed-off-by: threcc <trecchiu@redhat.com>

threcc force-pushed the improve-llmisvc-not-ready branch from d20bdab to c962eb7 Compare April 3, 2026 13:35

threcc marked this pull request as ready for review April 3, 2026 13:37

Merge branch 'main' into improve-llmisvc-not-ready

561be3f

coderabbitai Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread tests/model_serving/model_server/llmd/utils.py

Comment thread tests/model_serving/model_server/llmd/utils.py

rhods-ci-bot added the commented-by-threcc label Apr 3, 2026

dbasunag reviewed Apr 6, 2026

View reviewed changes

Comment thread tests/model_serving/model_server/llmd/utils.py

rhods-ci-bot added the commented-by-dbasunag label Apr 6, 2026

dbasunag reviewed Apr 6, 2026

View reviewed changes

Comment thread tests/model_serving/model_server/llmd/utils.py

dbasunag approved these changes Apr 7, 2026

View reviewed changes

rhods-ci-bot added the lgtm-by-dbasunag label Apr 7, 2026

mwaykole approved these changes Apr 7, 2026

View reviewed changes

rhods-ci-bot added the lgtm-by-mwaykole label Apr 7, 2026

Merge branch 'main' into improve-llmisvc-not-ready

89c37e0

threcc enabled auto-merge (squash) April 7, 2026 15:42

rhods-ci-bot removed lgtm-by-mwaykole commented-by-dbasunag lgtm-by-dbasunag commented-by-threcc labels Apr 7, 2026

threcc merged commit 8088110 into opendatahub-io:main Apr 7, 2026
9 checks passed

threcc deleted the improve-llmisvc-not-ready branch April 9, 2026 14:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add debug diagnostics when LLMISVC fails to become Ready#1342

Add debug diagnostics when LLMISVC fails to become Ready#1342
threcc merged 3 commits intoopendatahub-io:mainfrom
threcc:improve-llmisvc-not-ready

threcc commented Apr 3, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

github-actions Bot commented Apr 3, 2026

Uh oh!

coderabbitai Bot commented Apr 3, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

threcc commented Apr 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Summary

Changes

Example output

Please review and indicate how it has been tested

Summary by CodeRabbit

Release Notes

Uh oh!

github-actions Bot commented Apr 3, 2026

Uh oh!

coderabbitai Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

threcc commented Apr 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 3, 2026 •

edited

Loading