Skip to content

Add debug diagnostics when LLMISVC fails to become Ready#1342

Merged
threcc merged 3 commits intoopendatahub-io:mainfrom
threcc:improve-llmisvc-not-ready
Apr 7, 2026
Merged

Add debug diagnostics when LLMISVC fails to become Ready#1342
threcc merged 3 commits intoopendatahub-io:mainfrom
threcc:improve-llmisvc-not-ready

Conversation

@threcc
Copy link
Copy Markdown
Contributor

@threcc threcc commented Apr 3, 2026

Pull Request

Summary

Structured diagnostic output enables automated triage by AI debugging agents and reduce time-to-root-cause for CI failures.

Changes

  • Log LLMInferenceService conditions and pod statuses when wait_for_llmisvc times out
  • Replaces a bare TimeoutExpiredError (which only showed "Timed Out: Ns") with actionable diagnostic output

Example output

============================================================
  LLMISVC llmisvc-tinyllama-oci-cpu timed out in llmd-smoke
============================================================

 Conditions:
  * HTTPRoutesReady: True
  * InferencePoolReady: True
  * MainWorkloadReady: False reason=MinimumReplicasUnavailable message=Deployment does not have minimum availability.
  * PresetsCombined: True
  * Ready: False reason=MinimumReplicasUnavailable message=Deployment does not have minimum availability.
  * RouterReady: False reason=MinimumReplicasUnavailable message=Deployment does not have minimum availability.
  * SchedulerWorkloadReady: False reason=MinimumReplicasUnavailable message=Deployment does not have minimum availability.
  * WorkloadsReady: False reason=MinimumReplicasUnavailable message=Deployment does not have minimum availability.

 Pods:
  * pod=llmisvc-tinyllama-oci-cpu-kserve-8887f4b67-lp5t6 phase=Running restarts=0
  * pod=llmisvc-tinyllama-oci-cpu-kserve-router-scheduler-59c445cdh9gb9 phase=Running restarts=1 | CrashLoopBackOff: back-off 10s restarting failed container=tokenizer pod=llmisvc-tinyllama-oci-cpu-kserve-router-scheduler-59c445cdh9gb9_llmd-smoke(3c94c519-0aad-4fec-9935-bd0ebbbe82db)
============================================================

Please review and indicate how it has been tested

  • Locally
    • Verified with lowered timeout (30s) — conditions and pods logged correctly
    • Verified with normal timeout — test passes, no debug output emitted
  • Jenkins

Summary by CodeRabbit

Release Notes

Bug Fixes

  • Enhanced timeout error diagnostics with improved logging. When service operations timeout, the system now captures and logs comprehensive debugging information including service conditions, pod statuses, container states, and restart counts. This detailed diagnostic context helps users quickly identify and resolve the underlying cause of timeout failures.

@threcc threcc requested a review from mwaykole as a code owner April 3, 2026 10:41
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 3, 2026

The following are automatically added/executed:

  • PR size label.
  • Run pre-commit
  • Run tox
  • Add PR author as the PR assignee
  • Build image based on the PR

Available user actions:

  • To mark a PR as WIP, add /wip in a comment. To remove it from the PR comment /wip cancel to the PR.
  • To block merging of a PR, add /hold in a comment. To un-block merging of PR comment /hold cancel.
  • To mark a PR as approved, add /lgtm in a comment. To remove, add /lgtm cancel.
    lgtm label removed on each new commit push.
  • To mark PR as verified comment /verified to the PR, to un-verify comment /verified cancel to the PR.
    verified label removed on each new commit push.
  • To Cherry-pick a merged PR /cherry-pick <target_branch_name> to the PR. If <target_branch_name> is valid,
    and the current PR is merged, a cherry-picked PR would be created and linked to the current PR.
  • To build and push image to quay, add /build-push-pr-image in a comment. This would create an image with tag
    pr-<pr_number> to quay repository. This image tag, however would be deleted on PR merge or close action.
Supported labels

{'/hold', '/wip', '/cherry-pick', '/build-push-pr-image', '/lgtm', '/verified'}

@github-actions github-actions Bot added the size/m label Apr 3, 2026
@threcc threcc force-pushed the improve-llmisvc-not-ready branch from 401aab1 to cc843fb Compare April 3, 2026 10:43
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 3, 2026

📝 Walkthrough

Walkthrough

Two factual edits: appended a newline to the trailing border in LLMISvcConfig.describe, and extended wait_for_llmisvc to catch TimeoutExpiredError, gather and log LLMISvc .status.conditions, related pod statuses, and recent namespace Warning events before re-raising the timeout. Potential information exposure risk noted (CWE-200, CWE-532).

Changes

Cohort / File(s) Summary
Config Formatting
tests/model_serving/model_server/llmd/llmd_configs/config_base.py
Appended "\n" to the trailing border entry returned by LLMISvcConfig.describe, altering the final string's trailing newline formatting.
Error Handling & Debug Logging
tests/model_serving/model_server/llmd/utils.py
wait_for_llmisvc now catches TimeoutExpiredError and calls _log_llmisvc_debug_info before re-raising. Added helpers: _debug_info_conditions, _debug_info_pod_statuses, and _log_llmisvc_debug_info to collect/format LLMISvc conditions, per-pod/container statuses, summed restarts, and recent namespace Warning events; diagnostic collection failures are caught and logged. Also imported TimeoutExpiredError from timeout_sampler.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add debug diagnostics when LLMISVC fails to become Ready' accurately and specifically describes the main change: adding diagnostic output when LLMISVC timeout occurs.
Description check ✅ Passed The PR description covers the summary, changes, example output, and testing verification. However, it lacks the template sections for Related Issues (Fixes, JIRA) and Additional Requirements (test image, marker/component updates).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/model_serving/model_server/llmd/utils.py (2)

58-63: Include terminated container states in the pod section.

This only inspects state.waiting. Timeouts caused by OOMKilled or Error often surface in state.terminated or lastState.terminated, so the current output can drop the actual cause and show only restart counts.

Remediation
     for container_status in container_statuses:
-        waiting = (container_status.get("state") or {}).get("waiting")
-        if waiting:
-            reason = waiting.get("reason", "Unknown")
-            message = waiting.get("message", "")
+        state = container_status.get("state") or {}
+        details = (
+            state.get("waiting")
+            or state.get("terminated")
+            or (container_status.get("lastState") or {}).get("terminated")
+        )
+        if details:
+            reason = details.get("reason", "Unknown")
+            message = details.get("message", "")
             parts.append(f"{container_status['name']}: {reason}" + (f" - {message}" if message else ""))

As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/model_serving/model_server/llmd/utils.py` around lines 58 - 63, The
loop over container_statuses only checks state.waiting and thus misses root
causes in state.terminated and lastState.terminated; update the logic that
builds parts (the block referencing container_statuses and
container_status['name']) to inspect in order: state.waiting, state.terminated,
then lastState.terminated (each accessed as (container_status.get("state") or
{}).get(...) and container_status.get("lastState", {})), extract reason and
message from terminated objects (use default reason "Unknown" and empty message
if missing), and append the same formatted string (e.g., "name: reason" with
optional " - message") so terminated reasons like OOMKilled or Error are
included in the pod section.

111-121: Add an automated regression test for the timeout branch.

This branch now has observable behavior: collect diagnostics, then re-raise TimeoutExpiredError. A small pytest test that forces llmisvc.wait_for_condition() to time out and asserts _log_llmisvc_debug_info() was called would keep this path from regressing silently.

As per coding guidelines, "Code should follow python, pytest best practices".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/model_serving/model_server/llmd/utils.py` around lines 111 - 121, Add a
pytest that forces the timeout branch in wait_for_llmisvc: create a mock
LLMInferenceService where LLMInferenceService.wait_for_condition raises
TimeoutExpiredError, patch or mock the module-level _log_llmisvc_debug_info to
capture calls, call wait_for_llmisvc(mock_llmisvc, timeout=1) and assert it
re-raises TimeoutExpiredError and that _log_llmisvc_debug_info was called with
the same llmisvc; use monkeypatch or unittest.mock to set the side_effect for
wait_for_condition and to spy on _log_llmisvc_debug_info so the diagnostic-call
behavior is covered and cannot regress.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/model_serving/model_server/llmd/utils.py`:
- Around line 83-108: The current _log_llmisvc_debug_info function wraps all
debug collection in one try/except so any failure (missing status, malformed
pod, Event lookup) hides all sections; change it to collect and render each
section independently by try/except around each call: call
_debug_info_conditions(llmisvc) inside its own try/except and fallback to a
clear placeholder like "(failed to collect conditions: <error>)", fetch pods via
Pod.get and build pod_section by handling exceptions per-pod (try
_debug_info_pod_status(pod) and on error include a placeholder for that pod),
and call _debug_info_namespace_events(llmisvc.client, ns) in its own try/except;
log each caught exception with LOGGER.exception or LOGGER.error including
context (e.g., "Failed to collect pods" or "Failed to collect events") but
proceed to assemble and emit the final LOGGER.error message containing whatever
sections were successfully collected.

---

Nitpick comments:
In `@tests/model_serving/model_server/llmd/utils.py`:
- Around line 58-63: The loop over container_statuses only checks state.waiting
and thus misses root causes in state.terminated and lastState.terminated; update
the logic that builds parts (the block referencing container_statuses and
container_status['name']) to inspect in order: state.waiting, state.terminated,
then lastState.terminated (each accessed as (container_status.get("state") or
{}).get(...) and container_status.get("lastState", {})), extract reason and
message from terminated objects (use default reason "Unknown" and empty message
if missing), and append the same formatted string (e.g., "name: reason" with
optional " - message") so terminated reasons like OOMKilled or Error are
included in the pod section.
- Around line 111-121: Add a pytest that forces the timeout branch in
wait_for_llmisvc: create a mock LLMInferenceService where
LLMInferenceService.wait_for_condition raises TimeoutExpiredError, patch or mock
the module-level _log_llmisvc_debug_info to capture calls, call
wait_for_llmisvc(mock_llmisvc, timeout=1) and assert it re-raises
TimeoutExpiredError and that _log_llmisvc_debug_info was called with the same
llmisvc; use monkeypatch or unittest.mock to set the side_effect for
wait_for_condition and to spy on _log_llmisvc_debug_info so the diagnostic-call
behavior is covered and cannot regress.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 50906fc5-35fc-4169-ad0c-3d9169864022

📥 Commits

Reviewing files that changed from the base of the PR and between 3c4789f and 20313b7.

📒 Files selected for processing (2)
  • tests/model_serving/model_server/llmd/llmd_configs/config_base.py
  • tests/model_serving/model_server/llmd/utils.py

Comment thread tests/model_serving/model_server/llmd/utils.py Outdated
@threcc threcc marked this pull request as draft April 3, 2026 11:01
@threcc threcc force-pushed the improve-llmisvc-not-ready branch from 9dd2afb to d20bdab Compare April 3, 2026 11:11
Signed-off-by: threcc <trecchiu@redhat.com>
@threcc threcc force-pushed the improve-llmisvc-not-ready branch from d20bdab to c962eb7 Compare April 3, 2026 13:35
@threcc threcc marked this pull request as ready for review April 3, 2026 13:37
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
tests/model_serving/model_server/llmd/utils.py (1)

67-94: ⚠️ Potential issue | 🟠 Major

Harden debug collection so one bad pod does not erase pod diagnostics.

_debug_info_pod_statuses still aborts the entire Pods section on a single malformed pod object, and section-level fallback drops root-cause context. Keep pod-level isolation and include exception text/context in logs.

Suggested fix
 def _debug_info_pod_statuses(llmisvc: LLMInferenceService) -> str:
@@
     lines = []
     for pod in pods:
-        phase = pod.instance.status.phase
-        all_statuses = (pod.instance.status.get("initContainerStatuses") or []) + (
-            pod.instance.status.get("containerStatuses") or []
-        )
-        restarts = sum(container_status.get("restartCount", 0) for container_status in all_statuses)
-        parts = [f"* pod={pod.name} phase={phase} restarts={restarts}"]
+        try:
+            status = pod.instance.status or {}
+            phase = status.get("phase", "Unknown")
+            all_statuses = (status.get("initContainerStatuses") or []) + (status.get("containerStatuses") or [])
+            restarts = sum(container_status.get("restartCount", 0) for container_status in all_statuses)
+            parts = [f"* pod={pod.name} phase={phase} restarts={restarts}"]

-        for container_status in all_statuses:
-            state = container_status.get("state") or {}
-            waiting = state.get("waiting")
-            if waiting:
-                # Container is currently waiting (e.g. CrashLoopBackOff, ImagePullBackOff)
-                reason = waiting.get("reason", "Unknown")
-                message = waiting.get("message", "")
-                parts.append(f"{reason}" + (f": {message}" if message else ""))
-            elif container_status.get("restartCount", 0) > 0:
-                # Container is running but has restarted — show why it last crashed
-                terminated = (container_status.get("lastState") or {}).get("terminated")
-                if terminated:
-                    parts.append(
-                        f" {container_status['name']}: last terminated"
-                        f" reason={terminated.get('reason', 'Unknown')}"
-                        f" exitCode={terminated.get('exitCode', '?')}"
-                    )
+            for container_status in all_statuses:
+                state = container_status.get("state") or {}
+                waiting = state.get("waiting")
+                if waiting:
+                    reason = waiting.get("reason", "Unknown")
+                    message = waiting.get("message", "")
+                    parts.append(f"{reason}" + (f": {message}" if message else ""))
+                elif container_status.get("restartCount", 0) > 0:
+                    terminated = (container_status.get("lastState") or {}).get("terminated")
+                    if terminated:
+                        parts.append(
+                            f" {container_status.get('name', 'unknown')}: last terminated"
+                            f" reason={terminated.get('reason', 'Unknown')}"
+                            f" exitCode={terminated.get('exitCode', '?')}"
+                        )

-        lines.append("  " + " | ".join(parts))
+            lines.append("  " + " | ".join(parts))
+        except Exception as exc:  # noqa: BLE001
+            LOGGER.exception("Failed to collect pod diagnostic details", pod=pod.name, namespace=llmisvc.namespace)
+            lines.append(f"  * pod={pod.name} (failed to collect details: {exc})")
     return "\n".join(lines)
@@
-        except Exception:  # noqa: BLE001
-            sections.append(f"\n {label}:\n  (failed to collect)")
+        except Exception as exc:  # noqa: BLE001
+            LOGGER.exception("Failed to collect timeout diagnostics section", section=label, llmisvc=name, namespace=ns)
+            sections.append(f"\n {label}:\n  (failed to collect: {exc})")

As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

Also applies to: 106-114

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/model_serving/model_server/llmd/utils.py` around lines 67 - 94, The
pod-processing loop ("for pod in pods") can raise on a malformed pod and
currently aborts building the whole Pods section; wrap the per-pod processing
(the body that computes phase, all_statuses, restarts, parts and iterates
container_status) in a try/except that catches Exception and appends a
diagnostic line for that pod (using pod.name or an index) containing the
exception message and optionally traceback text so one bad pod doesn't wipe out
diagnostics; continue to the next pod after recording the error. Ensure the
fallback remains at section-level but preserve per-pod context by recording the
exception on the same lines list used by the function (referencing variables
pods, pod, parts, lines and container_status).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/model_serving/model_server/llmd/utils.py`:
- Around line 101-109: The timeout diagnostic output currently builds sections
with only "Conditions" and "Pods"; add the missing namespace warning events
section by inserting a third tuple into the for-list such as ("Namespace Warning
events", lambda: _debug_info_namespace_warning_events(llmisvc)) so the loop
calls the function that returns namespace Warning events (use the actual helper
name if different) alongside _debug_info_conditions(llmisvc) and
_debug_info_pod_statuses(llmisvc), ensuring the final assembled sections include
namespace warning events for triage.
- Around line 42-47: The logged condition.reason and condition.message are
untrusted and must be sanitized and length-bounded before concatenation to
prevent log forging/leakage; implement a helper like
sanitize_status_field(value) that strips or escapes control characters (e.g.,
remove \r, \n, other non-printables), masks or removes known sensitive patterns,
and truncates to a safe max length (e.g., 200 chars), then call it when building
the line in the block that constructs "line" (use condition['reason'] and
condition['message']), replacing direct use with sanitized values; apply the
same helper to the other similar places noted (lines around the other two blocks
reported).

---

Duplicate comments:
In `@tests/model_serving/model_server/llmd/utils.py`:
- Around line 67-94: The pod-processing loop ("for pod in pods") can raise on a
malformed pod and currently aborts building the whole Pods section; wrap the
per-pod processing (the body that computes phase, all_statuses, restarts, parts
and iterates container_status) in a try/except that catches Exception and
appends a diagnostic line for that pod (using pod.name or an index) containing
the exception message and optionally traceback text so one bad pod doesn't wipe
out diagnostics; continue to the next pod after recording the error. Ensure the
fallback remains at section-level but preserve per-pod context by recording the
exception on the same lines list used by the function (referencing variables
pods, pod, parts, lines and container_status).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 22ae7b4d-fb21-4251-a1f1-cffb8d362112

📥 Commits

Reviewing files that changed from the base of the PR and between 445295a and 561be3f.

📒 Files selected for processing (2)
  • tests/model_serving/model_server/llmd/llmd_configs/config_base.py
  • tests/model_serving/model_server/llmd/utils.py
✅ Files skipped from review due to trivial changes (1)
  • tests/model_serving/model_server/llmd/llmd_configs/config_base.py

Comment thread tests/model_serving/model_server/llmd/utils.py
Comment thread tests/model_serving/model_server/llmd/utils.py
Comment thread tests/model_serving/model_server/llmd/utils.py
Comment thread tests/model_serving/model_server/llmd/utils.py
@threcc threcc merged commit 8088110 into opendatahub-io:main Apr 7, 2026
9 checks passed
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 7, 2026

Status of building tag latest: success.
Status of pushing tag latest to image registry: success.

@threcc threcc deleted the improve-llmisvc-not-ready branch April 9, 2026 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants