Add debug diagnostics when LLMISVC fails to become Ready#1342
Add debug diagnostics when LLMISVC fails to become Ready#1342threcc merged 3 commits intoopendatahub-io:mainfrom
Conversation
|
The following are automatically added/executed:
Available user actions:
Supported labels{'/hold', '/wip', '/cherry-pick', '/build-push-pr-image', '/lgtm', '/verified'} |
401aab1 to
cc843fb
Compare
📝 WalkthroughWalkthroughTwo factual edits: appended a newline to the trailing border in Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/model_serving/model_server/llmd/utils.py (2)
58-63: Include terminated container states in the pod section.This only inspects
state.waiting. Timeouts caused byOOMKilledorErroroften surface instate.terminatedorlastState.terminated, so the current output can drop the actual cause and show only restart counts.Remediation
for container_status in container_statuses: - waiting = (container_status.get("state") or {}).get("waiting") - if waiting: - reason = waiting.get("reason", "Unknown") - message = waiting.get("message", "") + state = container_status.get("state") or {} + details = ( + state.get("waiting") + or state.get("terminated") + or (container_status.get("lastState") or {}).get("terminated") + ) + if details: + reason = details.get("reason", "Unknown") + message = details.get("message", "") parts.append(f"{container_status['name']}: {reason}" + (f" - {message}" if message else ""))As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/model_serving/model_server/llmd/utils.py` around lines 58 - 63, The loop over container_statuses only checks state.waiting and thus misses root causes in state.terminated and lastState.terminated; update the logic that builds parts (the block referencing container_statuses and container_status['name']) to inspect in order: state.waiting, state.terminated, then lastState.terminated (each accessed as (container_status.get("state") or {}).get(...) and container_status.get("lastState", {})), extract reason and message from terminated objects (use default reason "Unknown" and empty message if missing), and append the same formatted string (e.g., "name: reason" with optional " - message") so terminated reasons like OOMKilled or Error are included in the pod section.
111-121: Add an automated regression test for the timeout branch.This branch now has observable behavior: collect diagnostics, then re-raise
TimeoutExpiredError. A small pytest test that forcesllmisvc.wait_for_condition()to time out and asserts_log_llmisvc_debug_info()was called would keep this path from regressing silently.As per coding guidelines, "Code should follow python, pytest best practices".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/model_serving/model_server/llmd/utils.py` around lines 111 - 121, Add a pytest that forces the timeout branch in wait_for_llmisvc: create a mock LLMInferenceService where LLMInferenceService.wait_for_condition raises TimeoutExpiredError, patch or mock the module-level _log_llmisvc_debug_info to capture calls, call wait_for_llmisvc(mock_llmisvc, timeout=1) and assert it re-raises TimeoutExpiredError and that _log_llmisvc_debug_info was called with the same llmisvc; use monkeypatch or unittest.mock to set the side_effect for wait_for_condition and to spy on _log_llmisvc_debug_info so the diagnostic-call behavior is covered and cannot regress.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/model_serving/model_server/llmd/utils.py`:
- Around line 83-108: The current _log_llmisvc_debug_info function wraps all
debug collection in one try/except so any failure (missing status, malformed
pod, Event lookup) hides all sections; change it to collect and render each
section independently by try/except around each call: call
_debug_info_conditions(llmisvc) inside its own try/except and fallback to a
clear placeholder like "(failed to collect conditions: <error>)", fetch pods via
Pod.get and build pod_section by handling exceptions per-pod (try
_debug_info_pod_status(pod) and on error include a placeholder for that pod),
and call _debug_info_namespace_events(llmisvc.client, ns) in its own try/except;
log each caught exception with LOGGER.exception or LOGGER.error including
context (e.g., "Failed to collect pods" or "Failed to collect events") but
proceed to assemble and emit the final LOGGER.error message containing whatever
sections were successfully collected.
---
Nitpick comments:
In `@tests/model_serving/model_server/llmd/utils.py`:
- Around line 58-63: The loop over container_statuses only checks state.waiting
and thus misses root causes in state.terminated and lastState.terminated; update
the logic that builds parts (the block referencing container_statuses and
container_status['name']) to inspect in order: state.waiting, state.terminated,
then lastState.terminated (each accessed as (container_status.get("state") or
{}).get(...) and container_status.get("lastState", {})), extract reason and
message from terminated objects (use default reason "Unknown" and empty message
if missing), and append the same formatted string (e.g., "name: reason" with
optional " - message") so terminated reasons like OOMKilled or Error are
included in the pod section.
- Around line 111-121: Add a pytest that forces the timeout branch in
wait_for_llmisvc: create a mock LLMInferenceService where
LLMInferenceService.wait_for_condition raises TimeoutExpiredError, patch or mock
the module-level _log_llmisvc_debug_info to capture calls, call
wait_for_llmisvc(mock_llmisvc, timeout=1) and assert it re-raises
TimeoutExpiredError and that _log_llmisvc_debug_info was called with the same
llmisvc; use monkeypatch or unittest.mock to set the side_effect for
wait_for_condition and to spy on _log_llmisvc_debug_info so the diagnostic-call
behavior is covered and cannot regress.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 50906fc5-35fc-4169-ad0c-3d9169864022
📒 Files selected for processing (2)
tests/model_serving/model_server/llmd/llmd_configs/config_base.pytests/model_serving/model_server/llmd/utils.py
9dd2afb to
d20bdab
Compare
Signed-off-by: threcc <trecchiu@redhat.com>
d20bdab to
c962eb7
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
tests/model_serving/model_server/llmd/utils.py (1)
67-94:⚠️ Potential issue | 🟠 MajorHarden debug collection so one bad pod does not erase pod diagnostics.
_debug_info_pod_statusesstill aborts the entire Pods section on a single malformed pod object, and section-level fallback drops root-cause context. Keep pod-level isolation and include exception text/context in logs.Suggested fix
def _debug_info_pod_statuses(llmisvc: LLMInferenceService) -> str: @@ lines = [] for pod in pods: - phase = pod.instance.status.phase - all_statuses = (pod.instance.status.get("initContainerStatuses") or []) + ( - pod.instance.status.get("containerStatuses") or [] - ) - restarts = sum(container_status.get("restartCount", 0) for container_status in all_statuses) - parts = [f"* pod={pod.name} phase={phase} restarts={restarts}"] + try: + status = pod.instance.status or {} + phase = status.get("phase", "Unknown") + all_statuses = (status.get("initContainerStatuses") or []) + (status.get("containerStatuses") or []) + restarts = sum(container_status.get("restartCount", 0) for container_status in all_statuses) + parts = [f"* pod={pod.name} phase={phase} restarts={restarts}"] - for container_status in all_statuses: - state = container_status.get("state") or {} - waiting = state.get("waiting") - if waiting: - # Container is currently waiting (e.g. CrashLoopBackOff, ImagePullBackOff) - reason = waiting.get("reason", "Unknown") - message = waiting.get("message", "") - parts.append(f"{reason}" + (f": {message}" if message else "")) - elif container_status.get("restartCount", 0) > 0: - # Container is running but has restarted — show why it last crashed - terminated = (container_status.get("lastState") or {}).get("terminated") - if terminated: - parts.append( - f" {container_status['name']}: last terminated" - f" reason={terminated.get('reason', 'Unknown')}" - f" exitCode={terminated.get('exitCode', '?')}" - ) + for container_status in all_statuses: + state = container_status.get("state") or {} + waiting = state.get("waiting") + if waiting: + reason = waiting.get("reason", "Unknown") + message = waiting.get("message", "") + parts.append(f"{reason}" + (f": {message}" if message else "")) + elif container_status.get("restartCount", 0) > 0: + terminated = (container_status.get("lastState") or {}).get("terminated") + if terminated: + parts.append( + f" {container_status.get('name', 'unknown')}: last terminated" + f" reason={terminated.get('reason', 'Unknown')}" + f" exitCode={terminated.get('exitCode', '?')}" + ) - lines.append(" " + " | ".join(parts)) + lines.append(" " + " | ".join(parts)) + except Exception as exc: # noqa: BLE001 + LOGGER.exception("Failed to collect pod diagnostic details", pod=pod.name, namespace=llmisvc.namespace) + lines.append(f" * pod={pod.name} (failed to collect details: {exc})") return "\n".join(lines) @@ - except Exception: # noqa: BLE001 - sections.append(f"\n {label}:\n (failed to collect)") + except Exception as exc: # noqa: BLE001 + LOGGER.exception("Failed to collect timeout diagnostics section", section=label, llmisvc=name, namespace=ns) + sections.append(f"\n {label}:\n (failed to collect: {exc})")As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".
Also applies to: 106-114
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/model_serving/model_server/llmd/utils.py` around lines 67 - 94, The pod-processing loop ("for pod in pods") can raise on a malformed pod and currently aborts building the whole Pods section; wrap the per-pod processing (the body that computes phase, all_statuses, restarts, parts and iterates container_status) in a try/except that catches Exception and appends a diagnostic line for that pod (using pod.name or an index) containing the exception message and optionally traceback text so one bad pod doesn't wipe out diagnostics; continue to the next pod after recording the error. Ensure the fallback remains at section-level but preserve per-pod context by recording the exception on the same lines list used by the function (referencing variables pods, pod, parts, lines and container_status).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/model_serving/model_server/llmd/utils.py`:
- Around line 101-109: The timeout diagnostic output currently builds sections
with only "Conditions" and "Pods"; add the missing namespace warning events
section by inserting a third tuple into the for-list such as ("Namespace Warning
events", lambda: _debug_info_namespace_warning_events(llmisvc)) so the loop
calls the function that returns namespace Warning events (use the actual helper
name if different) alongside _debug_info_conditions(llmisvc) and
_debug_info_pod_statuses(llmisvc), ensuring the final assembled sections include
namespace warning events for triage.
- Around line 42-47: The logged condition.reason and condition.message are
untrusted and must be sanitized and length-bounded before concatenation to
prevent log forging/leakage; implement a helper like
sanitize_status_field(value) that strips or escapes control characters (e.g.,
remove \r, \n, other non-printables), masks or removes known sensitive patterns,
and truncates to a safe max length (e.g., 200 chars), then call it when building
the line in the block that constructs "line" (use condition['reason'] and
condition['message']), replacing direct use with sanitized values; apply the
same helper to the other similar places noted (lines around the other two blocks
reported).
---
Duplicate comments:
In `@tests/model_serving/model_server/llmd/utils.py`:
- Around line 67-94: The pod-processing loop ("for pod in pods") can raise on a
malformed pod and currently aborts building the whole Pods section; wrap the
per-pod processing (the body that computes phase, all_statuses, restarts, parts
and iterates container_status) in a try/except that catches Exception and
appends a diagnostic line for that pod (using pod.name or an index) containing
the exception message and optionally traceback text so one bad pod doesn't wipe
out diagnostics; continue to the next pod after recording the error. Ensure the
fallback remains at section-level but preserve per-pod context by recording the
exception on the same lines list used by the function (referencing variables
pods, pod, parts, lines and container_status).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 22ae7b4d-fb21-4251-a1f1-cffb8d362112
📒 Files selected for processing (2)
tests/model_serving/model_server/llmd/llmd_configs/config_base.pytests/model_serving/model_server/llmd/utils.py
✅ Files skipped from review due to trivial changes (1)
- tests/model_serving/model_server/llmd/llmd_configs/config_base.py
|
Status of building tag latest: success. |
Pull Request
Summary
Structured diagnostic output enables automated triage by AI debugging agents and reduce time-to-root-cause for CI failures.
Changes
wait_for_llmisvctimes outExample output
Please review and indicate how it has been tested
Summary by CodeRabbit
Release Notes
Bug Fixes