Skip to content

[llmd] Work on Azure#930

Open
kpouget wants to merge 13 commits into
openshift-psap:mainfrom
kpouget:aks
Open

[llmd] Work on Azure#930
kpouget wants to merge 13 commits into
openshift-psap:mainfrom
kpouget:aks

Conversation

@kpouget
Copy link
Copy Markdown
Contributor

@kpouget kpouget commented May 11, 2026

Summary by CodeRabbit

Release Notes

  • Chores

    • Updated CI testing configuration to support Azure H100 preset and adjusted feature toggles
    • Enhanced HTTPRoute information capture during LLM inference service operations
    • Disabled specific visualization reports in inference diagnostics
  • Bug Fixes

    • Refined flavor filtering in baseline comparison report generation

Review Change Stack

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 11, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sjmonson for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 11, 2026

Warning

Rate limit exceeded

@kpouget has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 59 minutes and 39 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a443a793-2aa6-40c6-96f3-fbba6313cda7

📥 Commits

Reviewing files that changed from the base of the PR and between 16f23b4 and 7edd510.

📒 Files selected for processing (10)
  • projects/llm-d/testing/config.yaml
  • projects/llm-d/testing/llmisvcs/llmisvc-pd.yaml
  • projects/llm-d/testing/llmisvcs/llmisvc-simple.yaml
  • projects/llm-d/testing/prepare_llmd.py
  • projects/llm-d/testing/test_llmd.py
  • projects/llm-d/toolbox/llmd_capture_isvc_state/tasks/main.yml
  • projects/llm-d/toolbox/llmd_deploy_llm_inference_service/tasks/main.yml
  • projects/llm-d/visualizations/llmd_inference/data/plots.yaml
  • projects/llm-d/visualizations/llmd_inference/data/reports.yaml
  • projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py
📝 Walkthrough

Walkthrough

This PR updates LLM-D testing infrastructure by introducing a new Azure H100 CI preset with related configuration toggles, adding HTTPRoute capture tasks to Ansible playbooks for improved debugging during service deployment, and disabling specific visualization reports while streamlining baseline flavor filtering in plotting logic.

Changes

CI Preset and Feature Configuration

Layer / File(s) Summary
Azure H100 Preset Addition
projects/llm-d/testing/config.yaml
New azure_h100 CI preset is defined, extending azure, gpt-oss, and pvc_rwx with namespace set to kpouget-dev and storage class as azurefile-csi-nfs-premium. The preset is added to ci_presets.to_apply.
Feature Toggles and Configuration
projects/llm-d/testing/config.yaml
prepare.operators.skip is enabled, prepare.grafana.skip, prepare.monitoring.skip, and prepare.preload.skip are enabled, prepare.rhoai.tag is updated to a new image hash, and the opt-125m preset's vLLM --max-model-len=2048 argument is commented out.

HTTPRoute Debug Artifact Capture

Layer / File(s) Summary
Inference Service State Capture
projects/llm-d/toolbox/llmd_capture_isvc_state/tasks/main.yml
Two new Ansible tasks capture HTTPRoute YAML and status output for the target LLMInferenceService, filtered by router component label and service name, and store artifacts in the same directory.
Service Deployment Debug Tasks
projects/llm-d/toolbox/llmd_deploy_llm_inference_service/tasks/main.yml
Two new always-block tasks query HTTPRoute resources and write YAML and status outputs to separate artifact files, with failure suppression to ensure the deployment workflow continues if HTTPRoute is unavailable.

Visualization Report Configuration

Layer / File(s) Summary
Report Entries and Flavor Handling
projects/llm-d/visualizations/llmd_inference/data/plots.yaml, projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py
Report entries for GuideLLM Results, Performance Analysis, and Prometheus/VLLM metrics are commented out in the generate list. The simple-tp4-x4 baseline flavor now silently skips instead of appending a skip message to the report.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • openshift-psap/topsail#916: Both PRs modify testing configuration and CI presets; main PR skips "simple-tp4-x4" in plotting while retrieved PR adds it to test flavors.
  • openshift-psap/topsail#911: Both PRs modify the same Ansible task files to enhance HTTPRoute capture and service debugging.
  • openshift-psap/topsail#926: Both PRs modify throughput_comparisons.py to skip the "simple-tp4-x4" baseline flavor.

Poem

🐰 A rabbit hops through config files today,
New presets bloom in the Azure way,
HTTPRoutes captured in tasks so bright,
Reports go quiet, flavors skip right,
Testing flows clean, debugging takes flight! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title '[llmd] Work on Azure' is vague and generic, using the term 'Work on' which doesn't convey specific information about the changeset's primary modifications. Revise the title to be more specific about the main change, such as '[llmd] Add Azure H100 configuration and capture HTTPRoute artifacts' or '[llmd] Configure Azure H100 preset and enhance route debugging'.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py (1)

324-335: ⚡ Quick win

Remove unused filter_flavors function or refactor to use it.

The filter_flavors function is defined but never called within BaselineComparisonsReport.do_plot. Meanwhile, the newly added flavor filtering at lines 348-349 uses a simple if/continue pattern instead.

The other report classes (IntelligentRoutingComparisonsReport and PDComparisonsReport) define identical filter_flavors functions and actually use them for filtering. For consistency and to avoid dead code, consider either:

  1. Remove this unused function (simpler if the inline skip is sufficient), or
  2. Refactor lines 348-349 to use filter_flavors for consistency with other reports.
Option 1: Remove the unused function
-        def filter_flavors(setting_lists, flavor_filter):
-            """Filter flavors from setting_lists based on provided filter function"""
-            updated_setting_lists = []
-            for setting_list in setting_lists:
-                if setting_list and setting_list[0][0] == 'flavor':
-                    # Apply the filter function to flavors
-                    filtered_flavors = [(k, v) for k, v in setting_list if flavor_filter(v)]
-                    if filtered_flavors:
-                        updated_setting_lists.append(filtered_flavors)
-                else:
-                    updated_setting_lists.append(setting_list)
-            setting_lists[:] = updated_setting_lists
-
         # Get simple flavors
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py`
around lines 324 - 335, The file defines a helper function filter_flavors that
is never used in BaselineComparisonsReport.do_plot while that method instead
uses an inline "if/continue" flavor check; either delete the unused
filter_flavors function to remove dead code, or replace the inline flavor skip
in BaselineComparisonsReport.do_plot with a call to
filter_flavors(setting_lists, flavor_filter) so behavior matches
IntelligentRoutingComparisonsReport and PDComparisonsReport—update any variable
names to match the existing signature and ensure setting_lists is mutated in
place as filter_flavors does.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@projects/llm-d/testing/config.yaml`:
- Line 7: The config currently forces the azure-specific preset by setting
to_apply: [azure_h100], which causes all runs to inherit Azure behavior
(including the kpouget-dev namespace); remove or unset the global to_apply entry
so the default config is neutral and require selecting the azure_h100 preset
explicitly in CI/job inputs (or move the preset into the specific job/pipeline
entries that need it) to avoid leaking Azure-specific settings into non-Azure
workflows.
- Line 138: The root-level "skip: true" toggle in
projects/llm-d/testing/config.yaml is globally scoped and will disable baseline
prepare steps for all presets; move these Azure-specific skip flags out of the
top-level prepare.*.skip and instead set them only inside the Azure CI preset
(e.g., ci_presets.azure_h100 -> prepare.*.skip) so defaults remain
unchanged—update the three occurrences noted (lines referenced in the review) by
removing or setting them false at root and adding the skip:true entries within
the ci_presets.azure_h100 overrides for the corresponding prepare sections.

---

Nitpick comments:
In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py`:
- Around line 324-335: The file defines a helper function filter_flavors that is
never used in BaselineComparisonsReport.do_plot while that method instead uses
an inline "if/continue" flavor check; either delete the unused filter_flavors
function to remove dead code, or replace the inline flavor skip in
BaselineComparisonsReport.do_plot with a call to filter_flavors(setting_lists,
flavor_filter) so behavior matches IntelligentRoutingComparisonsReport and
PDComparisonsReport—update any variable names to match the existing signature
and ensure setting_lists is mutated in place as filter_flavors does.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7ee7db4-319b-46ac-96f9-5edcb3d1ede1

📥 Commits

Reviewing files that changed from the base of the PR and between 0240e09 and 16f23b4.

📒 Files selected for processing (5)
  • projects/llm-d/testing/config.yaml
  • projects/llm-d/toolbox/llmd_capture_isvc_state/tasks/main.yml
  • projects/llm-d/toolbox/llmd_deploy_llm_inference_service/tasks/main.yml
  • projects/llm-d/visualizations/llmd_inference/data/plots.yaml
  • projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py

Comment thread projects/llm-d/testing/config.yaml Outdated
Comment thread projects/llm-d/testing/config.yaml Outdated
@kpouget kpouget force-pushed the aks branch 3 times, most recently from ec5f6fe to ad0254c Compare May 20, 2026 11:13
@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 20, 2026

/test jump-ci llm-d azure_h100 baseline-flavors
/cluster aks-h100
/only test_ci

@psap-forge-bot
Copy link
Copy Markdown

🟢 Test of 'llm-d test test_ci' succeeded after 00 hours 20 minutes 16 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: baseline-flavors

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 20, 2026

/test jump-ci llm-d azure_h100 baseline-flavors
/cluster aks-h100
/only test_ci

@psap-forge-bot
Copy link
Copy Markdown

🟢 Test of 'llm-d test test_ci' succeeded after 00 hours 21 minutes 40 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: baseline-flavors

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 20, 2026

/test jump-ci llm-d azure_h100 baseline-flavors guidellm_heterogeneous_eval
/cluster aks-h100
/only test_ci

@psap-forge-bot
Copy link
Copy Markdown

🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 21 minutes 21 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: baseline-flavors
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 20, 2026

/test jump-ci llm-d azure_h100 intelligentrouting-flavors guidellm_heterogeneous_eval
/cluster aks-h100
/only test_ci

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 20, 2026

/test jump-ci llm-d azure_h100 baseline-flavors guidellm_heterogeneous_eval llama-70b
/cluster aks-h100
/only test_ci

@psap-forge-bot
Copy link
Copy Markdown

🔴 Test of 'llm-d test test_ci' failed after 00 hours 00 minutes 40 seconds. 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: intelligentrouting-flavors
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval

Failure indicator: Empty. (See run.log)

@psap-forge-bot
Copy link
Copy Markdown

🟢 Test of 'llm-d test test_ci' succeeded after 02 hours 16 minutes 44 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: intelligentrouting-flavors
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 21, 2026

/test jump-ci llm-d azure_h100 baseline-flavors llama-70b
/cluster aks-h100
/only test_ci

@psap-forge-bot
Copy link
Copy Markdown

🔴 Test of 'llm-d test test_ci' failed after 00 hours 23 minutes 44 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: baseline-flavors
PR_POSITIONAL_ARG_3: llama-70b

Failure indicator:

/tmp/topsail_202605211779347575/001__llm_d_testing/000__flavor_simple/000__llmd__deploy_llm_inference_service/FAILURE | [000__llmd__deploy_llm_inference_service] ./run_toolbox.py llmd deploy_llm_inference_service --name=llm-d-simple --namespace=kpouget-dev --yaml_file=/tmp/topsail_202605211779347575/001__llm_d_testing/000__flavor_simple/llmisvc-simple.yaml --> 2


@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 21, 2026

/test jump-ci llm-d azure_h100 baseline-flavors llama-70b
/cluster aks-h100
/only test_ci
/var tests.llmd.flavors: [simple-tp2]

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 21, 2026

/test jump-ci llm-d azure_h100 baseline-flavors llama-70b
/cluster aks-h100
/only test_ci
/var tests.llmd.flavors: simple-tp2

@psap-forge-bot
Copy link
Copy Markdown

🟢 Test of 'llm-d test test_ci' succeeded after 00 hours 08 minutes 02 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: baseline-flavors
PR_POSITIONAL_ARG_3: llama-70b
tests.llmd.flavors: simple-tp2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant