[llmd] Work on Azure by kpouget · Pull Request #930 · openshift-psap/topsail

kpouget · 2026-05-11T08:19:51Z

Summary by CodeRabbit

Release Notes

Chores
- Updated CI testing configuration to support Azure H100 preset and adjusted feature toggles
- Enhanced HTTPRoute information capture during LLM inference service operations
- Disabled specific visualization reports in inference diagnostics
Bug Fixes
- Refined flavor filtering in baseline comparison report generation

openshift-ci · 2026-05-11T08:19:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sjmonson for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-05-11T08:20:04Z

Warning

Rate limit exceeded

@kpouget has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 59 minutes and 39 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a443a793-2aa6-40c6-96f3-fbba6313cda7

📥 Commits

Reviewing files that changed from the base of the PR and between 16f23b4 and 7edd510.

📒 Files selected for processing (10)

projects/llm-d/testing/config.yaml
projects/llm-d/testing/llmisvcs/llmisvc-pd.yaml
projects/llm-d/testing/llmisvcs/llmisvc-simple.yaml
projects/llm-d/testing/prepare_llmd.py
projects/llm-d/testing/test_llmd.py
projects/llm-d/toolbox/llmd_capture_isvc_state/tasks/main.yml
projects/llm-d/toolbox/llmd_deploy_llm_inference_service/tasks/main.yml
projects/llm-d/visualizations/llmd_inference/data/plots.yaml
projects/llm-d/visualizations/llmd_inference/data/reports.yaml
projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py

📝 Walkthrough

Walkthrough

This PR updates LLM-D testing infrastructure by introducing a new Azure H100 CI preset with related configuration toggles, adding HTTPRoute capture tasks to Ansible playbooks for improved debugging during service deployment, and disabling specific visualization reports while streamlining baseline flavor filtering in plotting logic.

Changes

CI Preset and Feature Configuration

Layer / File(s)	Summary
Azure H100 Preset Addition `projects/llm-d/testing/config.yaml`	New `azure_h100` CI preset is defined, extending `azure`, `gpt-oss`, and `pvc_rwx` with namespace set to `kpouget-dev` and storage class as `azurefile-csi-nfs-premium`. The preset is added to `ci_presets.to_apply`.
Feature Toggles and Configuration `projects/llm-d/testing/config.yaml`	`prepare.operators.skip` is enabled, `prepare.grafana.skip`, `prepare.monitoring.skip`, and `prepare.preload.skip` are enabled, `prepare.rhoai.tag` is updated to a new image hash, and the `opt-125m` preset's vLLM `--max-model-len=2048` argument is commented out.

HTTPRoute Debug Artifact Capture

Layer / File(s)	Summary
Inference Service State Capture `projects/llm-d/toolbox/llmd_capture_isvc_state/tasks/main.yml`	Two new Ansible tasks capture HTTPRoute YAML and status output for the target LLMInferenceService, filtered by router component label and service name, and store artifacts in the same directory.
Service Deployment Debug Tasks `projects/llm-d/toolbox/llmd_deploy_llm_inference_service/tasks/main.yml`	Two new `always`-block tasks query HTTPRoute resources and write YAML and status outputs to separate artifact files, with failure suppression to ensure the deployment workflow continues if HTTPRoute is unavailable.

Visualization Report Configuration

Layer / File(s)	Summary
Report Entries and Flavor Handling `projects/llm-d/visualizations/llmd_inference/data/plots.yaml`, `projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py`	Report entries for GuideLLM Results, Performance Analysis, and Prometheus/VLLM metrics are commented out in the generate list. The `simple-tp4-x4` baseline flavor now silently skips instead of appending a skip message to the report.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

openshift-psap/topsail#916: Both PRs modify testing configuration and CI presets; main PR skips "simple-tp4-x4" in plotting while retrieved PR adds it to test flavors.
openshift-psap/topsail#911: Both PRs modify the same Ansible task files to enhance HTTPRoute capture and service debugging.
openshift-psap/topsail#926: Both PRs modify throughput_comparisons.py to skip the "simple-tp4-x4" baseline flavor.

Poem

🐰 A rabbit hops through config files today,
New presets bloom in the Azure way,
HTTPRoutes captured in tasks so bright,
Reports go quiet, flavors skip right,
Testing flows clean, debugging takes flight! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title '[llmd] Work on Azure' is vague and generic, using the term 'Work on' which doesn't convey specific information about the changeset's primary modifications.	Revise the title to be more specific about the main change, such as '[llmd] Add Azure H100 configuration and capture HTTPRoute artifacts' or '[llmd] Configure Azure H100 preset and enhance route debugging'.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py (1)

324-335: ⚡ Quick win

Remove unused filter_flavors function or refactor to use it.

The filter_flavors function is defined but never called within BaselineComparisonsReport.do_plot. Meanwhile, the newly added flavor filtering at lines 348-349 uses a simple if/continue pattern instead.

The other report classes (IntelligentRoutingComparisonsReport and PDComparisonsReport) define identical filter_flavors functions and actually use them for filtering. For consistency and to avoid dead code, consider either:

Remove this unused function (simpler if the inline skip is sufficient), or
Refactor lines 348-349 to use filter_flavors for consistency with other reports.

Option 1: Remove the unused function

-        def filter_flavors(setting_lists, flavor_filter):
-            """Filter flavors from setting_lists based on provided filter function"""
-            updated_setting_lists = []
-            for setting_list in setting_lists:
-                if setting_list and setting_list[0][0] == 'flavor':
-                    # Apply the filter function to flavors
-                    filtered_flavors = [(k, v) for k, v in setting_list if flavor_filter(v)]
-                    if filtered_flavors:
-                        updated_setting_lists.append(filtered_flavors)
-                else:
-                    updated_setting_lists.append(setting_list)
-            setting_lists[:] = updated_setting_lists
-
         # Get simple flavors

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py`
around lines 324 - 335, The file defines a helper function filter_flavors that
is never used in BaselineComparisonsReport.do_plot while that method instead
uses an inline "if/continue" flavor check; either delete the unused
filter_flavors function to remove dead code, or replace the inline flavor skip
in BaselineComparisonsReport.do_plot with a call to
filter_flavors(setting_lists, flavor_filter) so behavior matches
IntelligentRoutingComparisonsReport and PDComparisonsReport—update any variable
names to match the existing signature and ensure setting_lists is mutated in
place as filter_flavors does.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@projects/llm-d/testing/config.yaml`:
- Line 7: The config currently forces the azure-specific preset by setting
to_apply: [azure_h100], which causes all runs to inherit Azure behavior
(including the kpouget-dev namespace); remove or unset the global to_apply entry
so the default config is neutral and require selecting the azure_h100 preset
explicitly in CI/job inputs (or move the preset into the specific job/pipeline
entries that need it) to avoid leaking Azure-specific settings into non-Azure
workflows.
- Line 138: The root-level "skip: true" toggle in
projects/llm-d/testing/config.yaml is globally scoped and will disable baseline
prepare steps for all presets; move these Azure-specific skip flags out of the
top-level prepare.*.skip and instead set them only inside the Azure CI preset
(e.g., ci_presets.azure_h100 -> prepare.*.skip) so defaults remain
unchanged—update the three occurrences noted (lines referenced in the review) by
removing or setting them false at root and adding the skip:true entries within
the ci_presets.azure_h100 overrides for the corresponding prepare sections.

---

Nitpick comments:
In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py`:
- Around line 324-335: The file defines a helper function filter_flavors that is
never used in BaselineComparisonsReport.do_plot while that method instead uses
an inline "if/continue" flavor check; either delete the unused filter_flavors
function to remove dead code, or replace the inline flavor skip in
BaselineComparisonsReport.do_plot with a call to filter_flavors(setting_lists,
flavor_filter) so behavior matches IntelligentRoutingComparisonsReport and
PDComparisonsReport—update any variable names to match the existing signature
and ensure setting_lists is mutated in place as filter_flavors does.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7ee7db4-319b-46ac-96f9-5edcb3d1ede1

📥 Commits

Reviewing files that changed from the base of the PR and between 0240e09 and 16f23b4.

📒 Files selected for processing (5)

projects/llm-d/testing/config.yaml
projects/llm-d/toolbox/llmd_capture_isvc_state/tasks/main.yml
projects/llm-d/toolbox/llmd_deploy_llm_inference_service/tasks/main.yml
projects/llm-d/visualizations/llmd_inference/data/plots.yaml
projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py

…ns: improve

…oute

kpouget · 2026-05-20T11:19:10Z

/test jump-ci llm-d azure_h100 baseline-flavors
/cluster aks-h100
/only test_ci

psap-forge-bot · 2026-05-20T11:44:27Z

🟢 Test of 'llm-d test test_ci' succeeded after 00 hours 20 minutes 16 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: baseline-flavors

…config

kpouget · 2026-05-20T12:07:44Z

/test jump-ci llm-d azure_h100 baseline-flavors
/cluster aks-h100
/only test_ci

psap-forge-bot · 2026-05-20T12:37:17Z

🟢 Test of 'llm-d test test_ci' succeeded after 00 hours 21 minutes 40 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: baseline-flavors

kpouget · 2026-05-20T15:58:01Z

/test jump-ci llm-d azure_h100 baseline-flavors guidellm_heterogeneous_eval
/cluster aks-h100
/only test_ci

psap-forge-bot · 2026-05-20T19:25:11Z

🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 21 minutes 21 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: baseline-flavors
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval

kpouget · 2026-05-20T20:29:17Z

/test jump-ci llm-d azure_h100 intelligentrouting-flavors guidellm_heterogeneous_eval
/cluster aks-h100
/only test_ci

kpouget · 2026-05-20T20:33:17Z

/test jump-ci llm-d azure_h100 baseline-flavors guidellm_heterogeneous_eval llama-70b
/cluster aks-h100
/only test_ci

psap-forge-bot · 2026-05-20T20:33:33Z

🔴 Test of 'llm-d test test_ci' failed after 00 hours 00 minutes 40 seconds. 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: intelligentrouting-flavors
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval

Failure indicator: Empty. (See run.log)

psap-forge-bot · 2026-05-20T22:55:03Z

🟢 Test of 'llm-d test test_ci' succeeded after 02 hours 16 minutes 44 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: intelligentrouting-flavors
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval

kpouget · 2026-05-21T06:55:13Z

/test jump-ci llm-d azure_h100 baseline-flavors llama-70b
/cluster aks-h100
/only test_ci

…ison

psap-forge-bot · 2026-05-21T07:36:43Z

🔴 Test of 'llm-d test test_ci' failed after 00 hours 23 minutes 44 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: baseline-flavors
PR_POSITIONAL_ARG_3: llama-70b

Failure indicator:

/tmp/topsail_202605211779347575/001__llm_d_testing/000__flavor_simple/000__llmd__deploy_llm_inference_service/FAILURE | [000__llmd__deploy_llm_inference_service] ./run_toolbox.py llmd deploy_llm_inference_service --name=llm-d-simple --namespace=kpouget-dev --yaml_file=/tmp/topsail_202605211779347575/001__llm_d_testing/000__flavor_simple/llmisvc-simple.yaml --> 2

…iner

kpouget · 2026-05-21T09:38:09Z

/test jump-ci llm-d azure_h100 baseline-flavors llama-70b
/cluster aks-h100
/only test_ci
/var tests.llmd.flavors: [simple-tp2]

kpouget · 2026-05-21T10:15:34Z

/test jump-ci llm-d azure_h100 baseline-flavors llama-70b
/cluster aks-h100
/only test_ci
/var tests.llmd.flavors: simple-tp2

psap-forge-bot · 2026-05-21T10:26:58Z

🟢 Test of 'llm-d test test_ci' succeeded after 00 hours 08 minutes 02 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: baseline-flavors
PR_POSITIONAL_ARG_3: llama-70b
tests.llmd.flavors: simple-tp2

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

Comment thread projects/llm-d/testing/config.yaml Outdated

Comment thread projects/llm-d/testing/config.yaml Outdated

kpouget added 5 commits May 20, 2026 09:07

[llm-d] testing: config: add the 'to_apply' placeholder

5a054ce

[llm-d] testing: config: update to rhoai 3.4

2b70f5f

[llm-d] visualizations: llmd_inference: plotting/throughput_compariso…

61a03fd

…ns: improve

[llm-d] toolbox: llmd_deploy_llm_inference_service: capture the httpr…

12ebcf9

…oute

[llm-d] toolbox: llmd_capture_isvc_state: capture the httproute

2769ad7

kpouget force-pushed the aks branch 3 times, most recently from ec5f6fe to ad0254c Compare May 20, 2026 11:13

kpouget added 4 commits May 20, 2026 14:07

[llm-d] testing: config: add the azure_h100 preset

d91b9c4

[llm-d] testing: add AKS H100 support

73ef433

[llm-d] testing: llmisvcs: llmisvc-simple: use the default scheduler …

62c073e

…config

[llm-d] testing: test_llmd: allow setting a toleration to the router

fa2d523

kpouget force-pushed the aks branch from ad0254c to c83cba3 Compare May 20, 2026 12:07

kpouget force-pushed the aks branch from c83cba3 to 1efcc76 Compare May 21, 2026 06:49

[llm-d] testing: add dynamic IB configuration

f301a7d

kpouget force-pushed the aks branch from 1efcc76 to f301a7d Compare May 21, 2026 06:53

[llm-d] visualizations: llmd_inference: dedicated file for the compar…

ae52834

…ison

kpouget added 2 commits May 21, 2026 11:36

[llm-d] testing: test_llmd: correctly set the TP in the prefill conta…

a73819d

…iner

[llm-d] testing: config: update the PD flavors for AKS

7edd510

Conversation

kpouget commented May 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

openshift-ci Bot commented May 11, 2026

Uh oh!

coderabbitai Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kpouget commented May 20, 2026

Uh oh!

psap-forge-bot Bot commented May 20, 2026

Uh oh!

kpouget commented May 20, 2026

Uh oh!

psap-forge-bot Bot commented May 20, 2026

Uh oh!

kpouget commented May 20, 2026

Uh oh!

psap-forge-bot Bot commented May 20, 2026

Uh oh!

kpouget commented May 20, 2026

Uh oh!

kpouget commented May 20, 2026

Uh oh!

psap-forge-bot Bot commented May 20, 2026

Uh oh!

psap-forge-bot Bot commented May 20, 2026

Uh oh!

kpouget commented May 21, 2026

Uh oh!

psap-forge-bot Bot commented May 21, 2026

Uh oh!

kpouget commented May 21, 2026

Uh oh!

kpouget commented May 21, 2026

Uh oh!

psap-forge-bot Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kpouget commented May 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 11, 2026 •

edited

Loading