feat(observability): Add opt-in observability stack #3426 by abdullahpathan22 · Pull Request #3427 · kubeflow/manifests

abdullahpathan22 · 2026-03-27T14:29:04Z

This PR introduces a new common/observability/ kustomize component that provides an optional, batteries-included monitoring stack.

Key Changes:

Deploys to kubeflow-monitoring-system namespace.
- Adds Prometheus and Grafana operators.
- Includes ServiceMonitors for NVIDIA DCGM, AMD ROCm, and Kepler.
- Provides Grafana dashboards for GPU cluster/namespace usage and availability.
- Adds an installation test script in tests/observability_install.sh.
  Follows mentor feedback regarding namespace and operator inclusion. Opt-in only, no impact on default installations.

Fixes #3426

google-oss-prow · 2026-03-27T14:29:12Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign juliusvonkohout for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

Adds a new opt-in common/observability/ kustomize component intended to provide a bundled monitoring stack (operators, GPU ServiceMonitors, dashboards) and also refactors Model Registry installation/testing by introducing reusable scripts and wiring them into CI and example manifests.

Changes:

Introduce common/observability base/overlays/components with operators, ServiceMonitors, Kepler manifests, and Grafana dashboards.
Add Model Registry install + integration test scripts and update CI workflows to use them.
Update example installation and central dashboard (oauth2-proxy overlay) to surface Model Registry.

Reviewed changes

Copilot reviewed 35 out of 35 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
tests/observability_install.sh	Adds a script to apply the observability kustomize base/overlay and wait for operator pods.
tests/model_registry_test.sh	Adds integration tests for Model Registry API and gateway access.
tests/model_registry_install.sh	Adds a script to install Model Registry (server, Istio, UI, catalog) and wait for readiness.
example/kustomization.yaml	Enables Model Registry resources in the example install.
common/observability/overlays/kubeflow/kustomization.yaml	Defines a Kubeflow overlay for the observability component.
common/observability/components/kepler/kustomization.yaml	Adds an opt-in Kepler component referencing the Kepler base.
common/observability/base/service-monitors/nvidia-dcgm-service-monitor.yaml	Adds ServiceMonitor for NVIDIA DCGM exporter.
common/observability/base/service-monitors/kustomization.yaml	Aggregates GPU/Kepler ServiceMonitors.
common/observability/base/service-monitors/kepler-service-monitor.yaml	Adds ServiceMonitor intended for Kepler metrics.
common/observability/base/service-monitors/amd-gpu-service-monitor.yaml	Adds ServiceMonitor for AMD device metrics exporter.
common/observability/base/operators/prometheus-operator/rbac.yaml	Adds Prometheus Operator service account and cluster RBAC.
common/observability/base/operators/prometheus-operator/kustomization.yaml	Wires Prometheus Operator deployment + RBAC.
common/observability/base/operators/prometheus-operator/deployment.yaml	Adds Prometheus Operator deployment manifest.
common/observability/base/operators/grafana-operator/rbac.yaml	Adds Grafana Operator service account and cluster RBAC.
common/observability/base/operators/grafana-operator/kustomization.yaml	Wires Grafana Operator deployment/RBAC and Grafana CR.
common/observability/base/operators/grafana-operator/grafana.yaml	Adds Grafana custom resource.
common/observability/base/operators/grafana-operator/deployment.yaml	Adds Grafana Operator deployment manifest.
common/observability/base/namespace.yaml	Creates `kubeflow-monitoring-system` namespace.
common/observability/base/kustomization.yaml	Top-level observability base composition (namespace/operators/monitors/dashboards).
common/observability/base/kepler/serviceaccount.yaml	Adds Kepler service account.
common/observability/base/kepler/service.yaml	Adds Kepler Service.
common/observability/base/kepler/namespace.yaml	Adds (duplicate) namespace manifest for Kepler base.
common/observability/base/kepler/kustomization.yaml	Wires Kepler RBAC + DaemonSet + Service.
common/observability/base/kepler/daemonset.yaml	Adds Kepler DaemonSet.
common/observability/base/kepler/clusterrolebinding.yaml	Adds Kepler ClusterRoleBinding.
common/observability/base/kepler/clusterrole.yaml	Adds Kepler ClusterRole.
common/observability/base/dashboards/kustomization.yaml	Aggregates the dashboard ConfigMaps.
common/observability/base/dashboards/gpu-namespace-usage-dashboard.yaml	Adds a Grafana dashboard ConfigMap (namespace-level GPU usage).
common/observability/base/dashboards/gpu-cluster-usage-dashboard.yaml	Adds a Grafana dashboard ConfigMap (cluster-level GPU usage).
common/observability/base/dashboards/gpu-availability-allocation-dashboard.yaml	Adds a Grafana dashboard ConfigMap (GPU availability/allocation).
applications/centraldashboard/overlays/oauth2-proxy/patches/configmap.yaml	Patches central dashboard links/settings (adds Model Registry link).
applications/centraldashboard/overlays/oauth2-proxy/kustomization.yaml	Switches to istio overlay and applies the centraldashboard ConfigMap patch.
README.md	Documents optional observability and Model Registry install scripts.
.github/workflows/model_registry_test.yaml	Refactors workflow to use the new install/test scripts.
.github/workflows/full_kubeflow_integration_test.yaml	Uses new Model Registry scripts and runs Model Registry tests.

Copilot · 2026-03-27T14:36:54Z

README.md

+#### Observability Stack (Optional)
+
+This component provides an optional monitoring stack for GPU metrics (NVIDIA/AMD) and energy consumption (Kepler), along with Grafana dashboards. It includes Prometheus and Grafana operators and is deployed in the `kubeflow-monitoring-system` namespace.
+
+Install the observability base component:
+
+```sh
+./tests/observability_install.sh
+```
+
+To opt into Kepler for energy metrics:
+
+```sh
+kustomize build common/observability/components/kepler | kubectl apply -f -
+```


The README describes this as a ready-to-use “monitoring stack”, but the manifests currently do not include the required CRDs (ServiceMonitor/Grafana) and do not create a Prometheus instance or Grafana datasource. As a result, the install instructions here will not work on a fresh cluster and the dashboards will not render without extra manual steps.

Either ship the missing pieces as part of common/observability or document the required pre-existing dependencies and post-install configuration steps.

Copilot · 2026-03-27T14:36:54Z

common/observability/base/service-monitors/kepler-service-monitor.yaml

+  namespace: kepler
+spec:
+  endpoints:
+  - interval: 15s
+    port: http
+  namespaceSelector:
+    matchNames:
+    - kepler


kepler-service-monitor.yaml creates the ServiceMonitor in namespace kepler and only selects targets in namespace kepler, but the Kepler manifests in this component deploy to kubeflow-monitoring-system. As written, base installation will fail if the kepler namespace does not exist, and Prometheus will not discover the Kepler Service.

Align the ServiceMonitor metadata.namespace and namespaceSelector.matchNames with the namespace where Kepler is actually deployed (or move this ServiceMonitor into the Kepler opt-in component).

Suggested change

namespace: kepler

spec:

endpoints:

- interval: 15s

port: http

namespaceSelector:

matchNames:

- kepler

namespace: kubeflow-monitoring-system

spec:

endpoints:

- interval: 15s

port: http

namespaceSelector:

matchNames:

- kubeflow-monitoring-system

Copilot · 2026-03-27T14:36:55Z

common/observability/base/service-monitors/kustomization.yaml

+resources:
+- nvidia-dcgm-service-monitor.yaml
+- amd-gpu-service-monitor.yaml
+- kepler-service-monitor.yaml


Kepler is described as opt-in, but base/service-monitors unconditionally includes kepler-service-monitor.yaml. This makes kustomize build common/observability/base | kubectl apply depend on Kepler’s namespace/Service existing and couples base installs to an optional component.

Move the Kepler ServiceMonitor into common/observability/components/kepler (or gate it behind an overlay) so the base component installs cleanly without Kepler.

Suggested change

- kepler-service-monitor.yaml

Copilot · 2026-03-27T14:36:55Z

common/observability/base/kustomization.yaml

+resources:
+- namespace.yaml
+- operators/prometheus-operator
+- operators/grafana-operator
+- service-monitors
+- dashboards


The base component deploys the Prometheus Operator and Grafana Operator, and also applies ServiceMonitor/Grafana resources, but it does not include the required CRDs for monitoring.coreos.com and grafana.integreatly.org. On a cluster without those CRDs preinstalled, kubectl apply will fail before the operators can run.

If this is meant to be “batteries included”, include the CRDs (or an explicit preflight/install step) so tests/observability_install.sh works on a fresh cluster.

Copilot · 2026-03-27T14:36:55Z

common/observability/base/dashboards/gpu-cluster-usage-dashboard.yaml

+          "targets": [
+            { "expr": "avg(DCGM_FI_DEV_GPU_UTIL) or avg(amd_gpu_utilization)", "legendFormat": "GPU Utilization" }
+          ]
+        },
+        {
+          "title": "GPU Memory Used vs Total per Node",
+          "type": "timeseries",


The bundled Grafana dashboards hardcode datasource.uid = "prometheus", but this component does not create a corresponding GrafanaDataSource (or any other datasource provisioning). Unless a datasource with UID prometheus is created out-of-band, these dashboards will not render.

Provision the Prometheus datasource as part of this component (or avoid hardcoding a UID and use a default datasource strategy supported by the chosen Grafana operator).

Copilot · 2026-03-27T14:36:56Z

tests/observability_install.sh

+cd common/observability
+kustomize build base | kubectl apply -f -
+echo "Waiting for operators to be ready..."
+kubectl wait --for=condition=Ready pod -l 'app.kubernetes.io/name=prometheus-operator' --timeout=180s -n kubeflow-monitoring-system
+kubectl wait --for=condition=Ready pod -l 'control-plane=controller-manager' --timeout=180s -n kubeflow-monitoring-system
+echo "Applying Kubeflow overlay..."
+kustomize build overlays/kubeflow | kubectl apply -f -


This script applies base and then applies overlays/kubeflow, but the overlay currently only re-includes ../../base. That results in duplicate applies without any functional difference.

Apply only the overlay (or only the base) until the overlay adds distinct customization; this reduces install time and avoids confusion about what the overlay changes.

Copilot · 2026-03-27T14:36:56Z

tests/model_registry_test.sh

+# ---- Test 1: Direct API access via port-forward ----
+echo "Test 1: Direct Model Registry API access..."
+nohup kubectl port-forward svc/model-registry-service -n kubeflow 8081:8080 &
+timeout 30s bash -c 'until curl -s localhost:8081 > /dev/null 2>&1; do sleep 1; done'
+


model_registry_test.sh starts a background kubectl port-forward but does not capture its PID or clean it up on success/failure. This can leak background processes in local runs and can interfere with subsequent steps in CI that expect to bind to the same port.

Capture $! and trap a cleanup handler to kill the port-forward process(es) before exiting.

Copilot · 2026-03-27T14:36:56Z

example/kustomization.yaml

+# Model Registry
+- ../applications/model-registry/upstream/overlays/postgres
+# Model Registry Istio networking (VirtualService for /api/model_registry/)
+- ../applications/model-registry/upstream/options/istio
+# Model Registry UI
+- ../applications/model-registry/upstream/options/ui/overlays/istio
+# Model Catalog (demo)
+- ../applications/model-registry/upstream/options/catalog/overlays/demo
+


The PR title/description focuses on adding an opt-in observability stack, but this change also enables Model Registry resources in the example kustomization (and the PR adds Model Registry install/test scripts and workflow changes). This looks like additional scope not described in the PR metadata.

Either update the PR title/description to explicitly cover the Model Registry changes, or split the Model Registry updates into a separate PR for clearer review and release notes.

Copilot · 2026-03-27T14:36:57Z

common/observability/base/kustomization.yaml

+resources:
+- namespace.yaml
+- operators/prometheus-operator
+- operators/grafana-operator
+- service-monitors


This “observability stack” deploys the operators but does not create a Prometheus (or PrometheusAgent) instance. Without a Prometheus CR, no scraping will happen and the ServiceMonitors are effectively unused.

Add a minimal Prometheus (and any required RBAC/Service/ServiceAccount) or clarify in the component/docs that an external Prometheus is required.

Copilot · 2026-03-27T14:36:57Z

tests/model_registry_test.sh

+INGRESS_GATEWAY_SERVICE=$(kubectl get svc --namespace istio-system \
+  --selector="app=istio-ingressgateway" \
+  --output jsonpath='{.items[0].metadata.name}')
+
+nohup kubectl port-forward --namespace istio-system "svc/${INGRESS_GATEWAY_SERVICE}" 8080:80 &
+timeout 30s bash -c 'until curl -s localhost:8080 > /dev/null 2>&1; do sleep 1; done'


This script starts a second background port-forward to the Istio ingress gateway on local port 8080 and does not clean it up. In the full integration workflow, the gateway is already port-forwarded earlier (tests/port_forward_gateway.sh), so this extra port-forward is redundant and can make debugging harder if it fails.

Prefer reusing the existing gateway port-forward (or parameterize the port), and ensure the process is terminated via PID tracking + trap cleanup.

juliusvonkohout · 2026-03-27T16:52:14Z

please rebase to master, there is something wrong with your commits. I see commits that are already on master.

abdullahpathan22 · 2026-03-27T16:58:46Z

Hello @juliusvonkohout, actually there was something wrong, i have fixed it and it is ready for review, please take a look and provide me with the feedbacks or necesassry changes thank you!

Copilot

Pull request overview

Copilot reviewed 28 out of 30 changed files in this pull request and generated 8 comments.

Copilot · 2026-03-27T16:59:36Z

common/observability/base/service-monitors/amd-gpu-service-monitor.yaml

+kind: ServiceMonitor
+metadata:
+  name: amd-gpu-metrics-exporter
+  namespace: kube-amd-gpu


This ServiceMonitor has no labels, but the Prometheus CR in this component uses a serviceMonitorSelector.matchLabels (currently app.kubernetes.io/name: prometheus-operator). Without adding the expected label(s) here (or adjusting the Prometheus selector), Prometheus will ignore this ServiceMonitor.

Suggested change

namespace: kube-amd-gpu

namespace: kube-amd-gpu

labels:

app.kubernetes.io/name: prometheus-operator

Copilot · 2026-03-27T16:59:37Z

common/observability/base/operators/grafana-operator/datasource.yaml

+  instanceSelector:
+    matchLabels:
+      dashboards: "grafana"


GrafanaDatasource.spec.instanceSelector.matchLabels requires matching labels on the target Grafana resource. The Grafana instance in this PR has no metadata.labels.dashboards: "grafana", so this datasource will remain in the NoMatchingInstances state and never get provisioned.

Suggested change

instanceSelector:

matchLabels:

dashboards: "grafana"

instanceSelector: {}

Copilot · 2026-03-27T16:59:37Z

common/observability/base/operators/grafana-operator/grafana.yaml

+kind: Grafana
+metadata:
+  name: grafana
+  namespace: kubeflow-monitoring-system


The Grafana instance has no labels, but the GrafanaDatasource (and any future GrafanaDashboard) resources in this component rely on instanceSelector.matchLabels to bind to a Grafana instance. Add the expected labels (e.g., dashboards: "grafana") under metadata.labels so provisioning works.

Suggested change

namespace: kubeflow-monitoring-system

namespace: kubeflow-monitoring-system

labels:

dashboards: "grafana"

Copilot · 2026-03-27T16:59:38Z

common/observability/base/dashboards/gpu-cluster-usage-dashboard.yaml

+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: gpu-cluster-usage-dashboard
+  namespace: kubeflow-monitoring-system
+  labels:
+    grafana_dashboard: "1"
+data:


These dashboards are provided as ConfigMaps labeled grafana_dashboard: "1", but this component deploys the Grafana Operator (integreatly) which provisions dashboards via GrafanaDashboard CRs (optionally with spec.configMapRef). Without corresponding GrafanaDashboard resources (or a sidecar-based ConfigMap dashboard loader), these ConfigMaps will not be imported into Grafana.

Copilot · 2026-03-27T16:59:38Z

README.md

+#### Model Registry
+
+Install the Model Registry with its UI and database components:
+
+```sh
+./tests/model_registry_install.sh
+```


The PR title/description is scoped to adding an opt-in observability stack, but this change also adds Model Registry installation/testing scripts and workflow usage. Please either update the PR description to cover these additional changes or split them into a separate PR to keep review and release notes coherent.

Copilot · 2026-03-27T16:59:38Z

common/observability/base/service-monitors/nvidia-dcgm-service-monitor.yaml

+kind: ServiceMonitor
+metadata:
+  name: nvidia-dcgm-exporter
+  namespace: gpu-operator


This ServiceMonitor has no labels, but the Prometheus CR in this component uses a serviceMonitorSelector.matchLabels (currently app.kubernetes.io/name: prometheus-operator). Without adding the expected label(s) here (or adjusting the Prometheus selector), Prometheus will ignore this ServiceMonitor.

Suggested change

namespace: gpu-operator

namespace: gpu-operator

labels:

app.kubernetes.io/name: prometheus-operator

Copilot · 2026-03-27T16:59:38Z

common/observability/components/kepler/kepler-service-monitor.yaml

+kind: ServiceMonitor
+metadata:
+  name: kepler
+  namespace: kubeflow-monitoring-system


This ServiceMonitor has no labels, but the Prometheus CR in the base component uses a serviceMonitorSelector.matchLabels. Without adding the expected label(s) here (or adjusting the Prometheus selector), Prometheus will ignore the Kepler ServiceMonitor.

Suggested change

namespace: kubeflow-monitoring-system

namespace: kubeflow-monitoring-system

labels:

app.kubernetes.io/name: kepler

Copilot · 2026-03-27T16:59:39Z

common/observability/base/operators/prometheus-operator/prometheus.yaml

+  serviceMonitorSelector:
+    matchLabels:
+      app.kubernetes.io/name: prometheus-operator


The Prometheus CR selects ServiceMonitors by label app.kubernetes.io/name: prometheus-operator, but the ServiceMonitors introduced in this component (and the optional Kepler ServiceMonitor) do not set that label. Additionally, because the NVIDIA/AMD ServiceMonitors are created in other namespaces (gpu-operator, kube-amd-gpu), Prometheus will not discover them unless serviceMonitorNamespaceSelector is set (or the ServiceMonitors are moved into kubeflow-monitoring-system). As-is, Prometheus will not scrape the intended targets.

Suggested change

serviceMonitorSelector:

matchLabels:

app.kubernetes.io/name: prometheus-operator

serviceMonitorSelector: {}

serviceMonitorNamespaceSelector: {}

Copilot

Pull request overview

Copilot reviewed 31 out of 33 changed files in this pull request and generated 16 comments.

Copilot · 2026-03-27T18:31:37Z

tests/observability_install.sh

+cd common/observability
+kustomize build overlays/kubeflow | kubectl apply -f -


The install script applies a large operator bundle and CRDs using client-side kubectl apply. This is prone to failures due to the last-applied-configuration annotation size limit (and conflicts when CRDs already exist). Align this script with other install scripts in tests/ by using server-side apply (and force-conflicts where appropriate) for the kustomize output.

Copilot · 2026-03-27T18:31:37Z

common/observability/base/operators/grafana-operator/rbac.yaml

+  verbs: ["*"]
+- apiGroups: ["apps"]
+  resources: ["deployments"]
+  verbs: ["*"]
+- apiGroups: ["grafana.integreatly.org"]
+  resources: ["grafanas", "grafanadashboards", "grafanadatasources", "grafanafolders"]
+  verbs: ["*"]
+- apiGroups: ["networking.k8s.io"]
+  resources: ["ingresses"]
+  verbs: ["*"]


This ClusterRole grants wildcard verbs ("*") on several resource types. For a namespaced installation this is typically broader than necessary and increases blast radius. Prefer narrowing to the minimal verb set and using a Role/RoleBinding where cluster scope is not required.

Suggested change

verbs: ["*"]

- apiGroups: ["apps"]

resources: ["deployments"]

verbs: ["*"]

- apiGroups: ["grafana.integreatly.org"]

resources: ["grafanas", "grafanadashboards", "grafanadatasources", "grafanafolders"]

verbs: ["*"]

- apiGroups: ["networking.k8s.io"]

resources: ["ingresses"]

verbs: ["*"]

verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

- apiGroups: ["apps"]

resources: ["deployments"]

verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

- apiGroups: ["grafana.integreatly.org"]

resources: ["grafanas", "grafanadashboards", "grafanadatasources", "grafanafolders"]

verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

- apiGroups: ["networking.k8s.io"]

resources: ["ingresses"]

verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

Copilot · 2026-03-27T18:31:37Z

common/observability/base/kepler/daemonset.yaml

+        securityContext:
+          privileged: true


securityContext.privileged: true is a very high-privilege setting. If Kepler supports it, harden the container (for example allowPrivilegeEscalation: false, readOnlyRootFilesystem, and explicit dropped capabilities) and document why privileged mode is required.

Suggested change

securityContext:

privileged: true

securityContext:

# Kepler requires privileged mode to access host-level energy and kernel metrics

# via /proc, /sys, and the container runtime socket.

privileged: true

allowPrivilegeEscalation: false

readOnlyRootFilesystem: true

capabilities:

drop:

- "ALL"

Copilot · 2026-03-27T18:31:38Z

README.md

 We are planning to cut 2 releases per year, for example 26.03 and 26.10 before each KubeCon EU and NA.
 We ask each working group/component to provide non-breaking patch releases for 6 months based on the version in each date release.
-We try to BEST-EFFORT support each release for 6 months as community. There is [commercial support](https://www.kubeflow.org/docs/started/support/#support-from-commercial-providers-in-the-kubeflow-ecosystem) available if needed.
+We try to BEST-EFFORT support each realease for 6 monhts as community. There is [commercial support](https://www.kubeflow.org/docs/started/support/#support-from-commercial-providers-in-the-kubeflow-ecosystem) available if needed.


Spelling: realease should be release.

Suggested change

We try to BEST-EFFORT support each realease for 6 monhts as community. There is [commercial support](https://www.kubeflow.org/docs/started/support/#support-from-commercial-providers-in-the-kubeflow-ecosystem) available if needed.

We try to BEST-EFFORT support each release for 6 months as community. There is [commercial support](https://www.kubeflow.org/docs/started/support/#support-from-commercial-providers-in-the-kubeflow-ecosystem) available if needed.

Copilot · 2026-03-27T18:31:38Z

README.md

 | Katib | applications/katib/upstream | [v0.19.0](https://github.com/kubeflow/katib/tree/v0.19.0/manifests/v1beta1) | 13m | 476Mi | 10GB |
 | KServe | applications/kserve/kserve | [v0.16.0](https://github.com/kserve/kserve/releases/tag/v0.16.0/install/v0.16.0) | 600m | 1200Mi | 0GB |
-| KServe Models Web Application | applications/kserve/models-web-app | [c71ee4309f0335159d9fdfd4559a538b5c782c92](https://github.com/kserve/models-web-app/tree/c71ee4309f0335159d9fdfd4559a538b5c782c92/manifests/kustomize) | 6m | 259Mi  | 0GB |
+| KServe Models Web Application | applications/kserve/models-web-app | [v0.15.0](https://github.com/kserve/models-web-app/tree/v0.15.0/config) | 6m | 259Mi  | 0GB |


This table entry links KServe Models Web Application to v0.15.0, but the manifests in applications/kserve/models-web-app currently set the image tag to 0.16.1. Please align the README's upstream revision/link with the version actually shipped.

Suggested change

| KServe Models Web Application | applications/kserve/models-web-app | [v0.15.0](https://github.com/kserve/models-web-app/tree/v0.15.0/config) | 6m | 259Mi | 0GB |

| KServe Models Web Application | applications/kserve/models-web-app | [v0.16.1](https://github.com/kserve/models-web-app/tree/v0.16.1/config) | 6m | 259Mi | 0GB |

Copilot · 2026-03-27T18:31:40Z

common/observability/base/operators/prometheus-operator/prometheus.yaml

+  name: prometheus
+rules:
+- apiGroups: [""]
+  resources: ["nodes", "nodes/metrics", "services", "endpoints", "pods"]
+  verbs: ["get", "list", "watch"]
+- apiGroups: [""]
+  resources: ["configmaps"]
+  verbs: ["get"]
+- nonResourceURLs: ["/metrics"]
+  verbs: ["get"]
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: prometheus
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: prometheus
+subjects:
+- kind: ServiceAccount
+  name: prometheus


The ClusterRole is named prometheus, which is very likely to collide with existing Prometheus installations because this is a cluster-scoped resource. Consider prefixing this name with something unique to this component (for example, kubeflow-monitoring-...) to avoid installation conflicts.

Suggested change

name: prometheus

rules:

- apiGroups: [""]

resources: ["nodes", "nodes/metrics", "services", "endpoints", "pods"]

verbs: ["get", "list", "watch"]

- apiGroups: [""]

resources: ["configmaps"]

verbs: ["get"]

- nonResourceURLs: ["/metrics"]

verbs: ["get"]

---

apiVersion: rbac.authorization.k8s.io/v1

kind: ClusterRoleBinding

metadata:

name: prometheus

roleRef:

apiGroup: rbac.authorization.k8s.io

kind: ClusterRole

name: prometheus

subjects:

- kind: ServiceAccount

name: prometheus

name: kubeflow-monitoring-prometheus

rules:

- apiGroups: [""]

resources: ["nodes", "nodes/metrics", "services", "endpoints", "pods"]

verbs: ["get", "list", "watch"]

- apiGroups: [""]

resources: ["configmaps"]

verbs: ["get"]

- nonResourceURLs: ["/metrics"]

verbs: ["get"]

---

apiVersion: rbac.authorization.k8s.io/v1

kind: ClusterRoleBinding

metadata:

name: kubeflow-monitoring-prometheus

roleRef:

apiGroup: rbac.authorization.k8s.io

kind: ClusterRole

name: kubeflow-monitoring-prometheus

subjects:

- kind: ServiceAccount

name: prometheus

Copilot · 2026-03-27T18:31:40Z

common/observability/base/operators/prometheus-operator/prometheus.yaml

+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: prometheus
+roleRef:


The ClusterRoleBinding is named prometheus, which is very likely to collide with existing Prometheus installations because this is a cluster-scoped resource. Consider prefixing this name with something unique to this component (for example, kubeflow-monitoring-...) to avoid installation conflicts.

Copilot · 2026-03-27T18:31:40Z

common/observability/base/operators/grafana-operator/datasource.yaml

+  namespace: kubeflow-monitoring-system
+spec:
+  datasource:
+    name: prometheus


Dashboards in this component reference the Prometheus datasource by UID ("uid": "prometheus"), but this GrafanaDatasource does not set spec.datasource.uid. Without a stable UID, Grafana will generate a different one and the dashboards will not bind to the intended datasource.

Suggested change

name: prometheus

name: prometheus

uid: prometheus

Copilot · 2026-03-27T18:31:40Z

README.md

 To view all past security scans, head to the [Image Extracting and Security Scanning GitHub Action workflow](https://github.com/kubeflow/manifests/actions/workflows/trivy.yaml). In the logs of the workflow, you can expand the `Run image extracting and security scanning script` step to view the CVE logs. You will find a per-image CVE scan and a JSON dump of per-WorkingGroup aggregated metrics. You can run the Python script from the workflow file locally on your machine to obtain the detailed JSON files for any git commit.

-For more information please consult the [SECURITY.md](./SECURITY.md).
+For more infromation please consult the [SECURITY.md](./SECURITY.md).


Spelling: infromation should be information.

Suggested change

For more infromation please consult the [SECURITY.md](./SECURITY.md).

For more information please consult the [SECURITY.md](./SECURITY.md).

Copilot · 2026-03-27T18:31:40Z

README.md

  **A:** Please refer to each individual component's documentation for a dependency compatibility range. For Istio, Knative, Dex, Cert-Manager, and OAuth2 Proxy, the versions in `common` are the ones we have validated.
 - **Q:** Can I use Kubeflow in an air-gapped environment?
-  **A:** Yes you can. You just need to get the list of images from our [trivy CVE scanning script](https://github.com/kubeflow/manifests/blob/master/tests/trivy_scan.py), mirror them and replace the references in the manifests with kustomize components and overlays, see [Upgrading and Extending](#upgrading-and-extending). You could also use a simple kyverno policy to replace the images at runtime, which could be easier to maintain.
+  **A:** Yes you can. You just need to to get the list of images from our [trivy CVE scanning script](https://github.com/kubeflow/manifests/blob/master/tests/trivy_scan.py), mirror them and replace the references in the manifests with kustomize components and overlays, see [Upgrading and Extending](#upgrading-and-extending). You could also use a simple kyverno policy to replace the images at runtime, which could be easier to maintain.


Spelling/grammar: remove the duplicated word to ("need to to get").

Suggested change

**A:** Yes you can. You just need to to get the list of images from our [trivy CVE scanning script](https://github.com/kubeflow/manifests/blob/master/tests/trivy_scan.py), mirror them and replace the references in the manifests with kustomize components and overlays, see [Upgrading and Extending](#upgrading-and-extending). You could also use a simple kyverno policy to replace the images at runtime, which could be easier to maintain.

**A:** Yes you can. You just need to get the list of images from our [trivy CVE scanning script](https://github.com/kubeflow/manifests/blob/master/tests/trivy_scan.py), mirror them and replace the references in the manifests with kustomize components and overlays, see [Upgrading and Extending](#upgrading-and-extending). You could also use a simple kyverno policy to replace the images at runtime, which could be easier to maintain.

Copilot

Pull request overview

Copilot reviewed 31 out of 33 changed files in this pull request and generated 3 comments.

Copilot · 2026-03-27T18:56:54Z

README.md

 - For the specific Kubernetes version per release, consult the [release notes](https://github.com/kubeflow/manifests/releases).
 - Our Kind script below will take care of installing continuously tested Kubernetes, Kustomize and Kubectl versions for you.
 - We use Kind as default but also support Minikube, Rancher, EKS, AKS, and GKE. GKE might need tiny adjustments documented here in this file and OpenShift is also possible.



This PR removes the ARM64/aarch64 support note from the README. That change appears unrelated to the observability stack feature and removes potentially useful guidance for users. If this was accidental, consider restoring the section; if intentional, it likely deserves a separate PR with rationale.

Copilot · 2026-03-27T18:56:55Z

README.md

+
+

Two consecutive blank lines were added here, which creates unnecessary diff noise and slightly reduces readability. Consider removing the extra empty lines.

Suggested change

Copilot · 2026-03-27T18:56:55Z

common/observability/base/dashboards/gpu-namespace-usage-dashboard.yaml

+          "title": "Per-namespace GPU Utilization over time",
+          "type": "timeseries",
+          "targets": [
+            { "expr": "sum(DCGM_FI_DEV_GPU_UTIL) by (namespace)", "legendFormat": "{{namespace}}" }
+          ]


The “Per-namespace GPU Utilization over time” panel groups DCGM_FI_DEV_GPU_UTIL by namespace, but for DCGM exporter metrics the namespace label (when present) typically reflects the scrape target’s Kubernetes namespace (for example gpu-operator), not the namespaces where GPU workloads run. This will produce misleading results. Either adjust the query to a workload-namespace-aware metric/join, or rename the panel to reflect what is actually being grouped.

Copilot

Pull request overview

Copilot reviewed 31 out of 33 changed files in this pull request and generated 4 comments.

Copilot · 2026-03-27T19:18:32Z

common/observability/components/kepler/kepler-service-monitor.yaml

+  name: kepler
+  namespace: kubeflow-monitoring-system
+  labels:
+    app.kubernetes.io/name: prometheus-operator


This ServiceMonitor sets metadata.labels.app.kubernetes.io/name: prometheus-operator, but the Prometheus CR currently uses an empty serviceMonitorSelector, so this label is not doing anything. Either scope Prometheus to select only ServiceMonitors with a dedicated label (recommended), or update/remove this label to avoid implying it is required for discovery.

Suggested change

app.kubernetes.io/name: prometheus-operator

app.kubernetes.io/name: kepler

Copilot · 2026-03-27T19:18:33Z

README.md

+This component provides an optional monitoring stack for GPU metrics (NVIDIA/AMD) and energy consumption (Kepler), along with Grafana dashboards. It includes Prometheus and Grafana operators and is deployed in the `kubeflow-monitoring-system` namespace.
+
+Install the observability base component:


The description here reads like Kepler is included in the base install, but the next section makes Kepler opt-in. Consider adjusting the wording to explicitly say Kepler is optional/opt-in to avoid confusion about what ./tests/observability_install.sh installs by default.

Suggested change

This component provides an optional monitoring stack for GPU metrics (NVIDIA/AMD) and energy consumption (Kepler), along with Grafana dashboards. It includes Prometheus and Grafana operators and is deployed in the `kubeflow-monitoring-system` namespace.

Install the observability base component:

This component provides an optional monitoring stack for GPU metrics (NVIDIA/AMD), along with Grafana dashboards. It includes Prometheus and Grafana operators and is deployed in the `kubeflow-monitoring-system` namespace. Support for energy consumption metrics via Kepler is an additional, opt-in component and is not installed by default by `./tests/observability_install.sh`; see the Kepler section below to enable it.

Install the observability base component (GPU metrics, Prometheus, and Grafana, without Kepler):

Copilot · 2026-03-27T19:18:33Z

common/observability/base/namespace.yaml

+apiVersion: v1
+kind: Namespace
+metadata:
+  name: kubeflow-monitoring-system


The new namespace manifest does not include the Pod Security Standards label used across other common/ namespaces (for example pod-security.kubernetes.io/enforce: restricted). Without this, installs can behave inconsistently on clusters with PodSecurity admission enabled. Consider adding the same default PSS label here (and any additional labels you need).

Suggested change

name: kubeflow-monitoring-system

name: kubeflow-monitoring-system

labels:

pod-security.kubernetes.io/enforce: restricted

Copilot · 2026-03-27T19:18:33Z

common/observability/base/kepler/namespace.yaml

+apiVersion: v1
+kind: Namespace
+metadata:
+  name: kubeflow-monitoring-system


Kepler runs as a privileged DaemonSet (hostPID/hostNetwork + privileged container), but this namespace manifest does not set a Pod Security Standards label that would allow it on clusters enforcing PSS. Consider setting pod-security.kubernetes.io/enforce: privileged here (or applying an explicit namespace patch as part of the Kepler component) so the opt-in Kepler install is self-contained.

Suggested change

name: kubeflow-monitoring-system

name: kubeflow-monitoring-system

labels:

pod-security.kubernetes.io/enforce: privileged

Copilot

Pull request overview

Copilot reviewed 36 out of 38 changed files in this pull request and generated 13 comments.

Copilot · 2026-03-27T21:02:51Z

common/observability/README.md

+| NVIDIA GPU Operator | NVIDIA ServiceMonitor | Deploy in `gpu-operator` ns — silent if absent |
+| AMD GPU Operator | AMD ServiceMonitor | Deploy in `kube-amd-gpu` ns — silent if absent |


The README states the NVIDIA/AMD ServiceMonitors are deployed into gpu-operator / kube-amd-gpu and are “silent if absent”, but the base kustomization sets namespace: kubeflow-monitoring-system (which rewrites namespaced resources). Either update the README to reflect that the ServiceMonitors are created in kubeflow-monitoring-system and target those namespaces via spec.namespaceSelector, or adjust the manifests so the installation is actually silent when the GPU-operator namespaces are missing.

Copilot · 2026-03-27T21:02:51Z

common/observability/README.md

+| NVIDIA DCGM ServiceMonitor | gpu-operator | Scrapes DCGM exporter |
+| AMD ROCm ServiceMonitor | kube-amd-gpu | Scrapes device-metrics-exporter |
+| 3x GrafanaDashboard CRs | kubeflow-monitoring-system | GPU dashboards |
+| Kepler DaemonSet (opt-in) | kepler | Per-pod energy/power draw metrics |
+| Kepler ServiceMonitor (opt-in) | kubeflow-monitoring-system | Scrapes Kepler |
+


In the “What gets installed” table, the ServiceMonitor namespaces are listed as gpu-operator and kube-amd-gpu, but the component’s namespace: kubeflow-monitoring-system will place namespaced resources into kubeflow-monitoring-system at build time. Please reconcile the table with the actual rendered output (and clarify that spec.namespaceSelector controls the target namespace to scrape).

Suggested change

| NVIDIA DCGM ServiceMonitor | gpu-operator | Scrapes DCGM exporter |

| AMD ROCm ServiceMonitor | kube-amd-gpu | Scrapes device-metrics-exporter |

| 3x GrafanaDashboard CRs | kubeflow-monitoring-system | GPU dashboards |

| Kepler DaemonSet (opt-in) | kepler | Per-pod energy/power draw metrics |

| Kepler ServiceMonitor (opt-in) | kubeflow-monitoring-system | Scrapes Kepler |

| NVIDIA DCGM ServiceMonitor | kubeflow-monitoring-system | Scrapes DCGM exporter in `gpu-operator` namespace via `spec.namespaceSelector` |

| AMD ROCm ServiceMonitor | kubeflow-monitoring-system | Scrapes device-metrics-exporter in `kube-amd-gpu` namespace via `spec.namespaceSelector` |

| 3x GrafanaDashboard CRs | kubeflow-monitoring-system | GPU dashboards |

| Kepler DaemonSet (opt-in) | kepler | Per-pod energy/power draw metrics |

| Kepler ServiceMonitor (opt-in) | kubeflow-monitoring-system | Scrapes Kepler |

Note: All ServiceMonitor resources are created in the `kubeflow-monitoring-system` namespace. The `spec.namespaceSelector` field on each ServiceMonitor controls which target namespaces are scraped.

Copilot · 2026-03-27T21:02:52Z

tests/model_registry_test.sh

+# FIX 14: Track port-forward PIDs for cleanup
+PF_PIDS=()


The # FIX 14: comments reference an external/unknown identifier and do not explain intent on their own. Consider replacing with a self-contained explanation (e.g., “Track port-forward PIDs for cleanup”) so the script remains understandable without external context.

Copilot · 2026-03-27T21:02:52Z

common/observability/base/dashboards/gpu-cluster-usage-dashboard-cr.yaml

+  configMapRef:
+    name: gpu-cluster-usage-dashboard
+    key: gpu-cluster-usage.json


spec.configMapRef references a ConfigMap named gpu-cluster-usage-dashboard, but this repository only defines a GrafanaDashboard with that name (no ConfigMap with key gpu-cluster-usage.json). This dashboard CR will not reconcile as intended. Either switch the referenced object to an actual ConfigMap containing the JSON, or remove configMapRef and keep a single GrafanaDashboard resource (avoid duplicating dashboards in this kustomization).

Suggested change

configMapRef:

name: gpu-cluster-usage-dashboard

key: gpu-cluster-usage.json

Copilot · 2026-03-27T21:02:52Z

common/observability/base/dashboards/gpu-availability-allocation-dashboard-cr.yaml

+    key: gpu-availability-allocation.json
+  instanceSelector:
+    matchLabels:
+      dashboards: "grafana"


spec.configMapRef references a ConfigMap gpu-availability-allocation-dashboard with key gpu-availability-allocation.json, but no such ConfigMap exists in this component. The only matching name is another GrafanaDashboard, so this CR will not reconcile correctly. Please either add the referenced ConfigMap or drop the -cr.yaml variant and keep a single dashboard definition.

Suggested change

dashboards: "grafana"

dashboards: "grafana"

---

apiVersion: v1

kind: ConfigMap

metadata:

name: gpu-availability-allocation-dashboard

namespace: kubeflow-monitoring-system

data:

gpu-availability-allocation.json: "{}"

Copilot · 2026-03-27T21:02:53Z

common/observability/README.md

+Open http://localhost:3000 — default credentials are admin/admin on first login.
+


The README documents Grafana default credentials as admin/admin. Since this stack is intended to be “batteries-included”, please add a clear warning and a concrete follow-up step to rotate the admin password (or configure admin credentials via a Secret) to avoid encouraging an insecure default in real clusters.

Suggested change

Open http://localhost:3000 — default credentials are admin/admin on first login.

Open http://localhost:3000 — default credentials are `admin` / `admin` on first login.

**Security warning:** These default administrator credentials are insecure and must not be used beyond initial local testing.

**Immediately after first login, rotate the Grafana administrator password** via the Grafana user interface (`Configuration → Users → admin`). For production or shared clusters, configure hardened administrator credentials through a Kubernetes Secret and the corresponding Grafana custom resource configuration (see Grafana Operator documentation) before exposing Grafana outside the cluster.

Copilot · 2026-03-27T21:02:53Z

.github/workflows/observability-kustomize-build.yaml

+      - uses: actions/checkout@v4
+      - name: Install kustomize
+        run: |
+          curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
+          sudo mv kustomize /usr/local/bin/


This workflow installs kustomize by piping a script from the kustomize repo’s master branch, which is non-deterministic and bypasses checksum verification. The repository already has a pinned+checksum-verified installer (tests/install_KinD_create_KinD_cluster_install_kustomize.sh) and most workflows use actions/checkout@v5 (this one uses @v4). Please align with the repo’s existing installation approach and pin versions for reproducible CI.

Suggested change

- uses: actions/checkout@v4

- name: Install kustomize

run: |

curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash

sudo mv kustomize /usr/local/bin/

- uses: actions/checkout@v5

- name: Install kustomize

run: |

chmod +x tests/install_KinD_create_KinD_cluster_install_kustomize.sh

tests/install_KinD_create_KinD_cluster_install_kustomize.sh

Copilot · 2026-03-27T21:02:53Z

common/observability/base/operators/grafana-operator/grafana.yaml

+        spec:
+          containers:
+          - name: grafana
+            image: grafana/grafana:10.2.3


The kubeflow-monitoring-system namespace is labeled with PSS restricted enforcement, but the Grafana pod template here does not specify a pod/container securityContext (e.g., allowPrivilegeEscalation: false, capabilities.drop: ["ALL"], seccompProfile). To avoid relying on operator defaults (and potential PodSecurity admission rejections), consider setting an explicit restricted-compliant security context in this spec.deployment.spec.template.

Suggested change

spec:

containers:

- name: grafana

image: grafana/grafana:10.2.3

spec:

securityContext:

runAsNonRoot: true

runAsUser: 472

runAsGroup: 472

fsGroup: 472

seccompProfile:

type: RuntimeDefault

containers:

- name: grafana

image: grafana/grafana:10.2.3

securityContext:

allowPrivilegeEscalation: false

capabilities:

drop:

- ALL

Copilot · 2026-03-27T21:02:54Z

common/observability/base/dashboards/gpu-namespace-usage-dashboard-cr.yaml

+  configMapRef:
+    name: gpu-namespace-usage-dashboard
+    key: gpu-namespace-usage.json


spec.configMapRef points to gpu-namespace-usage-dashboard/gpu-namespace-usage.json, but there is no ConfigMap providing that key (the only object with that name is another GrafanaDashboard). This will fail to load the dashboard. Consider keeping a single GrafanaDashboard (inline json) or add the missing ConfigMap and reference it consistently.

Suggested change

configMapRef:

name: gpu-namespace-usage-dashboard

key: gpu-namespace-usage.json

json: |

{

"title": "GPU Namespace Usage",

"schemaVersion": 38,

"version": 1,

"timezone": "browser",

"editable": true,

"panels": []

}

Copilot · 2026-03-27T21:02:54Z

common/observability/base/service-monitors/amd-gpu-service-monitor.yaml

+  name: amd-gpu-metrics-exporter
+  namespace: kube-amd-gpu
+  labels:


Same as the NVIDIA ServiceMonitor: the top-level namespace: kubeflow-monitoring-system will rewrite this resource’s namespace at build time, so metadata.namespace: kube-amd-gpu is misleading and makes the manifest fail if applied standalone when kube-amd-gpu is absent. Consider placing the ServiceMonitor in kubeflow-monitoring-system (or omitting metadata.namespace) and using spec.namespaceSelector.matchNames: [kube-amd-gpu] to target the exporter.

- Remove misleading metadata.namespace from NVIDIA/AMD ServiceMonitors; base kustomization sets namespace, spec.namespaceSelector targets exporters - Add restricted PSS securityContext to Grafana CR pod spec - Upgrade CI workflow to actions/checkout@v5 and use repo's pinned kustomize installer instead of curl|bash - Add wait_or_dump helper to model_registry_test.sh for failure diagnostics - Remove stale -cr.yaml dashboard duplicates that referenced non-existent ConfigMaps, causing dashboard reconciliation failures - Fix Kepler DaemonSet inline comment to reference correct 'kepler' namespace - Rewrite README: correct ServiceMonitor namespace descriptions, add namespaceSelector clarification, add Grafana security warning Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 36 out of 38 changed files in this pull request and generated 5 comments.

Copilot · 2026-03-27T21:18:48Z

tests/model_registry_test.sh

+  if ! kubectl wait --for=condition=available -n "$ns" "deployment/$deploy" --timeout="$timeout"; then
+    echo "ERROR: deployment $deploy in namespace $ns did not become available"
+    kubectl events -n "$ns"
+    kubectl describe "deployment/$deploy" -n "$ns"
+    kubectl logs "deployment/$deploy" -n "$ns" --all-containers --tail=50 || true
+    exit 1


kubectl events is not consistently available across kubectl versions/environments, and with set -e this can cause the failure-reporting path to exit before printing the subsequent describe/logs. Use a more portable command (e.g., kubectl get events) and/or guard it with || true so diagnostics are always emitted.

Copilot · 2026-03-27T21:18:48Z

tests/model_registry_test.sh

+nohup kubectl port-forward svc/model-registry-service -n kubeflow 8081:8080 &
+PF_PIDS+=($!)
+
+while ! curl -s localhost:8081 > /dev/null; do
+  echo "waiting for port-forwarding 8081..."
+  sleep 1
+done
+echo "port-forwarding 8081 ready"


This port-forward readiness loop has no timeout/max retries. If port-forwarding fails (e.g., port already in use, service missing, RBAC), the script can hang indefinitely in CI. Add a bounded retry/timeout and/or verify the port-forward process is still running (similar to tests/model_catalog_test.sh).

Copilot · 2026-03-27T21:18:49Z

tests/model_registry_test.sh

+  nohup kubectl port-forward --namespace istio-system svc/${INGRESS_GATEWAY_SERVICE} 8080:80 &
+  PF_PIDS+=($!)
+  sleep 3
+fi
+
+while ! curl -s localhost:8080 > /dev/null; do
+  echo "waiting for port-forwarding 8080..."
+  sleep 1
+done
+echo "port-forwarding 8080 ready"


This second port-forward readiness loop also lacks a timeout/max retries, so CI can hang indefinitely if the Istio gateway port-forward fails. Add a bounded retry/timeout (or reuse a shared helper) to guarantee the script terminates with diagnostics.

Copilot · 2026-03-27T21:18:49Z

common/observability/base/dashboards/kustomization.yaml

+resources:
+- gpu-cluster-usage-dashboard.yaml
+- gpu-namespace-usage-dashboard.yaml
+- gpu-availability-allocation-dashboard.yaml


The dashboards directory contains both inline-JSON GrafanaDashboard resources and separate *-dashboard-cr.yaml resources that reference ConfigMaps, but the *-dashboard-cr.yaml files are not included here (and there are no ConfigMaps generated for them). This leaves dead/unused manifests in the tree and makes it unclear which provisioning mechanism is intended. Either remove the *-dashboard-cr.yaml files or include them together with the required ConfigMaps/configMapGenerator and reference them consistently.

Suggested change

- gpu-availability-allocation-dashboard.yaml

Copilot · 2026-03-27T21:18:49Z

README.md

 | KServe | applications/kserve/kserve | [v0.16.0](https://github.com/kserve/kserve/releases/tag/v0.16.0/install/v0.16.0) | 600m | 1200Mi | 0GB |
-| KServe Models Web Application | applications/kserve/models-web-app | [c71ee4309f0335159d9fdfd4559a538b5c782c92](https://github.com/kserve/models-web-app/tree/c71ee4309f0335159d9fdfd4559a538b5c782c92/manifests/kustomize) | 6m | 259Mi  | 0GB |
+| KServe Models Web Application | applications/kserve/models-web-app | [v0.16.1](https://github.com/kserve/models-web-app/tree/v0.16.1/config) | 6m | 259Mi  | 0GB |
 | Kubeflow Pipelines | applications/pipeline/upstream | [2.16.0](https://github.com/kubeflow/pipelines/tree/2.16.0/manifests/kustomize) | 970m | 3552Mi | 35GB |


This PR is described as introducing an opt-in observability stack, but this README change also updates the referenced KServe Models Web Application version/link. If this update is intentional, it should be called out in the PR description (or moved to a separate PR) to avoid mixing unrelated changes.

- Replace kubectl events with more portable command in tests - Add bounded retry validation for port-forward loops - Remove orphaned dashboards CRs without ConfigMaps - Revert unintended KServe Models Web App version bump in README Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 33 out of 35 changed files in this pull request and generated 4 comments.

Copilot · 2026-03-28T03:30:20Z

common/observability/base/operators/grafana-operator/rbac.yaml

+kind: ClusterRole
+metadata:
+  name: grafana-operator
+rules:
+- apiGroups: [""]


The ClusterRole / ClusterRoleBinding names here are very generic (grafana-operator). Because these are cluster-scoped, installing this component into a cluster that already has a Grafana Operator (or any other component using the same names) will cause RBAC resource name collisions and unexpected permission changes. Consider prefixing these names (for example with kubeflow- / kubeflow-monitoring-) to make them globally unique.

Copilot · 2026-03-28T03:30:20Z

tests/model_registry_test.sh

+    fi
+    echo "waiting for port-forwarding $port..."
+    sleep 1
+    ((count++))
+  done


wait_for_port uses ((count++)) while the script runs with set -e. In Bash, ((count++)) returns exit status 1 when count is 0 (first loop iteration), which will terminate the script prematurely. Use an increment form that does not trigger set -e (for example, count=$((count+1)), ((++count)), or add an explicit || true).

Copilot · 2026-03-28T03:30:20Z

common/observability/base/operators/prometheus-operator/prometheus.yaml

+spec:
+  serviceAccountName: prometheus
+  serviceMonitorSelector: {}
+  serviceMonitorNamespaceSelector: {}
+  resources:
+    requests:
+      memory: 400Mi


The kubeflow-monitoring-system namespace is labeled with Pod Security Standards restricted, but this Prometheus CR does not define pod/container securityContext (e.g., allowPrivilegeEscalation: false, seccompProfile: RuntimeDefault, runAsNonRoot, dropped capabilities). On clusters enforcing PSA restricted without SeccompDefault, the operator-created Prometheus StatefulSet can be blocked. Configure the Prometheus CR to set an explicit pod and container security context that satisfies PSS restricted.

Copilot · 2026-03-28T03:30:20Z

common/observability/base/kepler/daemonset.yaml

+        - name: containerd
+          mountPath: /var/run/containerd
+          readOnly: true
+      volumes:
+      - name: proc
+        hostPath:
+          path: /proc
+      - name: sys
+        hostPath:
+          path: /sys
+      - name: containerd
+        hostPath:
+          path: /var/run/containerd


The Kepler DaemonSet hard-codes a hostPath mount for /var/run/containerd. On nodes that do not use containerd (e.g., CRI-O), this path may not contain the expected socket and Kepler can fail to collect runtime metrics (or silently report incomplete data). Consider making the runtime socket mount configurable (or mount a more general path used by supported runtimes) and document the supported container runtimes explicitly.

- Rename Grafana Operator ClusterRole/Binding to prevent RBAC collisions - Fix premature bash exit from set -e in wait_for_port increment - Add PSS restricted securityContext configuration to Prometheus CR - Generalize Kepler DaemonSet runtime socket mount path to /var/run Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>

…eview feedback Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 33 out of 35 changed files in this pull request and generated 5 comments.

Copilot · 2026-03-28T11:35:41Z

common/observability/base/namespace.yaml

+    pod-security.kubernetes.io/enforce-version: latest
+    pod-security.kubernetes.io/warn: restricted
+    pod-security.kubernetes.io/warn-version: latest


Using pod-security.kubernetes.io/*-version: latest makes the enforced Pod Security Standard level change automatically as Kubernetes evolves, which can lead to unexpected breakage on cluster upgrades. Consider pinning to a specific Kubernetes minor (matching the project's supported floor) or omitting the version labels so upgrades are explicit and controlled.

Suggested change

pod-security.kubernetes.io/enforce-version: latest

pod-security.kubernetes.io/warn: restricted

pod-security.kubernetes.io/warn-version: latest

pod-security.kubernetes.io/warn: restricted

Copilot · 2026-03-28T11:35:42Z

common/observability/base/kepler/namespace.yaml

+    pod-security.kubernetes.io/enforce-version: latest
+    pod-security.kubernetes.io/warn: privileged
+    pod-security.kubernetes.io/warn-version: latest


Using pod-security.kubernetes.io/*-version: latest makes the enforced Pod Security Standard level change automatically as Kubernetes evolves, which can lead to unexpected breakage on cluster upgrades. Consider pinning to a specific Kubernetes minor (matching the project's supported floor) or omitting the version labels so upgrades are explicit and controlled.

Suggested change

pod-security.kubernetes.io/enforce-version: latest

pod-security.kubernetes.io/warn: privileged

pod-security.kubernetes.io/warn-version: latest

pod-security.kubernetes.io/warn: privileged

Copilot · 2026-03-28T11:35:42Z

tests/model_registry_test.sh

+    echo "ERROR: deployment $deploy in namespace $ns did not become available"
+    kubectl get events -n "$ns" --sort-by='.lastTimestamp' || true
+    kubectl describe "deployment/$deploy" -n "$ns"
+    kubectl logs "deployment/$deploy" -n "$ns" --all-containers --tail=50 || true
+    exit 1


In the failure path, kubectl describe "deployment/$deploy" -n "$ns" is not guarded with || true. With set -euo pipefail, if the deployment does not exist (or describe fails for another reason), the script can exit before printing logs and the explicit exit 1, reducing debuggability. Consider adding || true (like the other debug commands) so the function reliably reaches the intended error output and exit 1.

Copilot · 2026-03-28T11:35:42Z

tests/model_registry_test.sh

+# Track port-forward PIDs so they are always killed on exit.
+PF_PIDS=()
+


Variable names like PF_PIDS, PF_PID_8081, and PF_PID_8080 are abbreviations that are hard to interpret without local context. Consider renaming them to fully spelled-out, descriptive names (for example, something like “port forward process IDs”) to improve long-term readability and maintainability of this test script.

Copilot · 2026-03-28T11:35:42Z

common/observability/base/operators/grafana-operator/rbac.yaml

+- apiGroups: [""]
+  resources: ["configmaps", "secrets", "services", "serviceaccounts"]
+  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
+- apiGroups: ["apps"]
+  resources: ["deployments"]
+  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]


The Grafana Operator ClusterRole grants cluster-wide write access to Secrets/ConfigMaps/Services/ServiceAccounts and Deployments. Since this stack is intended to be namespaced to kubeflow-monitoring-system, consider scoping the operator to a single namespace (e.g., set a watch namespace) and replacing the ClusterRole/ClusterRoleBinding with a namespaced Role/RoleBinding (or otherwise reducing privileges) to avoid unnecessary cluster-wide secret mutation permissions.

Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 33 out of 35 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (6)

tests/observability_install.sh:1

kubectl apply --server-side --force-conflicts can unintentionally take ownership of and overwrite fields for resources that may already exist (especially in shared dev clusters), making the install script risky outside of ephemeral CI. Consider removing --force-conflicts by default, or gating it behind an env var/flag (e.g., FORCE_CONFLICTS=true).
tests/observability_install.sh:1
Waiting on deployment -l control-plane=controller-manager is not specific to the Grafana operator and can match multiple deployments (or none), which can make the script flaky or hang/fail unexpectedly. Prefer waiting for a specific resource name (e.g., deployment/grafana-operator) or a unique label selector tied to this deployment.
tests/model_registry_test.sh:1
Using nohup without redirecting stdout/stderr will typically create/update nohup.out, which can clutter CI artifacts and local working directories. Consider redirecting output to a known location (or /dev/null) while still keeping the PID-based cleanup logic.
tests/model_registry_test.sh:1
Using nohup without redirecting stdout/stderr will typically create/update nohup.out, which can clutter CI artifacts and local working directories. Consider redirecting output to a known location (or /dev/null) while still keeping the PID-based cleanup logic.
tests/model_registry_test.sh:1
Since PORT_FORWARD_PIDS is always initialized, ${PORT_FORWARD_PIDS[@]:-} is unnecessary and can cause an extra loop iteration with an empty string in some cases. Use a normal array iteration (\"${PORT_FORWARD_PIDS[@]}\") and (optionally) gate on array length for clearer behavior.
tests/model_registry_test.sh:1
Since PORT_FORWARD_PIDS is always initialized, ${PORT_FORWARD_PIDS[@]:-} is unnecessary and can cause an extra loop iteration with an empty string in some cases. Use a normal array iteration (\"${PORT_FORWARD_PIDS[@]}\") and (optionally) gate on array length for clearer behavior.

Copilot · 2026-03-28T13:51:54Z

common/observability/README.md

+| Prerequisite | Required for | Notes |
+|---|---|---|
+| Kubernetes 1.27+ | Everything | |
+| kustomize v5+ | Installation | |
+| NVIDIA GPU Operator | NVIDIA ServiceMonitor | Runs in `gpu-operator` ns — ServiceMonitor scrapes it via `spec.namespaceSelector`; silent if absent |
+| AMD GPU Operator | AMD ServiceMonitor | Runs in `kube-amd-gpu` ns — ServiceMonitor scrapes it via `spec.namespaceSelector`; silent if absent |
+| kube-state-metrics | GPU Namespace Usage + Availability dashboards | **Without it 2/3 dashboards render blank with no error** — install via kube-prometheus-stack or standalone |


The prerequisite (and other) tables use || at the start of rows, which renders as an extra empty column / broken markdown in many renderers. Use standard markdown table formatting (| ... |) consistently so the tables render correctly.

Copilot · 2026-03-28T13:51:54Z

common/observability/base/operators/grafana-operator/grafana.yaml

+  config:
+    auth:
+      disable_login_form: "false"
+    users:
+      allow_sign_up: "false"


This stack appears to rely on Grafana's default initial credentials (also documented as admin/admin). Even for an opt-in component, shipping a configuration that comes up with well-known credentials is risky. Prefer wiring admin credentials through a Kubernetes Secret supported by the Grafana Operator (or at least make the insecure default explicitly opt-in), and document the required Secret/values.

Copilot · 2026-03-28T13:51:54Z

common/observability/base/kepler/daemonset.yaml

+          capabilities:
+            drop:
+              - ALL


Setting privileged: true effectively grants broad capabilities; additionally specifying capabilities.drop: [ALL] is misleading because privileged mode can negate the intent of capability dropping. To avoid conveying a false sense of hardening, either remove the capabilities.drop stanza when running privileged, or (if feasible) run non-privileged with a minimal explicit capability set.

Suggested change

capabilities:

drop:

- ALL

Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>

Copilot

Copilot reviewed 34 out of 36 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings March 27, 2026 14:29

google-oss-prow bot requested review from kimwnasptd and tarekabouzeid March 27, 2026 14:29

google-oss-prow bot added the size/XL label Mar 27, 2026

Copilot started reviewing on behalf of abdullahpathan22 March 27, 2026 14:29 View session