Skip to content

feat(observability): Add opt-in observability stack #3426#3427

Open
abdullahpathan22 wants to merge 8 commits intokubeflow:masterfrom
abdullahpathan22:feature/observability-stack-3426
Open

feat(observability): Add opt-in observability stack #3426#3427
abdullahpathan22 wants to merge 8 commits intokubeflow:masterfrom
abdullahpathan22:feature/observability-stack-3426

Conversation

@abdullahpathan22
Copy link
Copy Markdown

This PR introduces a new common/observability/ kustomize component that provides an optional, batteries-included monitoring stack.

Key Changes:

  • Deploys to kubeflow-monitoring-system namespace.
    • Adds Prometheus and Grafana operators.
    • Includes ServiceMonitors for NVIDIA DCGM, AMD ROCm, and Kepler.
    • Provides Grafana dashboards for GPU cluster/namespace usage and availability.
    • Adds an installation test script in tests/observability_install.sh.
      Follows mentor feedback regarding namespace and operator inclusion. Opt-in only, no impact on default installations.

Fixes #3426

Copilot AI review requested due to automatic review settings March 27, 2026 14:29
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign juliusvonkohout for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new opt-in common/observability/ kustomize component intended to provide a bundled monitoring stack (operators, GPU ServiceMonitors, dashboards) and also refactors Model Registry installation/testing by introducing reusable scripts and wiring them into CI and example manifests.

Changes:

  • Introduce common/observability base/overlays/components with operators, ServiceMonitors, Kepler manifests, and Grafana dashboards.
  • Add Model Registry install + integration test scripts and update CI workflows to use them.
  • Update example installation and central dashboard (oauth2-proxy overlay) to surface Model Registry.

Reviewed changes

Copilot reviewed 35 out of 35 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
tests/observability_install.sh Adds a script to apply the observability kustomize base/overlay and wait for operator pods.
tests/model_registry_test.sh Adds integration tests for Model Registry API and gateway access.
tests/model_registry_install.sh Adds a script to install Model Registry (server, Istio, UI, catalog) and wait for readiness.
example/kustomization.yaml Enables Model Registry resources in the example install.
common/observability/overlays/kubeflow/kustomization.yaml Defines a Kubeflow overlay for the observability component.
common/observability/components/kepler/kustomization.yaml Adds an opt-in Kepler component referencing the Kepler base.
common/observability/base/service-monitors/nvidia-dcgm-service-monitor.yaml Adds ServiceMonitor for NVIDIA DCGM exporter.
common/observability/base/service-monitors/kustomization.yaml Aggregates GPU/Kepler ServiceMonitors.
common/observability/base/service-monitors/kepler-service-monitor.yaml Adds ServiceMonitor intended for Kepler metrics.
common/observability/base/service-monitors/amd-gpu-service-monitor.yaml Adds ServiceMonitor for AMD device metrics exporter.
common/observability/base/operators/prometheus-operator/rbac.yaml Adds Prometheus Operator service account and cluster RBAC.
common/observability/base/operators/prometheus-operator/kustomization.yaml Wires Prometheus Operator deployment + RBAC.
common/observability/base/operators/prometheus-operator/deployment.yaml Adds Prometheus Operator deployment manifest.
common/observability/base/operators/grafana-operator/rbac.yaml Adds Grafana Operator service account and cluster RBAC.
common/observability/base/operators/grafana-operator/kustomization.yaml Wires Grafana Operator deployment/RBAC and Grafana CR.
common/observability/base/operators/grafana-operator/grafana.yaml Adds Grafana custom resource.
common/observability/base/operators/grafana-operator/deployment.yaml Adds Grafana Operator deployment manifest.
common/observability/base/namespace.yaml Creates kubeflow-monitoring-system namespace.
common/observability/base/kustomization.yaml Top-level observability base composition (namespace/operators/monitors/dashboards).
common/observability/base/kepler/serviceaccount.yaml Adds Kepler service account.
common/observability/base/kepler/service.yaml Adds Kepler Service.
common/observability/base/kepler/namespace.yaml Adds (duplicate) namespace manifest for Kepler base.
common/observability/base/kepler/kustomization.yaml Wires Kepler RBAC + DaemonSet + Service.
common/observability/base/kepler/daemonset.yaml Adds Kepler DaemonSet.
common/observability/base/kepler/clusterrolebinding.yaml Adds Kepler ClusterRoleBinding.
common/observability/base/kepler/clusterrole.yaml Adds Kepler ClusterRole.
common/observability/base/dashboards/kustomization.yaml Aggregates the dashboard ConfigMaps.
common/observability/base/dashboards/gpu-namespace-usage-dashboard.yaml Adds a Grafana dashboard ConfigMap (namespace-level GPU usage).
common/observability/base/dashboards/gpu-cluster-usage-dashboard.yaml Adds a Grafana dashboard ConfigMap (cluster-level GPU usage).
common/observability/base/dashboards/gpu-availability-allocation-dashboard.yaml Adds a Grafana dashboard ConfigMap (GPU availability/allocation).
applications/centraldashboard/overlays/oauth2-proxy/patches/configmap.yaml Patches central dashboard links/settings (adds Model Registry link).
applications/centraldashboard/overlays/oauth2-proxy/kustomization.yaml Switches to istio overlay and applies the centraldashboard ConfigMap patch.
README.md Documents optional observability and Model Registry install scripts.
.github/workflows/model_registry_test.yaml Refactors workflow to use the new install/test scripts.
.github/workflows/full_kubeflow_integration_test.yaml Uses new Model Registry scripts and runs Model Registry tests.

Comment on lines +181 to +195
#### Observability Stack (Optional)

This component provides an optional monitoring stack for GPU metrics (NVIDIA/AMD) and energy consumption (Kepler), along with Grafana dashboards. It includes Prometheus and Grafana operators and is deployed in the `kubeflow-monitoring-system` namespace.

Install the observability base component:

```sh
./tests/observability_install.sh
```

To opt into Kepler for energy metrics:

```sh
kustomize build common/observability/components/kepler | kubectl apply -f -
```
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README describes this as a ready-to-use “monitoring stack”, but the manifests currently do not include the required CRDs (ServiceMonitor/Grafana) and do not create a Prometheus instance or Grafana datasource. As a result, the install instructions here will not work on a fresh cluster and the dashboards will not render without extra manual steps.

Either ship the missing pieces as part of common/observability or document the required pre-existing dependencies and post-install configuration steps.

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +12
namespace: kepler
spec:
endpoints:
- interval: 15s
port: http
namespaceSelector:
matchNames:
- kepler
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kepler-service-monitor.yaml creates the ServiceMonitor in namespace kepler and only selects targets in namespace kepler, but the Kepler manifests in this component deploy to kubeflow-monitoring-system. As written, base installation will fail if the kepler namespace does not exist, and Prometheus will not discover the Kepler Service.

Align the ServiceMonitor metadata.namespace and namespaceSelector.matchNames with the namespace where Kepler is actually deployed (or move this ServiceMonitor into the Kepler opt-in component).

Suggested change
namespace: kepler
spec:
endpoints:
- interval: 15s
port: http
namespaceSelector:
matchNames:
- kepler
namespace: kubeflow-monitoring-system
spec:
endpoints:
- interval: 15s
port: http
namespaceSelector:
matchNames:
- kubeflow-monitoring-system

Copilot uses AI. Check for mistakes.
resources:
- nvidia-dcgm-service-monitor.yaml
- amd-gpu-service-monitor.yaml
- kepler-service-monitor.yaml
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kepler is described as opt-in, but base/service-monitors unconditionally includes kepler-service-monitor.yaml. This makes kustomize build common/observability/base | kubectl apply depend on Kepler’s namespace/Service existing and couples base installs to an optional component.

Move the Kepler ServiceMonitor into common/observability/components/kepler (or gate it behind an overlay) so the base component installs cleanly without Kepler.

Suggested change
- kepler-service-monitor.yaml

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +9
resources:
- namespace.yaml
- operators/prometheus-operator
- operators/grafana-operator
- service-monitors
- dashboards
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base component deploys the Prometheus Operator and Grafana Operator, and also applies ServiceMonitor/Grafana resources, but it does not include the required CRDs for monitoring.coreos.com and grafana.integreatly.org. On a cluster without those CRDs preinstalled, kubectl apply will fail before the operators can run.

If this is meant to be “batteries included”, include the CRDs (or an explicit preflight/install step) so tests/observability_install.sh works on a fresh cluster.

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +22
"targets": [
{ "expr": "avg(DCGM_FI_DEV_GPU_UTIL) or avg(amd_gpu_utilization)", "legendFormat": "GPU Utilization" }
]
},
{
"title": "GPU Memory Used vs Total per Node",
"type": "timeseries",
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bundled Grafana dashboards hardcode datasource.uid = "prometheus", but this component does not create a corresponding GrafanaDataSource (or any other datasource provisioning). Unless a datasource with UID prometheus is created out-of-band, these dashboards will not render.

Provision the Prometheus datasource as part of this component (or avoid hardcoding a UID and use a default datasource strategy supported by the chosen Grafana operator).

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +10
cd common/observability
kustomize build base | kubectl apply -f -
echo "Waiting for operators to be ready..."
kubectl wait --for=condition=Ready pod -l 'app.kubernetes.io/name=prometheus-operator' --timeout=180s -n kubeflow-monitoring-system
kubectl wait --for=condition=Ready pod -l 'control-plane=controller-manager' --timeout=180s -n kubeflow-monitoring-system
echo "Applying Kubeflow overlay..."
kustomize build overlays/kubeflow | kubectl apply -f -
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script applies base and then applies overlays/kubeflow, but the overlay currently only re-includes ../../base. That results in duplicate applies without any functional difference.

Apply only the overlay (or only the base) until the overlay adds distinct customization; this reduces install time and avoids confusion about what the overlay changes.

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +15
# ---- Test 1: Direct API access via port-forward ----
echo "Test 1: Direct Model Registry API access..."
nohup kubectl port-forward svc/model-registry-service -n kubeflow 8081:8080 &
timeout 30s bash -c 'until curl -s localhost:8081 > /dev/null 2>&1; do sleep 1; done'

Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_registry_test.sh starts a background kubectl port-forward but does not capture its PID or clean it up on success/failure. This can leak background processes in local runs and can interfere with subsequent steps in CI that expect to bind to the same port.

Capture $! and trap a cleanup handler to kill the port-forward process(es) before exiting.

Copilot uses AI. Check for mistakes.
Comment on lines +102 to +110
# Model Registry
- ../applications/model-registry/upstream/overlays/postgres
# Model Registry Istio networking (VirtualService for /api/model_registry/)
- ../applications/model-registry/upstream/options/istio
# Model Registry UI
- ../applications/model-registry/upstream/options/ui/overlays/istio
# Model Catalog (demo)
- ../applications/model-registry/upstream/options/catalog/overlays/demo

Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title/description focuses on adding an opt-in observability stack, but this change also enables Model Registry resources in the example kustomization (and the PR adds Model Registry install/test scripts and workflow changes). This looks like additional scope not described in the PR metadata.

Either update the PR title/description to explicitly cover the Model Registry changes, or split the Model Registry updates into a separate PR for clearer review and release notes.

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +8
resources:
- namespace.yaml
- operators/prometheus-operator
- operators/grafana-operator
- service-monitors
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This “observability stack” deploys the operators but does not create a Prometheus (or PrometheusAgent) instance. Without a Prometheus CR, no scraping will happen and the ServiceMonitors are effectively unused.

Add a minimal Prometheus (and any required RBAC/Service/ServiceAccount) or clarify in the component/docs that an external Prometheus is required.

Copilot uses AI. Check for mistakes.
Comment on lines +108 to +113
INGRESS_GATEWAY_SERVICE=$(kubectl get svc --namespace istio-system \
--selector="app=istio-ingressgateway" \
--output jsonpath='{.items[0].metadata.name}')

nohup kubectl port-forward --namespace istio-system "svc/${INGRESS_GATEWAY_SERVICE}" 8080:80 &
timeout 30s bash -c 'until curl -s localhost:8080 > /dev/null 2>&1; do sleep 1; done'
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script starts a second background port-forward to the Istio ingress gateway on local port 8080 and does not clean it up. In the full integration workflow, the gateway is already port-forwarded earlier (tests/port_forward_gateway.sh), so this extra port-forward is redundant and can make debugging harder if it fails.

Prefer reusing the existing gateway port-forward (or parameterize the port), and ensure the process is terminated via PID tracking + trap cleanup.

Copilot uses AI. Check for mistakes.
@abdullahpathan22 abdullahpathan22 force-pushed the feature/observability-stack-3426 branch 2 times, most recently from b15fca8 to 57b1609 Compare March 27, 2026 16:47
@juliusvonkohout
Copy link
Copy Markdown
Member

please rebase to master, there is something wrong with your commits. I see commits that are already on master.

@abdullahpathan22
Copy link
Copy Markdown
Author

abdullahpathan22 commented Mar 27, 2026

Hello @juliusvonkohout, actually there was something wrong, i have fixed it and it is ready for review, please take a look and provide me with the feedbacks or necesassry changes thank you!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 30 changed files in this pull request and generated 8 comments.

kind: ServiceMonitor
metadata:
name: amd-gpu-metrics-exporter
namespace: kube-amd-gpu
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ServiceMonitor has no labels, but the Prometheus CR in this component uses a serviceMonitorSelector.matchLabels (currently app.kubernetes.io/name: prometheus-operator). Without adding the expected label(s) here (or adjusting the Prometheus selector), Prometheus will ignore this ServiceMonitor.

Suggested change
namespace: kube-amd-gpu
namespace: kube-amd-gpu
labels:
app.kubernetes.io/name: prometheus-operator

Copilot uses AI. Check for mistakes.
Comment on lines +17 to +19
instanceSelector:
matchLabels:
dashboards: "grafana"
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GrafanaDatasource.spec.instanceSelector.matchLabels requires matching labels on the target Grafana resource. The Grafana instance in this PR has no metadata.labels.dashboards: "grafana", so this datasource will remain in the NoMatchingInstances state and never get provisioned.

Suggested change
instanceSelector:
matchLabels:
dashboards: "grafana"
instanceSelector: {}

Copilot uses AI. Check for mistakes.
kind: Grafana
metadata:
name: grafana
namespace: kubeflow-monitoring-system
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Grafana instance has no labels, but the GrafanaDatasource (and any future GrafanaDashboard) resources in this component rely on instanceSelector.matchLabels to bind to a Grafana instance. Add the expected labels (e.g., dashboards: "grafana") under metadata.labels so provisioning works.

Suggested change
namespace: kubeflow-monitoring-system
namespace: kubeflow-monitoring-system
labels:
dashboards: "grafana"

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +8
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-cluster-usage-dashboard
namespace: kubeflow-monitoring-system
labels:
grafana_dashboard: "1"
data:
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These dashboards are provided as ConfigMaps labeled grafana_dashboard: "1", but this component deploys the Grafana Operator (integreatly) which provisions dashboards via GrafanaDashboard CRs (optionally with spec.configMapRef). Without corresponding GrafanaDashboard resources (or a sidecar-based ConfigMap dashboard loader), these ConfigMaps will not be imported into Grafana.

Copilot uses AI. Check for mistakes.
README.md Outdated
Comment on lines +465 to +471
#### Model Registry

Install the Model Registry with its UI and database components:

```sh
./tests/model_registry_install.sh
```
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title/description is scoped to adding an opt-in observability stack, but this change also adds Model Registry installation/testing scripts and workflow usage. Please either update the PR description to cover these additional changes or split them into a separate PR to keep review and release notes coherent.

Copilot uses AI. Check for mistakes.
kind: ServiceMonitor
metadata:
name: nvidia-dcgm-exporter
namespace: gpu-operator
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ServiceMonitor has no labels, but the Prometheus CR in this component uses a serviceMonitorSelector.matchLabels (currently app.kubernetes.io/name: prometheus-operator). Without adding the expected label(s) here (or adjusting the Prometheus selector), Prometheus will ignore this ServiceMonitor.

Suggested change
namespace: gpu-operator
namespace: gpu-operator
labels:
app.kubernetes.io/name: prometheus-operator

Copilot uses AI. Check for mistakes.
kind: ServiceMonitor
metadata:
name: kepler
namespace: kubeflow-monitoring-system
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ServiceMonitor has no labels, but the Prometheus CR in the base component uses a serviceMonitorSelector.matchLabels. Without adding the expected label(s) here (or adjusting the Prometheus selector), Prometheus will ignore the Kepler ServiceMonitor.

Suggested change
namespace: kubeflow-monitoring-system
namespace: kubeflow-monitoring-system
labels:
app.kubernetes.io/name: kepler

Copilot uses AI. Check for mistakes.
Comment on lines +8 to +10
serviceMonitorSelector:
matchLabels:
app.kubernetes.io/name: prometheus-operator
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Prometheus CR selects ServiceMonitors by label app.kubernetes.io/name: prometheus-operator, but the ServiceMonitors introduced in this component (and the optional Kepler ServiceMonitor) do not set that label. Additionally, because the NVIDIA/AMD ServiceMonitors are created in other namespaces (gpu-operator, kube-amd-gpu), Prometheus will not discover them unless serviceMonitorNamespaceSelector is set (or the ServiceMonitors are moved into kubeflow-monitoring-system). As-is, Prometheus will not scrape the intended targets.

Suggested change
serviceMonitorSelector:
matchLabels:
app.kubernetes.io/name: prometheus-operator
serviceMonitorSelector: {}
serviceMonitorNamespaceSelector: {}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 33 changed files in this pull request and generated 16 comments.

Comment on lines +6 to +7
cd common/observability
kustomize build overlays/kubeflow | kubectl apply -f -
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The install script applies a large operator bundle and CRDs using client-side kubectl apply. This is prone to failures due to the last-applied-configuration annotation size limit (and conflicts when CRDs already exist). Align this script with other install scripts in tests/ by using server-side apply (and force-conflicts where appropriate) for the kustomize output.

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +23
verbs: ["*"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["*"]
- apiGroups: ["grafana.integreatly.org"]
resources: ["grafanas", "grafanadashboards", "grafanadatasources", "grafanafolders"]
verbs: ["*"]
- apiGroups: ["networking.k8s.io"]
resources: ["ingresses"]
verbs: ["*"]
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ClusterRole grants wildcard verbs ("*") on several resource types. For a namespaced installation this is typically broader than necessary and increases blast radius. Prefer narrowing to the minimal verb set and using a Role/RoleBinding where cluster scope is not required.

Suggested change
verbs: ["*"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["*"]
- apiGroups: ["grafana.integreatly.org"]
resources: ["grafanas", "grafanadashboards", "grafanadatasources", "grafanafolders"]
verbs: ["*"]
- apiGroups: ["networking.k8s.io"]
resources: ["ingresses"]
verbs: ["*"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["grafana.integreatly.org"]
resources: ["grafanas", "grafanadashboards", "grafanadatasources", "grafanafolders"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["networking.k8s.io"]
resources: ["ingresses"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +34
securityContext:
privileged: true
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

securityContext.privileged: true is a very high-privilege setting. If Kepler supports it, harden the container (for example allowPrivilegeEscalation: false, readOnlyRootFilesystem, and explicit dropped capabilities) and document why privileged mode is required.

Suggested change
securityContext:
privileged: true
securityContext:
# Kepler requires privileged mode to access host-level energy and kernel metrics
# via /proc, /sys, and the container runtime socket.
privileged: true
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- "ALL"

Copilot uses AI. Check for mistakes.
README.md Outdated
We are planning to cut 2 releases per year, for example 26.03 and 26.10 before each KubeCon EU and NA.
We ask each working group/component to provide non-breaking patch releases for 6 months based on the version in each date release.
We try to BEST-EFFORT support each release for 6 months as community. There is [commercial support](https://www.kubeflow.org/docs/started/support/#support-from-commercial-providers-in-the-kubeflow-ecosystem) available if needed.
We try to BEST-EFFORT support each realease for 6 monhts as community. There is [commercial support](https://www.kubeflow.org/docs/started/support/#support-from-commercial-providers-in-the-kubeflow-ecosystem) available if needed.
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling: realease should be release.

Suggested change
We try to BEST-EFFORT support each realease for 6 monhts as community. There is [commercial support](https://www.kubeflow.org/docs/started/support/#support-from-commercial-providers-in-the-kubeflow-ecosystem) available if needed.
We try to BEST-EFFORT support each release for 6 months as community. There is [commercial support](https://www.kubeflow.org/docs/started/support/#support-from-commercial-providers-in-the-kubeflow-ecosystem) available if needed.

Copilot uses AI. Check for mistakes.
README.md Outdated
| Katib | applications/katib/upstream | [v0.19.0](https://github.com/kubeflow/katib/tree/v0.19.0/manifests/v1beta1) | 13m | 476Mi | 10GB |
| KServe | applications/kserve/kserve | [v0.16.0](https://github.com/kserve/kserve/releases/tag/v0.16.0/install/v0.16.0) | 600m | 1200Mi | 0GB |
| KServe Models Web Application | applications/kserve/models-web-app | [c71ee4309f0335159d9fdfd4559a538b5c782c92](https://github.com/kserve/models-web-app/tree/c71ee4309f0335159d9fdfd4559a538b5c782c92/manifests/kustomize) | 6m | 259Mi | 0GB |
| KServe Models Web Application | applications/kserve/models-web-app | [v0.15.0](https://github.com/kserve/models-web-app/tree/v0.15.0/config) | 6m | 259Mi | 0GB |
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table entry links KServe Models Web Application to v0.15.0, but the manifests in applications/kserve/models-web-app currently set the image tag to 0.16.1. Please align the README's upstream revision/link with the version actually shipped.

Suggested change
| KServe Models Web Application | applications/kserve/models-web-app | [v0.15.0](https://github.com/kserve/models-web-app/tree/v0.15.0/config) | 6m | 259Mi | 0GB |
| KServe Models Web Application | applications/kserve/models-web-app | [v0.16.1](https://github.com/kserve/models-web-app/tree/v0.16.1/config) | 6m | 259Mi | 0GB |

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +44
name: prometheus
rules:
- apiGroups: [""]
resources: ["nodes", "nodes/metrics", "services", "endpoints", "pods"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ClusterRole is named prometheus, which is very likely to collide with existing Prometheus installations because this is a cluster-scoped resource. Consider prefixing this name with something unique to this component (for example, kubeflow-monitoring-...) to avoid installation conflicts.

Suggested change
name: prometheus
rules:
- apiGroups: [""]
resources: ["nodes", "nodes/metrics", "services", "endpoints", "pods"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
name: kubeflow-monitoring-prometheus
rules:
- apiGroups: [""]
resources: ["nodes", "nodes/metrics", "services", "endpoints", "pods"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kubeflow-monitoring-prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kubeflow-monitoring-prometheus
subjects:
- kind: ServiceAccount
name: prometheus

Copilot uses AI. Check for mistakes.
Comment on lines +34 to +38
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ClusterRoleBinding is named prometheus, which is very likely to collide with existing Prometheus installations because this is a cluster-scoped resource. Consider prefixing this name with something unique to this component (for example, kubeflow-monitoring-...) to avoid installation conflicts.

Copilot uses AI. Check for mistakes.
namespace: kubeflow-monitoring-system
spec:
datasource:
name: prometheus
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dashboards in this component reference the Prometheus datasource by UID ("uid": "prometheus"), but this GrafanaDatasource does not set spec.datasource.uid. Without a stable UID, Grafana will generate a different one and the dashboards will not bind to the intended datasource.

Suggested change
name: prometheus
name: prometheus
uid: prometheus

Copilot uses AI. Check for mistakes.
README.md Outdated
To view all past security scans, head to the [Image Extracting and Security Scanning GitHub Action workflow](https://github.com/kubeflow/manifests/actions/workflows/trivy.yaml). In the logs of the workflow, you can expand the `Run image extracting and security scanning script` step to view the CVE logs. You will find a per-image CVE scan and a JSON dump of per-WorkingGroup aggregated metrics. You can run the Python script from the workflow file locally on your machine to obtain the detailed JSON files for any git commit.

For more information please consult the [SECURITY.md](./SECURITY.md).
For more infromation please consult the [SECURITY.md](./SECURITY.md).
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling: infromation should be information.

Suggested change
For more infromation please consult the [SECURITY.md](./SECURITY.md).
For more information please consult the [SECURITY.md](./SECURITY.md).

Copilot uses AI. Check for mistakes.
README.md Outdated
**A:** Please refer to each individual component's documentation for a dependency compatibility range. For Istio, Knative, Dex, Cert-Manager, and OAuth2 Proxy, the versions in `common` are the ones we have validated.
- **Q:** Can I use Kubeflow in an air-gapped environment?
**A:** Yes you can. You just need to get the list of images from our [trivy CVE scanning script](https://github.com/kubeflow/manifests/blob/master/tests/trivy_scan.py), mirror them and replace the references in the manifests with kustomize components and overlays, see [Upgrading and Extending](#upgrading-and-extending). You could also use a simple kyverno policy to replace the images at runtime, which could be easier to maintain.
**A:** Yes you can. You just need to to get the list of images from our [trivy CVE scanning script](https://github.com/kubeflow/manifests/blob/master/tests/trivy_scan.py), mirror them and replace the references in the manifests with kustomize components and overlays, see [Upgrading and Extending](#upgrading-and-extending). You could also use a simple kyverno policy to replace the images at runtime, which could be easier to maintain.
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling/grammar: remove the duplicated word to ("need to to get").

Suggested change
**A:** Yes you can. You just need to to get the list of images from our [trivy CVE scanning script](https://github.com/kubeflow/manifests/blob/master/tests/trivy_scan.py), mirror them and replace the references in the manifests with kustomize components and overlays, see [Upgrading and Extending](#upgrading-and-extending). You could also use a simple kyverno policy to replace the images at runtime, which could be easier to maintain.
**A:** Yes you can. You just need to get the list of images from our [trivy CVE scanning script](https://github.com/kubeflow/manifests/blob/master/tests/trivy_scan.py), mirror them and replace the references in the manifests with kustomize components and overlays, see [Upgrading and Extending](#upgrading-and-extending). You could also use a simple kyverno policy to replace the images at runtime, which could be easier to maintain.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 33 changed files in this pull request and generated 3 comments.

Comment on lines 108 to 111
- For the specific Kubernetes version per release, consult the [release notes](https://github.com/kubeflow/manifests/releases).
- Our Kind script below will take care of installing continuously tested Kubernetes, Kustomize and Kubectl versions for you.
- We use Kind as default but also support Minikube, Rancher, EKS, AKS, and GKE. GKE might need tiny adjustments documented here in this file and OpenShift is also possible.

Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR removes the ARM64/aarch64 support note from the README. That change appears unrelated to the observability stack feature and removes potentially useful guidance for users. If this was accidental, consider restoring the section; if intentional, it likely deserves a separate PR with rationale.

Copilot uses AI. Check for mistakes.
README.md Outdated
Comment on lines +465 to +466


Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two consecutive blank lines were added here, which creates unnecessary diff noise and slightly reduces readability. Consider removing the extra empty lines.

Suggested change

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +18
"title": "Per-namespace GPU Utilization over time",
"type": "timeseries",
"targets": [
{ "expr": "sum(DCGM_FI_DEV_GPU_UTIL) by (namespace)", "legendFormat": "{{namespace}}" }
]
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “Per-namespace GPU Utilization over time” panel groups DCGM_FI_DEV_GPU_UTIL by namespace, but for DCGM exporter metrics the namespace label (when present) typically reflects the scrape target’s Kubernetes namespace (for example gpu-operator), not the namespaces where GPU workloads run. This will produce misleading results. Either adjust the query to a workload-namespace-aware metric/join, or rename the panel to reflect what is actually being grouped.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 33 changed files in this pull request and generated 4 comments.

name: kepler
namespace: kubeflow-monitoring-system
labels:
app.kubernetes.io/name: prometheus-operator
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ServiceMonitor sets metadata.labels.app.kubernetes.io/name: prometheus-operator, but the Prometheus CR currently uses an empty serviceMonitorSelector, so this label is not doing anything. Either scope Prometheus to select only ServiceMonitors with a dedicated label (recommended), or update/remove this label to avoid implying it is required for discovery.

Suggested change
app.kubernetes.io/name: prometheus-operator
app.kubernetes.io/name: kepler

Copilot uses AI. Check for mistakes.
README.md Outdated
Comment on lines +188 to +190
This component provides an optional monitoring stack for GPU metrics (NVIDIA/AMD) and energy consumption (Kepler), along with Grafana dashboards. It includes Prometheus and Grafana operators and is deployed in the `kubeflow-monitoring-system` namespace.

Install the observability base component:
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description here reads like Kepler is included in the base install, but the next section makes Kepler opt-in. Consider adjusting the wording to explicitly say Kepler is optional/opt-in to avoid confusion about what ./tests/observability_install.sh installs by default.

Suggested change
This component provides an optional monitoring stack for GPU metrics (NVIDIA/AMD) and energy consumption (Kepler), along with Grafana dashboards. It includes Prometheus and Grafana operators and is deployed in the `kubeflow-monitoring-system` namespace.
Install the observability base component:
This component provides an optional monitoring stack for GPU metrics (NVIDIA/AMD), along with Grafana dashboards. It includes Prometheus and Grafana operators and is deployed in the `kubeflow-monitoring-system` namespace. Support for energy consumption metrics via Kepler is an additional, opt-in component and is not installed by default by `./tests/observability_install.sh`; see the Kepler section below to enable it.
Install the observability base component (GPU metrics, Prometheus, and Grafana, without Kepler):

Copilot uses AI. Check for mistakes.
apiVersion: v1
kind: Namespace
metadata:
name: kubeflow-monitoring-system
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new namespace manifest does not include the Pod Security Standards label used across other common/ namespaces (for example pod-security.kubernetes.io/enforce: restricted). Without this, installs can behave inconsistently on clusters with PodSecurity admission enabled. Consider adding the same default PSS label here (and any additional labels you need).

Suggested change
name: kubeflow-monitoring-system
name: kubeflow-monitoring-system
labels:
pod-security.kubernetes.io/enforce: restricted

Copilot uses AI. Check for mistakes.
apiVersion: v1
kind: Namespace
metadata:
name: kubeflow-monitoring-system
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kepler runs as a privileged DaemonSet (hostPID/hostNetwork + privileged container), but this namespace manifest does not set a Pod Security Standards label that would allow it on clusters enforcing PSS. Consider setting pod-security.kubernetes.io/enforce: privileged here (or applying an explicit namespace patch as part of the Kepler component) so the opt-in Kepler install is self-contained.

Suggested change
name: kubeflow-monitoring-system
name: kubeflow-monitoring-system
labels:
pod-security.kubernetes.io/enforce: privileged

Copilot uses AI. Check for mistakes.
@abdullahpathan22 abdullahpathan22 force-pushed the feature/observability-stack-3426 branch from a07c50a to 919001f Compare March 27, 2026 19:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 38 changed files in this pull request and generated 13 comments.

Comment on lines +9 to +10
| NVIDIA GPU Operator | NVIDIA ServiceMonitor | Deploy in `gpu-operator` ns — silent if absent |
| AMD GPU Operator | AMD ServiceMonitor | Deploy in `kube-amd-gpu` ns — silent if absent |
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README states the NVIDIA/AMD ServiceMonitors are deployed into gpu-operator / kube-amd-gpu and are “silent if absent”, but the base kustomization sets namespace: kubeflow-monitoring-system (which rewrites namespaced resources). Either update the README to reflect that the ServiceMonitors are created in kubeflow-monitoring-system and target those namespaces via spec.namespaceSelector, or adjust the manifests so the installation is actually silent when the GPU-operator namespaces are missing.

Copilot uses AI. Check for mistakes.
Comment on lines +34 to +39
| NVIDIA DCGM ServiceMonitor | gpu-operator | Scrapes DCGM exporter |
| AMD ROCm ServiceMonitor | kube-amd-gpu | Scrapes device-metrics-exporter |
| 3x GrafanaDashboard CRs | kubeflow-monitoring-system | GPU dashboards |
| Kepler DaemonSet (opt-in) | kepler | Per-pod energy/power draw metrics |
| Kepler ServiceMonitor (opt-in) | kubeflow-monitoring-system | Scrapes Kepler |

Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the “What gets installed” table, the ServiceMonitor namespaces are listed as gpu-operator and kube-amd-gpu, but the component’s namespace: kubeflow-monitoring-system will place namespaced resources into kubeflow-monitoring-system at build time. Please reconcile the table with the actual rendered output (and clarify that spec.namespaceSelector controls the target namespace to scrape).

Suggested change
| NVIDIA DCGM ServiceMonitor | gpu-operator | Scrapes DCGM exporter |
| AMD ROCm ServiceMonitor | kube-amd-gpu | Scrapes device-metrics-exporter |
| 3x GrafanaDashboard CRs | kubeflow-monitoring-system | GPU dashboards |
| Kepler DaemonSet (opt-in) | kepler | Per-pod energy/power draw metrics |
| Kepler ServiceMonitor (opt-in) | kubeflow-monitoring-system | Scrapes Kepler |
| NVIDIA DCGM ServiceMonitor | kubeflow-monitoring-system | Scrapes DCGM exporter in `gpu-operator` namespace via `spec.namespaceSelector` |
| AMD ROCm ServiceMonitor | kubeflow-monitoring-system | Scrapes device-metrics-exporter in `kube-amd-gpu` namespace via `spec.namespaceSelector` |
| 3x GrafanaDashboard CRs | kubeflow-monitoring-system | GPU dashboards |
| Kepler DaemonSet (opt-in) | kepler | Per-pod energy/power draw metrics |
| Kepler ServiceMonitor (opt-in) | kubeflow-monitoring-system | Scrapes Kepler |
Note: All ServiceMonitor resources are created in the `kubeflow-monitoring-system` namespace. The `spec.namespaceSelector` field on each ServiceMonitor controls which target namespaces are scraped.

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +5
# FIX 14: Track port-forward PIDs for cleanup
PF_PIDS=()
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The # FIX 14: comments reference an external/unknown identifier and do not explain intent on their own. Consider replacing with a self-contained explanation (e.g., “Track port-forward PIDs for cleanup”) so the script remains understandable without external context.

Copilot uses AI. Check for mistakes.
Comment on lines +7 to +9
configMapRef:
name: gpu-cluster-usage-dashboard
key: gpu-cluster-usage.json
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spec.configMapRef references a ConfigMap named gpu-cluster-usage-dashboard, but this repository only defines a GrafanaDashboard with that name (no ConfigMap with key gpu-cluster-usage.json). This dashboard CR will not reconcile as intended. Either switch the referenced object to an actual ConfigMap containing the JSON, or remove configMapRef and keep a single GrafanaDashboard resource (avoid duplicating dashboards in this kustomization).

Suggested change
configMapRef:
name: gpu-cluster-usage-dashboard
key: gpu-cluster-usage.json

Copilot uses AI. Check for mistakes.
key: gpu-availability-allocation.json
instanceSelector:
matchLabels:
dashboards: "grafana"
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spec.configMapRef references a ConfigMap gpu-availability-allocation-dashboard with key gpu-availability-allocation.json, but no such ConfigMap exists in this component. The only matching name is another GrafanaDashboard, so this CR will not reconcile correctly. Please either add the referenced ConfigMap or drop the -cr.yaml variant and keep a single dashboard definition.

Suggested change
dashboards: "grafana"
dashboards: "grafana"
---
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-availability-allocation-dashboard
namespace: kubeflow-monitoring-system
data:
gpu-availability-allocation.json: "{}"

Copilot uses AI. Check for mistakes.
Comment on lines +51 to +52
Open http://localhost:3000 — default credentials are admin/admin on first login.

Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README documents Grafana default credentials as admin/admin. Since this stack is intended to be “batteries-included”, please add a clear warning and a concrete follow-up step to rotate the admin password (or configure admin credentials via a Secret) to avoid encouraging an insecure default in real clusters.

Suggested change
Open http://localhost:3000 — default credentials are admin/admin on first login.
Open http://localhost:3000 — default credentials are `admin` / `admin` on first login.
**Security warning:** These default administrator credentials are insecure and must not be used beyond initial local testing.
**Immediately after first login, rotate the Grafana administrator password** via the Grafana user interface (`Configuration → Users → admin`). For production or shared clusters, configure hardened administrator credentials through a Kubernetes Secret and the corresponding Grafana custom resource configuration (see Grafana Operator documentation) before exposing Grafana outside the cluster.

Copilot uses AI. Check for mistakes.
Comment on lines +12 to +16
- uses: actions/checkout@v4
- name: Install kustomize
run: |
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
sudo mv kustomize /usr/local/bin/
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow installs kustomize by piping a script from the kustomize repo’s master branch, which is non-deterministic and bypasses checksum verification. The repository already has a pinned+checksum-verified installer (tests/install_KinD_create_KinD_cluster_install_kustomize.sh) and most workflows use actions/checkout@v5 (this one uses @v4). Please align with the repo’s existing installation approach and pin versions for reproducible CI.

Suggested change
- uses: actions/checkout@v4
- name: Install kustomize
run: |
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
sudo mv kustomize /usr/local/bin/
- uses: actions/checkout@v5
- name: Install kustomize
run: |
chmod +x tests/install_KinD_create_KinD_cluster_install_kustomize.sh
tests/install_KinD_create_KinD_cluster_install_kustomize.sh

Copilot uses AI. Check for mistakes.
Comment on lines +17 to +20
spec:
containers:
- name: grafana
image: grafana/grafana:10.2.3
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kubeflow-monitoring-system namespace is labeled with PSS restricted enforcement, but the Grafana pod template here does not specify a pod/container securityContext (e.g., allowPrivilegeEscalation: false, capabilities.drop: ["ALL"], seccompProfile). To avoid relying on operator defaults (and potential PodSecurity admission rejections), consider setting an explicit restricted-compliant security context in this spec.deployment.spec.template.

Suggested change
spec:
containers:
- name: grafana
image: grafana/grafana:10.2.3
spec:
securityContext:
runAsNonRoot: true
runAsUser: 472
runAsGroup: 472
fsGroup: 472
seccompProfile:
type: RuntimeDefault
containers:
- name: grafana
image: grafana/grafana:10.2.3
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL

Copilot uses AI. Check for mistakes.
Comment on lines +7 to +9
configMapRef:
name: gpu-namespace-usage-dashboard
key: gpu-namespace-usage.json
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spec.configMapRef points to gpu-namespace-usage-dashboard/gpu-namespace-usage.json, but there is no ConfigMap providing that key (the only object with that name is another GrafanaDashboard). This will fail to load the dashboard. Consider keeping a single GrafanaDashboard (inline json) or add the missing ConfigMap and reference it consistently.

Suggested change
configMapRef:
name: gpu-namespace-usage-dashboard
key: gpu-namespace-usage.json
json: |
{
"title": "GPU Namespace Usage",
"schemaVersion": 38,
"version": 1,
"timezone": "browser",
"editable": true,
"panels": []
}

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +6
name: amd-gpu-metrics-exporter
namespace: kube-amd-gpu
labels:
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the NVIDIA ServiceMonitor: the top-level namespace: kubeflow-monitoring-system will rewrite this resource’s namespace at build time, so metadata.namespace: kube-amd-gpu is misleading and makes the manifest fail if applied standalone when kube-amd-gpu is absent. Consider placing the ServiceMonitor in kubeflow-monitoring-system (or omitting metadata.namespace) and using spec.namespaceSelector.matchNames: [kube-amd-gpu] to target the exporter.

Copilot uses AI. Check for mistakes.
- Remove misleading metadata.namespace from NVIDIA/AMD ServiceMonitors;
  base kustomization sets namespace, spec.namespaceSelector targets exporters
- Add restricted PSS securityContext to Grafana CR pod spec
- Upgrade CI workflow to actions/checkout@v5 and use repo's pinned kustomize
  installer instead of curl|bash
- Add wait_or_dump helper to model_registry_test.sh for failure diagnostics
- Remove stale -cr.yaml dashboard duplicates that referenced non-existent
  ConfigMaps, causing dashboard reconciliation failures
- Fix Kepler DaemonSet inline comment to reference correct 'kepler' namespace
- Rewrite README: correct ServiceMonitor namespace descriptions, add
  namespaceSelector clarification, add Grafana security warning

Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 38 changed files in this pull request and generated 5 comments.

Comment on lines +21 to +26
if ! kubectl wait --for=condition=available -n "$ns" "deployment/$deploy" --timeout="$timeout"; then
echo "ERROR: deployment $deploy in namespace $ns did not become available"
kubectl events -n "$ns"
kubectl describe "deployment/$deploy" -n "$ns"
kubectl logs "deployment/$deploy" -n "$ns" --all-containers --tail=50 || true
exit 1
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubectl events is not consistently available across kubectl versions/environments, and with set -e this can cause the failure-reporting path to exit before printing the subsequent describe/logs. Use a more portable command (e.g., kubectl get events) and/or guard it with || true so diagnostics are always emitted.

Copilot uses AI. Check for mistakes.
Comment on lines +36 to +43
nohup kubectl port-forward svc/model-registry-service -n kubeflow 8081:8080 &
PF_PIDS+=($!)

while ! curl -s localhost:8081 > /dev/null; do
echo "waiting for port-forwarding 8081..."
sleep 1
done
echo "port-forwarding 8081 ready"
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This port-forward readiness loop has no timeout/max retries. If port-forwarding fails (e.g., port already in use, service missing, RBAC), the script can hang indefinitely in CI. Add a bounded retry/timeout and/or verify the port-forward process is still running (similar to tests/model_catalog_test.sh).

Copilot uses AI. Check for mistakes.
Comment on lines +52 to +61
nohup kubectl port-forward --namespace istio-system svc/${INGRESS_GATEWAY_SERVICE} 8080:80 &
PF_PIDS+=($!)
sleep 3
fi

while ! curl -s localhost:8080 > /dev/null; do
echo "waiting for port-forwarding 8080..."
sleep 1
done
echo "port-forwarding 8080 ready"
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This second port-forward readiness loop also lacks a timeout/max retries, so CI can hang indefinitely if the Istio gateway port-forward fails. Add a bounded retry/timeout (or reuse a shared helper) to guarantee the script terminates with diagnostics.

Copilot uses AI. Check for mistakes.
resources:
- gpu-cluster-usage-dashboard.yaml
- gpu-namespace-usage-dashboard.yaml
- gpu-availability-allocation-dashboard.yaml
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dashboards directory contains both inline-JSON GrafanaDashboard resources and separate *-dashboard-cr.yaml resources that reference ConfigMaps, but the *-dashboard-cr.yaml files are not included here (and there are no ConfigMaps generated for them). This leaves dead/unused manifests in the tree and makes it unclear which provisioning mechanism is intended. Either remove the *-dashboard-cr.yaml files or include them together with the required ConfigMaps/configMapGenerator and reference them consistently.

Suggested change
- gpu-availability-allocation-dashboard.yaml

Copilot uses AI. Check for mistakes.
Comment on lines 77 to 79
| KServe | applications/kserve/kserve | [v0.16.0](https://github.com/kserve/kserve/releases/tag/v0.16.0/install/v0.16.0) | 600m | 1200Mi | 0GB |
| KServe Models Web Application | applications/kserve/models-web-app | [c71ee4309f0335159d9fdfd4559a538b5c782c92](https://github.com/kserve/models-web-app/tree/c71ee4309f0335159d9fdfd4559a538b5c782c92/manifests/kustomize) | 6m | 259Mi | 0GB |
| KServe Models Web Application | applications/kserve/models-web-app | [v0.16.1](https://github.com/kserve/models-web-app/tree/v0.16.1/config) | 6m | 259Mi | 0GB |
| Kubeflow Pipelines | applications/pipeline/upstream | [2.16.0](https://github.com/kubeflow/pipelines/tree/2.16.0/manifests/kustomize) | 970m | 3552Mi | 35GB |
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is described as introducing an opt-in observability stack, but this README change also updates the referenced KServe Models Web Application version/link. If this update is intentional, it should be called out in the PR description (or moved to a separate PR) to avoid mixing unrelated changes.

Copilot uses AI. Check for mistakes.
- Replace kubectl events with more portable command in tests

- Add bounded retry validation for port-forward loops

- Remove orphaned dashboards CRs without ConfigMaps

- Revert unintended KServe Models Web App version bump in README

Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 35 changed files in this pull request and generated 4 comments.

Comment on lines +8 to +12
kind: ClusterRole
metadata:
name: grafana-operator
rules:
- apiGroups: [""]
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ClusterRole / ClusterRoleBinding names here are very generic (grafana-operator). Because these are cluster-scoped, installing this component into a cluster that already has a Grafana Operator (or any other component using the same names) will cause RBAC resource name collisions and unexpected permission changes. Consider prefixing these names (for example with kubeflow- / kubeflow-monitoring-) to make them globally unique.

Copilot uses AI. Check for mistakes.
Comment on lines +44 to +48
fi
echo "waiting for port-forwarding $port..."
sleep 1
((count++))
done
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait_for_port uses ((count++)) while the script runs with set -e. In Bash, ((count++)) returns exit status 1 when count is 0 (first loop iteration), which will terminate the script prematurely. Use an increment form that does not trigger set -e (for example, count=$((count+1)), ((++count)), or add an explicit || true).

Copilot uses AI. Check for mistakes.
Comment on lines +6 to +12
spec:
serviceAccountName: prometheus
serviceMonitorSelector: {}
serviceMonitorNamespaceSelector: {}
resources:
requests:
memory: 400Mi
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kubeflow-monitoring-system namespace is labeled with Pod Security Standards restricted, but this Prometheus CR does not define pod/container securityContext (e.g., allowPrivilegeEscalation: false, seccompProfile: RuntimeDefault, runAsNonRoot, dropped capabilities). On clusters enforcing PSA restricted without SeccompDefault, the operator-created Prometheus StatefulSet can be blocked. Configure the Prometheus CR to set an explicit pod and container security context that satisfies PSS restricted.

Copilot uses AI. Check for mistakes.
Comment on lines +52 to +64
- name: containerd
mountPath: /var/run/containerd
readOnly: true
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: containerd
hostPath:
path: /var/run/containerd
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Kepler DaemonSet hard-codes a hostPath mount for /var/run/containerd. On nodes that do not use containerd (e.g., CRI-O), this path may not contain the expected socket and Kepler can fail to collect runtime metrics (or silently report incomplete data). Consider making the runtime socket mount configurable (or mount a more general path used by supported runtimes) and document the supported container runtimes explicitly.

Copilot uses AI. Check for mistakes.
- Rename Grafana Operator ClusterRole/Binding to prevent RBAC collisions

- Fix premature bash exit from set -e in wait_for_port increment

- Add PSS restricted securityContext configuration to Prometheus CR

- Generalize Kepler DaemonSet runtime socket mount path to /var/run

Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
…eview feedback

Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 35 changed files in this pull request and generated 5 comments.

Comment on lines +7 to +9
pod-security.kubernetes.io/enforce-version: latest
pod-security.kubernetes.io/warn: restricted
pod-security.kubernetes.io/warn-version: latest
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using pod-security.kubernetes.io/*-version: latest makes the enforced Pod Security Standard level change automatically as Kubernetes evolves, which can lead to unexpected breakage on cluster upgrades. Consider pinning to a specific Kubernetes minor (matching the project's supported floor) or omitting the version labels so upgrades are explicit and controlled.

Suggested change
pod-security.kubernetes.io/enforce-version: latest
pod-security.kubernetes.io/warn: restricted
pod-security.kubernetes.io/warn-version: latest
pod-security.kubernetes.io/warn: restricted

Copilot uses AI. Check for mistakes.
Comment on lines +7 to +9
pod-security.kubernetes.io/enforce-version: latest
pod-security.kubernetes.io/warn: privileged
pod-security.kubernetes.io/warn-version: latest
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using pod-security.kubernetes.io/*-version: latest makes the enforced Pod Security Standard level change automatically as Kubernetes evolves, which can lead to unexpected breakage on cluster upgrades. Consider pinning to a specific Kubernetes minor (matching the project's supported floor) or omitting the version labels so upgrades are explicit and controlled.

Suggested change
pod-security.kubernetes.io/enforce-version: latest
pod-security.kubernetes.io/warn: privileged
pod-security.kubernetes.io/warn-version: latest
pod-security.kubernetes.io/warn: privileged

Copilot uses AI. Check for mistakes.
Comment on lines +22 to +26
echo "ERROR: deployment $deploy in namespace $ns did not become available"
kubectl get events -n "$ns" --sort-by='.lastTimestamp' || true
kubectl describe "deployment/$deploy" -n "$ns"
kubectl logs "deployment/$deploy" -n "$ns" --all-containers --tail=50 || true
exit 1
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the failure path, kubectl describe "deployment/$deploy" -n "$ns" is not guarded with || true. With set -euo pipefail, if the deployment does not exist (or describe fails for another reason), the script can exit before printing logs and the explicit exit 1, reducing debuggability. Consider adding || true (like the other debug commands) so the function reliably reaches the intended error output and exit 1.

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +6
# Track port-forward PIDs so they are always killed on exit.
PF_PIDS=()

Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable names like PF_PIDS, PF_PID_8081, and PF_PID_8080 are abbreviations that are hard to interpret without local context. Consider renaming them to fully spelled-out, descriptive names (for example, something like “port forward process IDs”) to improve long-term readability and maintainability of this test script.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +12 to +17
- apiGroups: [""]
resources: ["configmaps", "secrets", "services", "serviceaccounts"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Grafana Operator ClusterRole grants cluster-wide write access to Secrets/ConfigMaps/Services/ServiceAccounts and Deployments. Since this stack is intended to be namespaced to kubeflow-monitoring-system, consider scoping the operator to a single namespace (e.g., set a watch namespace) and replacing the ClusterRole/ClusterRoleBinding with a namespaced Role/RoleBinding (or otherwise reducing privileges) to avoid unnecessary cluster-wide secret mutation permissions.

Copilot uses AI. Check for mistakes.
Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 35 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (6)

tests/observability_install.sh:1

  • kubectl apply --server-side --force-conflicts can unintentionally take ownership of and overwrite fields for resources that may already exist (especially in shared dev clusters), making the install script risky outside of ephemeral CI. Consider removing --force-conflicts by default, or gating it behind an env var/flag (e.g., FORCE_CONFLICTS=true).
    tests/observability_install.sh:1
  • Waiting on deployment -l control-plane=controller-manager is not specific to the Grafana operator and can match multiple deployments (or none), which can make the script flaky or hang/fail unexpectedly. Prefer waiting for a specific resource name (e.g., deployment/grafana-operator) or a unique label selector tied to this deployment.
    tests/model_registry_test.sh:1
  • Using nohup without redirecting stdout/stderr will typically create/update nohup.out, which can clutter CI artifacts and local working directories. Consider redirecting output to a known location (or /dev/null) while still keeping the PID-based cleanup logic.
    tests/model_registry_test.sh:1
  • Using nohup without redirecting stdout/stderr will typically create/update nohup.out, which can clutter CI artifacts and local working directories. Consider redirecting output to a known location (or /dev/null) while still keeping the PID-based cleanup logic.
    tests/model_registry_test.sh:1
  • Since PORT_FORWARD_PIDS is always initialized, ${PORT_FORWARD_PIDS[@]:-} is unnecessary and can cause an extra loop iteration with an empty string in some cases. Use a normal array iteration (\"${PORT_FORWARD_PIDS[@]}\") and (optionally) gate on array length for clearer behavior.
    tests/model_registry_test.sh:1
  • Since PORT_FORWARD_PIDS is always initialized, ${PORT_FORWARD_PIDS[@]:-} is unnecessary and can cause an extra loop iteration with an empty string in some cases. Use a normal array iteration (\"${PORT_FORWARD_PIDS[@]}\") and (optionally) gate on array length for clearer behavior.

Comment on lines +7 to +13
| Prerequisite | Required for | Notes |
|---|---|---|
| Kubernetes 1.27+ | Everything | |
| kustomize v5+ | Installation | |
| NVIDIA GPU Operator | NVIDIA ServiceMonitor | Runs in `gpu-operator` ns — ServiceMonitor scrapes it via `spec.namespaceSelector`; silent if absent |
| AMD GPU Operator | AMD ServiceMonitor | Runs in `kube-amd-gpu` ns — ServiceMonitor scrapes it via `spec.namespaceSelector`; silent if absent |
| kube-state-metrics | GPU Namespace Usage + Availability dashboards | **Without it 2/3 dashboards render blank with no error** — install via kube-prometheus-stack or standalone |
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prerequisite (and other) tables use || at the start of rows, which renders as an extra empty column / broken markdown in many renderers. Use standard markdown table formatting (| ... |) consistently so the tables render correctly.

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +13
config:
auth:
disable_login_form: "false"
users:
allow_sign_up: "false"
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stack appears to rely on Grafana's default initial credentials (also documented as admin/admin). Even for an opt-in component, shipping a configuration that comes up with well-known credentials is risky. Prefer wiring admin credentials through a Kubernetes Secret supported by the Grafana Operator (or at least make the insecure default explicitly opt-in), and document the required Secret/values.

Copilot uses AI. Check for mistakes.
Comment on lines +42 to +44
capabilities:
drop:
- ALL
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting privileged: true effectively grants broad capabilities; additionally specifying capabilities.drop: [ALL] is misleading because privileged mode can negate the intent of capability dropping. To avoid conveying a false sense of hardening, either remove the capabilities.drop stanza when running privileged, or (if feasible) run non-privileged with a minimal explicit capability set.

Suggested change
capabilities:
drop:
- ALL

Copilot uses AI. Check for mistakes.
Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 34 out of 36 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add opt-in observability stack

3 participants