feat(observability): Add opt-in observability stack #3426#3427
feat(observability): Add opt-in observability stack #3426#3427abdullahpathan22 wants to merge 8 commits intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Adds a new opt-in common/observability/ kustomize component intended to provide a bundled monitoring stack (operators, GPU ServiceMonitors, dashboards) and also refactors Model Registry installation/testing by introducing reusable scripts and wiring them into CI and example manifests.
Changes:
- Introduce
common/observabilitybase/overlays/components with operators, ServiceMonitors, Kepler manifests, and Grafana dashboards. - Add Model Registry install + integration test scripts and update CI workflows to use them.
- Update example installation and central dashboard (oauth2-proxy overlay) to surface Model Registry.
Reviewed changes
Copilot reviewed 35 out of 35 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/observability_install.sh | Adds a script to apply the observability kustomize base/overlay and wait for operator pods. |
| tests/model_registry_test.sh | Adds integration tests for Model Registry API and gateway access. |
| tests/model_registry_install.sh | Adds a script to install Model Registry (server, Istio, UI, catalog) and wait for readiness. |
| example/kustomization.yaml | Enables Model Registry resources in the example install. |
| common/observability/overlays/kubeflow/kustomization.yaml | Defines a Kubeflow overlay for the observability component. |
| common/observability/components/kepler/kustomization.yaml | Adds an opt-in Kepler component referencing the Kepler base. |
| common/observability/base/service-monitors/nvidia-dcgm-service-monitor.yaml | Adds ServiceMonitor for NVIDIA DCGM exporter. |
| common/observability/base/service-monitors/kustomization.yaml | Aggregates GPU/Kepler ServiceMonitors. |
| common/observability/base/service-monitors/kepler-service-monitor.yaml | Adds ServiceMonitor intended for Kepler metrics. |
| common/observability/base/service-monitors/amd-gpu-service-monitor.yaml | Adds ServiceMonitor for AMD device metrics exporter. |
| common/observability/base/operators/prometheus-operator/rbac.yaml | Adds Prometheus Operator service account and cluster RBAC. |
| common/observability/base/operators/prometheus-operator/kustomization.yaml | Wires Prometheus Operator deployment + RBAC. |
| common/observability/base/operators/prometheus-operator/deployment.yaml | Adds Prometheus Operator deployment manifest. |
| common/observability/base/operators/grafana-operator/rbac.yaml | Adds Grafana Operator service account and cluster RBAC. |
| common/observability/base/operators/grafana-operator/kustomization.yaml | Wires Grafana Operator deployment/RBAC and Grafana CR. |
| common/observability/base/operators/grafana-operator/grafana.yaml | Adds Grafana custom resource. |
| common/observability/base/operators/grafana-operator/deployment.yaml | Adds Grafana Operator deployment manifest. |
| common/observability/base/namespace.yaml | Creates kubeflow-monitoring-system namespace. |
| common/observability/base/kustomization.yaml | Top-level observability base composition (namespace/operators/monitors/dashboards). |
| common/observability/base/kepler/serviceaccount.yaml | Adds Kepler service account. |
| common/observability/base/kepler/service.yaml | Adds Kepler Service. |
| common/observability/base/kepler/namespace.yaml | Adds (duplicate) namespace manifest for Kepler base. |
| common/observability/base/kepler/kustomization.yaml | Wires Kepler RBAC + DaemonSet + Service. |
| common/observability/base/kepler/daemonset.yaml | Adds Kepler DaemonSet. |
| common/observability/base/kepler/clusterrolebinding.yaml | Adds Kepler ClusterRoleBinding. |
| common/observability/base/kepler/clusterrole.yaml | Adds Kepler ClusterRole. |
| common/observability/base/dashboards/kustomization.yaml | Aggregates the dashboard ConfigMaps. |
| common/observability/base/dashboards/gpu-namespace-usage-dashboard.yaml | Adds a Grafana dashboard ConfigMap (namespace-level GPU usage). |
| common/observability/base/dashboards/gpu-cluster-usage-dashboard.yaml | Adds a Grafana dashboard ConfigMap (cluster-level GPU usage). |
| common/observability/base/dashboards/gpu-availability-allocation-dashboard.yaml | Adds a Grafana dashboard ConfigMap (GPU availability/allocation). |
| applications/centraldashboard/overlays/oauth2-proxy/patches/configmap.yaml | Patches central dashboard links/settings (adds Model Registry link). |
| applications/centraldashboard/overlays/oauth2-proxy/kustomization.yaml | Switches to istio overlay and applies the centraldashboard ConfigMap patch. |
| README.md | Documents optional observability and Model Registry install scripts. |
| .github/workflows/model_registry_test.yaml | Refactors workflow to use the new install/test scripts. |
| .github/workflows/full_kubeflow_integration_test.yaml | Uses new Model Registry scripts and runs Model Registry tests. |
| #### Observability Stack (Optional) | ||
|
|
||
| This component provides an optional monitoring stack for GPU metrics (NVIDIA/AMD) and energy consumption (Kepler), along with Grafana dashboards. It includes Prometheus and Grafana operators and is deployed in the `kubeflow-monitoring-system` namespace. | ||
|
|
||
| Install the observability base component: | ||
|
|
||
| ```sh | ||
| ./tests/observability_install.sh | ||
| ``` | ||
|
|
||
| To opt into Kepler for energy metrics: | ||
|
|
||
| ```sh | ||
| kustomize build common/observability/components/kepler | kubectl apply -f - | ||
| ``` |
There was a problem hiding this comment.
The README describes this as a ready-to-use “monitoring stack”, but the manifests currently do not include the required CRDs (ServiceMonitor/Grafana) and do not create a Prometheus instance or Grafana datasource. As a result, the install instructions here will not work on a fresh cluster and the dashboards will not render without extra manual steps.
Either ship the missing pieces as part of common/observability or document the required pre-existing dependencies and post-install configuration steps.
| namespace: kepler | ||
| spec: | ||
| endpoints: | ||
| - interval: 15s | ||
| port: http | ||
| namespaceSelector: | ||
| matchNames: | ||
| - kepler |
There was a problem hiding this comment.
kepler-service-monitor.yaml creates the ServiceMonitor in namespace kepler and only selects targets in namespace kepler, but the Kepler manifests in this component deploy to kubeflow-monitoring-system. As written, base installation will fail if the kepler namespace does not exist, and Prometheus will not discover the Kepler Service.
Align the ServiceMonitor metadata.namespace and namespaceSelector.matchNames with the namespace where Kepler is actually deployed (or move this ServiceMonitor into the Kepler opt-in component).
| namespace: kepler | |
| spec: | |
| endpoints: | |
| - interval: 15s | |
| port: http | |
| namespaceSelector: | |
| matchNames: | |
| - kepler | |
| namespace: kubeflow-monitoring-system | |
| spec: | |
| endpoints: | |
| - interval: 15s | |
| port: http | |
| namespaceSelector: | |
| matchNames: | |
| - kubeflow-monitoring-system |
| resources: | ||
| - nvidia-dcgm-service-monitor.yaml | ||
| - amd-gpu-service-monitor.yaml | ||
| - kepler-service-monitor.yaml |
There was a problem hiding this comment.
Kepler is described as opt-in, but base/service-monitors unconditionally includes kepler-service-monitor.yaml. This makes kustomize build common/observability/base | kubectl apply depend on Kepler’s namespace/Service existing and couples base installs to an optional component.
Move the Kepler ServiceMonitor into common/observability/components/kepler (or gate it behind an overlay) so the base component installs cleanly without Kepler.
| - kepler-service-monitor.yaml |
| resources: | ||
| - namespace.yaml | ||
| - operators/prometheus-operator | ||
| - operators/grafana-operator | ||
| - service-monitors | ||
| - dashboards |
There was a problem hiding this comment.
The base component deploys the Prometheus Operator and Grafana Operator, and also applies ServiceMonitor/Grafana resources, but it does not include the required CRDs for monitoring.coreos.com and grafana.integreatly.org. On a cluster without those CRDs preinstalled, kubectl apply will fail before the operators can run.
If this is meant to be “batteries included”, include the CRDs (or an explicit preflight/install step) so tests/observability_install.sh works on a fresh cluster.
| "targets": [ | ||
| { "expr": "avg(DCGM_FI_DEV_GPU_UTIL) or avg(amd_gpu_utilization)", "legendFormat": "GPU Utilization" } | ||
| ] | ||
| }, | ||
| { | ||
| "title": "GPU Memory Used vs Total per Node", | ||
| "type": "timeseries", |
There was a problem hiding this comment.
The bundled Grafana dashboards hardcode datasource.uid = "prometheus", but this component does not create a corresponding GrafanaDataSource (or any other datasource provisioning). Unless a datasource with UID prometheus is created out-of-band, these dashboards will not render.
Provision the Prometheus datasource as part of this component (or avoid hardcoding a UID and use a default datasource strategy supported by the chosen Grafana operator).
tests/observability_install.sh
Outdated
| cd common/observability | ||
| kustomize build base | kubectl apply -f - | ||
| echo "Waiting for operators to be ready..." | ||
| kubectl wait --for=condition=Ready pod -l 'app.kubernetes.io/name=prometheus-operator' --timeout=180s -n kubeflow-monitoring-system | ||
| kubectl wait --for=condition=Ready pod -l 'control-plane=controller-manager' --timeout=180s -n kubeflow-monitoring-system | ||
| echo "Applying Kubeflow overlay..." | ||
| kustomize build overlays/kubeflow | kubectl apply -f - |
There was a problem hiding this comment.
This script applies base and then applies overlays/kubeflow, but the overlay currently only re-includes ../../base. That results in duplicate applies without any functional difference.
Apply only the overlay (or only the base) until the overlay adds distinct customization; this reduces install time and avoids confusion about what the overlay changes.
tests/model_registry_test.sh
Outdated
| # ---- Test 1: Direct API access via port-forward ---- | ||
| echo "Test 1: Direct Model Registry API access..." | ||
| nohup kubectl port-forward svc/model-registry-service -n kubeflow 8081:8080 & | ||
| timeout 30s bash -c 'until curl -s localhost:8081 > /dev/null 2>&1; do sleep 1; done' | ||
|
|
There was a problem hiding this comment.
model_registry_test.sh starts a background kubectl port-forward but does not capture its PID or clean it up on success/failure. This can leak background processes in local runs and can interfere with subsequent steps in CI that expect to bind to the same port.
Capture $! and trap a cleanup handler to kill the port-forward process(es) before exiting.
example/kustomization.yaml
Outdated
| # Model Registry | ||
| - ../applications/model-registry/upstream/overlays/postgres | ||
| # Model Registry Istio networking (VirtualService for /api/model_registry/) | ||
| - ../applications/model-registry/upstream/options/istio | ||
| # Model Registry UI | ||
| - ../applications/model-registry/upstream/options/ui/overlays/istio | ||
| # Model Catalog (demo) | ||
| - ../applications/model-registry/upstream/options/catalog/overlays/demo | ||
|
|
There was a problem hiding this comment.
The PR title/description focuses on adding an opt-in observability stack, but this change also enables Model Registry resources in the example kustomization (and the PR adds Model Registry install/test scripts and workflow changes). This looks like additional scope not described in the PR metadata.
Either update the PR title/description to explicitly cover the Model Registry changes, or split the Model Registry updates into a separate PR for clearer review and release notes.
| resources: | ||
| - namespace.yaml | ||
| - operators/prometheus-operator | ||
| - operators/grafana-operator | ||
| - service-monitors |
There was a problem hiding this comment.
This “observability stack” deploys the operators but does not create a Prometheus (or PrometheusAgent) instance. Without a Prometheus CR, no scraping will happen and the ServiceMonitors are effectively unused.
Add a minimal Prometheus (and any required RBAC/Service/ServiceAccount) or clarify in the component/docs that an external Prometheus is required.
tests/model_registry_test.sh
Outdated
| INGRESS_GATEWAY_SERVICE=$(kubectl get svc --namespace istio-system \ | ||
| --selector="app=istio-ingressgateway" \ | ||
| --output jsonpath='{.items[0].metadata.name}') | ||
|
|
||
| nohup kubectl port-forward --namespace istio-system "svc/${INGRESS_GATEWAY_SERVICE}" 8080:80 & | ||
| timeout 30s bash -c 'until curl -s localhost:8080 > /dev/null 2>&1; do sleep 1; done' |
There was a problem hiding this comment.
This script starts a second background port-forward to the Istio ingress gateway on local port 8080 and does not clean it up. In the full integration workflow, the gateway is already port-forwarded earlier (tests/port_forward_gateway.sh), so this extra port-forward is redundant and can make debugging harder if it fails.
Prefer reusing the existing gateway port-forward (or parameterize the port), and ensure the process is terminated via PID tracking + trap cleanup.
b15fca8 to
57b1609
Compare
|
please rebase to master, there is something wrong with your commits. I see commits that are already on master. |
57b1609 to
3f6dbe3
Compare
|
Hello @juliusvonkohout, actually there was something wrong, i have fixed it and it is ready for review, please take a look and provide me with the feedbacks or necesassry changes thank you! |
| kind: ServiceMonitor | ||
| metadata: | ||
| name: amd-gpu-metrics-exporter | ||
| namespace: kube-amd-gpu |
There was a problem hiding this comment.
This ServiceMonitor has no labels, but the Prometheus CR in this component uses a serviceMonitorSelector.matchLabels (currently app.kubernetes.io/name: prometheus-operator). Without adding the expected label(s) here (or adjusting the Prometheus selector), Prometheus will ignore this ServiceMonitor.
| namespace: kube-amd-gpu | |
| namespace: kube-amd-gpu | |
| labels: | |
| app.kubernetes.io/name: prometheus-operator |
| instanceSelector: | ||
| matchLabels: | ||
| dashboards: "grafana" |
There was a problem hiding this comment.
GrafanaDatasource.spec.instanceSelector.matchLabels requires matching labels on the target Grafana resource. The Grafana instance in this PR has no metadata.labels.dashboards: "grafana", so this datasource will remain in the NoMatchingInstances state and never get provisioned.
| instanceSelector: | |
| matchLabels: | |
| dashboards: "grafana" | |
| instanceSelector: {} |
| kind: Grafana | ||
| metadata: | ||
| name: grafana | ||
| namespace: kubeflow-monitoring-system |
There was a problem hiding this comment.
The Grafana instance has no labels, but the GrafanaDatasource (and any future GrafanaDashboard) resources in this component rely on instanceSelector.matchLabels to bind to a Grafana instance. Add the expected labels (e.g., dashboards: "grafana") under metadata.labels so provisioning works.
| namespace: kubeflow-monitoring-system | |
| namespace: kubeflow-monitoring-system | |
| labels: | |
| dashboards: "grafana" |
| apiVersion: v1 | ||
| kind: ConfigMap | ||
| metadata: | ||
| name: gpu-cluster-usage-dashboard | ||
| namespace: kubeflow-monitoring-system | ||
| labels: | ||
| grafana_dashboard: "1" | ||
| data: |
There was a problem hiding this comment.
These dashboards are provided as ConfigMaps labeled grafana_dashboard: "1", but this component deploys the Grafana Operator (integreatly) which provisions dashboards via GrafanaDashboard CRs (optionally with spec.configMapRef). Without corresponding GrafanaDashboard resources (or a sidecar-based ConfigMap dashboard loader), these ConfigMaps will not be imported into Grafana.
README.md
Outdated
| #### Model Registry | ||
|
|
||
| Install the Model Registry with its UI and database components: | ||
|
|
||
| ```sh | ||
| ./tests/model_registry_install.sh | ||
| ``` |
There was a problem hiding this comment.
The PR title/description is scoped to adding an opt-in observability stack, but this change also adds Model Registry installation/testing scripts and workflow usage. Please either update the PR description to cover these additional changes or split them into a separate PR to keep review and release notes coherent.
| kind: ServiceMonitor | ||
| metadata: | ||
| name: nvidia-dcgm-exporter | ||
| namespace: gpu-operator |
There was a problem hiding this comment.
This ServiceMonitor has no labels, but the Prometheus CR in this component uses a serviceMonitorSelector.matchLabels (currently app.kubernetes.io/name: prometheus-operator). Without adding the expected label(s) here (or adjusting the Prometheus selector), Prometheus will ignore this ServiceMonitor.
| namespace: gpu-operator | |
| namespace: gpu-operator | |
| labels: | |
| app.kubernetes.io/name: prometheus-operator |
| kind: ServiceMonitor | ||
| metadata: | ||
| name: kepler | ||
| namespace: kubeflow-monitoring-system |
There was a problem hiding this comment.
This ServiceMonitor has no labels, but the Prometheus CR in the base component uses a serviceMonitorSelector.matchLabels. Without adding the expected label(s) here (or adjusting the Prometheus selector), Prometheus will ignore the Kepler ServiceMonitor.
| namespace: kubeflow-monitoring-system | |
| namespace: kubeflow-monitoring-system | |
| labels: | |
| app.kubernetes.io/name: kepler |
| serviceMonitorSelector: | ||
| matchLabels: | ||
| app.kubernetes.io/name: prometheus-operator |
There was a problem hiding this comment.
The Prometheus CR selects ServiceMonitors by label app.kubernetes.io/name: prometheus-operator, but the ServiceMonitors introduced in this component (and the optional Kepler ServiceMonitor) do not set that label. Additionally, because the NVIDIA/AMD ServiceMonitors are created in other namespaces (gpu-operator, kube-amd-gpu), Prometheus will not discover them unless serviceMonitorNamespaceSelector is set (or the ServiceMonitors are moved into kubeflow-monitoring-system). As-is, Prometheus will not scrape the intended targets.
| serviceMonitorSelector: | |
| matchLabels: | |
| app.kubernetes.io/name: prometheus-operator | |
| serviceMonitorSelector: {} | |
| serviceMonitorNamespaceSelector: {} |
3f6dbe3 to
a19b4f8
Compare
tests/observability_install.sh
Outdated
| cd common/observability | ||
| kustomize build overlays/kubeflow | kubectl apply -f - |
There was a problem hiding this comment.
The install script applies a large operator bundle and CRDs using client-side kubectl apply. This is prone to failures due to the last-applied-configuration annotation size limit (and conflicts when CRDs already exist). Align this script with other install scripts in tests/ by using server-side apply (and force-conflicts where appropriate) for the kustomize output.
| verbs: ["*"] | ||
| - apiGroups: ["apps"] | ||
| resources: ["deployments"] | ||
| verbs: ["*"] | ||
| - apiGroups: ["grafana.integreatly.org"] | ||
| resources: ["grafanas", "grafanadashboards", "grafanadatasources", "grafanafolders"] | ||
| verbs: ["*"] | ||
| - apiGroups: ["networking.k8s.io"] | ||
| resources: ["ingresses"] | ||
| verbs: ["*"] |
There was a problem hiding this comment.
This ClusterRole grants wildcard verbs ("*") on several resource types. For a namespaced installation this is typically broader than necessary and increases blast radius. Prefer narrowing to the minimal verb set and using a Role/RoleBinding where cluster scope is not required.
| verbs: ["*"] | |
| - apiGroups: ["apps"] | |
| resources: ["deployments"] | |
| verbs: ["*"] | |
| - apiGroups: ["grafana.integreatly.org"] | |
| resources: ["grafanas", "grafanadashboards", "grafanadatasources", "grafanafolders"] | |
| verbs: ["*"] | |
| - apiGroups: ["networking.k8s.io"] | |
| resources: ["ingresses"] | |
| verbs: ["*"] | |
| verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] | |
| - apiGroups: ["apps"] | |
| resources: ["deployments"] | |
| verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] | |
| - apiGroups: ["grafana.integreatly.org"] | |
| resources: ["grafanas", "grafanadashboards", "grafanadatasources", "grafanafolders"] | |
| verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] | |
| - apiGroups: ["networking.k8s.io"] | |
| resources: ["ingresses"] | |
| verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] |
| securityContext: | ||
| privileged: true |
There was a problem hiding this comment.
securityContext.privileged: true is a very high-privilege setting. If Kepler supports it, harden the container (for example allowPrivilegeEscalation: false, readOnlyRootFilesystem, and explicit dropped capabilities) and document why privileged mode is required.
| securityContext: | |
| privileged: true | |
| securityContext: | |
| # Kepler requires privileged mode to access host-level energy and kernel metrics | |
| # via /proc, /sys, and the container runtime socket. | |
| privileged: true | |
| allowPrivilegeEscalation: false | |
| readOnlyRootFilesystem: true | |
| capabilities: | |
| drop: | |
| - "ALL" |
README.md
Outdated
| We are planning to cut 2 releases per year, for example 26.03 and 26.10 before each KubeCon EU and NA. | ||
| We ask each working group/component to provide non-breaking patch releases for 6 months based on the version in each date release. | ||
| We try to BEST-EFFORT support each release for 6 months as community. There is [commercial support](https://www.kubeflow.org/docs/started/support/#support-from-commercial-providers-in-the-kubeflow-ecosystem) available if needed. | ||
| We try to BEST-EFFORT support each realease for 6 monhts as community. There is [commercial support](https://www.kubeflow.org/docs/started/support/#support-from-commercial-providers-in-the-kubeflow-ecosystem) available if needed. |
There was a problem hiding this comment.
Spelling: realease should be release.
| We try to BEST-EFFORT support each realease for 6 monhts as community. There is [commercial support](https://www.kubeflow.org/docs/started/support/#support-from-commercial-providers-in-the-kubeflow-ecosystem) available if needed. | |
| We try to BEST-EFFORT support each release for 6 months as community. There is [commercial support](https://www.kubeflow.org/docs/started/support/#support-from-commercial-providers-in-the-kubeflow-ecosystem) available if needed. |
README.md
Outdated
| | Katib | applications/katib/upstream | [v0.19.0](https://github.com/kubeflow/katib/tree/v0.19.0/manifests/v1beta1) | 13m | 476Mi | 10GB | | ||
| | KServe | applications/kserve/kserve | [v0.16.0](https://github.com/kserve/kserve/releases/tag/v0.16.0/install/v0.16.0) | 600m | 1200Mi | 0GB | | ||
| | KServe Models Web Application | applications/kserve/models-web-app | [c71ee4309f0335159d9fdfd4559a538b5c782c92](https://github.com/kserve/models-web-app/tree/c71ee4309f0335159d9fdfd4559a538b5c782c92/manifests/kustomize) | 6m | 259Mi | 0GB | | ||
| | KServe Models Web Application | applications/kserve/models-web-app | [v0.15.0](https://github.com/kserve/models-web-app/tree/v0.15.0/config) | 6m | 259Mi | 0GB | |
There was a problem hiding this comment.
This table entry links KServe Models Web Application to v0.15.0, but the manifests in applications/kserve/models-web-app currently set the image tag to 0.16.1. Please align the README's upstream revision/link with the version actually shipped.
| | KServe Models Web Application | applications/kserve/models-web-app | [v0.15.0](https://github.com/kserve/models-web-app/tree/v0.15.0/config) | 6m | 259Mi | 0GB | | |
| | KServe Models Web Application | applications/kserve/models-web-app | [v0.16.1](https://github.com/kserve/models-web-app/tree/v0.16.1/config) | 6m | 259Mi | 0GB | |
| name: prometheus | ||
| rules: | ||
| - apiGroups: [""] | ||
| resources: ["nodes", "nodes/metrics", "services", "endpoints", "pods"] | ||
| verbs: ["get", "list", "watch"] | ||
| - apiGroups: [""] | ||
| resources: ["configmaps"] | ||
| verbs: ["get"] | ||
| - nonResourceURLs: ["/metrics"] | ||
| verbs: ["get"] | ||
| --- | ||
| apiVersion: rbac.authorization.k8s.io/v1 | ||
| kind: ClusterRoleBinding | ||
| metadata: | ||
| name: prometheus | ||
| roleRef: | ||
| apiGroup: rbac.authorization.k8s.io | ||
| kind: ClusterRole | ||
| name: prometheus | ||
| subjects: | ||
| - kind: ServiceAccount | ||
| name: prometheus |
There was a problem hiding this comment.
The ClusterRole is named prometheus, which is very likely to collide with existing Prometheus installations because this is a cluster-scoped resource. Consider prefixing this name with something unique to this component (for example, kubeflow-monitoring-...) to avoid installation conflicts.
| name: prometheus | |
| rules: | |
| - apiGroups: [""] | |
| resources: ["nodes", "nodes/metrics", "services", "endpoints", "pods"] | |
| verbs: ["get", "list", "watch"] | |
| - apiGroups: [""] | |
| resources: ["configmaps"] | |
| verbs: ["get"] | |
| - nonResourceURLs: ["/metrics"] | |
| verbs: ["get"] | |
| --- | |
| apiVersion: rbac.authorization.k8s.io/v1 | |
| kind: ClusterRoleBinding | |
| metadata: | |
| name: prometheus | |
| roleRef: | |
| apiGroup: rbac.authorization.k8s.io | |
| kind: ClusterRole | |
| name: prometheus | |
| subjects: | |
| - kind: ServiceAccount | |
| name: prometheus | |
| name: kubeflow-monitoring-prometheus | |
| rules: | |
| - apiGroups: [""] | |
| resources: ["nodes", "nodes/metrics", "services", "endpoints", "pods"] | |
| verbs: ["get", "list", "watch"] | |
| - apiGroups: [""] | |
| resources: ["configmaps"] | |
| verbs: ["get"] | |
| - nonResourceURLs: ["/metrics"] | |
| verbs: ["get"] | |
| --- | |
| apiVersion: rbac.authorization.k8s.io/v1 | |
| kind: ClusterRoleBinding | |
| metadata: | |
| name: kubeflow-monitoring-prometheus | |
| roleRef: | |
| apiGroup: rbac.authorization.k8s.io | |
| kind: ClusterRole | |
| name: kubeflow-monitoring-prometheus | |
| subjects: | |
| - kind: ServiceAccount | |
| name: prometheus |
| apiVersion: rbac.authorization.k8s.io/v1 | ||
| kind: ClusterRoleBinding | ||
| metadata: | ||
| name: prometheus | ||
| roleRef: |
There was a problem hiding this comment.
The ClusterRoleBinding is named prometheus, which is very likely to collide with existing Prometheus installations because this is a cluster-scoped resource. Consider prefixing this name with something unique to this component (for example, kubeflow-monitoring-...) to avoid installation conflicts.
| namespace: kubeflow-monitoring-system | ||
| spec: | ||
| datasource: | ||
| name: prometheus |
There was a problem hiding this comment.
Dashboards in this component reference the Prometheus datasource by UID ("uid": "prometheus"), but this GrafanaDatasource does not set spec.datasource.uid. Without a stable UID, Grafana will generate a different one and the dashboards will not bind to the intended datasource.
| name: prometheus | |
| name: prometheus | |
| uid: prometheus |
README.md
Outdated
| To view all past security scans, head to the [Image Extracting and Security Scanning GitHub Action workflow](https://github.com/kubeflow/manifests/actions/workflows/trivy.yaml). In the logs of the workflow, you can expand the `Run image extracting and security scanning script` step to view the CVE logs. You will find a per-image CVE scan and a JSON dump of per-WorkingGroup aggregated metrics. You can run the Python script from the workflow file locally on your machine to obtain the detailed JSON files for any git commit. | ||
|
|
||
| For more information please consult the [SECURITY.md](./SECURITY.md). | ||
| For more infromation please consult the [SECURITY.md](./SECURITY.md). |
There was a problem hiding this comment.
Spelling: infromation should be information.
| For more infromation please consult the [SECURITY.md](./SECURITY.md). | |
| For more information please consult the [SECURITY.md](./SECURITY.md). |
README.md
Outdated
| **A:** Please refer to each individual component's documentation for a dependency compatibility range. For Istio, Knative, Dex, Cert-Manager, and OAuth2 Proxy, the versions in `common` are the ones we have validated. | ||
| - **Q:** Can I use Kubeflow in an air-gapped environment? | ||
| **A:** Yes you can. You just need to get the list of images from our [trivy CVE scanning script](https://github.com/kubeflow/manifests/blob/master/tests/trivy_scan.py), mirror them and replace the references in the manifests with kustomize components and overlays, see [Upgrading and Extending](#upgrading-and-extending). You could also use a simple kyverno policy to replace the images at runtime, which could be easier to maintain. | ||
| **A:** Yes you can. You just need to to get the list of images from our [trivy CVE scanning script](https://github.com/kubeflow/manifests/blob/master/tests/trivy_scan.py), mirror them and replace the references in the manifests with kustomize components and overlays, see [Upgrading and Extending](#upgrading-and-extending). You could also use a simple kyverno policy to replace the images at runtime, which could be easier to maintain. |
There was a problem hiding this comment.
Spelling/grammar: remove the duplicated word to ("need to to get").
| **A:** Yes you can. You just need to to get the list of images from our [trivy CVE scanning script](https://github.com/kubeflow/manifests/blob/master/tests/trivy_scan.py), mirror them and replace the references in the manifests with kustomize components and overlays, see [Upgrading and Extending](#upgrading-and-extending). You could also use a simple kyverno policy to replace the images at runtime, which could be easier to maintain. | |
| **A:** Yes you can. You just need to get the list of images from our [trivy CVE scanning script](https://github.com/kubeflow/manifests/blob/master/tests/trivy_scan.py), mirror them and replace the references in the manifests with kustomize components and overlays, see [Upgrading and Extending](#upgrading-and-extending). You could also use a simple kyverno policy to replace the images at runtime, which could be easier to maintain. |
a19b4f8 to
54e9895
Compare
| - For the specific Kubernetes version per release, consult the [release notes](https://github.com/kubeflow/manifests/releases). | ||
| - Our Kind script below will take care of installing continuously tested Kubernetes, Kustomize and Kubectl versions for you. | ||
| - We use Kind as default but also support Minikube, Rancher, EKS, AKS, and GKE. GKE might need tiny adjustments documented here in this file and OpenShift is also possible. | ||
|
|
There was a problem hiding this comment.
This PR removes the ARM64/aarch64 support note from the README. That change appears unrelated to the observability stack feature and removes potentially useful guidance for users. If this was accidental, consider restoring the section; if intentional, it likely deserves a separate PR with rationale.
README.md
Outdated
|
|
||
|
|
There was a problem hiding this comment.
Two consecutive blank lines were added here, which creates unnecessary diff noise and slightly reduces readability. Consider removing the extra empty lines.
| "title": "Per-namespace GPU Utilization over time", | ||
| "type": "timeseries", | ||
| "targets": [ | ||
| { "expr": "sum(DCGM_FI_DEV_GPU_UTIL) by (namespace)", "legendFormat": "{{namespace}}" } | ||
| ] |
There was a problem hiding this comment.
The “Per-namespace GPU Utilization over time” panel groups DCGM_FI_DEV_GPU_UTIL by namespace, but for DCGM exporter metrics the namespace label (when present) typically reflects the scrape target’s Kubernetes namespace (for example gpu-operator), not the namespaces where GPU workloads run. This will produce misleading results. Either adjust the query to a workload-namespace-aware metric/join, or rename the panel to reflect what is actually being grouped.
54e9895 to
a07c50a
Compare
| name: kepler | ||
| namespace: kubeflow-monitoring-system | ||
| labels: | ||
| app.kubernetes.io/name: prometheus-operator |
There was a problem hiding this comment.
This ServiceMonitor sets metadata.labels.app.kubernetes.io/name: prometheus-operator, but the Prometheus CR currently uses an empty serviceMonitorSelector, so this label is not doing anything. Either scope Prometheus to select only ServiceMonitors with a dedicated label (recommended), or update/remove this label to avoid implying it is required for discovery.
| app.kubernetes.io/name: prometheus-operator | |
| app.kubernetes.io/name: kepler |
README.md
Outdated
| This component provides an optional monitoring stack for GPU metrics (NVIDIA/AMD) and energy consumption (Kepler), along with Grafana dashboards. It includes Prometheus and Grafana operators and is deployed in the `kubeflow-monitoring-system` namespace. | ||
|
|
||
| Install the observability base component: |
There was a problem hiding this comment.
The description here reads like Kepler is included in the base install, but the next section makes Kepler opt-in. Consider adjusting the wording to explicitly say Kepler is optional/opt-in to avoid confusion about what ./tests/observability_install.sh installs by default.
| This component provides an optional monitoring stack for GPU metrics (NVIDIA/AMD) and energy consumption (Kepler), along with Grafana dashboards. It includes Prometheus and Grafana operators and is deployed in the `kubeflow-monitoring-system` namespace. | |
| Install the observability base component: | |
| This component provides an optional monitoring stack for GPU metrics (NVIDIA/AMD), along with Grafana dashboards. It includes Prometheus and Grafana operators and is deployed in the `kubeflow-monitoring-system` namespace. Support for energy consumption metrics via Kepler is an additional, opt-in component and is not installed by default by `./tests/observability_install.sh`; see the Kepler section below to enable it. | |
| Install the observability base component (GPU metrics, Prometheus, and Grafana, without Kepler): |
| apiVersion: v1 | ||
| kind: Namespace | ||
| metadata: | ||
| name: kubeflow-monitoring-system |
There was a problem hiding this comment.
The new namespace manifest does not include the Pod Security Standards label used across other common/ namespaces (for example pod-security.kubernetes.io/enforce: restricted). Without this, installs can behave inconsistently on clusters with PodSecurity admission enabled. Consider adding the same default PSS label here (and any additional labels you need).
| name: kubeflow-monitoring-system | |
| name: kubeflow-monitoring-system | |
| labels: | |
| pod-security.kubernetes.io/enforce: restricted |
| apiVersion: v1 | ||
| kind: Namespace | ||
| metadata: | ||
| name: kubeflow-monitoring-system |
There was a problem hiding this comment.
Kepler runs as a privileged DaemonSet (hostPID/hostNetwork + privileged container), but this namespace manifest does not set a Pod Security Standards label that would allow it on clusters enforcing PSS. Consider setting pod-security.kubernetes.io/enforce: privileged here (or applying an explicit namespace patch as part of the Kepler component) so the opt-in Kepler install is self-contained.
| name: kubeflow-monitoring-system | |
| name: kubeflow-monitoring-system | |
| labels: | |
| pod-security.kubernetes.io/enforce: privileged |
a07c50a to
919001f
Compare
fa3747f to
6d8d4ca
Compare
common/observability/README.md
Outdated
| | NVIDIA GPU Operator | NVIDIA ServiceMonitor | Deploy in `gpu-operator` ns — silent if absent | | ||
| | AMD GPU Operator | AMD ServiceMonitor | Deploy in `kube-amd-gpu` ns — silent if absent | |
There was a problem hiding this comment.
The README states the NVIDIA/AMD ServiceMonitors are deployed into gpu-operator / kube-amd-gpu and are “silent if absent”, but the base kustomization sets namespace: kubeflow-monitoring-system (which rewrites namespaced resources). Either update the README to reflect that the ServiceMonitors are created in kubeflow-monitoring-system and target those namespaces via spec.namespaceSelector, or adjust the manifests so the installation is actually silent when the GPU-operator namespaces are missing.
common/observability/README.md
Outdated
| | NVIDIA DCGM ServiceMonitor | gpu-operator | Scrapes DCGM exporter | | ||
| | AMD ROCm ServiceMonitor | kube-amd-gpu | Scrapes device-metrics-exporter | | ||
| | 3x GrafanaDashboard CRs | kubeflow-monitoring-system | GPU dashboards | | ||
| | Kepler DaemonSet (opt-in) | kepler | Per-pod energy/power draw metrics | | ||
| | Kepler ServiceMonitor (opt-in) | kubeflow-monitoring-system | Scrapes Kepler | | ||
|
|
There was a problem hiding this comment.
In the “What gets installed” table, the ServiceMonitor namespaces are listed as gpu-operator and kube-amd-gpu, but the component’s namespace: kubeflow-monitoring-system will place namespaced resources into kubeflow-monitoring-system at build time. Please reconcile the table with the actual rendered output (and clarify that spec.namespaceSelector controls the target namespace to scrape).
| | NVIDIA DCGM ServiceMonitor | gpu-operator | Scrapes DCGM exporter | | |
| | AMD ROCm ServiceMonitor | kube-amd-gpu | Scrapes device-metrics-exporter | | |
| | 3x GrafanaDashboard CRs | kubeflow-monitoring-system | GPU dashboards | | |
| | Kepler DaemonSet (opt-in) | kepler | Per-pod energy/power draw metrics | | |
| | Kepler ServiceMonitor (opt-in) | kubeflow-monitoring-system | Scrapes Kepler | | |
| | NVIDIA DCGM ServiceMonitor | kubeflow-monitoring-system | Scrapes DCGM exporter in `gpu-operator` namespace via `spec.namespaceSelector` | | |
| | AMD ROCm ServiceMonitor | kubeflow-monitoring-system | Scrapes device-metrics-exporter in `kube-amd-gpu` namespace via `spec.namespaceSelector` | | |
| | 3x GrafanaDashboard CRs | kubeflow-monitoring-system | GPU dashboards | | |
| | Kepler DaemonSet (opt-in) | kepler | Per-pod energy/power draw metrics | | |
| | Kepler ServiceMonitor (opt-in) | kubeflow-monitoring-system | Scrapes Kepler | | |
| Note: All ServiceMonitor resources are created in the `kubeflow-monitoring-system` namespace. The `spec.namespaceSelector` field on each ServiceMonitor controls which target namespaces are scraped. |
tests/model_registry_test.sh
Outdated
| # FIX 14: Track port-forward PIDs for cleanup | ||
| PF_PIDS=() |
There was a problem hiding this comment.
The # FIX 14: comments reference an external/unknown identifier and do not explain intent on their own. Consider replacing with a self-contained explanation (e.g., “Track port-forward PIDs for cleanup”) so the script remains understandable without external context.
| configMapRef: | ||
| name: gpu-cluster-usage-dashboard | ||
| key: gpu-cluster-usage.json |
There was a problem hiding this comment.
spec.configMapRef references a ConfigMap named gpu-cluster-usage-dashboard, but this repository only defines a GrafanaDashboard with that name (no ConfigMap with key gpu-cluster-usage.json). This dashboard CR will not reconcile as intended. Either switch the referenced object to an actual ConfigMap containing the JSON, or remove configMapRef and keep a single GrafanaDashboard resource (avoid duplicating dashboards in this kustomization).
| configMapRef: | |
| name: gpu-cluster-usage-dashboard | |
| key: gpu-cluster-usage.json |
| key: gpu-availability-allocation.json | ||
| instanceSelector: | ||
| matchLabels: | ||
| dashboards: "grafana" |
There was a problem hiding this comment.
spec.configMapRef references a ConfigMap gpu-availability-allocation-dashboard with key gpu-availability-allocation.json, but no such ConfigMap exists in this component. The only matching name is another GrafanaDashboard, so this CR will not reconcile correctly. Please either add the referenced ConfigMap or drop the -cr.yaml variant and keep a single dashboard definition.
| dashboards: "grafana" | |
| dashboards: "grafana" | |
| --- | |
| apiVersion: v1 | |
| kind: ConfigMap | |
| metadata: | |
| name: gpu-availability-allocation-dashboard | |
| namespace: kubeflow-monitoring-system | |
| data: | |
| gpu-availability-allocation.json: "{}" |
common/observability/README.md
Outdated
| Open http://localhost:3000 — default credentials are admin/admin on first login. | ||
|
|
There was a problem hiding this comment.
The README documents Grafana default credentials as admin/admin. Since this stack is intended to be “batteries-included”, please add a clear warning and a concrete follow-up step to rotate the admin password (or configure admin credentials via a Secret) to avoid encouraging an insecure default in real clusters.
| Open http://localhost:3000 — default credentials are admin/admin on first login. | |
| Open http://localhost:3000 — default credentials are `admin` / `admin` on first login. | |
| **Security warning:** These default administrator credentials are insecure and must not be used beyond initial local testing. | |
| **Immediately after first login, rotate the Grafana administrator password** via the Grafana user interface (`Configuration → Users → admin`). For production or shared clusters, configure hardened administrator credentials through a Kubernetes Secret and the corresponding Grafana custom resource configuration (see Grafana Operator documentation) before exposing Grafana outside the cluster. |
| - uses: actions/checkout@v4 | ||
| - name: Install kustomize | ||
| run: | | ||
| curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash | ||
| sudo mv kustomize /usr/local/bin/ |
There was a problem hiding this comment.
This workflow installs kustomize by piping a script from the kustomize repo’s master branch, which is non-deterministic and bypasses checksum verification. The repository already has a pinned+checksum-verified installer (tests/install_KinD_create_KinD_cluster_install_kustomize.sh) and most workflows use actions/checkout@v5 (this one uses @v4). Please align with the repo’s existing installation approach and pin versions for reproducible CI.
| - uses: actions/checkout@v4 | |
| - name: Install kustomize | |
| run: | | |
| curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash | |
| sudo mv kustomize /usr/local/bin/ | |
| - uses: actions/checkout@v5 | |
| - name: Install kustomize | |
| run: | | |
| chmod +x tests/install_KinD_create_KinD_cluster_install_kustomize.sh | |
| tests/install_KinD_create_KinD_cluster_install_kustomize.sh |
| spec: | ||
| containers: | ||
| - name: grafana | ||
| image: grafana/grafana:10.2.3 |
There was a problem hiding this comment.
The kubeflow-monitoring-system namespace is labeled with PSS restricted enforcement, but the Grafana pod template here does not specify a pod/container securityContext (e.g., allowPrivilegeEscalation: false, capabilities.drop: ["ALL"], seccompProfile). To avoid relying on operator defaults (and potential PodSecurity admission rejections), consider setting an explicit restricted-compliant security context in this spec.deployment.spec.template.
| spec: | |
| containers: | |
| - name: grafana | |
| image: grafana/grafana:10.2.3 | |
| spec: | |
| securityContext: | |
| runAsNonRoot: true | |
| runAsUser: 472 | |
| runAsGroup: 472 | |
| fsGroup: 472 | |
| seccompProfile: | |
| type: RuntimeDefault | |
| containers: | |
| - name: grafana | |
| image: grafana/grafana:10.2.3 | |
| securityContext: | |
| allowPrivilegeEscalation: false | |
| capabilities: | |
| drop: | |
| - ALL |
| configMapRef: | ||
| name: gpu-namespace-usage-dashboard | ||
| key: gpu-namespace-usage.json |
There was a problem hiding this comment.
spec.configMapRef points to gpu-namespace-usage-dashboard/gpu-namespace-usage.json, but there is no ConfigMap providing that key (the only object with that name is another GrafanaDashboard). This will fail to load the dashboard. Consider keeping a single GrafanaDashboard (inline json) or add the missing ConfigMap and reference it consistently.
| configMapRef: | |
| name: gpu-namespace-usage-dashboard | |
| key: gpu-namespace-usage.json | |
| json: | | |
| { | |
| "title": "GPU Namespace Usage", | |
| "schemaVersion": 38, | |
| "version": 1, | |
| "timezone": "browser", | |
| "editable": true, | |
| "panels": [] | |
| } |
| name: amd-gpu-metrics-exporter | ||
| namespace: kube-amd-gpu | ||
| labels: |
There was a problem hiding this comment.
Same as the NVIDIA ServiceMonitor: the top-level namespace: kubeflow-monitoring-system will rewrite this resource’s namespace at build time, so metadata.namespace: kube-amd-gpu is misleading and makes the manifest fail if applied standalone when kube-amd-gpu is absent. Consider placing the ServiceMonitor in kubeflow-monitoring-system (or omitting metadata.namespace) and using spec.namespaceSelector.matchNames: [kube-amd-gpu] to target the exporter.
- Remove misleading metadata.namespace from NVIDIA/AMD ServiceMonitors; base kustomization sets namespace, spec.namespaceSelector targets exporters - Add restricted PSS securityContext to Grafana CR pod spec - Upgrade CI workflow to actions/checkout@v5 and use repo's pinned kustomize installer instead of curl|bash - Add wait_or_dump helper to model_registry_test.sh for failure diagnostics - Remove stale -cr.yaml dashboard duplicates that referenced non-existent ConfigMaps, causing dashboard reconciliation failures - Fix Kepler DaemonSet inline comment to reference correct 'kepler' namespace - Rewrite README: correct ServiceMonitor namespace descriptions, add namespaceSelector clarification, add Grafana security warning Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
| if ! kubectl wait --for=condition=available -n "$ns" "deployment/$deploy" --timeout="$timeout"; then | ||
| echo "ERROR: deployment $deploy in namespace $ns did not become available" | ||
| kubectl events -n "$ns" | ||
| kubectl describe "deployment/$deploy" -n "$ns" | ||
| kubectl logs "deployment/$deploy" -n "$ns" --all-containers --tail=50 || true | ||
| exit 1 |
There was a problem hiding this comment.
kubectl events is not consistently available across kubectl versions/environments, and with set -e this can cause the failure-reporting path to exit before printing the subsequent describe/logs. Use a more portable command (e.g., kubectl get events) and/or guard it with || true so diagnostics are always emitted.
tests/model_registry_test.sh
Outdated
| nohup kubectl port-forward svc/model-registry-service -n kubeflow 8081:8080 & | ||
| PF_PIDS+=($!) | ||
|
|
||
| while ! curl -s localhost:8081 > /dev/null; do | ||
| echo "waiting for port-forwarding 8081..." | ||
| sleep 1 | ||
| done | ||
| echo "port-forwarding 8081 ready" |
There was a problem hiding this comment.
This port-forward readiness loop has no timeout/max retries. If port-forwarding fails (e.g., port already in use, service missing, RBAC), the script can hang indefinitely in CI. Add a bounded retry/timeout and/or verify the port-forward process is still running (similar to tests/model_catalog_test.sh).
tests/model_registry_test.sh
Outdated
| nohup kubectl port-forward --namespace istio-system svc/${INGRESS_GATEWAY_SERVICE} 8080:80 & | ||
| PF_PIDS+=($!) | ||
| sleep 3 | ||
| fi | ||
|
|
||
| while ! curl -s localhost:8080 > /dev/null; do | ||
| echo "waiting for port-forwarding 8080..." | ||
| sleep 1 | ||
| done | ||
| echo "port-forwarding 8080 ready" |
There was a problem hiding this comment.
This second port-forward readiness loop also lacks a timeout/max retries, so CI can hang indefinitely if the Istio gateway port-forward fails. Add a bounded retry/timeout (or reuse a shared helper) to guarantee the script terminates with diagnostics.
| resources: | ||
| - gpu-cluster-usage-dashboard.yaml | ||
| - gpu-namespace-usage-dashboard.yaml | ||
| - gpu-availability-allocation-dashboard.yaml |
There was a problem hiding this comment.
The dashboards directory contains both inline-JSON GrafanaDashboard resources and separate *-dashboard-cr.yaml resources that reference ConfigMaps, but the *-dashboard-cr.yaml files are not included here (and there are no ConfigMaps generated for them). This leaves dead/unused manifests in the tree and makes it unclear which provisioning mechanism is intended. Either remove the *-dashboard-cr.yaml files or include them together with the required ConfigMaps/configMapGenerator and reference them consistently.
| - gpu-availability-allocation-dashboard.yaml |
| | KServe | applications/kserve/kserve | [v0.16.0](https://github.com/kserve/kserve/releases/tag/v0.16.0/install/v0.16.0) | 600m | 1200Mi | 0GB | | ||
| | KServe Models Web Application | applications/kserve/models-web-app | [c71ee4309f0335159d9fdfd4559a538b5c782c92](https://github.com/kserve/models-web-app/tree/c71ee4309f0335159d9fdfd4559a538b5c782c92/manifests/kustomize) | 6m | 259Mi | 0GB | | ||
| | KServe Models Web Application | applications/kserve/models-web-app | [v0.16.1](https://github.com/kserve/models-web-app/tree/v0.16.1/config) | 6m | 259Mi | 0GB | | ||
| | Kubeflow Pipelines | applications/pipeline/upstream | [2.16.0](https://github.com/kubeflow/pipelines/tree/2.16.0/manifests/kustomize) | 970m | 3552Mi | 35GB | |
There was a problem hiding this comment.
This PR is described as introducing an opt-in observability stack, but this README change also updates the referenced KServe Models Web Application version/link. If this update is intentional, it should be called out in the PR description (or moved to a separate PR) to avoid mixing unrelated changes.
- Replace kubectl events with more portable command in tests - Add bounded retry validation for port-forward loops - Remove orphaned dashboards CRs without ConfigMaps - Revert unintended KServe Models Web App version bump in README Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
| kind: ClusterRole | ||
| metadata: | ||
| name: grafana-operator | ||
| rules: | ||
| - apiGroups: [""] |
There was a problem hiding this comment.
The ClusterRole / ClusterRoleBinding names here are very generic (grafana-operator). Because these are cluster-scoped, installing this component into a cluster that already has a Grafana Operator (or any other component using the same names) will cause RBAC resource name collisions and unexpected permission changes. Consider prefixing these names (for example with kubeflow- / kubeflow-monitoring-) to make them globally unique.
| fi | ||
| echo "waiting for port-forwarding $port..." | ||
| sleep 1 | ||
| ((count++)) | ||
| done |
There was a problem hiding this comment.
wait_for_port uses ((count++)) while the script runs with set -e. In Bash, ((count++)) returns exit status 1 when count is 0 (first loop iteration), which will terminate the script prematurely. Use an increment form that does not trigger set -e (for example, count=$((count+1)), ((++count)), or add an explicit || true).
| spec: | ||
| serviceAccountName: prometheus | ||
| serviceMonitorSelector: {} | ||
| serviceMonitorNamespaceSelector: {} | ||
| resources: | ||
| requests: | ||
| memory: 400Mi |
There was a problem hiding this comment.
The kubeflow-monitoring-system namespace is labeled with Pod Security Standards restricted, but this Prometheus CR does not define pod/container securityContext (e.g., allowPrivilegeEscalation: false, seccompProfile: RuntimeDefault, runAsNonRoot, dropped capabilities). On clusters enforcing PSA restricted without SeccompDefault, the operator-created Prometheus StatefulSet can be blocked. Configure the Prometheus CR to set an explicit pod and container security context that satisfies PSS restricted.
| - name: containerd | ||
| mountPath: /var/run/containerd | ||
| readOnly: true | ||
| volumes: | ||
| - name: proc | ||
| hostPath: | ||
| path: /proc | ||
| - name: sys | ||
| hostPath: | ||
| path: /sys | ||
| - name: containerd | ||
| hostPath: | ||
| path: /var/run/containerd |
There was a problem hiding this comment.
The Kepler DaemonSet hard-codes a hostPath mount for /var/run/containerd. On nodes that do not use containerd (e.g., CRI-O), this path may not contain the expected socket and Kepler can fail to collect runtime metrics (or silently report incomplete data). Consider making the runtime socket mount configurable (or mount a more general path used by supported runtimes) and document the supported container runtimes explicitly.
- Rename Grafana Operator ClusterRole/Binding to prevent RBAC collisions - Fix premature bash exit from set -e in wait_for_port increment - Add PSS restricted securityContext configuration to Prometheus CR - Generalize Kepler DaemonSet runtime socket mount path to /var/run Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
…eview feedback Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
8ec9b85 to
aab37cf
Compare
| pod-security.kubernetes.io/enforce-version: latest | ||
| pod-security.kubernetes.io/warn: restricted | ||
| pod-security.kubernetes.io/warn-version: latest |
There was a problem hiding this comment.
Using pod-security.kubernetes.io/*-version: latest makes the enforced Pod Security Standard level change automatically as Kubernetes evolves, which can lead to unexpected breakage on cluster upgrades. Consider pinning to a specific Kubernetes minor (matching the project's supported floor) or omitting the version labels so upgrades are explicit and controlled.
| pod-security.kubernetes.io/enforce-version: latest | |
| pod-security.kubernetes.io/warn: restricted | |
| pod-security.kubernetes.io/warn-version: latest | |
| pod-security.kubernetes.io/warn: restricted |
| pod-security.kubernetes.io/enforce-version: latest | ||
| pod-security.kubernetes.io/warn: privileged | ||
| pod-security.kubernetes.io/warn-version: latest |
There was a problem hiding this comment.
Using pod-security.kubernetes.io/*-version: latest makes the enforced Pod Security Standard level change automatically as Kubernetes evolves, which can lead to unexpected breakage on cluster upgrades. Consider pinning to a specific Kubernetes minor (matching the project's supported floor) or omitting the version labels so upgrades are explicit and controlled.
| pod-security.kubernetes.io/enforce-version: latest | |
| pod-security.kubernetes.io/warn: privileged | |
| pod-security.kubernetes.io/warn-version: latest | |
| pod-security.kubernetes.io/warn: privileged |
| echo "ERROR: deployment $deploy in namespace $ns did not become available" | ||
| kubectl get events -n "$ns" --sort-by='.lastTimestamp' || true | ||
| kubectl describe "deployment/$deploy" -n "$ns" | ||
| kubectl logs "deployment/$deploy" -n "$ns" --all-containers --tail=50 || true | ||
| exit 1 |
There was a problem hiding this comment.
In the failure path, kubectl describe "deployment/$deploy" -n "$ns" is not guarded with || true. With set -euo pipefail, if the deployment does not exist (or describe fails for another reason), the script can exit before printing logs and the explicit exit 1, reducing debuggability. Consider adding || true (like the other debug commands) so the function reliably reaches the intended error output and exit 1.
tests/model_registry_test.sh
Outdated
| # Track port-forward PIDs so they are always killed on exit. | ||
| PF_PIDS=() | ||
|
|
There was a problem hiding this comment.
Variable names like PF_PIDS, PF_PID_8081, and PF_PID_8080 are abbreviations that are hard to interpret without local context. Consider renaming them to fully spelled-out, descriptive names (for example, something like “port forward process IDs”) to improve long-term readability and maintainability of this test script.
| - apiGroups: [""] | ||
| resources: ["configmaps", "secrets", "services", "serviceaccounts"] | ||
| verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] | ||
| - apiGroups: ["apps"] | ||
| resources: ["deployments"] | ||
| verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] |
There was a problem hiding this comment.
The Grafana Operator ClusterRole grants cluster-wide write access to Secrets/ConfigMaps/Services/ServiceAccounts and Deployments. Since this stack is intended to be namespaced to kubeflow-monitoring-system, consider scoping the operator to a single namespace (e.g., set a watch namespace) and replacing the ClusterRole/ClusterRoleBinding with a namespaced Role/RoleBinding (or otherwise reducing privileges) to avoid unnecessary cluster-wide secret mutation permissions.
Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 33 out of 35 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (6)
tests/observability_install.sh:1
kubectl apply --server-side --force-conflictscan unintentionally take ownership of and overwrite fields for resources that may already exist (especially in shared dev clusters), making the install script risky outside of ephemeral CI. Consider removing--force-conflictsby default, or gating it behind an env var/flag (e.g.,FORCE_CONFLICTS=true).
tests/observability_install.sh:1- Waiting on
deployment -l control-plane=controller-manageris not specific to the Grafana operator and can match multiple deployments (or none), which can make the script flaky or hang/fail unexpectedly. Prefer waiting for a specific resource name (e.g.,deployment/grafana-operator) or a unique label selector tied to this deployment.
tests/model_registry_test.sh:1 - Using
nohupwithout redirecting stdout/stderr will typically create/updatenohup.out, which can clutter CI artifacts and local working directories. Consider redirecting output to a known location (or/dev/null) while still keeping the PID-based cleanup logic.
tests/model_registry_test.sh:1 - Using
nohupwithout redirecting stdout/stderr will typically create/updatenohup.out, which can clutter CI artifacts and local working directories. Consider redirecting output to a known location (or/dev/null) while still keeping the PID-based cleanup logic.
tests/model_registry_test.sh:1 - Since
PORT_FORWARD_PIDSis always initialized,${PORT_FORWARD_PIDS[@]:-}is unnecessary and can cause an extra loop iteration with an empty string in some cases. Use a normal array iteration (\"${PORT_FORWARD_PIDS[@]}\") and (optionally) gate on array length for clearer behavior.
tests/model_registry_test.sh:1 - Since
PORT_FORWARD_PIDSis always initialized,${PORT_FORWARD_PIDS[@]:-}is unnecessary and can cause an extra loop iteration with an empty string in some cases. Use a normal array iteration (\"${PORT_FORWARD_PIDS[@]}\") and (optionally) gate on array length for clearer behavior.
| | Prerequisite | Required for | Notes | | ||
| |---|---|---| | ||
| | Kubernetes 1.27+ | Everything | | | ||
| | kustomize v5+ | Installation | | | ||
| | NVIDIA GPU Operator | NVIDIA ServiceMonitor | Runs in `gpu-operator` ns — ServiceMonitor scrapes it via `spec.namespaceSelector`; silent if absent | | ||
| | AMD GPU Operator | AMD ServiceMonitor | Runs in `kube-amd-gpu` ns — ServiceMonitor scrapes it via `spec.namespaceSelector`; silent if absent | | ||
| | kube-state-metrics | GPU Namespace Usage + Availability dashboards | **Without it 2/3 dashboards render blank with no error** — install via kube-prometheus-stack or standalone | |
There was a problem hiding this comment.
The prerequisite (and other) tables use || at the start of rows, which renders as an extra empty column / broken markdown in many renderers. Use standard markdown table formatting (| ... |) consistently so the tables render correctly.
| config: | ||
| auth: | ||
| disable_login_form: "false" | ||
| users: | ||
| allow_sign_up: "false" |
There was a problem hiding this comment.
This stack appears to rely on Grafana's default initial credentials (also documented as admin/admin). Even for an opt-in component, shipping a configuration that comes up with well-known credentials is risky. Prefer wiring admin credentials through a Kubernetes Secret supported by the Grafana Operator (or at least make the insecure default explicitly opt-in), and document the required Secret/values.
| capabilities: | ||
| drop: | ||
| - ALL |
There was a problem hiding this comment.
Setting privileged: true effectively grants broad capabilities; additionally specifying capabilities.drop: [ALL] is misleading because privileged mode can negate the intent of capability dropping. To avoid conveying a false sense of hardening, either remove the capabilities.drop stanza when running privileged, or (if feasible) run non-privileged with a minimal explicit capability set.
| capabilities: | |
| drop: | |
| - ALL |
Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
This PR introduces a new
common/observability/kustomize component that provides an optional, batteries-included monitoring stack.Key Changes:
kubeflow-monitoring-systemnamespace.tests/observability_install.sh.Follows mentor feedback regarding namespace and operator inclusion. Opt-in only, no impact on default installations.
Fixes #3426