Skip to content

Commit 54e9895

Browse files
feat(observability): Add opt-in observability stack #3426
Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
1 parent 46f3142 commit 54e9895

33 files changed

Lines changed: 56535 additions & 6 deletions

README.md

Lines changed: 20 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ This repository periodically synchronizes all official Kubeflow components from
7575
| Volumes Web Application | applications/volumes-web-app/upstream | [v1.10.0](https://github.com/kubeflow/kubeflow/tree/v1.10.0/components/crud-web-apps/volumes/manifests) | 4m | 226Mi | 0GB |
7676
| Katib | applications/katib/upstream | [v0.19.0](https://github.com/kubeflow/katib/tree/v0.19.0/manifests/v1beta1) | 13m | 476Mi | 10GB |
7777
| KServe | applications/kserve/kserve | [v0.16.0](https://github.com/kserve/kserve/releases/tag/v0.16.0/install/v0.16.0) | 600m | 1200Mi | 0GB |
78-
| KServe Models Web Application | applications/kserve/models-web-app | [c71ee4309f0335159d9fdfd4559a538b5c782c92](https://github.com/kserve/models-web-app/tree/c71ee4309f0335159d9fdfd4559a538b5c782c92/manifests/kustomize) | 6m | 259Mi | 0GB |
78+
| KServe Models Web Application | applications/kserve/models-web-app | [v0.16.1](https://github.com/kserve/models-web-app/tree/v0.16.1/config) | 6m | 259Mi | 0GB |
7979
| Kubeflow Pipelines | applications/pipeline/upstream | [2.16.0](https://github.com/kubeflow/pipelines/tree/2.16.0/manifests/kustomize) | 970m | 3552Mi | 35GB |
8080
| Kubeflow Model Registry | applications/model-registry/upstream | [v0.3.7](https://github.com/kubeflow/model-registry/tree/v0.3.7/manifests/kustomize) | 510m | 2112Mi | 20GB |
8181
| Spark Operator | applications/spark/spark-operator | [2.5.0](https://github.com/kubeflow/spark-operator/tree/v2.5.0) | 9m | 41Mi | 0GB |
@@ -84,6 +84,7 @@ This repository periodically synchronizes all official Kubeflow components from
8484
| Cert Manager | common/cert-manager | [1.19.4](https://github.com/cert-manager/cert-manager/releases/tag/v1.19.4) | 3m | 128Mi | 0GB |
8585
| Dex | common/dex | [2.45.0](https://github.com/dexidp/dex/releases/tag/v2.45.0) | 3m | 27Mi | 0GB |
8686
| OAuth2-Proxy | common/oauth2-proxy | [7.14.3](https://github.com/oauth2-proxy/oauth2-proxy/releases/tag/v7.14.3) | 3m | 27Mi | 0GB |
87+
| Observability | common/observability | [3426](https://github.com/kubeflow/manifests/issues/3426) | - | - | 0GB |
8788
| **Total** | | | **4380m** | **12341Mi** | **65GB** |
8889

8990

@@ -108,11 +109,6 @@ The `example` directory contains an example kustomization for the single command
108109
- Our Kind script below will take care of installing continuously tested Kubernetes, Kustomize and Kubectl versions for you.
109110
- We use Kind as default but also support Minikube, Rancher, EKS, AKS, and GKE. GKE might need tiny adjustments documented here in this file and OpenShift is also possible.
110111

111-
### ARM64 / aarch64 note
112-
113-
Kubeflow on ARM64/aarch64 may not be fully supported yet because some OCI images might not be available for `linux/arm64`.
114-
If you hit image pull errors such as “no matching manifest for linux/arm64”, please track/report details in kubeflow/manifests#2745 and take a look at the [Google Summer of Code project for Kubeflow on ARM64](https://www.kubeflow.org/events/upcoming-events/gsoc-2026/#project--end-to-end-arm64-support--validation-on-kubeflow).
115-
116112
---
117113
**NOTE**
118114

@@ -182,6 +178,22 @@ Install the Kubeflow namespace:
182178
kustomize build common/kubeflow-namespace/base | kubectl apply -f -
183179
```
184180

181+
#### Observability Stack (Optional)
182+
183+
This component provides an optional monitoring stack for GPU metrics (NVIDIA/AMD) and energy consumption (Kepler), along with Grafana dashboards. It includes Prometheus and Grafana operators and is deployed in the `kubeflow-monitoring-system` namespace.
184+
185+
Install the observability base component:
186+
187+
```sh
188+
./tests/observability_install.sh
189+
```
190+
191+
To opt into Kepler for energy metrics:
192+
193+
```sh
194+
kustomize build common/observability/components/kepler | kubectl apply -f -
195+
```
196+
185197
#### Cert-manager
186198

187199
Cert-manager is used by many Kubeflow components to provide certificates for admission webhooks.
@@ -450,6 +462,8 @@ kustomize build applications/tensorboard/tensorboard-controller/upstream/overlay
450462
./tests/spark_install.sh
451463
```
452464

465+
466+
453467
#### User Namespaces
454468

455469
Finally, create a new namespace for the default user (named `kubeflow-user-example-com`).
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
apiVersion: grafana.integreatly.org/v1beta1
2+
kind: GrafanaDashboard
3+
metadata:
4+
name: gpu-availability-allocation
5+
namespace: kubeflow-monitoring-system
6+
spec:
7+
configMapRef:
8+
name: gpu-availability-allocation-dashboard
9+
key: gpu-availability-allocation.json
10+
instanceSelector:
11+
matchLabels:
12+
dashboards: "grafana"
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
apiVersion: v1
2+
kind: ConfigMap
3+
metadata:
4+
name: gpu-availability-allocation-dashboard
5+
namespace: kubeflow-monitoring-system
6+
labels:
7+
grafana_dashboard: "1"
8+
data:
9+
gpu-availability-allocation.json: |
10+
{
11+
"title": "GPU Availability & Allocation",
12+
"panels": [
13+
{
14+
"title": "Pending GPU workloads",
15+
"type": "stat",
16+
"targets": [
17+
{ "expr": "count(kube_pod_status_phase{phase=\"Pending\"} * on(pod, namespace) group_left() kube_pod_container_resource_requests{resource=\"nvidia.com/gpu\"})", "legendFormat": "Pending NVIDIA GPU Pods" }
18+
]
19+
}
20+
],
21+
"datasource": { "uid": "prometheus" }
22+
}
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
apiVersion: grafana.integreatly.org/v1beta1
2+
kind: GrafanaDashboard
3+
metadata:
4+
name: gpu-cluster-usage
5+
namespace: kubeflow-monitoring-system
6+
spec:
7+
configMapRef:
8+
name: gpu-cluster-usage-dashboard
9+
key: gpu-cluster-usage.json
10+
instanceSelector:
11+
matchLabels:
12+
dashboards: "grafana"
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
apiVersion: v1
2+
kind: ConfigMap
3+
metadata:
4+
name: gpu-cluster-usage-dashboard
5+
namespace: kubeflow-monitoring-system
6+
labels:
7+
grafana_dashboard: "1"
8+
data:
9+
gpu-cluster-usage.json: |
10+
{
11+
"title": "GPU Cluster Usage",
12+
"panels": [
13+
{
14+
"title": "Cluster-wide GPU Utilization %",
15+
"type": "timeseries",
16+
"targets": [
17+
{ "expr": "avg(DCGM_FI_DEV_GPU_UTIL) or avg(amd_gpu_utilization)", "legendFormat": "GPU Utilization" }
18+
]
19+
},
20+
{
21+
"title": "GPU Memory Used vs Total per Node",
22+
"type": "timeseries",
23+
"targets": [
24+
{ "expr": "sum(DCGM_FI_DEV_FB_USED) by (node)", "legendFormat": "{{node}} Used" },
25+
{ "expr": "sum(DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED) by (node)", "legendFormat": "{{node}} Total" }
26+
]
27+
}
28+
],
29+
"datasource": { "uid": "prometheus" }
30+
}
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
apiVersion: grafana.integreatly.org/v1beta1
2+
kind: GrafanaDashboard
3+
metadata:
4+
name: gpu-namespace-usage
5+
namespace: kubeflow-monitoring-system
6+
spec:
7+
configMapRef:
8+
name: gpu-namespace-usage-dashboard
9+
key: gpu-namespace-usage.json
10+
instanceSelector:
11+
matchLabels:
12+
dashboards: "grafana"
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
apiVersion: v1
2+
kind: ConfigMap
3+
metadata:
4+
name: gpu-namespace-usage-dashboard
5+
namespace: kubeflow-monitoring-system
6+
labels:
7+
grafana_dashboard: "1"
8+
data:
9+
gpu-namespace-usage.json: |
10+
{
11+
"title": "GPU Namespace Usage",
12+
"panels": [
13+
{
14+
"title": "Per-namespace GPU Utilization over time",
15+
"type": "timeseries",
16+
"targets": [
17+
{ "expr": "sum(DCGM_FI_DEV_GPU_UTIL) by (namespace)", "legendFormat": "{{namespace}}" }
18+
]
19+
}
20+
],
21+
"datasource": { "uid": "prometheus" }
22+
}
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
apiVersion: kustomize.config.k8s.io/v1beta1
2+
kind: Kustomization
3+
resources:
4+
- gpu-cluster-usage-dashboard.yaml
5+
- gpu-namespace-usage-dashboard.yaml
6+
- gpu-availability-allocation-dashboard.yaml
7+
- gpu-cluster-usage-dashboard-cr.yaml
8+
- gpu-namespace-usage-dashboard-cr.yaml
9+
- gpu-availability-allocation-dashboard-cr.yaml
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
apiVersion: rbac.authorization.k8s.io/v1
2+
kind: ClusterRole
3+
metadata:
4+
name: kepler-role
5+
rules:
6+
- apiGroups: [""]
7+
resources: ["nodes", "pods", "namespaces"]
8+
verbs: ["get", "list", "watch"]
9+
- apiGroups: [""]
10+
resources: ["endpoints"]
11+
verbs: ["get"]
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
apiVersion: rbac.authorization.k8s.io/v1
2+
kind: ClusterRoleBinding
3+
metadata:
4+
name: kepler-role-binding
5+
roleRef:
6+
apiGroup: rbac.authorization.k8s.io
7+
kind: ClusterRole
8+
name: kepler-role
9+
subjects:
10+
- kind: ServiceAccount
11+
name: kepler-sa
12+
namespace: kubeflow-monitoring-system

0 commit comments

Comments
 (0)