Skip to content

Commit a07c50a

Browse files
feat(observability): Add opt-in observability stack #3426
Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
1 parent 46f3142 commit a07c50a

33 files changed

Lines changed: 56534 additions & 3 deletions

README.md

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ This repository periodically synchronizes all official Kubeflow components from
7575
| Volumes Web Application | applications/volumes-web-app/upstream | [v1.10.0](https://github.com/kubeflow/kubeflow/tree/v1.10.0/components/crud-web-apps/volumes/manifests) | 4m | 226Mi | 0GB |
7676
| Katib | applications/katib/upstream | [v0.19.0](https://github.com/kubeflow/katib/tree/v0.19.0/manifests/v1beta1) | 13m | 476Mi | 10GB |
7777
| KServe | applications/kserve/kserve | [v0.16.0](https://github.com/kserve/kserve/releases/tag/v0.16.0/install/v0.16.0) | 600m | 1200Mi | 0GB |
78-
| KServe Models Web Application | applications/kserve/models-web-app | [c71ee4309f0335159d9fdfd4559a538b5c782c92](https://github.com/kserve/models-web-app/tree/c71ee4309f0335159d9fdfd4559a538b5c782c92/manifests/kustomize) | 6m | 259Mi | 0GB |
78+
| KServe Models Web Application | applications/kserve/models-web-app | [v0.16.1](https://github.com/kserve/models-web-app/tree/v0.16.1/config) | 6m | 259Mi | 0GB |
7979
| Kubeflow Pipelines | applications/pipeline/upstream | [2.16.0](https://github.com/kubeflow/pipelines/tree/2.16.0/manifests/kustomize) | 970m | 3552Mi | 35GB |
8080
| Kubeflow Model Registry | applications/model-registry/upstream | [v0.3.7](https://github.com/kubeflow/model-registry/tree/v0.3.7/manifests/kustomize) | 510m | 2112Mi | 20GB |
8181
| Spark Operator | applications/spark/spark-operator | [2.5.0](https://github.com/kubeflow/spark-operator/tree/v2.5.0) | 9m | 41Mi | 0GB |
@@ -84,6 +84,7 @@ This repository periodically synchronizes all official Kubeflow components from
8484
| Cert Manager | common/cert-manager | [1.19.4](https://github.com/cert-manager/cert-manager/releases/tag/v1.19.4) | 3m | 128Mi | 0GB |
8585
| Dex | common/dex | [2.45.0](https://github.com/dexidp/dex/releases/tag/v2.45.0) | 3m | 27Mi | 0GB |
8686
| OAuth2-Proxy | common/oauth2-proxy | [7.14.3](https://github.com/oauth2-proxy/oauth2-proxy/releases/tag/v7.14.3) | 3m | 27Mi | 0GB |
87+
| Observability | common/observability | [3426](https://github.com/kubeflow/manifests/issues/3426) | - | - | 0GB |
8788
| **Total** | | | **4380m** | **12341Mi** | **65GB** |
8889

8990

@@ -111,7 +112,7 @@ The `example` directory contains an example kustomization for the single command
111112
### ARM64 / aarch64 note
112113

113114
Kubeflow on ARM64/aarch64 may not be fully supported yet because some OCI images might not be available for `linux/arm64`.
114-
If you hit image pull errors such as no matching manifest for linux/arm64, please track/report details in kubeflow/manifests#2745 and take a look at the [Google Summer of Code project for Kubeflow on ARM64](https://www.kubeflow.org/events/upcoming-events/gsoc-2026/#project--end-to-end-arm64-support--validation-on-kubeflow).
115+
If you hit image pull errors such as "no matching manifest for linux/arm64", please track/report details in kubeflow/manifests#2745 and take a look at the [Google Summer of Code project for Kubeflow on ARM64](https://www.kubeflow.org/events/upcoming-events/gsoc-2026/#project--end-to-end-arm64-support--validation-on-kubeflow).
115116

116117
---
117118
**NOTE**
@@ -182,6 +183,22 @@ Install the Kubeflow namespace:
182183
kustomize build common/kubeflow-namespace/base | kubectl apply -f -
183184
```
184185

186+
#### Observability Stack (Optional)
187+
188+
This component provides an optional monitoring stack for GPU metrics (NVIDIA/AMD) and energy consumption (Kepler), along with Grafana dashboards. It includes Prometheus and Grafana operators and is deployed in the `kubeflow-monitoring-system` namespace.
189+
190+
Install the observability base component:
191+
192+
```sh
193+
./tests/observability_install.sh
194+
```
195+
196+
To opt into Kepler for energy metrics:
197+
198+
```sh
199+
kustomize build common/observability/components/kepler | kubectl apply -f -
200+
```
201+
185202
#### Cert-manager
186203

187204
Cert-manager is used by many Kubeflow components to provide certificates for admission webhooks.
@@ -448,7 +465,6 @@ kustomize build applications/tensorboard/tensorboard-controller/upstream/overlay
448465

449466
```sh
450467
./tests/spark_install.sh
451-
```
452468

453469
#### User Namespaces
454470

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
apiVersion: grafana.integreatly.org/v1beta1
2+
kind: GrafanaDashboard
3+
metadata:
4+
name: gpu-availability-allocation
5+
namespace: kubeflow-monitoring-system
6+
spec:
7+
configMapRef:
8+
name: gpu-availability-allocation-dashboard
9+
key: gpu-availability-allocation.json
10+
instanceSelector:
11+
matchLabels:
12+
dashboards: "grafana"
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
apiVersion: v1
2+
kind: ConfigMap
3+
metadata:
4+
name: gpu-availability-allocation-dashboard
5+
namespace: kubeflow-monitoring-system
6+
labels:
7+
grafana_dashboard: "1"
8+
data:
9+
gpu-availability-allocation.json: |
10+
{
11+
"title": "GPU Availability & Allocation",
12+
"panels": [
13+
{
14+
"title": "Pending GPU workloads",
15+
"type": "stat",
16+
"targets": [
17+
{ "expr": "count(kube_pod_status_phase{phase=\"Pending\"} * on(pod, namespace) group_left() kube_pod_container_resource_requests{resource=\"nvidia.com/gpu\"})", "legendFormat": "Pending NVIDIA GPU Pods" }
18+
]
19+
}
20+
],
21+
"datasource": { "uid": "prometheus" }
22+
}
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
apiVersion: grafana.integreatly.org/v1beta1
2+
kind: GrafanaDashboard
3+
metadata:
4+
name: gpu-cluster-usage
5+
namespace: kubeflow-monitoring-system
6+
spec:
7+
configMapRef:
8+
name: gpu-cluster-usage-dashboard
9+
key: gpu-cluster-usage.json
10+
instanceSelector:
11+
matchLabels:
12+
dashboards: "grafana"
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
apiVersion: v1
2+
kind: ConfigMap
3+
metadata:
4+
name: gpu-cluster-usage-dashboard
5+
namespace: kubeflow-monitoring-system
6+
labels:
7+
grafana_dashboard: "1"
8+
data:
9+
gpu-cluster-usage.json: |
10+
{
11+
"title": "GPU Cluster Usage",
12+
"panels": [
13+
{
14+
"title": "Cluster-wide GPU Utilization %",
15+
"type": "timeseries",
16+
"targets": [
17+
{ "expr": "avg(DCGM_FI_DEV_GPU_UTIL) or avg(amd_gpu_utilization)", "legendFormat": "GPU Utilization" }
18+
]
19+
},
20+
{
21+
"title": "GPU Memory Used vs Total per Node",
22+
"type": "timeseries",
23+
"targets": [
24+
{ "expr": "sum(DCGM_FI_DEV_FB_USED) by (node)", "legendFormat": "{{node}} Used" },
25+
{ "expr": "sum(DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED) by (node)", "legendFormat": "{{node}} Total" }
26+
]
27+
}
28+
],
29+
"datasource": { "uid": "prometheus" }
30+
}
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
apiVersion: grafana.integreatly.org/v1beta1
2+
kind: GrafanaDashboard
3+
metadata:
4+
name: gpu-namespace-usage
5+
namespace: kubeflow-monitoring-system
6+
spec:
7+
configMapRef:
8+
name: gpu-namespace-usage-dashboard
9+
key: gpu-namespace-usage.json
10+
instanceSelector:
11+
matchLabels:
12+
dashboards: "grafana"
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
apiVersion: v1
2+
kind: ConfigMap
3+
metadata:
4+
name: gpu-namespace-usage-dashboard
5+
namespace: kubeflow-monitoring-system
6+
labels:
7+
grafana_dashboard: "1"
8+
data:
9+
gpu-namespace-usage.json: |
10+
{
11+
"title": "GPU Namespace Usage",
12+
"panels": [
13+
{
14+
"title": "Per-Workload-Namespace GPU Utilization over time",
15+
"type": "timeseries",
16+
"targets": [
17+
{ "expr": "sum by (exported_namespace) (label_replace(DCGM_FI_DEV_GPU_UTIL * on(pod) group_left(namespace) kube_pod_info{}, \"exported_namespace\", \"$1\", \"namespace\", \"(.*)\"))", "legendFormat": "{{exported_namespace}}" }
18+
]
19+
}
20+
],
21+
"datasource": { "uid": "prometheus" }
22+
}
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
apiVersion: kustomize.config.k8s.io/v1beta1
2+
kind: Kustomization
3+
resources:
4+
- gpu-cluster-usage-dashboard.yaml
5+
- gpu-namespace-usage-dashboard.yaml
6+
- gpu-availability-allocation-dashboard.yaml
7+
- gpu-cluster-usage-dashboard-cr.yaml
8+
- gpu-namespace-usage-dashboard-cr.yaml
9+
- gpu-availability-allocation-dashboard-cr.yaml
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
apiVersion: rbac.authorization.k8s.io/v1
2+
kind: ClusterRole
3+
metadata:
4+
name: kepler-role
5+
rules:
6+
- apiGroups: [""]
7+
resources: ["nodes", "pods", "namespaces"]
8+
verbs: ["get", "list", "watch"]
9+
- apiGroups: [""]
10+
resources: ["endpoints"]
11+
verbs: ["get"]
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
apiVersion: rbac.authorization.k8s.io/v1
2+
kind: ClusterRoleBinding
3+
metadata:
4+
name: kepler-role-binding
5+
roleRef:
6+
apiGroup: rbac.authorization.k8s.io
7+
kind: ClusterRole
8+
name: kepler-role
9+
subjects:
10+
- kind: ServiceAccount
11+
name: kepler-sa
12+
namespace: kubeflow-monitoring-system

0 commit comments

Comments
 (0)