Skip to content

Commit 57b1609

Browse files
feat(observability): Add opt-in observability stack #3426
- Deploy to kubeflow-monitoring-system namespace. - Add Prometheus and Grafana operators. - Add ServiceMonitors for NVIDIA, AMD, and Kepler. - Add Grafana dashboards for GPU metrics. - Add installation test script tests/observability_install.sh. Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
1 parent 017025d commit 57b1609

34 files changed

Lines changed: 56557 additions & 9 deletions

README.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,7 @@ This repository periodically synchronizes all official Kubeflow components from
8484
| Cert Manager | common/cert-manager | [1.19.4](https://github.com/cert-manager/cert-manager/releases/tag/v1.19.4) | 3m | 128Mi | 0GB |
8585
| Dex | common/dex | [2.45.0](https://github.com/dexidp/dex/releases/tag/v2.45.0) | 3m | 27Mi | 0GB |
8686
| OAuth2-Proxy | common/oauth2-proxy | [7.14.3](https://github.com/oauth2-proxy/oauth2-proxy/releases/tag/v7.14.3) | 3m | 27Mi | 0GB |
87+
| Observability | common/observability | [3426](https://github.com/kubeflow/manifests/issues/3426) | - | - | 0GB |
8788
| **Total** | | | **4380m** | **12341Mi** | **65GB** |
8889

8990

@@ -177,6 +178,22 @@ Install the Kubeflow namespace:
177178
kustomize build common/kubeflow-namespace/base | kubectl apply -f -
178179
```
179180

181+
#### Observability Stack (Optional)
182+
183+
This component provides an optional monitoring stack for GPU metrics (NVIDIA/AMD) and energy consumption (Kepler), along with Grafana dashboards. It includes Prometheus and Grafana operators and is deployed in the `kubeflow-monitoring-system` namespace.
184+
185+
Install the observability base component:
186+
187+
```sh
188+
./tests/observability_install.sh
189+
```
190+
191+
To opt into Kepler for energy metrics:
192+
193+
```sh
194+
kustomize build common/observability/components/kepler | kubectl apply -f -
195+
```
196+
180197
#### Cert-manager
181198

182199
Cert-manager is used by many Kubeflow components to provide certificates for admission webhooks.

applications/profiles/upstream/overlays/kubeflow/kustomization.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ commonLabels:
1212

1313
patchesStrategicMerge:
1414
- patches/kfam.yaml
15-
- patches/remove-namespace.yaml
1615

1716
configurations:
1817
- params.yaml
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
apiVersion: v1
2+
kind: ConfigMap
3+
metadata:
4+
name: gpu-availability-allocation-dashboard
5+
namespace: kubeflow-monitoring-system
6+
labels:
7+
grafana_dashboard: "1"
8+
data:
9+
gpu-availability-allocation.json: |
10+
{
11+
"title": "GPU Availability & Allocation",
12+
"panels": [
13+
{
14+
"title": "Pending GPU workloads",
15+
"type": "stat",
16+
"targets": [
17+
{ "expr": "count(kube_pod_status_phase{phase=\"Pending\"} * on(pod, namespace) group_left() kube_pod_container_resource_requests{resource=\"nvidia.com/gpu\"})", "legendFormat": "Pending NVIDIA GPU Pods" }
18+
]
19+
}
20+
],
21+
"datasource": { "uid": "prometheus" }
22+
}
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
apiVersion: v1
2+
kind: ConfigMap
3+
metadata:
4+
name: gpu-cluster-usage-dashboard
5+
namespace: kubeflow-monitoring-system
6+
labels:
7+
grafana_dashboard: "1"
8+
data:
9+
gpu-cluster-usage.json: |
10+
{
11+
"title": "GPU Cluster Usage",
12+
"panels": [
13+
{
14+
"title": "Cluster-wide GPU Utilization %",
15+
"type": "timeseries",
16+
"targets": [
17+
{ "expr": "avg(DCGM_FI_DEV_GPU_UTIL) or avg(amd_gpu_utilization)", "legendFormat": "GPU Utilization" }
18+
]
19+
},
20+
{
21+
"title": "GPU Memory Used vs Total per Node",
22+
"type": "timeseries",
23+
"targets": [
24+
{ "expr": "sum(DCGM_FI_DEV_FB_USED) by (node)", "legendFormat": "{{node}} Used" },
25+
{ "expr": "sum(DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED) by (node)", "legendFormat": "{{node}} Total" }
26+
]
27+
}
28+
],
29+
"datasource": { "uid": "prometheus" }
30+
}
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
apiVersion: v1
2+
kind: ConfigMap
3+
metadata:
4+
name: gpu-namespace-usage-dashboard
5+
namespace: kubeflow-monitoring-system
6+
labels:
7+
grafana_dashboard: "1"
8+
data:
9+
gpu-namespace-usage.json: |
10+
{
11+
"title": "GPU Namespace Usage",
12+
"panels": [
13+
{
14+
"title": "Per-namespace GPU Utilization over time",
15+
"type": "timeseries",
16+
"targets": [
17+
{ "expr": "sum(DCGM_FI_DEV_GPU_UTIL) by (namespace)", "legendFormat": "{{namespace}}" }
18+
]
19+
}
20+
],
21+
"datasource": { "uid": "prometheus" }
22+
}
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
apiVersion: kustomize.config.k8s.io/v1beta1
2+
kind: Kustomization
3+
resources:
4+
- gpu-cluster-usage-dashboard.yaml
5+
- gpu-namespace-usage-dashboard.yaml
6+
- gpu-availability-allocation-dashboard.yaml
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
apiVersion: rbac.authorization.k8s.io/v1
2+
kind: ClusterRole
3+
metadata:
4+
name: kepler-role
5+
rules:
6+
- apiGroups: [""]
7+
resources: ["nodes", "pods", "namespaces"]
8+
verbs: ["get", "list", "watch"]
9+
- apiGroups: [""]
10+
resources: ["endpoints"]
11+
verbs: ["get"]
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
apiVersion: rbac.authorization.k8s.io/v1
2+
kind: ClusterRoleBinding
3+
metadata:
4+
name: kepler-role-binding
5+
roleRef:
6+
apiGroup: rbac.authorization.k8s.io
7+
kind: ClusterRole
8+
name: kepler-role
9+
subjects:
10+
- kind: ServiceAccount
11+
name: kepler-sa
12+
namespace: kubeflow-monitoring-system
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
apiVersion: apps/v1
2+
kind: DaemonSet
3+
metadata:
4+
name: kepler
5+
namespace: kubeflow-monitoring-system
6+
labels:
7+
app.kubernetes.io/name: kepler
8+
spec:
9+
selector:
10+
matchLabels:
11+
app.kubernetes.io/name: kepler
12+
template:
13+
metadata:
14+
labels:
15+
app.kubernetes.io/name: kepler
16+
spec:
17+
serviceAccountName: kepler-sa
18+
hostPID: true
19+
hostNetwork: true
20+
containers:
21+
- name: kepler
22+
image: quay.io/sustainable_computing_io/kepler:v0.7.11
23+
ports:
24+
- name: http
25+
containerPort: 9102
26+
resources:
27+
requests:
28+
cpu: 100m
29+
memory: 128Mi
30+
limits:
31+
cpu: 500m
32+
memory: 512Mi
33+
securityContext:
34+
privileged: true
35+
volumeMounts:
36+
- name: proc
37+
mountPath: /proc
38+
readOnly: true
39+
- name: sys
40+
mountPath: /sys
41+
readOnly: true
42+
- name: containerd
43+
mountPath: /var/run/containerd
44+
readOnly: true
45+
volumes:
46+
- name: proc
47+
hostPath:
48+
path: /proc
49+
- name: sys
50+
hostPath:
51+
path: /sys
52+
- name: containerd
53+
hostPath:
54+
path: /var/run/containerd
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
apiVersion: kustomize.config.k8s.io/v1beta1
2+
kind: Kustomization
3+
resources:
4+
- namespace.yaml
5+
- serviceaccount.yaml
6+
- clusterrole.yaml
7+
- clusterrolebinding.yaml
8+
- daemonset.yaml
9+
- service.yaml

0 commit comments

Comments
 (0)