Skip to content

Commit 3f6dbe3

Browse files
feat(observability): Add opt-in observability stack #3426
Signed-off-by: abdullahpathan22 <abdullahpathan22@users.noreply.github.com>
1 parent 46f3142 commit 3f6dbe3

30 files changed

Lines changed: 56490 additions & 12 deletions

README.md

Lines changed: 32 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ You can also install the master branch of [`kubeflow/manifests`](https://github.
1010

1111
We are planning to cut 2 releases per year, for example 26.03 and 26.10 before each KubeCon EU and NA.
1212
We ask each working group/component to provide non-breaking patch releases for 6 months based on the version in each date release.
13-
We try to BEST-EFFORT support each release for 6 months as community. There is [commercial support](https://www.kubeflow.org/docs/started/support/#support-from-commercial-providers-in-the-kubeflow-ecosystem) available if needed.
13+
We try to BEST-EFFORT support each realease for 6 monhts as community. There is [commercial support](https://www.kubeflow.org/docs/started/support/#support-from-commercial-providers-in-the-kubeflow-ecosystem) available if needed.
1414
The working groups (KFP, Katib, Trainer, ...) are allowed to release new component versions with breaking changes, but they will only be included in the master branch or the next date release.
1515
This should only apply to “stable” components, as “alpha/beta” components might release breaking changes in patch releases.
1616

@@ -63,7 +63,7 @@ This repository periodically synchronizes all official Kubeflow components from
6363
| Component | Local Manifests Path | Upstream Revision | CPU (millicores) | Memory (Mi) | PVC Storage (GB) |
6464
| - | - | - | - | - | - |
6565
| Training Operator | applications/training-operator/upstream | [v1.9.2](https://github.com/kubeflow/training-operator/tree/v1.9.2/manifests) | 3m | 25Mi | 0GB |
66-
| Trainer | applications/trainer/upstream | [v2.2.0](https://github.com/kubeflow/trainer/tree/v2.2.0/manifests) | 8m | 143Mi | 0GB |
66+
| Trainer | applications/trainer/upstream | [v2.1.0](https://github.com/kubeflow/trainer/tree/v2.1.0/manifests) | 8m | 143Mi | 0GB |
6767
| Notebook Controller | applications/jupyter/notebook-controller/upstream | [v1.10.0](https://github.com/kubeflow/kubeflow/tree/v1.10.0/components/notebook-controller/config) | 5m | 93Mi | 0GB |
6868
| PVC Viewer Controller | applications/pvcviewer-controller/upstream | [v1.10.0](https://github.com/kubeflow/kubeflow/tree/v1.10.0/components/pvcviewer-controller/config) | 15m | 128Mi | 0GB |
6969
| Tensorboard Controller | applications/tensorboard/tensorboard-controller/upstream | [v1.10.0](https://github.com/kubeflow/kubeflow/tree/v1.10.0/components/tensorboard-controller/config) | 15m | 128Mi | 0GB |
@@ -75,15 +75,16 @@ This repository periodically synchronizes all official Kubeflow components from
7575
| Volumes Web Application | applications/volumes-web-app/upstream | [v1.10.0](https://github.com/kubeflow/kubeflow/tree/v1.10.0/components/crud-web-apps/volumes/manifests) | 4m | 226Mi | 0GB |
7676
| Katib | applications/katib/upstream | [v0.19.0](https://github.com/kubeflow/katib/tree/v0.19.0/manifests/v1beta1) | 13m | 476Mi | 10GB |
7777
| KServe | applications/kserve/kserve | [v0.16.0](https://github.com/kserve/kserve/releases/tag/v0.16.0/install/v0.16.0) | 600m | 1200Mi | 0GB |
78-
| KServe Models Web Application | applications/kserve/models-web-app | [c71ee4309f0335159d9fdfd4559a538b5c782c92](https://github.com/kserve/models-web-app/tree/c71ee4309f0335159d9fdfd4559a538b5c782c92/manifests/kustomize) | 6m | 259Mi | 0GB |
78+
| KServe Models Web Application | applications/kserve/models-web-app | [v0.15.0](https://github.com/kserve/models-web-app/tree/v0.15.0/config) | 6m | 259Mi | 0GB |
7979
| Kubeflow Pipelines | applications/pipeline/upstream | [2.16.0](https://github.com/kubeflow/pipelines/tree/2.16.0/manifests/kustomize) | 970m | 3552Mi | 35GB |
8080
| Kubeflow Model Registry | applications/model-registry/upstream | [v0.3.7](https://github.com/kubeflow/model-registry/tree/v0.3.7/manifests/kustomize) | 510m | 2112Mi | 20GB |
81-
| Spark Operator | applications/spark/spark-operator | [2.5.0](https://github.com/kubeflow/spark-operator/tree/v2.5.0) | 9m | 41Mi | 0GB |
81+
| Spark Operator | applications/spark/spark-operator | [2.5.0-rc.0](https://github.com/kubeflow/spark-operator/tree/v2.5.0-rc.0) | 9m | 41Mi | 0GB |
8282
| Istio | common/istio | [1.29.0](https://github.com/istio/istio/releases/tag/1.29.0) | 750m | 2364Mi | 0GB |
8383
| Knative | common/knative/knative-serving <br /> common/knative/knative-eventing | [v1.21.1](https://github.com/knative/serving/releases/tag/knative-v1.21.1) <br /> [v1.21.0](https://github.com/knative/eventing/releases/tag/knative-v1.21.0) | 1450m | 1038Mi | 0GB |
8484
| Cert Manager | common/cert-manager | [1.19.4](https://github.com/cert-manager/cert-manager/releases/tag/v1.19.4) | 3m | 128Mi | 0GB |
8585
| Dex | common/dex | [2.45.0](https://github.com/dexidp/dex/releases/tag/v2.45.0) | 3m | 27Mi | 0GB |
8686
| OAuth2-Proxy | common/oauth2-proxy | [7.14.3](https://github.com/oauth2-proxy/oauth2-proxy/releases/tag/v7.14.3) | 3m | 27Mi | 0GB |
87+
| Observability | common/observability | [3426](https://github.com/kubeflow/manifests/issues/3426) | - | - | 0GB |
8788
| **Total** | | | **4380m** | **12341Mi** | **65GB** |
8889

8990

@@ -108,11 +109,6 @@ The `example` directory contains an example kustomization for the single command
108109
- Our Kind script below will take care of installing continuously tested Kubernetes, Kustomize and Kubectl versions for you.
109110
- We use Kind as default but also support Minikube, Rancher, EKS, AKS, and GKE. GKE might need tiny adjustments documented here in this file and OpenShift is also possible.
110111

111-
### ARM64 / aarch64 note
112-
113-
Kubeflow on ARM64/aarch64 may not be fully supported yet because some OCI images might not be available for `linux/arm64`.
114-
If you hit image pull errors such as “no matching manifest for linux/arm64”, please track/report details in kubeflow/manifests#2745 and take a look at the [Google Summer of Code project for Kubeflow on ARM64](https://www.kubeflow.org/events/upcoming-events/gsoc-2026/#project--end-to-end-arm64-support--validation-on-kubeflow).
115-
116112
---
117113
**NOTE**
118114

@@ -182,6 +178,22 @@ Install the Kubeflow namespace:
182178
kustomize build common/kubeflow-namespace/base | kubectl apply -f -
183179
```
184180

181+
#### Observability Stack (Optional)
182+
183+
This component provides an optional monitoring stack for GPU metrics (NVIDIA/AMD) and energy consumption (Kepler), along with Grafana dashboards. It includes Prometheus and Grafana operators and is deployed in the `kubeflow-monitoring-system` namespace.
184+
185+
Install the observability base component:
186+
187+
```sh
188+
./tests/observability_install.sh
189+
```
190+
191+
To opt into Kepler for energy metrics:
192+
193+
```sh
194+
kustomize build common/observability/components/kepler | kubectl apply -f -
195+
```
196+
185197
#### Cert-manager
186198

187199
Cert-manager is used by many Kubeflow components to provide certificates for admission webhooks.
@@ -207,7 +219,7 @@ kustomize build common/istio/istio-install/overlays/gke | kubectl apply -f -
207219

208220
#### Oauth2-proxy
209221

210-
The oauth2-proxy extends your Istio Ingress-Gateway capabilities to function as an OIDC client. It supports user sessions as well as proper token-based machine-to-machine authentication. Authorization which is completely different from authentication is handled via Kubernetes RBAC and Istio authorizationpolicies.
222+
The oauth2-proxy extends your Istio Ingress-Gateway capabilities to function as an OIDC client. It supports user sessions as well as proper token-based machine-to-machine authentication. Authorization which is completely different form authentication is handled via Kubernetes RBAC and Istio authorizationpolicies.
211223

212224
```sh
213225
echo "Installing oauth2-proxy..."
@@ -450,6 +462,14 @@ kustomize build applications/tensorboard/tensorboard-controller/upstream/overlay
450462
./tests/spark_install.sh
451463
```
452464

465+
#### Model Registry
466+
467+
Install the Model Registry with its UI and database components:
468+
469+
```sh
470+
./tests/model_registry_install.sh
471+
```
472+
453473
#### User Namespaces
454474

455475
Finally, create a new namespace for the default user (named `kubeflow-user-example-com`).
@@ -584,7 +604,7 @@ For modifications and in-place upgrades of the Kubeflow platform, we provide a r
584604

585605
To view all past security scans, head to the [Image Extracting and Security Scanning GitHub Action workflow](https://github.com/kubeflow/manifests/actions/workflows/trivy.yaml). In the logs of the workflow, you can expand the `Run image extracting and security scanning script` step to view the CVE logs. You will find a per-image CVE scan and a JSON dump of per-WorkingGroup aggregated metrics. You can run the Python script from the workflow file locally on your machine to obtain the detailed JSON files for any git commit.
586606

587-
For more information please consult the [SECURITY.md](./SECURITY.md).
607+
For more infromation please consult the [SECURITY.md](./SECURITY.md).
588608

589609
## Pre-commit Hooks
590610

@@ -624,7 +644,7 @@ pre-commit run
624644
- **Q:** What versions of Istio, Knative, Cert-Manager, Argo, ... are compatible with Kubeflow?
625645
**A:** Please refer to each individual component's documentation for a dependency compatibility range. For Istio, Knative, Dex, Cert-Manager, and OAuth2 Proxy, the versions in `common` are the ones we have validated.
626646
- **Q:** Can I use Kubeflow in an air-gapped environment?
627-
**A:** Yes you can. You just need to get the list of images from our [trivy CVE scanning script](https://github.com/kubeflow/manifests/blob/master/tests/trivy_scan.py), mirror them and replace the references in the manifests with kustomize components and overlays, see [Upgrading and Extending](#upgrading-and-extending). You could also use a simple kyverno policy to replace the images at runtime, which could be easier to maintain.
647+
**A:** Yes you can. You just need to to get the list of images from our [trivy CVE scanning script](https://github.com/kubeflow/manifests/blob/master/tests/trivy_scan.py), mirror them and replace the references in the manifests with kustomize components and overlays, see [Upgrading and Extending](#upgrading-and-extending). You could also use a simple kyverno policy to replace the images at runtime, which could be easier to maintain.
628648
- **Q:** Why does Kubeflow use Istio CNI instead of standard Istio?
629649
**A:** Istio CNI provides better security by eliminating the need for privileged init containers, making it more compatible with Pod Security Standards (PSS). It also enables native sidecars support introduced in Kubernetes 1.28, which helps address issues with init containers and application lifecycle management.
630650
- **Q:** Why does Istio CNI fail on Google Kubernetes Engine (GKE) with "read-only file system" errors?
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
apiVersion: v1
2+
kind: ConfigMap
3+
metadata:
4+
name: gpu-availability-allocation-dashboard
5+
namespace: kubeflow-monitoring-system
6+
labels:
7+
grafana_dashboard: "1"
8+
data:
9+
gpu-availability-allocation.json: |
10+
{
11+
"title": "GPU Availability & Allocation",
12+
"panels": [
13+
{
14+
"title": "Pending GPU workloads",
15+
"type": "stat",
16+
"targets": [
17+
{ "expr": "count(kube_pod_status_phase{phase=\"Pending\"} * on(pod, namespace) group_left() kube_pod_container_resource_requests{resource=\"nvidia.com/gpu\"})", "legendFormat": "Pending NVIDIA GPU Pods" }
18+
]
19+
}
20+
],
21+
"datasource": { "uid": "prometheus" }
22+
}
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
apiVersion: v1
2+
kind: ConfigMap
3+
metadata:
4+
name: gpu-cluster-usage-dashboard
5+
namespace: kubeflow-monitoring-system
6+
labels:
7+
grafana_dashboard: "1"
8+
data:
9+
gpu-cluster-usage.json: |
10+
{
11+
"title": "GPU Cluster Usage",
12+
"panels": [
13+
{
14+
"title": "Cluster-wide GPU Utilization %",
15+
"type": "timeseries",
16+
"targets": [
17+
{ "expr": "avg(DCGM_FI_DEV_GPU_UTIL) or avg(amd_gpu_utilization)", "legendFormat": "GPU Utilization" }
18+
]
19+
},
20+
{
21+
"title": "GPU Memory Used vs Total per Node",
22+
"type": "timeseries",
23+
"targets": [
24+
{ "expr": "sum(DCGM_FI_DEV_FB_USED) by (node)", "legendFormat": "{{node}} Used" },
25+
{ "expr": "sum(DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED) by (node)", "legendFormat": "{{node}} Total" }
26+
]
27+
}
28+
],
29+
"datasource": { "uid": "prometheus" }
30+
}
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
apiVersion: v1
2+
kind: ConfigMap
3+
metadata:
4+
name: gpu-namespace-usage-dashboard
5+
namespace: kubeflow-monitoring-system
6+
labels:
7+
grafana_dashboard: "1"
8+
data:
9+
gpu-namespace-usage.json: |
10+
{
11+
"title": "GPU Namespace Usage",
12+
"panels": [
13+
{
14+
"title": "Per-namespace GPU Utilization over time",
15+
"type": "timeseries",
16+
"targets": [
17+
{ "expr": "sum(DCGM_FI_DEV_GPU_UTIL) by (namespace)", "legendFormat": "{{namespace}}" }
18+
]
19+
}
20+
],
21+
"datasource": { "uid": "prometheus" }
22+
}
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
apiVersion: kustomize.config.k8s.io/v1beta1
2+
kind: Kustomization
3+
resources:
4+
- gpu-cluster-usage-dashboard.yaml
5+
- gpu-namespace-usage-dashboard.yaml
6+
- gpu-availability-allocation-dashboard.yaml
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
apiVersion: rbac.authorization.k8s.io/v1
2+
kind: ClusterRole
3+
metadata:
4+
name: kepler-role
5+
rules:
6+
- apiGroups: [""]
7+
resources: ["nodes", "pods", "namespaces"]
8+
verbs: ["get", "list", "watch"]
9+
- apiGroups: [""]
10+
resources: ["endpoints"]
11+
verbs: ["get"]
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
apiVersion: rbac.authorization.k8s.io/v1
2+
kind: ClusterRoleBinding
3+
metadata:
4+
name: kepler-role-binding
5+
roleRef:
6+
apiGroup: rbac.authorization.k8s.io
7+
kind: ClusterRole
8+
name: kepler-role
9+
subjects:
10+
- kind: ServiceAccount
11+
name: kepler-sa
12+
namespace: kubeflow-monitoring-system
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
apiVersion: apps/v1
2+
kind: DaemonSet
3+
metadata:
4+
name: kepler
5+
namespace: kubeflow-monitoring-system
6+
labels:
7+
app.kubernetes.io/name: kepler
8+
spec:
9+
selector:
10+
matchLabels:
11+
app.kubernetes.io/name: kepler
12+
template:
13+
metadata:
14+
labels:
15+
app.kubernetes.io/name: kepler
16+
spec:
17+
serviceAccountName: kepler-sa
18+
hostPID: true
19+
hostNetwork: true
20+
containers:
21+
- name: kepler
22+
image: quay.io/sustainable_computing_io/kepler:v0.7.11
23+
ports:
24+
- name: http
25+
containerPort: 9102
26+
resources:
27+
requests:
28+
cpu: 100m
29+
memory: 128Mi
30+
limits:
31+
cpu: 500m
32+
memory: 512Mi
33+
securityContext:
34+
privileged: true
35+
volumeMounts:
36+
- name: proc
37+
mountPath: /proc
38+
readOnly: true
39+
- name: sys
40+
mountPath: /sys
41+
readOnly: true
42+
- name: containerd
43+
mountPath: /var/run/containerd
44+
readOnly: true
45+
volumes:
46+
- name: proc
47+
hostPath:
48+
path: /proc
49+
- name: sys
50+
hostPath:
51+
path: /sys
52+
- name: containerd
53+
hostPath:
54+
path: /var/run/containerd
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
apiVersion: kustomize.config.k8s.io/v1beta1
2+
kind: Kustomization
3+
resources:
4+
- namespace.yaml
5+
- serviceaccount.yaml
6+
- clusterrole.yaml
7+
- clusterrolebinding.yaml
8+
- daemonset.yaml
9+
- service.yaml
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
apiVersion: v1
2+
kind: Namespace
3+
metadata:
4+
name: kubeflow-monitoring-system

0 commit comments

Comments
 (0)