General Technical Review - Kubeflow / Graduation

Project: Kubeflow
Project Version: Every Kubeflow sub-project has its own version.
Website: https://www.kubeflow.org/
Date Updated: 2025-09-05
Template Version: v1.0
Description:

Kubeflow is the foundation of tools for AI Platforms on Kubernetes.

AI platform teams can build on top of Kubeflow by using each project independently or deploying the entire AI reference platform to meet their specific needs. The Kubeflow AI reference platform is composable, modular, portable, and scalable, backed by an ecosystem of Kubernetes-native projects that cover every stage of the AI lifecycle.

Whether you’re an AI practitioner, a platform administrator, or a team of developers, Kubeflow offers modular, scalable, and extensible tools to support your AI use cases.

What are Kubeflow Projects

Kubeflow is composed of multiple open source projects that address different aspects of the AI lifecycle. These projects are designed to be usable both independently and as part of the Kubeflow AI reference platform. This provides flexibility for users who may not need the full end-to-end AI platform capabilities but want to leverage specific functionalities, such as model training or model serving.

Kubeflow Projects in scope for CNCF Graduation

Kubeflow Spark Operator
Kubeflow Notebooks
Kubeflow Trainer
Kubeflow Katib
Kubeflow Model Registry
Kubeflow Pipelines

What is the Kubeflow AI Reference Platform

The Kubeflow AI reference platform refers to the full suite of Kubeflow projects bundled together with additional integration and management tools. Kubeflow AI reference platform deploys the comprehensive toolkit for the entire AI lifecycle. The Kubeflow AI reference platform can be installed via Packaged Distributions or Kubeflow Manifests.

Day 0 - Planning Phase

Scope

Describe the roadmap process

Kubeflow projects are managed by community members that are part of working groups. Each working group defines and agrees on the features for each release. The release cadence for each working group varies according to community agreement among the working groups.

Then, all Kubeflow projects have individual roadmap files in the Git repository defining each release and available to the public. This ensures we have a standard structure for each proposed feature, auditing, versioning, and transparency, since it is recorded along with the history in the Git repo.

For more information, check ROADMAP for each Kubeflow Project:

Community-wide changes are proposed as Kubeflow Enhancement proposals (KEPs) in the kubeflow/community repository or in the Kubeflow sub-projects KEPs.

Describe the target persona or user(s) for the project

Users: Data Scientists, ML Engineer, AI Practitioners, Data Engineers, AI Practitioners.
Operators: MLOps Engineers, AIOps engineers, Platform Engineers, AI Platform Engineers.
Vendors: Vendors and projects building Kubernetes based AI Platform products.

Explain the primary use case for the project. What additional use cases are supported by the project

The goal of Kubeflow is to run Cloud Native AI workloads for every stage in AI lifecycle. By using Kubeflow projects users can develop and deploy AI applications.

The primary use-cases include:

Large-scale data processing and feature engineering.
Distributed pre-training of foundation models.
Post-training and fine-tuning of LLMs.
Hyperparameter optimization and model tuning.
LLM inference and multi-host serving.

Additional Use Cases:

End-to-End GenAI pipeline building.
Interactive AI development.
Multiple users / projects with hard multi-tenancy on the same cluster.

Explain which use cases have been identified as unsupported by the project

As Kubeflow is composed of multiple projects, each working group makes its own determinations as t what will be excluded from them. However we have an overarching theme and governance structure (Steering Committee) that has identified the following areas as not being a priority for all projects:

The projects are deployed in any Kubernetes (each release will specify tested versions), regardless of the underlying infrastructure, independently through Kubernetes manifests leveraging Kustomize and/or Helm Charts. However, the project doesn’t provide an implementation to be deployed on infrastructure besides Kubernetes. - We do not officially enforce a deployment method or distribution.
Kubeflow doesn’t provide a GitOps implementation, however Kubeflow manifests can be integrated into a GitOps solution. For example, Platform Engineers can create an ArgoCD Application (CRD) to install and configure Kubeflow projects. by providing Kubeflow individual project manifests, for example Pipelines. The GitOps application will read from the Kubeflow Pipelines manifest and Argo CD will deploy the configurations in the target cluster.

Describe the intended types of organizations who would benefit from adopting this project

Kubeflow is intended to be used by any organization which needs to run AI workloads on Kubernetes, in any of the AI lifecycle stages an organization might choose to use just one or more projects, such as Pipelines or Training. Organizations can also use all the projects from Kubeflow to increase the user experience to build AI workloads. Additionally, organizations can develop their own customizations on top of the Kubeflow platform or choose to build distributions to help other organizations adopt a customized platform based on Kubeflow projects.

As Kubeflow maintains flexibility, organizations can choose their own path according to their needs.

We encourage adopters to be part of the adopters list in GitHub.

Examples of organizations that use Kubeflow:

AWS
Red Hat
Capital One
CERN
Google
Alibaba Cloud
Bloomberg
IBM
Cisco
Huawei
Microsoft
Tencent
DHL Data & Analytics
Telia
Roblox
Toyota
PepsiCo
Volvo
and others…

Describe any completed end user research and link to any reports.

We regularly undertake user research about Kubeflow and its users. Research could be done by working groups, during events, or by conducting at least one study annually. We ensure these surveys are visible to the public and shared with the community.

Here are some results from previous years.

Usability

How should your target personas interact with your project

AI Practitioner - Kubeflow SDK, Kubeflow UIs

Platform Admins - Operator guides, installing and configuring any Kubeflow projects via Helm charts or Kustomize manifests predefined and available in the Kubeflow documentation from the command line.

Describe the user experience (UX) and user interface (UI) of the project

Kubeflow user experience in each project is a collection of projects, the user experience for the projects are each with their own interfaces, APIs and SDKs.

Describing User Experience through SDK

We are working on a unified Kubeflow SDK that gives AI practitioners Python-native experience to interact with Kubeflow APIs.

Through Kubeflow SDK, users will be able to interact with the different projects, increasing user experience and reducing complexity.

Example to interact with Kubeflow Trainer using Kubeflow SDK:

from kubeflow.trainer import TrainerClient, BuiltinTrainer, TorchTuneConfig

# Fine-Tune Llama 3.2
job_id = TrainerClient().train(
    runtime=TrainerClient().get_runtime(name="torchtune-llama3.2-1b"),
    trainer=BuiltinTrainer(
        config=TorchTuneConfig(
            resources_per_node={
                "memory": "200G",
                "gpu": 1,
            },
        )
    ),
)

# Wait for TrainJob to complete
TrainerClient().wait_for_job_status(job_id)

# Print TrainJob logs
print("\n".join(TrainerClient().get_job_logs(name=job_id)))

Example to interact with Kubeflow Katib using Kubeflow SDK

from kubeflow.optimizer import OptimizerClient, search
from kubeflow.trainer import TrainerClient, BuiltinTrainer, TorchTuneConfig

OptimizerClient().optimize(
    objective="loss",
    mode="min",
    num_trials=5,
    trainer=BuiltinTrainer(
        config=TorchTuneConfig(
            resources_per_node={
                "memory": "200G",
                "gpu": 1,
            },
            lr=search("0.1", "0.2", "lognormal"),
            num_epochs=search("4", "8", "uniform"),
        )
    ),
    runtime=TrainerClient().get_runtime(name="torchtune-llama3.2-1b"),
)

Find more details and examples about Kubeflow SDK.

Describing User Experience through APIs

Kubeflow Pipelines

REST API is available under the /pipeline/ HTTP path. For example, if you host Kubeflow at https://kubeflow.example.com, the API will be available at https://kubeflow.example.com/pipeline/.

The API is documented using swagger and examples about its usage can be found here

Kubeflow Model Registry

Model Registry REST API is available under the /api/model_registry/ HTTP path. More information can be found here

The Model Catalog Service provides a read-only discovery service for ML models across multiple catalog sources. It acts as a federated metadata aggregation layer, allowing users to search and discover models from various external catalogs through a unified REST API.

Describing User Interfaces through web UI

Kubeflow Central Dashboard acts as a hub for the AI platform and tools by exposing the UIs of components running in the cluster.

End users can access the Central Dashboard installed version on their organization according to the user and access control setup previously by Platform Admin. Users can access Kubeflow projects such as Kubeflow Pipelines (KFP) to manage experiments, pipeline definitions, runs, and recurrent runs. KFP Artifacts to track artifacts produced by pipelines stored in MLMD. KFP Executions to track executions of pipeline components stored in MLMD. Kubeflow Katib experiments to manage AutoML experiments. Kubeflow Notebooks to manage Kubeflow Notebooks. Kubeflow TensorBoards to manage TensorBoard instances. Kubeflow Volumes to manage Kubernetes PVC Volumes. Contributors page to manage contributors of profiles (namespaces) that you own.

More information can be found in the Kubeflow Dashboard docs.

Describe how this project integrates with other projects in a production environment

All Kubeflow projects are Kubernetes native and so fit into the wider ecosystem of Kubernetes based tools. Kubeflow projects extensively use tools from the cloud native ecosystem, including, Argo Workflow, Istio, Helm, Knative, KServe, JobSet, Kueue, and other projects.

Specific components integrate with popular AI frameworks where applicable (e.g. Kubeflow Trainer integrates with PyTorch and other model frameworks)

The following diagram gives an overview of the Kubeflow Ecosystem and how it relates to the wider Kubernetes AI landscapes.

Specifically for a production environment, Kubeflow can integrate and leverage the following resources and projects:

Cloud Providers and underlying infrastructure
- Kubeflow can leverage the underlying infrastructure provided through Kubernetes, such as access hardware accelerators from Intel, Nvidia, and AMD through Kubernetes Operators.
- Organizations can run Kubeflow projects on any Kubernetes platform and use cloud provider services, such as scalability across different nodes, taking advantage of projects such as Open Cluster Management.
Configuration as code
- GitOps approach can be used to set up Kubeflow setup and configuration, for example, using Argo CD to manage configuration as code in multiple clusters.
Security
- Security is becoming a key aspect, Kubeflow projects implement security best practices in Kubernetes including Pod Security Standards, Network policies, RBAC. Additionally, they can leverage projects such as the External Secrets Operator.
- Data Scientists can also use Sigstore projects to sign and validate their models to avoid type of attacks such as tampering with Models to ensure the integrity and authenticity of models persists during the Model Development Lifecycle.
Simplify User experience
- Organizations can simplify Data scientists' experience by scaling Platform Engineers by providing software templates through Backstage to access environments quickly, leveraging Kubeflow projects such as Notebooks.

Design

Explain the design principles and best practices the project is following

Loosely Coupled and Distributable Services
- Built as modular, independent microservices that can scale and evolve.
- Services communicate through well defined APIs, enabling flexible orchestration and integration.
Kubernetes native with Declarative APIs and Automation
- Uses Kubernetes CRD for each project.
- Integrate natively with Kubernetes core APIs like Jobs, Pods, Deployment.
- Supports automated deployment, orchestration, and CI/CD workflows.
- Enables GitOps management for reproducible and auditable operations.
Portability and Platform-Agnosticism
- Designed to run on any Kubernetes cluster, including onprem, edge, public cloud or hybrid environments.
- Avoids vendor lock-in while ensuring consistent behavior across different infrastructures.
Observability
- Provides metrics, logs, and traces to monitor system performance and workflow execution.
- Enables debugging, performance analysis, and monitoring of workloads and processes.
Resilience and Availability
- Leverages platform features such as health checks, automatic restarts, and scaling to maintain high availability.
- Designed to tolerate failures and minimizing disruption to workloads.
Security and Multi-Tenancy
- Implements access control, isolation, and resource quotas to separate workloads and users.
- Supports secure communication and policy enforcement to maintain a safe multi-tenant environment.
Extensibility and Interoperability
- Provides APIs and runtime contracts to allow integration of new tools, frameworks, or workflows.
- Supports extension and customization without breaking existing functionality.

Kubeflow overall has strong contributing guidelines that inform our development and design. Each project also provides new contributors specific guidance on their GitHub repositories:

For the new features every Kubeflow project follows the KEP guidelines

Outline or link to the project’s architecture requirements

We do follow the same practices as Kubernetes for APIs, with alpha, beta, and stable status of APIs. Details are described here.

Define any specific service dependencies the project relies on in the cluster

The following projects are required to install Kubeflow projects:

Istio, Knative, Cert-manager
Detailed information can be found in the Kubeflow manifests.

Specific projects have other dependencies:

Kubeflow Pipelines: Argo Workflows>=v3.1
Kubeflow Trainer: JobSet>=v0.8.0
Kubeflow Katib: MySQL>=v8.0
Kubeflow Model Registry: MySQL>=v8.0

Describe how the project implements Identity and Access Management

Kubeflow projects use a pluggable system of IAM that connects back to Kubernetes IAM in most cases. As a reference implementation, the optional Kubeflow Manifests/Dashboard components use the following IAM systems:

General security Talks about Kubeflow including an architectural introduction:

Describe how the project has addressed sovereignty

Kubeflow’s projects can be self-hosted on any Kubernetes cluster, including air-gapped environments, which is critical for those organizations using disconnected environments. Additionally, Kubeflow projects support multi-tenancy, which allows organizations to isolate workloads in a Kubernetes cluster, which can be used with Network Policies to restrict isolation from different components by ports or application name. Kubeflow projects support custom service accounts, allowing organizations to control how Kubeflow projects interact with the cluster by leveraging RBAC. Further, Platform Administrators can decide which users in Kubernetes will have access to the Kubeflow projects by creating different users and RBAC policies. Kubeflow Notebooks also supports profiles, which allows simplifying the permission management.

In terms of data management, data sovereignty is managed by those who deploy or package the projects.

Describe any compliance requirements addressed by the project

Kubeflow projects are extensible which allows users to fit their internal compliance requirements. As a result, specific common compliance frameworks (SOC-2, GDPR, etc.) are the responsibility of end users and vendors. However, we aim to provide a strong foundation through reference architectures similar things from which to build on.

Describe the project’s High Availability requirements

The end users can adjust the replicas. Kubeflow project controllers support leader election, for example Kubeflow Trainer.

Describe the project’s resource requirements, including CPU, Network and Memory

The following table shows the resource requirements for each Kubeflow project, calculated as the maximum of actual usage and configured requests for CPU/memory, plus storage requirements from PVCs:

The maximum looks hefty so rather consider in general the maximum (manifest requests, average usage).

Component	CPU (cores)	Memory (Mi)	Storage (GB)
Cert Manager	3m	130Mi	0GB
Dex + OAuth2-Proxy	3m	28Mi	0GB
Istio	2300m	3502Mi	0GB
Kubeflow KServe	600m	1200Mi	0GB
Kubeflow Katib	9m	471Mi	13GB
Kubeflow Core	35m	841Mi	0GB
Metadata	78m	687Mi	30GB
Kubeflow Model Registry	510m	2112Mi	0GB
Other	20m	354Mi	6GB
Kubeflow Pipelines	970m	3552Mi	90GB
Kubeflow Spark Operator	4m	41Mi	0GB
Kubeflow Trainer	3m	25Mi	0GB
Total	4535m	12943Mi	139GB

Describe the project’s storage requirements, including its use of ephemeral and/or persistent storage

Each project can configure storage in different ways. This is also true of the manifests which configure storage for a collection of projects. This can be seen here.

Please outline the project’s API Design

Various Kubeflow projects offer APIs and Python SDKs. See the following sets of reference documentation:

Pipelines reference docs for the Kubeflow Pipelines API and SDK, including the Kubeflow Pipelines domain-specific language (DSL).
Kubeflow SDK to interact with Kubeflow APIs with Python-native interface.

See also Kubeflow APIs and SDKs.

Kubeflow CRDs are following Kubernetes best practices for the API changes.

Describe the project’s API topology and conventions

Kubeflow APIs and SDKs documentation.

Describe the project defaults

Some defaults can be seen in the Kubeflow manifests.

Additionally, API defaults are shown in the API docs, such as: Kubeflow Trainer or Kubeflow Katib.

Outline any additional configurations from default to make reasonable use of the project

Kubeflow manifests guides.

Some public and listed distributions have their own ways to install Kubeflow AI platform

Describe any new or changed API types and calls—including to cloud providers—that will result from this project being enabled and used

Deploying Kubeflow manifests and individual projects results in exposing new APIs and the possibility to call configured 3rd party APIs if integrations exist. These all have to be explicitly set by the users.

Describe compatibility of any new or changed APIs with API servers, including the Kubernetes API server

Kubeflow follows the same practice for API compatibility as Kubernetes.

Describe versioning of any new or changed APIs, including how breaking changes are handled

Many Kubeflow projects use Kubernetes CRDs, and for these resources follows the same deprecation policy as Kubernetes.

Describe the project’s release processes, including major, minor and patch releases

Every Kubeflow project follow its own release lifecycle, for example Kubeflow Trainer or Kubeflow Katib.

However, community also maintain the Kubeflow AI reference platform releases which install all Kubeflow projects together for end-to-end AI platform:

Installation

Describe how the project is installed and initialized

Kubeflow Installation guide.

Kubeflow projects can be installed as a standalone applications or together using the Kubeflow Manifests or Kubeflow Distributions (public or private).

How does an adopter test and validate the installation

Distributions can verify the installation by following this guide or executing this test suites.

Kubeflow projects docs also explain how platform admins should validate the installation of individual applications, for example Kubeflow Trainer

Security

Provide a link to the project’s cloud native security self assessment.

Kubeflow Security Self Assessment

How are you satisfying the tenets of cloud native security projects

Make security a design requirement.

Kubeflow projects are built with security as a fundamental concern. Kubeflow projects follow Kubernetes and cloud native security best practices. By leveraging Kubernetes Custom Resources Definition (CRDs), Role-Based Access Control (RBAC), network policies, and pod security standards, Kubeflow projects seamlessly integrate into cloud native secure environments. Additionally, Kubeflow projects re-use functionality from other cloud native tools like Istio, Cert-Manager, Kubernetes and inherit its security best practices.

Default configuration for Kubeflow projects follow best practices such as rootless containers, limited privileges, etc.

Applying secure configuration has the best user experience

Default configuration for Kubeflow projects follow best practices such as rootless containers, limited privileges, etc. Users can migrate workloads to more secure configurations without breaking changes, leveraging Kubernetes’ declarative model.

Selecting insecure configuration is a conscious decision

Insecure options require explicit user setup and configurations.

Describe how each of the cloud native principles apply to your project

Kubeflow uses cloud native principles by building on Kubernetes and other cloud native technologies while extending them in our composable projects.

Kubeflow projects use containers as a fundamental unit of task/deployment. For example, every TrainJob uses separate containers to train models, allowing users to package their training code along with its dependencies in a containerized environment

Kubeflow projects are designed to run natively on Kubernetes and define various CRDs for each AI workload. For example, TrainJob CR for model training or Notebook CR for interactive development. It uses native Kubernetes primitives like Deployments, Services, Job, JobSet, and CRDs to orchestrate AI workloads.

Kubeflow follows a loosely-coupled microservices architecture model, where each project can be deployed as a standalone service and be integrated into desired AI platform on Kubernetes.

Kubeflow APIs are declarative and follow Kubernetes best practices. For instance, SparkApplication to manage Spark jobs or TrainJob to manage distributed training jobs.

How do you recommend users alter security defaults in order to "loosen" the security of the project

Users can install Kubeflow Pipelines without multi-user isolation if they want to loosen the security of deployment. Users suggested to follow the Kubeflow AI reference platform deployment mode to install Kubeflow Pipelines control plane in multi-tenant model. It provides isolation between Pipelines and Runs for users.

Security Hygiene

Describe the frameworks, practices and procedures the project uses to maintain the basic health and security of the project

Robust CI/CD infrastructure: Kubeflow projects have automated unit, integration, and end-to-end tests that are integrated into the CI pipelines to ensure code stability and correctness of PRs.
Dependency management: Kubeflow projects leverage lock files like go.mod and pyproject.toml, and progressing towards Software Bill of Materials (SBOM) generation.
Release stability: Kubeflow projects maintain release branches and release tags which allow maintainers create patch releases to address bug fixes and security vulnerabilities. Also, projects maintain changelogs and roadmaps to ensure traceability and transparency for all changes.
Static and Dynamic Code Analysis: Kubeflow projects CI integrates code linting, static, and dynamic code analysis tools to enforce code quality and detect potential issues before they reach production.
Secure Defaults: Defaults manifests and Helm charts are configured with security best practices, like pod security standards, rootless containers, and enforces least privileges.
Code Review Process: Kubeflow projects’ maintainers follow a strong review process with peer review and explicit approval before PRs being merged into the main branch.
Open Governance: Kubeflow follows open governance with regular public meetings and communication channels to keep users aware of development and roadmap.
Secure Image Build: Image builds are validated through CI/CD checks to ensure they are reproducible, security, and consistent.

Some examples can be found here:

Describe how the project has evaluated which features will be a security risk to users if they are not maintained by the project

Features are discussed and reviewed in the community, with security implications considered during design and code review processes. The project tracks the security health of dependencies and evaluates the impact of vulnerabilities or unmaintained projects. The project identifies potential risks by analyzing the attack surface, especially for features that interact with user-supplied code, custom containers, user supplied models or external data sources.

Cloud Native Threat Modeling

Explain the least minimal privileges required by the project and reasons for additional privileges

Each Kubeflow project's control plane needs to watch for its CRDs. For example, Kubeflow Trainer: TrainJob and ClusterTrainingRuntime, Spark Operator for SparkApplications, Katib for Experiments, etc. Controllers have minimal privileges to orchestrate its CRDs and corresponding resources that need to be created

Kubeflow control plane is installed cluster-wide, and it requires RBAC to manage namespace-scoped resources.

The default controllers run with PodSecurityStandards restricted and rootless containers, minimizing the risk of privilege escalation.

Users who interact with Kubeflow SDK get namespace-scoped permission to interact with required resources. For example, users’ role for Kubeflow Trainer can be found here. All roles are aggregated in the kubeflow-edit cluster role after installing Kubeflow Manifests.

Describe how the project is handling certificate rotation and mitigates any issues with certificates

Kubeflow projects use cert-manager to generate and rotate certificates. Those certificates are used for validation and mutation webhooks across Kubeflow projects. Cert-manager handles the issuance, renewal, and rotation of these certificates automatically without downtime.

Describe how the project is following and implementing secure software supply chain best practices

Automated CI/CD infrastructure: Kubeflow projects have automated unit, integration, and end-to-end tests that are integrated into the CI pipelines to ensure code stability and reduce risk of introducing vulnerabilities.
Vulnerability Scanning: Kubeflow uses tools for vulnerability scanning like Dependabot, Trivy to identify CVEs and address known CVEs in dependencies.
Dependency management: Kubeflow projects leverage lock files like go.mod and pyproject.toml, and progressing towards Software Bill of Materials (SBOM) generation.
Code Review Process: Kubeflow projects’ maintainers follow a strong review process with peer review and explicit approval before PRs being merged into the main branch.
DCO Check: Committers are required to sign and comply with the Developer Certificate of Origin (DCO) to affirm the legitimacy and authorship of their contributions.
Branch Protection: The project enforces branch protection rules to prevent unauthorized changes, enforce status checks, require pull request reviews, and control who can push to protected branches.
Prevent Secrets in Source Code: Proactively prevents committing secrets to the source code by using tools and GitHub workflows that detect sensitive data in PRs.
License compliance: Kubeflow uses the FOSSA product provided by the CNCF to scan for license compliance on a regular and ongoing basis.

Day 1 - Installation and Deployment Phase

Project Installation and Configuration

Describe what project installation and configuration look like

Each project has its own standalone installation guide.

Kubeflow projects are packaged by distributions and vendors into platform products which can be used to install our tools.

We provide optional manifests to deploy all Kubeflow projects as Kubeflow AI reference platform.

Project Enablement and Rollback

How can this project be enabled or disabled in a live cluster? Please describe any downtime required of the control plane or nodes

Users can set the replica count to 0 in the Kubeflow projects deployment. Existing AI workloads should not be impacted since it won’t be reconciled by controllers.

Updating the Kubeflow AI Reference Platform.

The installation guide described when control plane is ready.

Describe how enabling the project changes any default behavior of the cluster or running workloads

Enable projects by scaling replica count back to 1. All running AI workloads will be reconciled again by controllers, and they perform the appropriate updates to the CRDs.

Istio, Knative, cert-manager might interfere with existing installations.

Describe how the project tests enablement and disablement

Conformance program is work in progress to ensure tests across all Kubeflow projects.

How does the project clean up any resources created, including CRDs

The Kubeflow projects control plane can be deleted by cleanup the appropriate resources, for example:

kubectl delete -k https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.0.0

The command will cleanup all CRDs and control plane deployment. Since Kubeflow CRDs maintain ownerReference for associated resources, all resources will be removed after deleting the actual CRDs. For example:

kubectl delete trainjob --all --all-namespaces

Check this guide to cleanup the Kubeflow AI reference platform.

Rollout, Upgrade and Rollback Planning

How does the project intend to provide and maintain compatibility with infrastructure and orchestration management tools like Kubernetes and with what frequency

Kubeflow projects publish its supported Kubernetes version for every release. The supported versions are evaluated and upgraded on every release. We support Kubernetes 1.31+ and test on 1.32+

How the project handles rollback procedures

Some projects support leader election and HA to make sure that long-running workloads are complete. Users can scale up or down replicas of controllers during rollback.

How can a rollout or rollback fail? Describe any impact to already running workloads

Newer revision will not be active for many reasons including insufficient cluster resources for the newer spec, invalid configuration spec. Traffic will not be switched to the new unless the newer revision is ready to accept traffic. Hence, the already running workloads will not be affected in any case if rollout fails.

Describe any specific metrics that should inform a rollback

Kubeflow projects CRDs expose status that inform about activity of AI workloads. Additionally, controllers are exposed various Prometheus metrics to indicate workload status.

Explain how upgrades and rollbacks were tested and how the upgrade->downgrade->upgrade path was tested

Currently, it’s being manually tested by users, but automated tests are work in progress.

Explain how the project informs users of deprecations and removals of features and APIs

All API changes are backward compatible and changes are announced in the release notes and changelogs. If some APIs are deprecated, newer versions of APIs are introduced with user awareness. Kubeflow CRDs follow Kubernetes best practices for API compatibility.

For example Kubeflow Trainer breaking changes.

Explain how the project permits utilization of alpha and beta capabilities as part of a rollout

At the moment, functionality is logged explicitly if a feature is alpha or beta. We do not have feature gates. Additionally, we are working on conversion webhook that helps users with updated API versions of CRDs.

Day 2 - Day-to-Day Operations Phase

Scalability/Reliability

Describe how the project increases the size or count of existing API objects

In general, the API object count complexity is linear in the number of users and workloads. Users can list deployed Kubeflow CRDs across all namespaces.

For example, to list all TrainJob from Kubeflow Trainer run:

kubectl get trainjob --all-namespaces

The core controllers and resources are constant (beyond replicating specific controllers), see the full kustomize build here.

Describe how the project defines Service Level Objectives (SLOs) and Service Level Indicators (SLIs)

Kubeflow currently doesn’t provide the SLOs and SLIs, however we follow the Kubernetes SLOs

Describe any operations that will increase in time covered by existing SLIs/SLOs

As above.

Describe the increase in resource usage in any components as a result of enabling this project, to include CPU, Memory, Storage, Throughput

Resources requirements for Kubeflow projects are set here.

Describe which conditions enabling / using this project would result in resource exhaustion of some node resources

There are some specific issues for each project, we should list some of them (e.g. pipelines controller can have open file handler issues). Since many workloads are GPU-intensive, Kubernetes platform admins need to ensure that nodes have enough capacity to run AI workloads with Kubeflow projects.

Describe the load testing that has been performed on the project and the results

Some intensive load testing has been performed by Kubeflow users, for example some LLM foundation models have been trained using Kubeflow Trainer across a few hundreds GPUs. At KubeCon some users share their experience of running more than 10,000 GPUs with Kubeflow Training Operator.

Describe the recommended limits of users, requests, system resources, etc. and how they were obtained

Users on the order of hundreds are known to work. Scaling beyond hundreds of users may require an increase in the requests/limits replicas of some deployments, or scaling dependencies like databases.

Describe which resilience pattern the project uses and how, including the circuit breaker pattern

Kubeflow uses the Kubernetes resilience pattern to manage controllers with HA. If one replica fails, another part of the control plane takes responsibility to orchestrate workloads.

Observability Requirements

Describe the signals the project is using or producing, including logs, metrics, profiles and traces. Please include supported formats, recommended configurations and data storage

Kubeflow controllers expose Prometheus metrics to report workload status. Controllers also expose logs and status for platform admins to ensure stability.

Describe how the project captures audit logging

Platform admins can leverage Kubernetes audit for Kubeflow projects.

Describe any dashboards the project uses or implements as well as any dashboard requirements

Kubeflow Dashboard requirements.

Describe how the project surfaces project resource requirements for adopters to monitor cloud and infrastructure costs, e.g. FinOps That must happen on the Kubernetes namespace level

Users are recommended to use third-party tools like Kubecost to measure cloud and infrastructure cost of running Kubeflow projects.

Which parameters is the project covering to ensure the health of the application/service and its workloads

Most project use Kubernetes Liveness and Readiness probes.

For example Kubeflow Trainer controller manager.

How can an operator determine if the project is in use by workloads

Check the Pods in kubeflow-profile labeled namespaces.
Check the CRDs in user’s namespaces
Check the Kubeflow Dashboard resources.

How can someone using this project know that it is working for his instance

Kubeflow Manifests guide.
Run the Kubeflow manifest test suite
Run the upstream examples, for example Kubeflow Trainer PyTorch training.

Dependencies

Describe the specific running services the project depends on in the cluster

Most projects depend on Cert-Manager>=v1.16, and Istio>=v1.26

Specific projects have other dependencies:

Kubeflow Pipelines: Argo Workflows>=v3.1
Kubeflow Trainer: JobSet>=v0.8.0
Kubeflow Katib: MySQL>=v8.0
Kubeflow Model Registry: MySQL>=v8.0

For details, please take a look at General security Talks about Kubeflow including an architectural introduction.

Describe the project’s dependency lifecycle policy

We follow Kubernetes deprecation policy for supported versions. For example, we support the 3-4 latest versions of Kubernetes to deploy Kubeflow projects.

How does the project incorporate and consider source composition analysis as part of its development and security hygiene? Describe how this source composition analysis (SCA) is tracked

Various static and dynamic code analysis tools are enforced as described above.

Describe how the project implements changes based on source composition analysis (SCA) and the timescale

Various static and dynamic code analysis tools are enforced as described above.

Troubleshooting

How does this project recover if a key component or feature becomes unavailable

Specifics depend on the Kubeflow Project (KFP, Katib, etc.). Kubeflow projects are cloud and Kubernetes native apps, so fault tolerance is strongly tied to the health of the underlying cluster. In practice user workloads sometimes have a retry policy and the Kubeflow core services should just recover automatically.

More details are described in the previous sections.

Describe the known failure modes

Each Kubeflow project handles failure modes differently beyond native Kubernetes fault tolerance. Many of them are configured at the application level in user code.

Security

How is the project executing access control

As described above each Kubeflow projects require limited RBAC to manage its CRDs

Cloud Native Threat Modeling

Described in the section above.

How does the project ensure its security reporting and response team is representative of its community diversity (organizational and individual)

Every Kubeflow project follows security guidelines.

Kubeflow Spark Operator security policy.
Kubeflow Notebooks security policy.
Kubeflow Trainer security policy.
Kubeflow Katib security policy.
Kubeflow Model Registry security policy.
Kubeflow Pipelines security policy.

How does the project invite and rotate security reporting team members

Active project maintainers are responsible to ensure that security reports are addressed. Each project has its own security reporting and disclosure policy. For projects which use GitHub’s security disclosure system, access control is managed by having write access to the relevant github repository.

FilesExpand file tree

KUBEFLOW-GENERAL-TECHNICAL-REVIEW.md

Latest commit

History

KUBEFLOW-GENERAL-TECHNICAL-REVIEW.md

File metadata and controls

General Technical Review - Kubeflow / Graduation

What are Kubeflow Projects

Kubeflow Projects in scope for CNCF Graduation

What is the Kubeflow AI Reference Platform

Day 0 - Planning Phase

Scope

Describe the roadmap process

Describe the target persona or user(s) for the project

Explain the primary use case for the project. What additional use cases are supported by the project

Explain which use cases have been identified as unsupported by the project

Describe the intended types of organizations who would benefit from adopting this project

Describe any completed end user research and link to any reports.

Usability

How should your target personas interact with your project

Describe the user experience (UX) and user interface (UI) of the project

Describing User Experience through SDK

Describing User Experience through APIs

Describing User Interfaces through web UI

Describe how this project integrates with other projects in a production environment

Design

Explain the design principles and best practices the project is following

Outline or link to the project’s architecture requirements

Define any specific service dependencies the project relies on in the cluster

Describe how the project implements Identity and Access Management

Describe how the project has addressed sovereignty

Describe any compliance requirements addressed by the project

Describe the project’s High Availability requirements

Describe the project’s resource requirements, including CPU, Network and Memory

Describe the project’s storage requirements, including its use of ephemeral and/or persistent storage

Please outline the project’s API Design

Describe the project’s API topology and conventions

Describe the project defaults

Outline any additional configurations from default to make reasonable use of the project

Describe any new or changed API types and calls—including to cloud providers—that will result from this project being enabled and used

Describe compatibility of any new or changed APIs with API servers, including the Kubernetes API server

Describe versioning of any new or changed APIs, including how breaking changes are handled

Describe the project’s release processes, including major, minor and patch releases

Installation

Describe how the project is installed and initialized

How does an adopter test and validate the installation

Security

Provide a link to the project’s cloud native security self assessment.

How are you satisfying the tenets of cloud native security projects

Describe how each of the cloud native principles apply to your project

How do you recommend users alter security defaults in order to "loosen" the security of the project

Security Hygiene

Describe the frameworks, practices and procedures the project uses to maintain the basic health and security of the project

Describe how the project has evaluated which features will be a security risk to users if they are not maintained by the project

Cloud Native Threat Modeling

Explain the least minimal privileges required by the project and reasons for additional privileges

Describe how the project is handling certificate rotation and mitigates any issues with certificates

Describe how the project is following and implementing secure software supply chain best practices

Day 1 - Installation and Deployment Phase

Project Installation and Configuration

Describe what project installation and configuration look like

Project Enablement and Rollback

How can this project be enabled or disabled in a live cluster? Please describe any downtime required of the control plane or nodes

Describe how enabling the project changes any default behavior of the cluster or running workloads

Describe how the project tests enablement and disablement

How does the project clean up any resources created, including CRDs

Rollout, Upgrade and Rollback Planning

How does the project intend to provide and maintain compatibility with infrastructure and orchestration management tools like Kubernetes and with what frequency

How the project handles rollback procedures

How can a rollout or rollback fail? Describe any impact to already running workloads

Describe any specific metrics that should inform a rollback

Explain how upgrades and rollbacks were tested and how the upgrade->downgrade->upgrade path was tested

Explain how the project informs users of deprecations and removals of features and APIs

Explain how the project permits utilization of alpha and beta capabilities as part of a rollout

Day 2 - Day-to-Day Operations Phase

Scalability/Reliability

Describe how the project increases the size or count of existing API objects

Describe how the project defines Service Level Objectives (SLOs) and Service Level Indicators (SLIs)

Describe any operations that will increase in time covered by existing SLIs/SLOs

Describe the increase in resource usage in any components as a result of enabling this project, to include CPU, Memory, Storage, Throughput