-
Project: Kubeflow
-
Project Version: Every Kubeflow sub-project has its own version.
-
Website: https://www.kubeflow.org/
-
Date Updated: 2025-09-05
-
Template Version: v1.0
-
Description:
Kubeflow is the foundation of tools for AI Platforms on Kubernetes.
AI platform teams can build on top of Kubeflow by using each project independently or deploying the entire AI reference platform to meet their specific needs. The Kubeflow AI reference platform is composable, modular, portable, and scalable, backed by an ecosystem of Kubernetes-native projects that cover every stage of the AI lifecycle.
Whether you’re an AI practitioner, a platform administrator, or a team of developers, Kubeflow offers modular, scalable, and extensible tools to support your AI use cases.
Kubeflow is composed of multiple open source projects that address different aspects of the AI lifecycle. These projects are designed to be usable both independently and as part of the Kubeflow AI reference platform. This provides flexibility for users who may not need the full end-to-end AI platform capabilities but want to leverage specific functionalities, such as model training or model serving.
- Kubeflow Spark Operator
- Kubeflow Notebooks
- Kubeflow Trainer
- Kubeflow Katib
- Kubeflow Model Registry
- Kubeflow Pipelines
The Kubeflow AI reference platform refers to the full suite of Kubeflow projects bundled together with additional integration and management tools. Kubeflow AI reference platform deploys the comprehensive toolkit for the entire AI lifecycle. The Kubeflow AI reference platform can be installed via Packaged Distributions or Kubeflow Manifests.
Kubeflow projects are managed by community members that are part of working groups. Each working group defines and agrees on the features for each release. The release cadence for each working group varies according to community agreement among the working groups.
Then, all Kubeflow projects have individual roadmap files in the Git repository defining each release and available to the public. This ensures we have a standard structure for each proposed feature, auditing, versioning, and transparency, since it is recorded along with the history in the Git repo.
For more information, check ROADMAP for each Kubeflow Project:
Community-wide changes are proposed as Kubeflow Enhancement proposals (KEPs)
in the kubeflow/community repository or in the Kubeflow sub-projects KEPs.
- Users: Data Scientists, ML Engineer, AI Practitioners, Data Engineers, AI Practitioners.
- Operators: MLOps Engineers, AIOps engineers, Platform Engineers, AI Platform Engineers.
- Vendors: Vendors and projects building Kubernetes based AI Platform products.
Explain the primary use case for the project. What additional use cases are supported by the project
The goal of Kubeflow is to run Cloud Native AI workloads for every stage in AI lifecycle. By using Kubeflow projects users can develop and deploy AI applications.
The primary use-cases include:
- Large-scale data processing and feature engineering.
- Distributed pre-training of foundation models.
- Post-training and fine-tuning of LLMs.
- Hyperparameter optimization and model tuning.
- LLM inference and multi-host serving.
Additional Use Cases:
- End-to-End GenAI pipeline building.
- Interactive AI development.
- Multiple users / projects with hard multi-tenancy on the same cluster.
As Kubeflow is composed of multiple projects, each working group makes its own determinations as t what will be excluded from them. However we have an overarching theme and governance structure (Steering Committee) that has identified the following areas as not being a priority for all projects:
-
The projects are deployed in any Kubernetes (each release will specify tested versions), regardless of the underlying infrastructure, independently through Kubernetes manifests leveraging Kustomize and/or Helm Charts. However, the project doesn’t provide an implementation to be deployed on infrastructure besides Kubernetes. - We do not officially enforce a deployment method or distribution.
-
Kubeflow doesn’t provide a GitOps implementation, however Kubeflow manifests can be integrated into a GitOps solution. For example, Platform Engineers can create an ArgoCD Application (CRD) to install and configure Kubeflow projects. by providing Kubeflow individual project manifests, for example Pipelines. The GitOps application will read from the Kubeflow Pipelines manifest and Argo CD will deploy the configurations in the target cluster.
Kubeflow is intended to be used by any organization which needs to run AI workloads on Kubernetes, in any of the AI lifecycle stages an organization might choose to use just one or more projects, such as Pipelines or Training. Organizations can also use all the projects from Kubeflow to increase the user experience to build AI workloads. Additionally, organizations can develop their own customizations on top of the Kubeflow platform or choose to build distributions to help other organizations adopt a customized platform based on Kubeflow projects.
As Kubeflow maintains flexibility, organizations can choose their own path according to their needs.
We encourage adopters to be part of the adopters list in GitHub.
Examples of organizations that use Kubeflow:
- AWS
- Red Hat
- Capital One
- CERN
- Alibaba Cloud
- Bloomberg
- IBM
- Cisco
- Huawei
- Microsoft
- Tencent
- DHL Data & Analytics
- Telia
- Roblox
- Toyota
- PepsiCo
- Volvo
- and others…
We regularly undertake user research about Kubeflow and its users. Research could be done by working groups, during events, or by conducting at least one study annually. We ensure these surveys are visible to the public and shared with the community.
Here are some results from previous years.
- Kubeflow 2025 Survey
- 2025:UX designers supporting Model Registry conducted a series of user sessions to understand preferred interaction patterns (link)
- Kubeflow Survey 2024
- Kubeflow Survey 2023
- Kubeflow Survey 2022
AI Practitioner - Kubeflow SDK, Kubeflow UIs
Platform Admins - Operator guides, installing and configuring any Kubeflow projects via Helm charts or Kustomize manifests predefined and available in the Kubeflow documentation from the command line.
Kubeflow user experience in each project is a collection of projects, the user experience for the projects are each with their own interfaces, APIs and SDKs.
We are working on a unified Kubeflow SDK that gives AI practitioners Python-native experience to interact with Kubeflow APIs.
Through Kubeflow SDK, users will be able to interact with the different projects, increasing user experience and reducing complexity.
Example to interact with Kubeflow Trainer using Kubeflow SDK:
from kubeflow.trainer import TrainerClient, BuiltinTrainer, TorchTuneConfig
# Fine-Tune Llama 3.2
job_id = TrainerClient().train(
runtime=TrainerClient().get_runtime(name="torchtune-llama3.2-1b"),
trainer=BuiltinTrainer(
config=TorchTuneConfig(
resources_per_node={
"memory": "200G",
"gpu": 1,
},
)
),
)
# Wait for TrainJob to complete
TrainerClient().wait_for_job_status(job_id)
# Print TrainJob logs
print("\n".join(TrainerClient().get_job_logs(name=job_id)))Example to interact with Kubeflow Katib using Kubeflow SDK
from kubeflow.optimizer import OptimizerClient, search
from kubeflow.trainer import TrainerClient, BuiltinTrainer, TorchTuneConfig
OptimizerClient().optimize(
objective="loss",
mode="min",
num_trials=5,
trainer=BuiltinTrainer(
config=TorchTuneConfig(
resources_per_node={
"memory": "200G",
"gpu": 1,
},
lr=search("0.1", "0.2", "lognormal"),
num_epochs=search("4", "8", "uniform"),
)
),
runtime=TrainerClient().get_runtime(name="torchtune-llama3.2-1b"),
)Find more details and examples about Kubeflow SDK.
- Kubeflow Pipelines
REST API is available under the /pipeline/ HTTP path. For example, if you host Kubeflow at
https://kubeflow.example.com, the API will be available at https://kubeflow.example.com/pipeline/.
The API is documented using swagger and examples about its usage can be found here
- Kubeflow Model Registry
Model Registry REST API is available under the /api/model_registry/ HTTP path. More information
can be found here
The Model Catalog Service provides a read-only discovery service for ML models across multiple catalog sources. It acts as a federated metadata aggregation layer, allowing users to search and discover models from various external catalogs through a unified REST API.
Kubeflow Central Dashboard acts as a hub for the AI platform and tools by exposing the UIs of components running in the cluster.
End users can access the Central Dashboard installed version on their organization according to the user and access control setup previously by Platform Admin. Users can access Kubeflow projects such as Kubeflow Pipelines (KFP) to manage experiments, pipeline definitions, runs, and recurrent runs. KFP Artifacts to track artifacts produced by pipelines stored in MLMD. KFP Executions to track executions of pipeline components stored in MLMD. Kubeflow Katib experiments to manage AutoML experiments. Kubeflow Notebooks to manage Kubeflow Notebooks. Kubeflow TensorBoards to manage TensorBoard instances. Kubeflow Volumes to manage Kubernetes PVC Volumes. Contributors page to manage contributors of profiles (namespaces) that you own.
More information can be found in the Kubeflow Dashboard docs.
All Kubeflow projects are Kubernetes native and so fit into the wider ecosystem of Kubernetes based tools. Kubeflow projects extensively use tools from the cloud native ecosystem, including, Argo Workflow, Istio, Helm, Knative, KServe, JobSet, Kueue, and other projects.
Specific components integrate with popular AI frameworks where applicable (e.g. Kubeflow Trainer integrates with PyTorch and other model frameworks)
The following diagram gives an overview of the Kubeflow Ecosystem and how it relates to the wider Kubernetes AI landscapes.
Specifically for a production environment, Kubeflow can integrate and leverage the following resources and projects:
- Cloud Providers and underlying infrastructure
- Kubeflow can leverage the underlying infrastructure provided through Kubernetes, such as access hardware accelerators from Intel, Nvidia, and AMD through Kubernetes Operators.
- Organizations can run Kubeflow projects on any Kubernetes platform and use cloud provider services, such as scalability across different nodes, taking advantage of projects such as Open Cluster Management.
- Configuration as code
- GitOps approach can be used to set up Kubeflow setup and configuration, for example, using Argo CD to manage configuration as code in multiple clusters.
- Security
- Security is becoming a key aspect, Kubeflow projects implement security best practices in Kubernetes including Pod Security Standards, Network policies, RBAC. Additionally, they can leverage projects such as the External Secrets Operator.
- Data Scientists can also use Sigstore projects to sign and validate their models to avoid type of attacks such as tampering with Models to ensure the integrity and authenticity of models persists during the Model Development Lifecycle.
- Simplify User experience
- Organizations can simplify Data scientists' experience by scaling Platform Engineers by providing software templates through Backstage to access environments quickly, leveraging Kubeflow projects such as Notebooks.
- Loosely Coupled and Distributable Services
- Built as modular, independent microservices that can scale and evolve.
- Services communicate through well defined APIs, enabling flexible orchestration and integration.
- Kubernetes native with Declarative APIs and Automation
- Uses Kubernetes CRD for each project.
- Integrate natively with Kubernetes core APIs like Jobs, Pods, Deployment.
- Supports automated deployment, orchestration, and CI/CD workflows.
- Enables GitOps management for reproducible and auditable operations.
- Portability and Platform-Agnosticism
- Designed to run on any Kubernetes cluster, including onprem, edge, public cloud or hybrid environments.
- Avoids vendor lock-in while ensuring consistent behavior across different infrastructures.
- Observability
- Provides metrics, logs, and traces to monitor system performance and workflow execution.
- Enables debugging, performance analysis, and monitoring of workloads and processes.
- Resilience and Availability
- Leverages platform features such as health checks, automatic restarts, and scaling to maintain high availability.
- Designed to tolerate failures and minimizing disruption to workloads.
- Security and Multi-Tenancy
- Implements access control, isolation, and resource quotas to separate workloads and users.
- Supports secure communication and policy enforcement to maintain a safe multi-tenant environment.
- Extensibility and Interoperability
- Provides APIs and runtime contracts to allow integration of new tools, frameworks, or workflows.
- Supports extension and customization without breaking existing functionality.
Kubeflow overall has strong contributing guidelines that inform our development and design. Each project also provides new contributors specific guidance on their GitHub repositories:
- Kubeflow Spark Operator
- Kubeflow Notebooks
- Kubeflow Trainer
- Kubeflow Katib
- Kubeflow Model Registry
- Kubeflow Pipelines
For the new features every Kubeflow project follows the KEP guidelines
We do follow the same practices as Kubernetes for APIs, with alpha, beta, and stable status of APIs. Details are described here.
The following projects are required to install Kubeflow projects:
- Istio, Knative, Cert-manager
- Detailed information can be found in the Kubeflow manifests.
Specific projects have other dependencies:
- Kubeflow Pipelines: Argo Workflows>=v3.1
- Kubeflow Trainer: JobSet>=v0.8.0
- Kubeflow Katib: MySQL>=v8.0
- Kubeflow Model Registry: MySQL>=v8.0
Kubeflow projects use a pluggable system of IAM that connects back to Kubernetes IAM in most cases. As a reference implementation, the optional Kubeflow Manifests/Dashboard components use the following IAM systems:
General security Talks about Kubeflow including an architectural introduction:
- KubeCon / Kubeflow Summit 2024 Lightning Talk.
- KubeCon 2023 Hardening Kubeflow Security for Enterprise Environments.
- Kubeflow Summit 2023 Security Working Group Update.
- Blog Hardening Kubeflow Security for Enterprise Environments.
Kubeflow’s projects can be self-hosted on any Kubernetes cluster, including air-gapped environments, which is critical for those organizations using disconnected environments. Additionally, Kubeflow projects support multi-tenancy, which allows organizations to isolate workloads in a Kubernetes cluster, which can be used with Network Policies to restrict isolation from different components by ports or application name. Kubeflow projects support custom service accounts, allowing organizations to control how Kubeflow projects interact with the cluster by leveraging RBAC. Further, Platform Administrators can decide which users in Kubernetes will have access to the Kubeflow projects by creating different users and RBAC policies. Kubeflow Notebooks also supports profiles, which allows simplifying the permission management.
In terms of data management, data sovereignty is managed by those who deploy or package the projects.
Kubeflow projects are extensible which allows users to fit their internal compliance requirements. As a result, specific common compliance frameworks (SOC-2, GDPR, etc.) are the responsibility of end users and vendors. However, we aim to provide a strong foundation through reference architectures similar things from which to build on.
The end users can adjust the replicas. Kubeflow project controllers support leader election, for example Kubeflow Trainer.
The following table shows the resource requirements for each Kubeflow project, calculated as the maximum of actual usage and configured requests for CPU/memory, plus storage requirements from PVCs:
The maximum looks hefty so rather consider in general the maximum (manifest requests, average usage).
| Component | CPU (cores) | Memory (Mi) | Storage (GB) |
|---|---|---|---|
| Cert Manager | 3m | 130Mi | 0GB |
| Dex + OAuth2-Proxy | 3m | 28Mi | 0GB |
| Istio | 2300m | 3502Mi | 0GB |
| Kubeflow KServe | 600m | 1200Mi | 0GB |
| Kubeflow Katib | 9m | 471Mi | 13GB |
| Kubeflow Core | 35m | 841Mi | 0GB |
| Metadata | 78m | 687Mi | 30GB |
| Kubeflow Model Registry | 510m | 2112Mi | 0GB |
| Other | 20m | 354Mi | 6GB |
| Kubeflow Pipelines | 970m | 3552Mi | 90GB |
| Kubeflow Spark Operator | 4m | 41Mi | 0GB |
| Kubeflow Trainer | 3m | 25Mi | 0GB |
| Total | 4535m | 12943Mi | 139GB |
Describe the project’s storage requirements, including its use of ephemeral and/or persistent storage
Each project can configure storage in different ways. This is also true of the manifests which configure storage for a collection of projects. This can be seen here.
Various Kubeflow projects offer APIs and Python SDKs. See the following sets of reference documentation:
-
Pipelines reference docs for the Kubeflow Pipelines API and SDK, including the Kubeflow Pipelines domain-specific language (DSL).
-
Kubeflow SDK to interact with Kubeflow APIs with Python-native interface.
See also Kubeflow APIs and SDKs.
Kubeflow CRDs are following Kubernetes best practices for the API changes.
Kubeflow APIs and SDKs documentation.
Some defaults can be seen in the Kubeflow manifests.
Additionally, API defaults are shown in the API docs, such as: Kubeflow Trainer or Kubeflow Katib.
Some public and listed distributions have their own ways to install Kubeflow AI platform
Describe any new or changed API types and calls—including to cloud providers—that will result from this project being enabled and used
Deploying Kubeflow manifests and individual projects results in exposing new APIs and the possibility to call configured 3rd party APIs if integrations exist. These all have to be explicitly set by the users.
Describe compatibility of any new or changed APIs with API servers, including the Kubernetes API server
Kubeflow follows the same practice for API compatibility as Kubernetes.
Many Kubeflow projects use Kubernetes CRDs, and for these resources follows the same deprecation policy as Kubernetes.
Every Kubeflow project follow its own release lifecycle, for example Kubeflow Trainer or Kubeflow Katib.
However, community also maintain the Kubeflow AI reference platform releases which install all Kubeflow projects together for end-to-end AI platform:
Kubeflow projects can be installed as a standalone applications or together using the Kubeflow Manifests or Kubeflow Distributions (public or private).
Distributions can verify the installation by following this guide or executing this test suites.
Kubeflow projects docs also explain how platform admins should validate the installation of individual applications, for example Kubeflow Trainer
Provide a link to the project’s cloud native security self assessment.
- Make security a design requirement.
Kubeflow projects are built with security as a fundamental concern. Kubeflow projects follow Kubernetes and cloud native security best practices. By leveraging Kubernetes Custom Resources Definition (CRDs), Role-Based Access Control (RBAC), network policies, and pod security standards, Kubeflow projects seamlessly integrate into cloud native secure environments. Additionally, Kubeflow projects re-use functionality from other cloud native tools like Istio, Cert-Manager, Kubernetes and inherit its security best practices.
Default configuration for Kubeflow projects follow best practices such as rootless containers, limited privileges, etc.
- Applying secure configuration has the best user experience
Default configuration for Kubeflow projects follow best practices such as rootless containers, limited privileges, etc. Users can migrate workloads to more secure configurations without breaking changes, leveraging Kubernetes’ declarative model.
- Selecting insecure configuration is a conscious decision
Insecure options require explicit user setup and configurations.
Describe how each of the cloud native principles apply to your project
Kubeflow uses cloud native principles by building on Kubernetes and other cloud native technologies while extending them in our composable projects.
Kubeflow projects use containers as a fundamental unit of task/deployment. For example, every TrainJob uses separate containers to train models, allowing users to package their training code along with its dependencies in a containerized environment
Kubeflow projects are designed to run natively on Kubernetes and define various CRDs for each AI workload. For example, TrainJob CR for model training or Notebook CR for interactive development. It uses native Kubernetes primitives like Deployments, Services, Job, JobSet, and CRDs to orchestrate AI workloads.
Kubeflow follows a loosely-coupled microservices architecture model, where each project can be deployed as a standalone service and be integrated into desired AI platform on Kubernetes.
Kubeflow APIs are declarative and follow Kubernetes best practices. For instance, SparkApplication to manage Spark jobs or TrainJob to manage distributed training jobs.
Users can install Kubeflow Pipelines without multi-user isolation if they want to loosen the security of deployment. Users suggested to follow the Kubeflow AI reference platform deployment mode to install Kubeflow Pipelines control plane in multi-tenant model. It provides isolation between Pipelines and Runs for users.
Describe the frameworks, practices and procedures the project uses to maintain the basic health and security of the project
-
Robust CI/CD infrastructure: Kubeflow projects have automated unit, integration, and end-to-end tests that are integrated into the CI pipelines to ensure code stability and correctness of PRs.
-
Dependency management: Kubeflow projects leverage lock files like
go.modandpyproject.toml, and progressing towards Software Bill of Materials (SBOM) generation. -
Release stability: Kubeflow projects maintain release branches and release tags which allow maintainers create patch releases to address bug fixes and security vulnerabilities. Also, projects maintain changelogs and roadmaps to ensure traceability and transparency for all changes.
-
Static and Dynamic Code Analysis: Kubeflow projects CI integrates code linting, static, and dynamic code analysis tools to enforce code quality and detect potential issues before they reach production.
-
Secure Defaults: Defaults manifests and Helm charts are configured with security best practices, like pod security standards, rootless containers, and enforces least privileges.
-
Code Review Process: Kubeflow projects’ maintainers follow a strong review process with peer review and explicit approval before PRs being merged into the main branch.
-
Open Governance: Kubeflow follows open governance with regular public meetings and communication channels to keep users aware of development and roadmap.
-
Secure Image Build: Image builds are validated through CI/CD checks to ensure they are reproducible, security, and consistent.
Some examples can be found here:
- Kubeflow Spark Operator
- Kubeflow Katib
- Kubeflow Trainer
- Kubeflow Model Registry
- Kubeflow Pipelines
- Kubeflow Manifests
Describe how the project has evaluated which features will be a security risk to users if they are not maintained by the project
Features are discussed and reviewed in the community, with security implications considered during design and code review processes. The project tracks the security health of dependencies and evaluates the impact of vulnerabilities or unmaintained projects. The project identifies potential risks by analyzing the attack surface, especially for features that interact with user-supplied code, custom containers, user supplied models or external data sources.
Each Kubeflow project's control plane needs to watch for its CRDs. For example, Kubeflow Trainer: TrainJob and ClusterTrainingRuntime, Spark Operator for SparkApplications, Katib for Experiments, etc. Controllers have minimal privileges to orchestrate its CRDs and corresponding resources that need to be created
Kubeflow control plane is installed cluster-wide, and it requires RBAC to manage namespace-scoped resources.
The default controllers run with PodSecurityStandards restricted and rootless containers, minimizing the risk of privilege escalation.
Users who interact with Kubeflow SDK get namespace-scoped permission to interact with required
resources. For example, users’ role for
Kubeflow Trainer can be found here.
All roles are aggregated in the kubeflow-edit cluster role after installing Kubeflow Manifests.
Describe how the project is handling certificate rotation and mitigates any issues with certificates
Kubeflow projects use cert-manager to generate and rotate certificates. Those certificates are used for validation and mutation webhooks across Kubeflow projects. Cert-manager handles the issuance, renewal, and rotation of these certificates automatically without downtime.
Describe how the project is following and implementing secure software supply chain best practices
-
Automated CI/CD infrastructure: Kubeflow projects have automated unit, integration, and end-to-end tests that are integrated into the CI pipelines to ensure code stability and reduce risk of introducing vulnerabilities.
-
Vulnerability Scanning: Kubeflow uses tools for vulnerability scanning like Dependabot, Trivy to identify CVEs and address known CVEs in dependencies.
-
Dependency management: Kubeflow projects leverage lock files like
go.modandpyproject.toml, and progressing towards Software Bill of Materials (SBOM) generation. -
Code Review Process: Kubeflow projects’ maintainers follow a strong review process with peer review and explicit approval before PRs being merged into the main branch.
-
DCO Check: Committers are required to sign and comply with the Developer Certificate of Origin (DCO) to affirm the legitimacy and authorship of their contributions.
-
Branch Protection: The project enforces branch protection rules to prevent unauthorized changes, enforce status checks, require pull request reviews, and control who can push to protected branches.
-
Prevent Secrets in Source Code: Proactively prevents committing secrets to the source code by using tools and GitHub workflows that detect sensitive data in PRs.
-
License compliance: Kubeflow uses the FOSSA product provided by the CNCF to scan for license compliance on a regular and ongoing basis.
Each project has its own standalone installation guide.
Kubeflow projects are packaged by distributions and vendors into platform products which can be used to install our tools.
We provide optional manifests to deploy all Kubeflow projects as Kubeflow AI reference platform.
How can this project be enabled or disabled in a live cluster? Please describe any downtime required of the control plane or nodes
Users can set the replica count to 0 in the Kubeflow projects deployment. Existing AI workloads should not be impacted since it won’t be reconciled by controllers.
Updating the Kubeflow AI Reference Platform.
The installation guide described when control plane is ready.
Enable projects by scaling replica count back to 1. All running AI workloads will be reconciled again by controllers, and they perform the appropriate updates to the CRDs.
Istio, Knative, cert-manager might interfere with existing installations.
Conformance program is work in progress to ensure tests across all Kubeflow projects.
The Kubeflow projects control plane can be deleted by cleanup the appropriate resources, for example:
kubectl delete -k https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.0.0The command will cleanup all CRDs and control plane deployment. Since Kubeflow CRDs maintain
ownerReference for associated resources, all resources will be removed after deleting the
actual CRDs. For example:
kubectl delete trainjob --all --all-namespacesCheck this guide to cleanup the Kubeflow AI reference platform.
How does the project intend to provide and maintain compatibility with infrastructure and orchestration management tools like Kubernetes and with what frequency
Kubeflow projects publish its supported Kubernetes version for every release. The supported versions are evaluated and upgraded on every release. We support Kubernetes 1.31+ and test on 1.32+
Some projects support leader election and HA to make sure that long-running workloads are complete. Users can scale up or down replicas of controllers during rollback.
Newer revision will not be active for many reasons including insufficient cluster resources for the newer spec, invalid configuration spec. Traffic will not be switched to the new unless the newer revision is ready to accept traffic. Hence, the already running workloads will not be affected in any case if rollout fails.
Kubeflow projects CRDs expose status that inform about activity of AI workloads. Additionally, controllers are exposed various Prometheus metrics to indicate workload status.
Explain how upgrades and rollbacks were tested and how the upgrade->downgrade->upgrade path was tested
Currently, it’s being manually tested by users, but automated tests are work in progress.
All API changes are backward compatible and changes are announced in the release notes and changelogs. If some APIs are deprecated, newer versions of APIs are introduced with user awareness. Kubeflow CRDs follow Kubernetes best practices for API compatibility.
For example Kubeflow Trainer breaking changes.
At the moment, functionality is logged explicitly if a feature is alpha or beta. We do not have feature gates. Additionally, we are working on conversion webhook that helps users with updated API versions of CRDs.
In general, the API object count complexity is linear in the number of users and workloads. Users can list deployed Kubeflow CRDs across all namespaces.
For example, to list all TrainJob from Kubeflow Trainer run:
kubectl get trainjob --all-namespacesThe core controllers and resources are constant (beyond replicating specific controllers), see the full kustomize build here.
Describe how the project defines Service Level Objectives (SLOs) and Service Level Indicators (SLIs)
Kubeflow currently doesn’t provide the SLOs and SLIs, however we follow the Kubernetes SLOs
As above.
Describe the increase in resource usage in any components as a result of enabling this project, to include CPU, Memory, Storage, Throughput
Resources requirements for Kubeflow projects are set here.
Describe which conditions enabling / using this project would result in resource exhaustion of some node resources
There are some specific issues for each project, we should list some of them (e.g. pipelines controller can have open file handler issues). Since many workloads are GPU-intensive, Kubernetes platform admins need to ensure that nodes have enough capacity to run AI workloads with Kubeflow projects.
Some intensive load testing has been performed by Kubeflow users, for example some LLM foundation models have been trained using Kubeflow Trainer across a few hundreds GPUs. At KubeCon some users share their experience of running more than 10,000 GPUs with Kubeflow Training Operator.
Describe the recommended limits of users, requests, system resources, etc. and how they were obtained
Users on the order of hundreds are known to work. Scaling beyond hundreds of users may require an increase in the requests/limits replicas of some deployments, or scaling dependencies like databases.
Kubeflow uses the Kubernetes resilience pattern to manage controllers with HA. If one replica fails, another part of the control plane takes responsibility to orchestrate workloads.
Describe the signals the project is using or producing, including logs, metrics, profiles and traces. Please include supported formats, recommended configurations and data storage
Kubeflow controllers expose Prometheus metrics to report workload status. Controllers also expose logs and status for platform admins to ensure stability.
Platform admins can leverage Kubernetes audit for Kubeflow projects.
Kubeflow Dashboard requirements.
Describe how the project surfaces project resource requirements for adopters to monitor cloud and infrastructure costs, e.g. FinOps That must happen on the Kubernetes namespace level
Users are recommended to use third-party tools like Kubecost to measure cloud and infrastructure cost of running Kubeflow projects.
Which parameters is the project covering to ensure the health of the application/service and its workloads
Most project use Kubernetes Liveness and Readiness probes.
For example Kubeflow Trainer controller manager.
- Check the Pods in
kubeflow-profile labeled namespaces. - Check the CRDs in user’s namespaces
- Check the Kubeflow Dashboard resources.
- Kubeflow Manifests guide.
- Run the Kubeflow manifest test suite
- Run the upstream examples, for example Kubeflow Trainer PyTorch training.
Most projects depend on Cert-Manager>=v1.16, and Istio>=v1.26
Specific projects have other dependencies:
- Kubeflow Pipelines: Argo Workflows>=v3.1
- Kubeflow Trainer: JobSet>=v0.8.0
- Kubeflow Katib: MySQL>=v8.0
- Kubeflow Model Registry: MySQL>=v8.0
For details, please take a look at General security Talks about Kubeflow including an architectural introduction.
We follow Kubernetes deprecation policy for supported versions. For example, we support the 3-4 latest versions of Kubernetes to deploy Kubeflow projects.
How does the project incorporate and consider source composition analysis as part of its development and security hygiene? Describe how this source composition analysis (SCA) is tracked
Various static and dynamic code analysis tools are enforced as described above.
Describe how the project implements changes based on source composition analysis (SCA) and the timescale
Various static and dynamic code analysis tools are enforced as described above.
Specifics depend on the Kubeflow Project (KFP, Katib, etc.). Kubeflow projects are cloud and Kubernetes native apps, so fault tolerance is strongly tied to the health of the underlying cluster. In practice user workloads sometimes have a retry policy and the Kubeflow core services should just recover automatically.
More details are described in the previous sections.
Each Kubeflow project handles failure modes differently beyond native Kubernetes fault tolerance. Many of them are configured at the application level in user code.
As described above each Kubeflow projects require limited RBAC to manage its CRDs
Described in the section above.
How does the project ensure its security reporting and response team is representative of its community diversity (organizational and individual)
Every Kubeflow project follows security guidelines.
- Kubeflow Spark Operator security policy.
- Kubeflow Notebooks security policy.
- Kubeflow Trainer security policy.
- Kubeflow Katib security policy.
- Kubeflow Model Registry security policy.
- Kubeflow Pipelines security policy.
Active project maintainers are responsible to ensure that security reports are addressed. Each project has its own security reporting and disclosure policy. For projects which use GitHub’s security disclosure system, access control is managed by having write access to the relevant github repository.
