diff --git a/projects/kubeflow/tech-review/2026-04-24.md b/projects/kubeflow/tech-review/2026-04-24.md new file mode 100644 index 000000000..910b872b4 --- /dev/null +++ b/projects/kubeflow/tech-review/2026-04-24.md @@ -0,0 +1,965 @@ +# General Technical Review - Kubeflow / Graduation + +- **Project:** Kubeflow + +- **Project Version:** Every Kubeflow sub-project has its own version. + +- **Website:** https://www.kubeflow.org/ + +- **Date Updated**: 2025-09-05 + +- **Template Version**: v1.0 + +- **Description**: + +Kubeflow is the foundation of tools for AI Platforms on Kubernetes. + +AI platform teams can build on top of Kubeflow by using each project independently or deploying the +entire AI reference platform to meet their specific needs. The Kubeflow AI reference platform is +composable, modular, portable, and scalable, backed by an ecosystem of Kubernetes-native projects +that cover every stage of the +[AI lifecycle.](https://www.kubeflow.org/docs/started/architecture/#kubeflow-projects-in-the-ai-lifecycle) + +Whether you’re an AI practitioner, a platform administrator, or a team of developers, +Kubeflow offers modular, scalable, and extensible tools to support your AI use cases. + +## What are Kubeflow Projects + +Kubeflow is composed of multiple open source projects that address different aspects of the AI +lifecycle. These projects are designed to be usable both independently and as part of the Kubeflow +AI reference platform. This provides flexibility for users who may not need the full end-to-end +AI platform capabilities but want to leverage specific functionalities, such as model training or +model serving. + +### Kubeflow Projects in scope for CNCF Graduation + +- Kubeflow Spark Operator +- Kubeflow Notebooks +- Kubeflow Trainer +- Kubeflow Katib +- Kubeflow Model Registry +- Kubeflow Pipelines + +## What is the Kubeflow AI Reference Platform + +The Kubeflow AI reference platform refers to the full suite of Kubeflow projects bundled together +with additional integration and management tools. Kubeflow AI reference platform deploys the +comprehensive toolkit for the entire AI lifecycle. The Kubeflow AI reference platform can be +installed via [Packaged Distributions](https://www.kubeflow.org/docs/started/installing-kubeflow/#packaged-distributions) +or [Kubeflow Manifests](https://www.kubeflow.org/docs/started/installing-kubeflow/#kubeflow-manifests). + +## Day 0 - Planning Phase + +### Scope + +#### Describe the roadmap process + +Kubeflow projects are managed by community members that are part of working groups. +Each working group defines and agrees on the features for each release. The release cadence for +each working group varies according to community agreement among the working groups. + +Then, all Kubeflow projects have individual roadmap files in the Git repository defining each +release and available to the public. This ensures we have a standard structure for each proposed +feature, auditing, versioning, and transparency, since it is recorded along with the history in the Git repo. + +For more information, check ROADMAP for each Kubeflow Project: + +- [Kubeflow Spark Operator](https://github.com/kubeflow/spark-operator/blob/master/ROADMAP.md) +- [Kubeflow Trainer](https://github.com/kubeflow/trainer/blob/master/ROADMAP.md) +- [Kubeflow Katib](https://github.com/kubeflow/katib/blob/master/ROADMAP.md) +- [Kubeflow Hub](https://github.com/kubeflow/hub/blob/main/ROADMAP.md) +- [Kubeflow Pipelines](https://github.com/kubeflow/pipelines/blob/master/ROADMAP.md) + +Community-wide changes are proposed as [Kubeflow Enhancement proposals (KEPs)](https://github.com/kubeflow/community/tree/master/proposals) +in the `kubeflow/community` repository or in the [Kubeflow sub-projects KEPs](https://github.com/kubeflow/trainer/tree/master/docs/proposals). + +#### Describe the target persona or user(s) for the project + +1. Users: Data Scientists, ML Engineer, AI Practitioners, Data Engineers, AI Practitioners. +2. Operators: MLOps Engineers, AIOps engineers, Platform Engineers, AI Platform Engineers. +3. Vendors: Vendors and projects building Kubernetes based AI Platform products. + +#### Explain the primary use case for the project. What additional use cases are supported by the project + +The goal of Kubeflow is to run Cloud Native AI workloads for every stage in AI lifecycle. By +using Kubeflow projects users can develop and deploy AI applications. + +The primary use-cases include: + +- Large-scale data processing and feature engineering. +- Distributed pre-training of foundation models. +- Post-training and fine-tuning of LLMs. +- Hyperparameter optimization and model tuning. +- LLM inference and multi-host serving. + +Additional Use Cases: + +- End-to-End GenAI pipeline building. +- Interactive AI development. +- Multiple users / projects with hard multi-tenancy on the same cluster. + +#### Explain which use cases have been identified as unsupported by the project + +As Kubeflow is composed of multiple projects, each working group makes its own determinations as t +what will be excluded from them. However we have an overarching theme and governance structure +(Steering Committee) that has identified the following areas as not being a priority for all projects: + +- The projects are deployed in any Kubernetes (each release will specify tested versions), + regardless of the underlying infrastructure, independently through Kubernetes manifests leveraging + Kustomize and/or Helm Charts. However, the project doesn’t provide an implementation to be deployed + on infrastructure besides Kubernetes. - We do not officially enforce a deployment method or distribution. + +- Kubeflow doesn’t provide a GitOps implementation, however Kubeflow manifests can be integrated + into a GitOps solution. For example, Platform Engineers can create an ArgoCD Application (CRD) + to install and configure Kubeflow projects. by providing Kubeflow individual project manifests, + for example Pipelines. The GitOps application will read from the Kubeflow Pipelines manifest and + Argo CD will deploy the configurations in the target cluster. + +#### Describe the intended types of organizations who would benefit from adopting this project + +Kubeflow is intended to be used by any organization which needs to run AI workloads on Kubernetes, +in any of the AI lifecycle stages an organization might choose to use just one or more projects, +such as Pipelines or Training. Organizations can also use all the projects from Kubeflow to increase +the user experience to build AI workloads. Additionally, organizations can develop their own +customizations on top of the Kubeflow platform or choose to build distributions to help other +organizations adopt a customized platform based on Kubeflow projects. + +As Kubeflow maintains flexibility, organizations can choose their own path according to their needs. + +We encourage adopters to be part of the adopters list in [GitHub](https://github.com/kubeflow/community/blob/master/ADOPTERS.md). + +Examples of organizations that use Kubeflow: + +- AWS +- Red Hat +- Capital One +- CERN +- Google +- Alibaba Cloud +- Bloomberg +- IBM +- Cisco +- Huawei +- Microsoft +- Tencent +- DHL Data & Analytics +- Telia +- Roblox +- Toyota +- PepsiCo +- Volvo +- and others… + +#### Describe any completed end user research and link to any reports. + +We regularly undertake user research about Kubeflow and its users. Research could be done by +working groups, during events, or by conducting at least one study annually. We ensure these surveys +are visible to the public and shared with the community. + +Here are some results from previous years. + +- [Kubeflow 2025 Survey](https://docs.google.com/forms/d/11cSe5vmGLrGekJISHBMfjVh_97WFGuhcvGnd0l5aNLg/edit#responses) +- [2025:UX designers supporting Model Registry conducted a series of user sessions to understand preferred interaction patterns (link)](https://docs.google.com/forms/d/11cSe5vmGLrGekJISHBMfjVh_97WFGuhcvGnd0l5aNLg/edit#responses) +- [Kubeflow Survey 2024](https://github.com/kubeflow/community/issues/708) +- [Kubeflow Survey 2023](https://blog.kubeflow.org/kubeflow-user-survey-2023/) +- [Kubeflow Survey 2022](https://blog.kubeflow.org/kubeflow-user-survey-2022/) + +### Usability + +#### How should your target personas interact with your project + +AI Practitioner - Kubeflow SDK, Kubeflow UIs + +Platform Admins - Operator guides, installing and configuring any Kubeflow projects via Helm charts +or Kustomize manifests predefined and available in the Kubeflow documentation from the command line. + +- [Getting Started guide](https://www.kubeflow.org/docs/started/introduction/). +- [How to install Kubeflow projects](https://www.kubeflow.org/docs/started/installing-kubeflow/). +- [How to contribute to Kubeflow](https://www.kubeflow.org/docs/about/contributing/). + +#### Describe the user experience (UX) and user interface (UI) of the project + +Kubeflow user experience in each project is a collection of projects, the user experience for the +projects are each with their own [interfaces, APIs and SDKs](https://www.kubeflow.org/docs/started/architecture/#kubeflow-interfaces). + +##### Describing User Experience through SDK + +We are working on a unified [Kubeflow SDK](https://github.com/kubeflow/sdk) that gives +AI practitioners Python-native experience to interact with Kubeflow APIs. + +Through Kubeflow SDK, users will be able to interact with the different projects, increasing user experience and reducing complexity. + +
+ Kubeflow SDK Personas +
+ +Example to interact with Kubeflow Trainer using Kubeflow SDK: + +```py +from kubeflow.trainer import TrainerClient, BuiltinTrainer, TorchTuneConfig + +# Fine-Tune Llama 3.2 +job_id = TrainerClient().train( + runtime=TrainerClient().get_runtime(name="torchtune-llama3.2-1b"), + trainer=BuiltinTrainer( + config=TorchTuneConfig( + resources_per_node={ + "memory": "200G", + "gpu": 1, + }, + ) + ), +) + +# Wait for TrainJob to complete +TrainerClient().wait_for_job_status(job_id) + +# Print TrainJob logs +print("\n".join(TrainerClient().get_job_logs(name=job_id))) +``` + +Example to interact with Kubeflow Katib using Kubeflow SDK + +```py +from kubeflow.optimizer import OptimizerClient, search +from kubeflow.trainer import TrainerClient, BuiltinTrainer, TorchTuneConfig + +OptimizerClient().optimize( + objective="loss", + mode="min", + num_trials=5, + trainer=BuiltinTrainer( + config=TorchTuneConfig( + resources_per_node={ + "memory": "200G", + "gpu": 1, + }, + lr=search("0.1", "0.2", "lognormal"), + num_epochs=search("4", "8", "uniform"), + ) + ), + runtime=TrainerClient().get_runtime(name="torchtune-llama3.2-1b"), +) +``` + +Find more details and examples about [Kubeflow SDK](https://github.com/kubeflow/sdk). + +##### Describing User Experience through APIs + +- Kubeflow Pipelines + +REST API is available under the `/pipeline/` HTTP path. For example, if you host Kubeflow at +https://kubeflow.example.com, the API will be available at https://kubeflow.example.com/pipeline/. + +The API is documented using swagger and examples about its usage +[can be found here](https://www.kubeflow.org/docs/components/pipelines/reference/api/kubeflow-pipeline-api-spec/) + +- Kubeflow Model Registry + +Model Registry REST API is available under the `/api/model_registry/` HTTP path. More information +[can be found here](https://www.kubeflow.org/docs/components/hub/reference/rest-api/) + +The Model Catalog Service provides a read-only discovery service for ML models across multiple +catalog sources. It acts as a federated metadata aggregation layer, allowing users to search and +discover models from various external catalogs through a unified REST API. + +##### Describing User Interfaces through web UI + +[Kubeflow Central Dashboard](https://www.kubeflow.org/docs/components/central-dash/overview/) +acts as a hub for the AI platform and tools by exposing the UIs of components running in the cluster. + +
+ Kubeflow Dashboard +
+ +End users can access the Central Dashboard installed version on their organization according to +the user and access control setup previously by Platform Admin. Users can access Kubeflow projects +such as Kubeflow Pipelines (KFP) to manage experiments, pipeline definitions, runs, and recurrent +runs. KFP Artifacts to track artifacts produced by pipelines stored in MLMD. KFP Executions to +track executions of pipeline components stored in MLMD. Kubeflow Katib experiments to manage +AutoML experiments. Kubeflow Notebooks to manage Kubeflow Notebooks. Kubeflow TensorBoards to +manage TensorBoard instances. Kubeflow Volumes to manage Kubernetes PVC Volumes. Contributors page +to manage contributors of profiles (namespaces) that you own. + +More information can be found [in the Kubeflow Dashboard docs](https://www.kubeflow.org/docs/components/central-dash/overview/). + +#### Describe how this project integrates with other projects in a production environment + +All Kubeflow projects are Kubernetes native and so fit into the wider ecosystem of Kubernetes +based tools. Kubeflow projects extensively use tools from the cloud native ecosystem, +including, Argo Workflow, Istio, Helm, Knative, KServe, JobSet, Kueue, and other projects. + +Specific components integrate with popular AI frameworks where applicable (e.g. Kubeflow Trainer +integrates with PyTorch and other model frameworks) + +The following diagram gives an overview of the +[Kubeflow Ecosystem](https://www.kubeflow.org/docs/started/architecture/#kubeflow-ecosystem) +and how it relates to the wider Kubernetes AI landscapes. + +
+ Kubeflow Ecosystem +
+ +Specifically for a production environment, Kubeflow can integrate and leverage the following +resources and projects: + +- Cloud Providers and underlying infrastructure + - Kubeflow can leverage the underlying infrastructure provided through Kubernetes, such as access + hardware accelerators from Intel, Nvidia, and AMD through Kubernetes Operators. + - Organizations can run Kubeflow projects on any Kubernetes platform and use cloud provider + services, such as scalability across different nodes, taking advantage of projects such as Open Cluster Management. +- Configuration as code + - GitOps approach can be used to set up Kubeflow setup and configuration, for example, using + Argo CD to manage configuration as code in multiple clusters. +- Security + - Security is becoming a key aspect, Kubeflow projects implement security best practices in + Kubernetes including Pod Security Standards, Network policies, RBAC. Additionally, they can + leverage projects such as the External Secrets Operator. + - Data Scientists can also use Sigstore projects to sign and validate their models to avoid type + of attacks such as tampering with Models to ensure the integrity and authenticity of models + persists during the Model Development Lifecycle. +- Simplify User experience + - Organizations can simplify Data scientists' experience by scaling Platform Engineers by + providing software templates through Backstage to access environments quickly, leveraging + Kubeflow projects such as Notebooks. + +### Design + +#### Explain the design principles and best practices the project is following + +- Loosely Coupled and Distributable Services + - Built as modular, independent microservices that can scale and evolve. + - Services communicate through well defined APIs, enabling flexible orchestration and integration. +- Kubernetes native with Declarative APIs and Automation + - Uses Kubernetes CRD for each project. + - Integrate natively with Kubernetes core APIs like Jobs, Pods, Deployment. + - Supports automated deployment, orchestration, and CI/CD workflows. + - Enables GitOps management for reproducible and auditable operations. +- Portability and Platform-Agnosticism + - Designed to run on any Kubernetes cluster, including onprem, edge, public cloud or hybrid environments. + - Avoids vendor lock-in while ensuring consistent behavior across different infrastructures. +- Observability + - Provides metrics, logs, and traces to monitor system performance and workflow execution. + - Enables debugging, performance analysis, and monitoring of workloads and processes. +- Resilience and Availability + - Leverages platform features such as health checks, automatic restarts, and scaling to maintain high availability. + - Designed to tolerate failures and minimizing disruption to workloads. +- Security and Multi-Tenancy + - Implements access control, isolation, and resource quotas to separate workloads and users. + - Supports secure communication and policy enforcement to maintain a safe multi-tenant environment. +- Extensibility and Interoperability + - Provides APIs and runtime contracts to allow integration of new tools, frameworks, or workflows. + - Supports extension and customization without breaking existing functionality. + +Kubeflow overall has strong [contributing guidelines](https://www.kubeflow.org/docs/about/contributing/) +that inform our development and design. Each project also provides new contributors specific +guidance on their GitHub repositories: + +- [Kubeflow Spark Operator](https://github.com/kubeflow/spark-operator/blob/master/CONTRIBUTING.md) +- [Kubeflow Notebooks](https://github.com/kubeflow/notebooks/blob/main/CONTRIBUTING.md) +- [Kubeflow Trainer](https://github.com/kubeflow/trainer/blob/master/CONTRIBUTING.md) +- [Kubeflow Katib](https://github.com/kubeflow/katib/blob/master/CONTRIBUTING.md) +- [Kubeflow Model Registry](https://github.com/kubeflow/model-registry/blob/main/CONTRIBUTING.md) +- [Kubeflow Pipelines](https://github.com/kubeflow/pipelines/blob/master/CONTRIBUTING.md) + +For the new features every Kubeflow project follows +[the KEP guidelines](https://github.com/kubeflow/community/tree/master/proposals#kep-format-to-propose-and-document-enhancements-to-kubeflow) + +#### Outline or link to the project’s architecture requirements + +We do follow the same practices as Kubernetes for APIs, with alpha, beta, and stable status of APIs. +Details [are described here](https://www.kubeflow.org/docs/started/support/#component-status). + +#### Define any specific service dependencies the project relies on in the cluster + +The following projects are required to install Kubeflow projects: + +- Istio, Knative, Cert-manager +- Detailed information [can be found in the Kubeflow manifests](https://github.com/kubeflow/manifests#kubeflow-components-versions). + +Specific projects have other dependencies: + +- Kubeflow Pipelines: Argo Workflows>=v3.1 +- Kubeflow Trainer: JobSet>=v0.8.0 +- Kubeflow Katib: MySQL>=v8.0 +- Kubeflow Model Registry: MySQL>=v8.0 + +#### Describe how the project implements Identity and Access Management + +Kubeflow projects use a pluggable system of IAM that connects back to Kubernetes IAM in most cases. +As a reference implementation, the optional Kubeflow Manifests/Dashboard components use the +following IAM systems: + +- [OAuth2 Proxy](https://github.com/kubeflow/manifests#oauth2-proxy) +- [Kubeflow manifest with Dex](https://github.com/kubeflow/manifests#dex) + +General security Talks about Kubeflow including an architectural introduction: + +- [KubeCon / Kubeflow Summit 2024 Lightning Talk](https://youtu.be/f7C3v0gJRVI). +- [KubeCon 2023 Hardening Kubeflow Security for Enterprise Environments](https://youtu.be/wQaOa4Kjxs0). +- [Kubeflow Summit 2023 Security Working Group Update](https://youtu.be/XKHVt2yxQFo). +- [Blog Hardening Kubeflow Security for Enterprise Environments](https://blogs.vmware.com/opensource/2023/06/20/hardening-kubeflow-security-for-enterprise-environments-2/). + +#### Describe how the project has addressed sovereignty + +Kubeflow’s projects can be self-hosted on any Kubernetes cluster, including air-gapped environments, +which is critical for those organizations using disconnected environments. Additionally, +Kubeflow projects support multi-tenancy, which allows organizations to isolate workloads in a +Kubernetes cluster, which can be used with Network Policies to restrict isolation from different +components by ports or application name. Kubeflow projects support custom service accounts, +allowing organizations to control how Kubeflow projects interact with the cluster by leveraging RBAC. +Further, Platform Administrators can decide which users in Kubernetes will have access to the +Kubeflow projects by creating different users and RBAC policies. Kubeflow Notebooks also supports +profiles, which allows simplifying the permission management. + +In terms of data management, data sovereignty is managed by those who deploy or package the projects. + +#### Describe any compliance requirements addressed by the project + +Kubeflow projects are extensible which allows users to fit their internal compliance requirements. +As a result, specific common compliance frameworks (SOC-2, GDPR, etc.) are the responsibility of +end users and vendors. However, we aim to provide a strong foundation through reference architectures +similar things from which to build on. + +#### Describe the project’s High Availability requirements + +The end users can adjust the replicas. Kubeflow project controllers support leader election, +[for example Kubeflow Trainer](https://github.com/kubeflow/trainer/blob/master/cmd/trainer-controller-manager/main.go#L80). + +#### Describe the project’s resource requirements, including CPU, Network and Memory + +The following table shows the resource requirements for each Kubeflow project, calculated as the +maximum of actual usage and configured requests for CPU/memory, plus storage requirements from PVCs: + +The maximum looks hefty so rather consider in general the maximum (manifest requests, average usage). + +| Component | CPU (cores) | Memory (Mi) | Storage (GB) | +| ----------------------- | ----------- | ----------- | ------------ | +| Cert Manager | 3m | 130Mi | 0GB | +| Dex + OAuth2-Proxy | 3m | 28Mi | 0GB | +| Istio | 2300m | 3502Mi | 0GB | +| Kubeflow KServe | 600m | 1200Mi | 0GB | +| Kubeflow Katib | 9m | 471Mi | 13GB | +| Kubeflow Core | 35m | 841Mi | 0GB | +| Metadata | 78m | 687Mi | 30GB | +| Kubeflow Model Registry | 510m | 2112Mi | 0GB | +| Other | 20m | 354Mi | 6GB | +| Kubeflow Pipelines | 970m | 3552Mi | 90GB | +| Kubeflow Spark Operator | 4m | 41Mi | 0GB | +| Kubeflow Trainer | 3m | 25Mi | 0GB | +| **Total** | **4535m** | **12943Mi** | **139GB** | + +#### Describe the project’s storage requirements, including its use of ephemeral and/or persistent storage + +Each project can configure storage in different ways. This is also true of the manifests which +configure storage for a collection of projects. +This [can be seen here](https://github.com/kubeflow/manifests/pull/3091). + +#### Please outline the project’s API Design + +Various Kubeflow projects offer APIs and Python SDKs. See the following sets of reference documentation: + +- [Pipelines reference docs](https://www.kubeflow.org/docs/components/pipelines/reference/) + for the Kubeflow Pipelines API and SDK, including the Kubeflow Pipelines domain-specific language (DSL). +- [Kubeflow SDK](https://github.com/kubeflow/sdk) + to interact with Kubeflow APIs with Python-native interface. + + See also [Kubeflow APIs and SDKs](https://www.kubeflow.org/docs/started/architecture/#kubeflow-apis-and-sdks). + +Kubeflow CRDs are following Kubernetes best practices for +[the API changes](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-changes). + +#### Describe the project’s API topology and conventions + +[Kubeflow APIs and SDKs documentation](https://www.kubeflow.org/docs/started/architecture/#kubeflow-apis-and-sdks). + +#### Describe the project defaults + +Some defaults can be seen [in the Kubeflow manifests](https://github.com/kubeflow/manifests#installation). + +Additionally, API defaults are shown in the API docs, such as: +[Kubeflow Trainer](https://github.com/kubeflow/trainer/blob/master/pkg/apis/trainer/v1alpha1/trainjob_types.go#L128) +or [Kubeflow Katib](https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/experiments/v1beta1/experiment_types.go#L46). + +#### Outline any additional configurations from default to make reasonable use of the project + +[Kubeflow manifests guides](https://github.com/kubeflow/manifests/blob/master/README.md). + +Some public and listed distributions have their own ways +[to install Kubeflow AI platform](https://www.kubeflow.org/docs/started/installing-kubeflow/#packaged-distributions) + +#### Describe any new or changed API types and calls—including to cloud providers—that will result from this project being enabled and used + +Deploying Kubeflow manifests and individual projects results in exposing new APIs and the possibility +to call configured 3rd party APIs if integrations exist. These all have to be explicitly set by the users. + +#### Describe compatibility of any new or changed APIs with API servers, including the Kubernetes API server + +Kubeflow follows the same practice for [API compatibility as Kubernetes](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-changes). + +#### Describe versioning of any new or changed APIs, including how breaking changes are handled + +Many Kubeflow projects use Kubernetes CRDs, and for these resources follows +[the same deprecation policy as Kubernetes](https://kubernetes.io/docs/reference/using-api/deprecation-policy/). + +#### Describe the project’s release processes, including major, minor and patch releases + +Every Kubeflow project follow its own release lifecycle, for example +[Kubeflow Trainer](https://github.com/kubeflow/trainer/tree/master/docs/release) +or [Kubeflow Katib](https://github.com/kubeflow/katib/tree/master/docs/release). + +However, community also maintain [the Kubeflow AI reference platform](https://www.kubeflow.org/docs/kubeflow-platform/releases/) +releases which install all Kubeflow projects together for end-to-end AI platform: + +### Installation + +#### Describe how the project is installed and initialized + +[Kubeflow Installation guide](https://www.kubeflow.org/docs/started/installing-kubeflow/). + +Kubeflow projects can be installed as a standalone applications or together using +the Kubeflow Manifests or Kubeflow Distributions (public or private). + +#### How does an adopter test and validate the installation + +Distributions can verify the installation [by following this guide](https://github.com/kubeflow/manifests?tab=readme-ov-file#installation) +or [executing this test suites](https://github.com/kubeflow/manifests/blob/master/.github/workflows/full_kubeflow_integration_test.yaml). + +Kubeflow projects docs also explain how platform admins should validate the installation of individual +applications, [for example Kubeflow Trainer](https://www.kubeflow.org/docs/components/trainer/operator-guides/installation/#installing-the-kubeflow-trainer-controller-manager) + +### Security + +#### Provide a link to the project’s cloud native [security self assessment](https://tag-security.cncf.io/community/assessments/). + +- [Kubeflow Security Self Assessment](https://github.com/kubeflow/community/blob/master/security/self-assessment.md) + +#### How are you satisfying the tenets of cloud native security projects + +1. Make security a design requirement. + +Kubeflow projects are built with security as a fundamental concern. Kubeflow projects follow +Kubernetes and cloud native security best practices. By leveraging Kubernetes Custom Resources +Definition (CRDs), Role-Based Access Control (RBAC), network policies, and pod security standards, +Kubeflow projects seamlessly integrate into cloud native secure environments. Additionally, +Kubeflow projects re-use functionality from other cloud native tools like Istio, Cert-Manager, +Kubernetes and inherit its security best practices. + +Default configuration for Kubeflow projects follow best practices such as rootless containers, +limited privileges, etc. + +2. Applying secure configuration has the best user experience + +Default configuration for Kubeflow projects follow best practices such as rootless containers, +limited privileges, etc. Users can migrate workloads to more secure configurations without breaking +changes, leveraging Kubernetes’ declarative model. + +3. Selecting insecure configuration is a conscious decision + +Insecure options require explicit user setup and configurations. + +#### Describe how each of the [cloud native principles](https://github.com/cncf/toc/blob/main/DEFINITION.md) apply to your project + +Kubeflow uses cloud native principles by building on Kubernetes and other cloud native technologies +while extending them in our composable projects. + +Kubeflow projects use containers as a fundamental unit of task/deployment. For example, +every TrainJob uses separate containers to train models, allowing users to package their training +code along with its dependencies in a containerized environment + +Kubeflow projects are designed to run natively on Kubernetes and define various CRDs for each +AI workload. For example, TrainJob CR for model training or Notebook CR for interactive development. +It uses native Kubernetes primitives like Deployments, Services, Job, JobSet, and CRDs to +orchestrate AI workloads. + +Kubeflow follows a loosely-coupled microservices architecture model, where each project can +be deployed as a standalone service and be integrated into desired AI platform on Kubernetes. + +Kubeflow APIs are declarative and follow Kubernetes best practices. For instance, +SparkApplication to manage Spark jobs or TrainJob to manage distributed training jobs. + +#### How do you recommend users alter security defaults in order to "loosen" the security of the project + +Users can install Kubeflow Pipelines without multi-user isolation if they want to loosen the +security of deployment. Users suggested to follow the Kubeflow AI reference platform deployment +mode to install Kubeflow Pipelines control plane +[in multi-tenant model](https://www.kubeflow.org/docs/components/pipelines/operator-guides/multi-user/). +It provides isolation between Pipelines and Runs for users. + +#### Security Hygiene + +##### Describe the frameworks, practices and procedures the project uses to maintain the basic health and security of the project + +- **Robust CI/CD infrastructure:** Kubeflow projects have automated unit, integration, and end-to-end + tests that are integrated into the CI pipelines to ensure code stability and correctness of PRs. + +- **Dependency management:** Kubeflow projects leverage lock files like `go.mod` and `pyproject.toml`, + and progressing towards Software Bill of Materials (SBOM) generation. + +- **Release stability:** Kubeflow projects maintain release branches and release tags which allow + maintainers create patch releases to address bug fixes and security vulnerabilities. Also, + projects maintain changelogs and roadmaps to ensure traceability and transparency for all changes. + +- **Static and Dynamic Code Analysis:** Kubeflow projects CI integrates code linting, static, + and dynamic code analysis tools to enforce code quality and detect potential issues before they + reach production. + +- **Secure Defaults:** Defaults manifests and Helm charts are configured with security best practices, + like pod security standards, rootless containers, and enforces least privileges. + +- **Code Review Process:** Kubeflow projects’ maintainers follow a strong review process with peer + review and explicit approval before PRs being merged into the main branch. + +- **Open Governance:** Kubeflow follows open governance with regular public meetings and + communication channels to keep users aware of development and roadmap. + +- **Secure Image Build:** Image builds are validated through CI/CD checks to ensure they are + reproducible, security, and consistent. + +Some examples can be found here: + +- [Kubeflow Spark Operator](https://github.com/kubeflow/spark-operator/tree/master/.github/workflows) +- [Kubeflow Katib](https://github.com/kubeflow/katib/tree/master/.github/workflows) +- [Kubeflow Trainer](https://github.com/kubeflow/trainer/tree/master/.github/workflows) +- [Kubeflow Model Registry](https://github.com/kubeflow/model-registry/tree/main/.github/workflows) +- [Kubeflow Pipelines](https://github.com/kubeflow/pipelines/tree/master/.github/workflows) +- [Kubeflow Manifests](https://github.com/kubeflow/manifests/blob/master/.github/workflows/full_kubeflow_integration_test.yaml) + +##### Describe how the project has evaluated which features will be a security risk to users if they are not maintained by the project + +Features are discussed and reviewed in the community, with security implications considered during +design and code review processes. The project tracks the security health of dependencies and evaluates +the impact of vulnerabilities or unmaintained projects. The project identifies potential risks by +analyzing the attack surface, especially for features that interact with user-supplied code, +custom containers, user supplied models or external data sources. + +#### Cloud Native Threat Modeling + +##### Explain the least minimal privileges required by the project and reasons for additional privileges + +Each Kubeflow project's control plane needs to watch for its CRDs. For example, Kubeflow Trainer: +TrainJob and ClusterTrainingRuntime, Spark Operator for SparkApplications, Katib for Experiments, etc. +Controllers have minimal privileges to orchestrate its CRDs and corresponding resources that need to be created + +Kubeflow control plane is installed cluster-wide, and it requires RBAC to manage namespace-scoped resources. + +The default controllers run with PodSecurityStandards restricted and rootless containers, +minimizing the risk of privilege escalation. + +Users who interact with Kubeflow SDK get namespace-scoped permission to interact with required +resources. For example, users’ role for +[Kubeflow Trainer can be found here](https://github.com/kubeflow/trainer/blob/master/manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml#L17). +All roles are aggregated in the `kubeflow-edit` cluster role after installing Kubeflow Manifests. + +##### Describe how the project is handling certificate rotation and mitigates any issues with certificates + +Kubeflow projects use [cert-manager to generate and rotate certificates](https://github.com/kubeflow/manifests/tree/master/common/cert-manager). +Those certificates are used for validation and mutation webhooks across Kubeflow projects. Cert-manager +handles the issuance, renewal, and rotation of these certificates automatically without downtime. + +##### Describe how the project is following and implementing [secure software supply chain best practices](https://project.linuxfoundation.org/hubfs/CNCF_SSCP_v1.pdf) + +- **Automated CI/CD infrastructure:** Kubeflow projects have automated unit, integration, and + end-to-end tests that are integrated into the CI pipelines to ensure code stability and reduce + risk of introducing vulnerabilities. + +- **Vulnerability Scanning:** Kubeflow uses tools for vulnerability scanning like Dependabot, + Trivy to identify CVEs and address known CVEs in dependencies. + +- **Dependency management:** Kubeflow projects leverage lock files like `go.mod` and `pyproject.toml`, + and progressing towards Software Bill of Materials (SBOM) generation. + +- **Code Review Process:** Kubeflow projects’ maintainers follow a strong review process with peer + review and explicit approval before PRs being merged into the main branch. + +- **DCO Check:** Committers are required to sign and comply with the Developer Certificate of Origin + (DCO) to affirm the legitimacy and authorship of their contributions. + +- **Branch Protection:** The project enforces branch protection rules to prevent unauthorized changes, + enforce status checks, require pull request reviews, and control who can push to protected branches. + +- **Prevent Secrets in Source Code:** Proactively prevents committing secrets to the source code + by using tools and GitHub workflows that detect sensitive data in PRs. + +- **License compliance:** Kubeflow uses + [the FOSSA product](https://app.fossa.com/reports/bb8e2d41-254a-4af3-9044-e7f484c34dd1) + provided by the CNCF to scan for license compliance on a regular and ongoing basis. + +## Day 1 - Installation and Deployment Phase + +### Project Installation and Configuration + +#### Describe what project installation and configuration look like + +Each project has its [own standalone installation guide](https://www.kubeflow.org/docs/started/installing-kubeflow/#standalone-kubeflow-components). + +Kubeflow projects are packaged by [distributions and vendors](https://www.kubeflow.org/docs/started/installing-kubeflow/#kubeflow-platform) +into platform products which can be used to install our tools. + +We provide optional manifests to deploy all Kubeflow projects as +[Kubeflow AI reference platform](https://github.com/kubeflow/manifests/blob/master/README.md). + +### Project Enablement and Rollback + +#### How can this project be enabled or disabled in a live cluster? Please describe any downtime required of the control plane or nodes + +Users can set the replica count to 0 in the Kubeflow projects deployment. Existing AI workloads +should not be impacted since it won’t be reconciled by controllers. + +Updating [the Kubeflow AI Reference Platform](https://github.com/kubeflow/manifests/blob/master/README.md#upgrading-and-extending). + +The installation guide described [when control plane is ready](https://www.kubeflow.org/docs/components/trainer/operator-guides/installation/#installing-the-kubeflow-trainer-controller-manager). + +#### Describe how enabling the project changes any default behavior of the cluster or running workloads + +Enable projects by scaling replica count back to 1. All running AI workloads will be reconciled again +by controllers, and they perform the appropriate updates to the CRDs. + +Istio, Knative, cert-manager might interfere with existing installations. + +#### Describe how the project tests enablement and disablement + +Conformance program [is work in progress](https://github.com/kubeflow/kubeflow/tree/master/conformance/1.7) +to ensure tests across all Kubeflow projects. + +#### How does the project clean up any resources created, including CRDs + +The Kubeflow projects control plane can be deleted by cleanup the appropriate resources, +for example: + +```sh +kubectl delete -k https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.0.0 +``` + +The command will cleanup all CRDs and control plane deployment. Since Kubeflow CRDs maintain +`ownerReference` for associated resources, all resources will be removed after deleting the +actual CRDs. For example: + +```sh +kubectl delete trainjob --all --all-namespaces +``` + +Check [this guide to cleanup](https://github.com/kubeflow/manifests/blob/master/README.md#upgrading-and-extending) +the Kubeflow AI reference platform. + +### Rollout, Upgrade and Rollback Planning + +#### How does the project intend to provide and maintain compatibility with infrastructure and orchestration management tools like Kubernetes and with what frequency + +Kubeflow projects publish its supported Kubernetes version for every release. The supported versions +are evaluated and upgraded on every release. We support +[Kubernetes 1.31+ and test on 1.32+](https://github.com/kubeflow/manifests/blob/master/tests/install_KinD_create_KinD_cluster_install_kustomize.sh) + +#### How the project handles rollback procedures + +Some projects support leader election and HA to make sure that long-running workloads are complete. +Users can scale up or down replicas of controllers during rollback. + +#### How can a rollout or rollback fail? Describe any impact to already running workloads + +Newer revision will not be active for many reasons including insufficient cluster resources for the +newer spec, invalid configuration spec. Traffic will not be switched to the new unless the newer +revision is ready to accept traffic. Hence, the already running workloads will not be affected +in any case if rollout fails. + +#### Describe any specific metrics that should inform a rollback + +Kubeflow projects CRDs expose status that inform about activity of AI workloads. Additionally, +controllers are exposed various Prometheus metrics to indicate workload status. + +#### Explain how upgrades and rollbacks were tested and how the upgrade->downgrade->upgrade path was tested + +Currently, it’s being manually tested by users, but automated tests are work in progress. + +#### Explain how the project informs users of deprecations and removals of features and APIs + +All API changes are backward compatible and changes are announced in the release notes and +changelogs. If some APIs are deprecated, newer versions of APIs are introduced with user awareness. +Kubeflow CRDs follow Kubernetes best practices for API compatibility. + +For example [Kubeflow Trainer breaking changes](https://github.com/kubeflow/trainer/blob/master/CHANGELOG.md#breaking-changes). + +#### Explain how the project permits utilization of alpha and beta capabilities as part of a rollout + +At the moment, functionality is logged explicitly if a feature is alpha or beta. We do not have +feature gates. Additionally, we are working on conversion webhook that helps users with updated +API versions of CRDs. + +## Day 2 - Day-to-Day Operations Phase + +### Scalability/Reliability + +#### Describe how the project increases the size or count of existing API objects + +In general, the API object count complexity is linear in the number of users and workloads. Users +can list deployed Kubeflow CRDs across all namespaces. + +For example, to list all TrainJob from Kubeflow Trainer run: + +```sh +kubectl get trainjob --all-namespaces +``` + +The core controllers and resources are constant (beyond replicating specific controllers), +see [the full kustomize build here](https://github.com/kubeflow/manifests?tab=readme-ov-file#install-with-a-single-command). + +#### Describe how the project defines Service Level Objectives (SLOs) and Service Level Indicators (SLIs) + +Kubeflow currently doesn’t provide the SLOs and SLIs, however we follow +[the Kubernetes SLOs](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md) + +#### Describe any operations that will increase in time covered by existing SLIs/SLOs + +As above. + +#### Describe the increase in resource usage in any components as a result of enabling this project, to include CPU, Memory, Storage, Throughput + +Resources requirements for Kubeflow projects [are set here](https://github.com/kubeflow/manifests/pull/3091#issuecomment-3016609243). + +#### Describe which conditions enabling / using this project would result in resource exhaustion of some node resources + +There are some specific issues for each project, we should list some of them (e.g. pipelines +controller can have open file handler issues). Since many workloads are GPU-intensive, +Kubernetes platform admins need to ensure that nodes have enough capacity to run AI workloads with +Kubeflow projects. + +#### Describe the load testing that has been performed on the project and the results + +Some intensive load testing has been performed by Kubeflow users, for example some LLM foundation +models have been trained using Kubeflow Trainer across a few hundreds GPUs. At KubeCon some users +share their experience of running more than 10,000 GPUs with Kubeflow Training Operator. + +#### Describe the recommended limits of users, requests, system resources, etc. and how they were obtained + +Users on the order of hundreds are known to work. Scaling beyond hundreds of users may require an +increase in the requests/limits replicas of some deployments, or scaling dependencies like databases. + +#### Describe which resilience pattern the project uses and how, including the circuit breaker pattern + +Kubeflow uses the Kubernetes resilience pattern to manage controllers with HA. If one replica fails, another part of the control plane takes responsibility to orchestrate workloads. + +### Observability Requirements + +#### Describe the signals the project is using or producing, including logs, metrics, profiles and traces. Please include supported formats, recommended configurations and data storage + +Kubeflow controllers expose Prometheus metrics to report workload status. Controllers also expose +logs and status for platform admins to ensure stability. + +#### Describe how the project captures audit logging + +Platform admins can leverage [Kubernetes audit](https://kubernetes.io/docs/tasks/debug/debug-cluster/audit/) +for Kubeflow projects. + +#### Describe any dashboards the project uses or implements as well as any dashboard requirements + +[Kubeflow Dashboard requirements](https://www.kubeflow.org/docs/components/central-dash/overview/). + +#### Describe how the project surfaces project resource requirements for adopters to monitor cloud and infrastructure costs, e.g. FinOps That must happen on the Kubernetes namespace level + +Users are recommended to use third-party tools like Kubecost to measure cloud and infrastructure +cost of running Kubeflow projects. + +#### Which parameters is the project covering to ensure the health of the application/service and its workloads + +Most project use Kubernetes [Liveness and Readiness probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/). + +For example [Kubeflow Trainer controller manager](https://github.com/kubeflow/trainer/blob/master/manifests/base/manager/manager.yaml#L30-L43). + +#### How can an operator determine if the project is in use by workloads + +- Check the Pods in `kubeflow-profil`e labeled namespaces. +- Check the CRDs in user’s namespaces +- Check the Kubeflow Dashboard resources. + +#### How can someone using this project know that it is working for his instance + +- [Kubeflow Manifests guide](https://github.com/kubeflow/manifests#installation). +- Run the [Kubeflow manifest test suite](https://github.com/kubeflow/manifests/blob/master/.github/workflows/full_kubeflow_integration_test.yaml) +- Run the upstream examples, for example [Kubeflow Trainer PyTorch training](https://github.com/kubeflow/trainer/blob/master/examples/pytorch/question-answering/fine-tune-distilbert.ipynb). + +### Dependencies + +#### Describe the specific running services the project depends on in the cluster + +Most projects depend on Cert-Manager>=v1.16, and Istio>=v1.26 + +Specific projects have other dependencies: + +- Kubeflow Pipelines: Argo Workflows>=v3.1 +- Kubeflow Trainer: JobSet>=v0.8.0 +- Kubeflow Katib: MySQL>=v8.0 +- Kubeflow Model Registry: MySQL>=v8.0 + +For details, please take a look at General security Talks about Kubeflow including +[an architectural introduction](https://github.com/kubeflow/manifests#kubeflow-components-versions). + +#### Describe the project’s dependency lifecycle policy + +We follow Kubernetes deprecation policy for supported versions. For example, we support the +3-4 latest versions of Kubernetes to deploy Kubeflow projects. + +#### How does the project incorporate and consider source composition analysis as part of its development and security hygiene? Describe how this source composition analysis (SCA) is tracked + +Various static and dynamic code analysis tools are enforced as described above. + +#### Describe how the project implements changes based on source composition analysis (SCA) and the timescale + +Various static and dynamic code analysis tools are enforced as described above. + +### Troubleshooting + +#### How does this project recover if a key component or feature becomes unavailable + +Specifics depend on the Kubeflow Project (KFP, Katib, etc.). Kubeflow projects are cloud and +Kubernetes native apps, so fault tolerance is strongly tied to the health of the underlying cluster. +In practice user workloads sometimes have a retry policy and the Kubeflow core services should +just recover automatically. + +More details are described in the previous sections. + +#### Describe the known failure modes + +Each Kubeflow project handles failure modes differently beyond native Kubernetes fault tolerance. +Many of them are configured at the application level in user code. + +### Security + +#### How is the project executing access control + +As described above each Kubeflow projects require limited RBAC to manage its CRDs + +#### Cloud Native Threat Modeling + +Described in the section above. + +#### How does the project ensure its security reporting and response team is representative of its community diversity (organizational and individual) + +Every Kubeflow project follows security guidelines. + +- Kubeflow Spark Operator [security policy](https://github.com/kubeflow/spark-operator/blob/master/SECURITY.md). +- Kubeflow Notebooks [security policy](https://github.com/kubeflow/notebooks/blob/master/SECURITY.md). +- Kubeflow Trainer [security policy](https://github.com/kubeflow/trainer/blob/master/SECURITY.md). +- Kubeflow Katib [security policy](https://github.com/kubeflow/katib/blob/master/SECURITY.md). +- Kubeflow Model Registry [security policy](https://github.com/kubeflow/model-registry/blob/main/SECURITY.md). +- Kubeflow Pipelines [security policy](https://github.com/kubeflow/pipelines/blob/master/SECURITY.md). + +#### How does the project invite and rotate security reporting team members + +Active project maintainers are responsible to ensure that security reports are addressed. Each +project has its own security reporting and disclosure policy. For projects which use GitHub’s security +disclosure system, access control is managed by having write access to the relevant github repository.