Add --gpu option to the create command for NVIDIA support#94
Conversation
| @click.option("--no-gpu", is_flag=True, help="Create a notebook without GPU support.") | ||
| @click.option( | ||
| "--gpu", | ||
| type=click.Choice(SUPPORTED_GPUS), |
There was a problem hiding this comment.
Does this need to be case sensitive?
| type=click.Choice(SUPPORTED_GPUS), | |
| type=click.Choice(SUPPORTED_GPUS, case_sensitive=False), |
| help=f"Path to a Kubernetes config file. Defaults to the value of the KUBECONFIG environment variable, else to '{KUBECONFIG_DEFAULT}'.", # noqa E501 | ||
| ) | ||
| def create_notebook_command(name: str, image: str, kubeconfig: str) -> None: | ||
| @click.option("--no-gpu", is_flag=True, help="Create a notebook without GPU support.") |
There was a problem hiding this comment.
Why do we need a --no-gpu? Can the absence of --gpu=something imply no gpu, and we get rid of this?
There was a problem hiding this comment.
Its recommendation of ux team to provide this --no-gpu flag
There was a problem hiding this comment.
I wonder if there's a miscommunication here. If I understand correctly, these two commands are both the same?
dss create my-notebook
dss create my-notebook --no-gpu
Am I missing something about the CLI?
| {% if gpu %} | ||
| resources: | ||
| limits: | ||
| {{ gpu }}: 1 | ||
| {% endif %} |
There was a problem hiding this comment.
Do we support only a single GPU per notebook? If I have two GPUs, should we support using both?
| self.msg = str(msg) | ||
|
|
||
|
|
||
| def node_has_gpu_labels(lightkube_client: Client, labels: List[str]) -> bool: |
There was a problem hiding this comment.
This asserts that all nodes have the provided labels, not that a given node has gpu labels.
Alternatively you could include the gpu labels here in the function code drop the labels input
| def node_has_gpu_labels(lightkube_client: Client, labels: List[str]) -> bool: | |
| def all_nodes_have_labels(lightkube_client: Client, labels: List[str]) -> bool: |
| name (str): The name of the notebook server. | ||
| image (str): The image used for the notebook server. | ||
| lightkube_client (Client): The Kubernetes client. | ||
| image (str): The Docker image used for the notebook server. |
There was a problem hiding this comment.
Maybe leave as image or oci image instead of docker? just to not exclude rocks
| ) | ||
| logger.info(f"Success: Notebook {name} created successfully.") | ||
| if gpu: | ||
| logger.info(f"{gpu.title()} GPU attached to notebook.") |
There was a problem hiding this comment.
Minor suggestion. title() wont be correct in all cases (ex: Amd). I'd stick to the enumerators we say gpu should be
| logger.info(f"{gpu.title()} GPU attached to notebook.") | |
| logger.info(f"{gpu} GPU attached to notebook.") |
There was a problem hiding this comment.
I don't think the gpu requests are working as expected. Try this:
dss initialize --kubeconfig ~/.kube/config
dss create gpu-omitted --image=kubeflownotebookswg/jupyter-pytorch-cuda-full:latest --kubeconfig ~/.kube/config
dss create gpu-selected --image=kubeflownotebookswg/jupyter-pytorch-cuda-full:latest --kubeconfig ~/.kube/config
Then in each notebook server:
- create a terminal and do
nvidia-smi. The GPU should be visible only in one notebook server, but it is visible in both - create a notebook and run
import torch; torch.cuda.is_available(). The GPU should be available only in one notebook server, but it is available in both - create a notebook and run this tutorial. Both will run on the GPU. After doing this, run
nvidia-smion the host system and we'll see two processes both using the GPU, like:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 37C P0 36W / 70W | 280MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 70671 C /opt/conda/bin/python 158MiB |
| 0 N/A N/A 96171 C /opt/conda/bin/python 118MiB |
+---------------------------------------------------------------------------------------+
There was a problem hiding this comment.
One thing that is working correctly is the GPU requests. My machine has one gpu, and if I do:
dss create x ... --gpu=nvidia
dss create y ... --gpu=nvidia
notebook y sits pending with the FailedScheduling warning: 1 Insufficient nvidia.com/gpu
There was a problem hiding this comment.
see above comments
Tested by using an g4dn.xlarge ec2 instance in AWS. Had to bump the storage to ~150GB to run multiple notebooks. VM was set up with:
sudo snap install microk8s --channel 1.28/stable --classic
sudo usermod -a -G microk8s ubuntu
mkdir ~/.kube
sudo chown -R ubuntu ~/.kube
newgrp microk8s
# This starts a new terminal
microk8s enable storage dns rbac gpu
microk8s config > ~/.kube/config
git clone http://github.com/canonical/data-science-stack
cd data-science-stack/
git checkout --track origin/KF-5420-add-gpu-flag-for-create
pip install -e .
python --version
nvidia-smi
dss initialize --kubeconfig ~/.kube/config
dss create nothing-specified --image=kubeflownotebookswg/jupyter-pytorch-cuda-full:latest --kubeconfig ~/.kube/config
dss create with-gpu --image=kubeflownotebookswg/jupyter-pytorch-cuda-full:latest --kubeconfig ~/.kube/config --gpu=nvidia
From there, I enabled a socks proxy from my local machine (ssh -D 9999 -C -N -i PEM_FILE EC2_INSTANCE) and made sample notebooks out of this pytorch example to test if the gpu was working
closes: #39
User now can create GPU backed notebooks in the cluster.
Create is also blocking infinitely until the notebook is created or the image is unpullable.
NOTE: in order to test this you need a device with NVIDIA gpu set up in microk8s. I have tested on my device.
microk8s setup:
Other minor Changes:
createnow waits infinitely until its finishedwait_for_deploymentcan now wait infinitely