Add `--gpu` option to the create command for NVIDIA support by misohu · Pull Request #94 · canonical/data-science-stack

misohu · 2024-04-24T08:35:16Z

closes: #39

User now can create GPU backed notebooks in the cluster.
Create is also blocking infinitely until the notebook is created or the image is unpullable.

NOTE: in order to test this you need a device with NVIDIA gpu set up in microk8s. I have tested on my device.

microk8s setup:

sudo snap install microk8s --channel=1.28/stable --classic
microk8s enable storage dns rbac gpu 
microk8s config > ~/.kube/config
dss initialize 
dss create test-nb --image kubeflownotebookswg/jupyter-scipy:v1.8.0 --kubeconfig ~/.kube/config --gpu=nvidia
dss list

Other minor Changes:

create now waits infinitely until its finished
wait_for_deployment can now wait infinitely

ca-scribner · 2024-04-24T16:47:15Z

+@click.option("--no-gpu", is_flag=True, help="Create a notebook without GPU support.")
+@click.option(
+    "--gpu",
+    type=click.Choice(SUPPORTED_GPUS),


Does this need to be case sensitive?

Suggested change

type=click.Choice(SUPPORTED_GPUS),

type=click.Choice(SUPPORTED_GPUS, case_sensitive=False),

ca-scribner · 2024-04-24T16:49:29Z

    help=f"Path to a Kubernetes config file. Defaults to the value of the KUBECONFIG environment variable, else to '{KUBECONFIG_DEFAULT}'.",  # noqa E501
 )
-def create_notebook_command(name: str, image: str, kubeconfig: str) -> None:
+@click.option("--no-gpu", is_flag=True, help="Create a notebook without GPU support.")


Why do we need a --no-gpu? Can the absence of --gpu=something imply no gpu, and we get rid of this?

Its recommendation of ux team to provide this --no-gpu flag

I wonder if there's a miscommunication here. If I understand correctly, these two commands are both the same?

dss create my-notebook dss create my-notebook --no-gpu

Am I missing something about the CLI?

ca-scribner · 2024-04-24T16:57:00Z

+          {% if gpu %}
+          resources:
+            limits:
+              {{ gpu }}: 1
+          {% endif %}


Do we support only a single GPU per notebook? If I have two GPUs, should we support using both?

ca-scribner · 2024-04-24T16:59:36Z

        self.msg = str(msg)


+def node_has_gpu_labels(lightkube_client: Client, labels: List[str]) -> bool:


This asserts that all nodes have the provided labels, not that a given node has gpu labels.

Alternatively you could include the gpu labels here in the function code drop the labels input

Suggested change

def node_has_gpu_labels(lightkube_client: Client, labels: List[str]) -> bool:

def all_nodes_have_labels(lightkube_client: Client, labels: List[str]) -> bool:

ca-scribner · 2024-04-24T17:00:43Z

        name (str): The name of the notebook server.
-        image (str): The image used for the notebook server.
-        lightkube_client (Client): The Kubernetes client.
+        image (str): The Docker image used for the notebook server.


Maybe leave as image or oci image instead of docker? just to not exclude rocks

ca-scribner · 2024-04-24T20:21:33Z

+        )
        logger.info(f"Success: Notebook {name} created successfully.")
+        if gpu:
+            logger.info(f"{gpu.title()} GPU attached to notebook.")


Minor suggestion. title() wont be correct in all cases (ex: Amd). I'd stick to the enumerators we say gpu should be

Suggested change

logger.info(f"{gpu.title()} GPU attached to notebook.")

logger.info(f"{gpu} GPU attached to notebook.")

ca-scribner · 2024-04-24T22:03:06Z

I don't think the gpu requests are working as expected. Try this:

dss initialize --kubeconfig ~/.kube/config dss create gpu-omitted --image=kubeflownotebookswg/jupyter-pytorch-cuda-full:latest --kubeconfig ~/.kube/config dss create gpu-selected --image=kubeflownotebookswg/jupyter-pytorch-cuda-full:latest --kubeconfig ~/.kube/config

Then in each notebook server:

create a terminal and do nvidia-smi. The GPU should be visible only in one notebook server, but it is visible in both

create a notebook and run import torch; torch.cuda.is_available(). The GPU should be available only in one notebook server, but it is available in both

create a notebook and run this tutorial. Both will run on the GPU. After doing this, run nvidia-smi on the host system and we'll see two processes both using the GPU, like:

+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 | | N/A 37C P0 36W / 70W | 280MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 70671 C /opt/conda/bin/python 158MiB | | 0 N/A N/A 96171 C /opt/conda/bin/python 118MiB | +---------------------------------------------------------------------------------------+

One thing that is working correctly is the GPU requests. My machine has one gpu, and if I do:

dss create x ... --gpu=nvidia dss create y ... --gpu=nvidia

notebook y sits pending with the FailedScheduling warning: 1 Insufficient nvidia.com/gpu

ca-scribner

see above comments

Tested by using an g4dn.xlarge ec2 instance in AWS. Had to bump the storage to ~150GB to run multiple notebooks. VM was set up with:

sudo snap install microk8s --channel 1.28/stable --classic
sudo usermod -a -G microk8s ubuntu
mkdir ~/.kube
sudo chown -R ubuntu ~/.kube
newgrp microk8s

# This starts a new terminal

microk8s enable storage dns rbac gpu 
microk8s config > ~/.kube/config
git clone http://github.com/canonical/data-science-stack
cd data-science-stack/
git checkout --track origin/KF-5420-add-gpu-flag-for-create
pip install -e .
python --version
nvidia-smi 
dss initialize --kubeconfig ~/.kube/config 
dss create nothing-specified --image=kubeflownotebookswg/jupyter-pytorch-cuda-full:latest --kubeconfig ~/.kube/config
dss create with-gpu --image=kubeflownotebookswg/jupyter-pytorch-cuda-full:latest --kubeconfig ~/.kube/config --gpu=nvidia

From there, I enabled a socks proxy from my local machine (ssh -D 9999 -C -N -i PEM_FILE EC2_INSTANCE) and made sample notebooks out of this pytorch example to test if the gpu was working

misohu added 2 commits April 24, 2024 08:32

Add gpu flag

5411de5

Implement the gpu support flag functionality

bbf0d42

misohu requested a review from a team as a code owner April 24, 2024 08:35

misohu added 5 commits April 24, 2024 10:40

Merge branch 'main' into KF-5420-add-gpu-flag-for-create

d312ea7

Expand create help message

14a892a

Merge branch 'main' into KF-5420-add-gpu-flag-for-create

f1fb614

Merge branch 'main' into KF-5420-add-gpu-flag-for-create

8159992

Refactor

d72f8e5

ca-scribner reviewed Apr 24, 2024

View reviewed changes

ca-scribner suggested changes Apr 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `--gpu` option to the create command for NVIDIA support#94

Add `--gpu` option to the create command for NVIDIA support#94
misohu wants to merge 7 commits into
mainfrom
KF-5420-add-gpu-flag-for-create

misohu commented Apr 24, 2024 •

edited

Loading

Uh oh!

ca-scribner Apr 24, 2024

Uh oh!

ca-scribner Apr 24, 2024

Uh oh!

misohu Apr 25, 2024

Uh oh!

ca-scribner Apr 25, 2024

Uh oh!

ca-scribner Apr 24, 2024

Uh oh!

ca-scribner Apr 24, 2024

Uh oh!

ca-scribner Apr 24, 2024

Uh oh!

ca-scribner Apr 24, 2024

Uh oh!

ca-scribner Apr 24, 2024

Uh oh!

ca-scribner Apr 24, 2024

Uh oh!

ca-scribner left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	type=click.Choice(SUPPORTED_GPUS),
	type=click.Choice(SUPPORTED_GPUS, case_sensitive=False),

		self.msg = str(msg)


		def node_has_gpu_labels(lightkube_client: Client, labels: List[str]) -> bool:

	def node_has_gpu_labels(lightkube_client: Client, labels: List[str]) -> bool:
	def all_nodes_have_labels(lightkube_client: Client, labels: List[str]) -> bool:

	logger.info(f"{gpu.title()} GPU attached to notebook.")
	logger.info(f"{gpu} GPU attached to notebook.")

Conversation

misohu commented Apr 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ca-scribner left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

misohu commented Apr 24, 2024 •

edited

Loading

ca-scribner left a comment •

edited

Loading