pyHPC
diff --git a/‎notebooks/9_1_nvmath-python_interop.ipynb‎
Lines changed: 124 additions & 0 deletions b/‎notebooks/9_1_nvmath-python_interop.ipynb‎
Lines changed: 124 additions & 0 deletions
diff --git a/‎notebooks/9_2_nvmath-python_kernel_fusion.ipynb‎
Lines changed: 233 additions & 0 deletions b/‎notebooks/9_2_nvmath-python_kernel_fusion.ipynb‎
Lines changed: 233 additions & 0 deletions
@@ -0,0 +1,124 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "7b236cf1",
+   "metadata": {},
+   "source": [
+    "# 9.1. `nvmath-python`: Interoperability with CPU and GPU tensor libraries\n",
+    "The goal of this exercise is to demonstrate how easy it is to plug `nvmath-python` into existing projects that rely on popular CPU or GPU array libraries, such as NumPy, CuPy, and PyTorch, or how easy it is to start a new project where `nvmath-python` is used alongside array libraries."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e38c312d",
+   "metadata": {},
+   "source": [
+    "### Pure CuPy implementation\n",
+    "\n",
+    "This example demonstrates basic matrix multiplication of CuPy 2D arrays using `matmul`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b796dc7e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import cupy as cp\n",
+    "\n",
+    "# Prepare sample input data for matrix matmul\n",
+    "n, m, k = 2000, 4000, 5000\n",
+    "a = cp.random.rand(n, k)\n",
+    "b = cp.random.rand(k, m)\n",
+    "\n",
+    "# Perform matrix multiplication\n",
+    "result = cp.matmul(a, b)\n",
+    "\n",
+    "# Print the result\n",
+    "print(result)\n",
+    "\n",
+    "# Print CUDA device for each array\n",
+    "print(a.device)\n",
+    "print(b.device)\n",
+    "print(result.device)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7528a6f8",
+   "metadata": {},
+   "source": [
+    "### Using `nvmath-python` alongside CuPy\n",
+    "\n",
+    "This is a slight modification of the above example, where matrix multiplications is done using corresponding `nvmath-python` implementation.\n",
+    "\n",
+    "Note that `nvmath-python` supports multiple frameworks, including CuPy. It uses framework's memory pool and the current stream for seamless integration. The result of each operation is a tensor of the same framework that was used to pass the inputs. It is also located on the same device as the inputs. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "311ee2e9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# The same matrix multiplication as in the previous example but using nvmath-python\n",
+    "import nvmath\n",
+    "\n",
+    "# Perform matrix multiplication\n",
+    "result = nvmath.linalg.advanced.matmul(a, b)\n",
+    "\n",
+    "# Print the result\n",
+    "print(result)\n",
+    "\n",
+    "# Print CUDA device for each array\n",
+    "print(a.device)\n",
+    "print(b.device)\n",
+    "print(result.device)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "85b2ae1b",
+   "metadata": {},
+   "source": [
+    "As we can see, the code looks essentially the same. If one measures the performance of above implementations, it will be nearly identical. \n",
+    "\n",
+    "This is because CuPy and `nvmath-python` (as well as PyTorch) all use CUDA-X Math Libraries as the engine. It is up to a user, which library to choose for solving the above matrix multiplication problem. \n",
+    "\n",
+    "In the next examples we will demonstrate a few examples, where `nvmath-python` may become essential in reaching peak levels of performance. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bf34d34d",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "nersc-nvmath",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,233 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "e93afc39",
+   "metadata": {},
+   "source": [
+    "# 9.2. `nvmath-python`: Kernel fusion\n",
+    "\n",
+    "Some computational problems have lower compute to memory access instructions ratio, which can often be addressed by lowering the number of memory accesses via kernel fusion. This exercise illustrate how `nvmath-python` inherently fuses simpler operations into a single composite kernel.\n",
+    "\n",
+    "We illustrate this on the example of neural network propagation and the fast Fourier transfer example.\n",
+    "\n",
+    "## Advanced `matmul` with bias and epilog\n",
+    "Based on **Exercise 9.1**, it is not clear why one would need to use `nvmath-python` for matrix multiplications. Indeed, for basic `matmul` operation using `nvmath-python` alongside CuPy does seem an overkill. However, in scientific computing applications and AI, `matmul`s are often used in combination with other operations. For example, in neural networks quite a common usage pattern is as follows.\n",
+    "\n",
+    "**Matrix-Matrix Multiplication with Bias and ReLU:**\n",
+    "\n",
+    "$$C = \\text{ReLU}(A \\cdot B + b^T)$$\n",
+    "\n",
+    "where:\n",
+    "- $A \\in \\mathbb{R}^{m \\times k}$ is the input matrix\n",
+    "- $B \\in \\mathbb{R}^{k \\times n}$ is the weight matrix  \n",
+    "- $b \\in \\mathbb{R}^{m}$ is the bias vector (transposed and broadcasted to $m \\times n$)\n",
+    "- $\\text{ReLU}(x) = \\max(0, x)$ is the Rectified Linear Unit activation function\n",
+    "- $C \\in \\mathbb{R}^{m \\times n}$ is the output matrix\n",
+    "\n",
+    "The bias vector $b$ is transposed to $b^T \\in \\mathbb{R}^{m \\times 1}$ and automatically broadcasted across all columns of the result matrix. The ReLU function is applied element-wise to the result of the matrix multiplication plus bias.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "317b5ac3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import cupy as cp\n",
+    "\n",
+    "# Define the ReLU function\n",
+    "def relu(x):\n",
+    "    return cp.maximum(0, x)\n",
+    "\n",
+    "# Matrix dimensions\n",
+    "# m, k, n = 2, 5, 4\n",
+    "m, k, n = 2000, 1000, 4000\n",
+    "\n",
+    "# Create input matrix A (m x k)\n",
+    "A = cp.random.randn(m, k)\n",
+    "\n",
+    "# Create weight matrix B (k x n) \n",
+    "B = cp.random.randn(k, n)\n",
+    "\n",
+    "# Create bias vector b (m,) that will be transposed and broadcasted to (m x n)\n",
+    "b = cp.random.randn(m)\n",
+    "\n",
+    "# Implement the formula: C = ReLU(A * B + b^T)\n",
+    "# Kernel 1: Matrix multiplication\n",
+    "matmul_result = cp.matmul(A, B)\n",
+    "\n",
+    "# Kernel 2: Add bias (broadcasting happens automatically)\n",
+    "bias_result = matmul_result + b.reshape(-1, 1)\n",
+    "\n",
+    "# Kernel 3: Apply ReLU activation\n",
+    "C = relu(bias_result)\n",
+    "\n",
+    "# print(f\"A: {A}\")\n",
+    "# print(f\"B: {B}\")\n",
+    "# print(f\"b: {b}\")\n",
+    "# print(f\"C: {C}\")\n",
+    "\n",
+    "\n",
+    "print(f\"Input matrix A shape: {A.shape}\")\n",
+    "print(f\"Weight matrix B shape: {B.shape}\") \n",
+    "print(f\"Bias vector b shape: {b.shape}\")\n",
+    "print(f\"Transposed bias b^T shape: {b.reshape(-1, 1).shape}\")\n",
+    "print(f\"Output matrix C shape: {C.shape}\")\n",
+    "print(f\"Output matrix C device: {C.device}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "39c03a45",
+   "metadata": {},
+   "source": [
+    "**TODO: Validate the above implementation on small easy to comprehend inputs by manually initializing matrices and bias and by printing results step by step**\n",
+    "\n",
+    "`nvmath-python` leverages the power of `cuBLASLt` library that provides variety of options to implement such computational patterns. Here's an example of how one can implement the above example using `nvmath-python`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "af245b0d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import nvmath\n",
+    "\n",
+    "# Kernel 1, 2, and 3 are fused into a single kernel\n",
+    "C = nvmath.linalg.advanced.matmul(A, B, epilog=nvmath.linalg.advanced.MatmulEpilog.RELU_BIAS, epilog_inputs={\"bias\": b})\n",
+    "\n",
+    "# print(f\"A: {A}\")\n",
+    "# print(f\"B: {B}\")\n",
+    "# print(f\"b: {b}\")\n",
+    "# print(f\"C: {C}\")\n",
+    "\n",
+    "print(C.shape)\n",
+    "print(C.device)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "746a4d0e",
+   "metadata": {},
+   "source": [
+    "**TODO: Ensure that `nvmath-python` results are identical for the same small inputs**\n",
+    "\n",
+    "All three kernels are fused into a single kernel using JIT machinery behind `cuBLASLt`, which in certain problem settings may result in overall better performance due to better compute-to-memory access ratio"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "114319f3",
+   "metadata": {},
+   "source": [
+    "## Using custom FFT callbacks written in Python\n",
+    "\n",
+    "In previous example the epilog is a predefined set of activation functions and their gradients. This example illustrates the case when the epilog is a custom Python function, which is compiled into internal intermediate representation (LTO-IR) and then fused with the FFT operation into a single kernel. \n",
+    "\n",
+    "Specifically, we illustrate how to perform a convolution by providing a Python callback function as an epilog to the FFT operation.\n",
+    "\n",
+    "To begin with, let's create some input data. We will use the batched 1D FFT and apply a sine-form filter in the frequency domain."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cbea6060",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create the data for the batched 1-D FFT.\n",
+    "B, N = 256, 1024\n",
+    "a = cp.random.rand(B, N) + 1j * cp.random.rand(B, N)\n",
+    "\n",
+    "# Create the data to use as a filter.\n",
+    "filter_data = cp.sin(a)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e4211da9",
+   "metadata": {},
+   "source": [
+    "We also define the epilog function for forward FFT, a convolution, which corresponds to pointwise multiplication in the frequency domain. We also scale by the FFT size `N` here."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e1e27f3c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def convolve(data_out, offset, data, filter_data, unused):\n",
+    "    data_out[offset] = data * filter_data[offset] / N"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c39d8361",
+   "metadata": {},
+   "source": [
+    "Note we are accessing `data_out` and `filter_data` with a single `offset` integer, even though the output and `filter_data` are 2D tensors (batches of samples). Care must be taken to ensure that both arrays accessed here have the same memory layout.\n",
+    "\n",
+    "Next thing is to compile the epilog to intermediate representation (LTO-IR). In a system with GPUs that have different compute capability, the `compute_capability` option must be specified to the `compile_prolog` or `compile_epilog` helpers. Alternatively, the epilog can be compiled in the context of the device where the FFT to which the epilog is provided is executed. In this case we use the current device context, where the operands have been created:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1eda86b7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with cp.cuda.Device():\n",
+    "    epilog = nvmath.fft.compile_epilog(convolve, \"complex128\", \"complex128\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4e8418ea",
+   "metadata": {},
+   "source": [
+    "Finally, we perform the convolution as the forward FFT with the compiled epilog (filter) followed by the inverse FFT transformation:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3c0e951b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "r = nvmath.fft.fft(a, axes=[-1], epilog={\"ltoir\": epilog, \"data\": filter_data.data.ptr})\n",
+    "r = nvmath.fft.ifft(r, axes=[-1])"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "nersc-nvmath",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}