GPU support for Cartesian problems by kburns · Pull Request #324 · DedalusProject/dedalus

kburns · 2025-10-21T16:05:52Z

This PR adds GPU support for one dimensional bases and cartesian problems. Remaining rough edges include good defaults for subproblem coupling, and raising or warning for unsupported features (GPU+MPI, GPU+curvilinear).

csskene · 2025-10-28T16:09:57Z

Just noting here that the trace operator doesn't work on GPU's as cupy's einsum doesn't take an out argument. A similar fix to dot product (commit fb9b3d6) fixes things.
Also problem.add_equation will not work for float32 as unify attributes for dtype raises an 'Objects are not all equal' error.

csskene · 2025-11-10T11:01:30Z

Interpolate doesn't currently work either, with error
...
File "~/Packages/dedalus_gpu/dedalus/core/operators.py", line 975, in _subspace_matrix
return cls._full_matrix(input_basis, output_basis, *args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: InterpolateRealFourier._full_matrix() missing 1 required positional argument: 'position'

csskene · 2026-02-19T21:09:28Z

Just noting here that the trace operator doesn't work on GPU's as cupy's einsum doesn't take an out argument. A similar fix to dot product (commit fb9b3d6) fixes things. Also problem.add_equation will not work for float32 as unify attributes for dtype raises an 'Objects are not all equal' error.

trace is fixed as of [6e32312] .
I've traced the dtype error to MultiplyNumberField. Specifically, this line

self.dtype = np.result_type(type(arg0), arg1.dtype)

If arg0 is a python int, and arg1 is np.float32, this function returns np.float64, leading to the error. This error is unavoidable, since subtract invokes a multiplication by the int (-1). For example f-g will do f + ( (-1)*g). I am not sure what the best fix is. I've tried

self.dtype = np.result_type(arg0, arg1.dtype)

which fixes things, i.e. use arg0 rather than type(arg0). This gives np.float32 which is correct behaviour here. I wanted to check this is a good fix before pushing.

…ke a new buff and recompute the analysis.

kburns · 2026-05-05T14:49:58Z

We may want to use cupy's jit-rawkernel approach to handle type generality: https://docs.cupy.dev/en/stable/user_guide/kernel.html#jit-kernel-definition

csskene · 2026-05-07T17:19:43Z

We may want to use cupy's jit-rawkernel approach to handle type generality: https://docs.cupy.dev/en/stable/user_guide/kernel.html#jit-kernel-definition

This seems to work for here

@jit.rawkernel()
def apply_csr_mid_kernel(data, indices, indptr, x3, y3, N1, N2i, N2o, N3):
    n1 = jit.blockIdx.x * jit.blockDim.x + jit.threadIdx.x # batch index
    n3 = jit.blockIdx.y * jit.blockDim.y + jit.threadIdx.y # output column index
    if n1 >= N1 or n3 >= N3:
        return
    # Loop over output rows = CSR matrix rows
    for i in range(N2o):
        y3[n1, i, n3] = 0
        start = indptr[i]
        end = indptr[i + 1]
        for k in range(start, end):
            j = indices[k]
            y3[n1, i, n3] += data[k] * x3[n1, j, n3]

def cupy_apply_csr_mid(matrix, array, out):
    N1, N2i, N3 = array.shape
    N2o = matrix.shape[0]
    N1 =  cp.uint32(N1)
    N2i =  cp.uint32(N2i)
    N3 =  cp.uint32(N3)
    N2o =  cp.uint32(N2o)
    # Choose thread/block config
    threads_y = min(1024, N3) # maximize concurrency along n3
    threads_x = 1024 // threads_y # make block have 1024 threads
    blockdim = (threads_x, threads_y)
    blocks_x = (N1 + threads_x - 1) // threads_x
    blocks_y = (N3 + threads_y - 1) // threads_y
    griddim = (blocks_x, blocks_y)
    # Launch kernel
    apply_csr_mid_kernel(griddim, blockdim, (matrix.data, matrix.indices, matrix.indptr, array, out, N1, N2i, N2o, N3))

kburns · 2026-06-04T10:08:51Z

I've changed the Chebyshev transform default from the cupy DCT to matrix transforms for now, since even up to sizes of 1024 that seems much much faster. The cupy transform seems quite slow for some reason in my tests.

Note there is no CUDA-native DCT, so cupy implements the DCT as an extended FFT.

kburns added 28 commits May 27, 2025 14:47

Add array namespace option for field buffers

c8133e4

Add array-api-compat to setup.py

9d45669

Allow specifying array namespace by string

6db9593

Try fixing cupy allocation from buffer

a39b345

Fix cupy check

dd1f0f7

Add cupy-based complex fourier MMT

62ee03b

Fix transform lookup

68bbd21

Make fill_random array and dtype compatible

dce5d99

Work on cupy real fourier MMTs

e189f41

Generalize Fourier basis for more dtypes

cf8644d

Add cupy complex FFT

2fb0d32

Add cupy real fft

4fbe35b

Fix dtype conversion

8c7985d

Add array compat for basic arithmetic

d6a4525

Beginning adding array_compat to operators

6d05ff0

Quick implementation of apply_sparse for cupy

79d789c

Make einsum in dot compatible with cupy

fb9b3d6

Add custom kernel for cupy csr middle dot product

1e29a80

Convert local grids/modes to device arrays

d240656

Explicitly cast data norms to float

644f3bf

Cast grid spacing to device array in cartesian cfl

ef9091b

Convert field data gathers to numpy on gpu

426cad7

Fix subsystem gather/scatter to copy to/from gpu

c9f5bda

Allow for non-contiguous device copy

63f4033

Fix cupy csr kernel for double instead of float

68e2cb2

Move subsystems, coeff systems, and matrices to GPU

9421231

Build custom cupy superlu wrapper to reuse spsm descriptors

15a2d6e

Move all operator matrices to device. Add Chebyshev transforms

2e674b7

Make einsum in trace compatible with cupy

6e32312

csskene added 8 commits March 10, 2026 15:26

Fix dtype for MultiplyNumberField

115570e

Specify dtype for the CFL reducer

5e347b2

Ensure timestepping coefficients are the correct dtype

7d9e5a8

Convert Jacobi conversion matrices to specified dtype

3f53c89

Specify dtype for GlobalFlowProperty reducer

85a634d

Check if buff size grows even for same spsm descriptor. If it does ma…

04f0b0e

…ke a new buff and recompute the analysis.

Add custom kernel for apply_csr_mid for dtype float32

f9e1ff0

Convert subspace matrices to specified dtype

81b5a4a

csskene and others added 9 commits May 12, 2026 15:20

Use jit.rawkernel for apply_csr_mid_kernel

7429d58

Accumulate into acc in apply_csr_mid_kernel

9b0072b

Update distributor docstring

9f8a5f5

Start simplifying transform library defaults (broken for curvilinear)

d4e9923

Suppress docstring warnings

d066a7c

Add mock jit so linalg_gpu is still importable without cupy

afd169f

Add config option for default gpu subproblem coupling

4753120

Replace np with xp in timesteppers

579558c

Set matrix transform as default for Chebyshev

b0deada

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU support for Cartesian problems#324

GPU support for Cartesian problems#324
kburns wants to merge 46 commits into
masterfrom
gpu

kburns commented Oct 21, 2025

Uh oh!

csskene commented Oct 28, 2025

Uh oh!

csskene commented Nov 10, 2025

Uh oh!

csskene commented Feb 19, 2026

Uh oh!

kburns commented May 5, 2026

Uh oh!

csskene commented May 7, 2026

Uh oh!

kburns commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kburns commented Oct 21, 2025

Uh oh!

csskene commented Oct 28, 2025

Uh oh!

csskene commented Nov 10, 2025

Uh oh!

csskene commented Feb 19, 2026

Uh oh!

kburns commented May 5, 2026

Uh oh!

csskene commented May 7, 2026

Uh oh!

kburns commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kburns commented Jun 4, 2026 •

edited

Loading