Skip to content

Several bug fixes in the drjit-core#503

Open
lnuic wants to merge 4 commits into
masterfrom
cuda_ptx_fixes
Open

Several bug fixes in the drjit-core#503
lnuic wants to merge 4 commits into
masterfrom
cuda_ptx_fixes

Conversation

@lnuic
Copy link
Copy Markdown
Contributor

@lnuic lnuic commented May 24, 2026

This PR bumps the drjit-core submodule to pick up a series of bug fixes, and adds regression tests for several of them.
Currently, the only included tests cover logical bugs. This keeps the tests concise, but I'm happy to add tests for any of them before this gets merged.


drjit-core c8c9c2d, test 3db9a65f

scatter_inc: fix peer-match pointer hash (shl to shr). We convert the 64-bit pointer to 32 bits in order to use match.any.sync.b32, but in this case the typo used the left shift, which would make the offset bigger than 2^28 and start overlapping with the start of the array.

# requires roughly 1 GiB of GPU memory
import drjit as dr
from drjit.cuda import UInt32
c = dr.zeros(UInt32, (1 << 28) + 1)
idx = UInt32([0 if i % 2 == 0 else (1 << 28) for i in range(64)])
dr.scatter_inc(c, idx)
dr.eval(c)
print(int(c[0]), int(c[1 << 28]))

drjit-core 955ea2f

scatter_{exch,cas}: fix address stride for non u32 values. The byte stride for the target address was being computed from the index type (always 4 bytes) instead of the value type, so any non 4 byte value was scattered to the wrong slot.

import drjit as dr
from drjit.cuda import UInt64, UInt32
target = dr.zeros(UInt64, 8)
dr.scatter_exch(target, UInt64(99), UInt32(4))
dr.eval(target)
print(list(target))

drjit-core 7f97a64

local memory: fix masked read for bool/8 bit types. The default initialization of the result register for masked off lanes used mov.b1 or mov.b8, which do not exist in PTX, and the LLVM path had a matching type mismatch and a name collision between the read and write intermediates.

# A prior Bool write is needed to leave %w0=1 in the register; pre fix
# the masked off lane then reads that stale 1 instead of False.
import drjit as dr
from drjit.cuda import Bool, UInt32
s = dr.alloc_local(Bool, 4, value=Bool(True, True))
s.write(Bool(True, True), UInt32(0, 0), Bool(True, True))
mask = UInt32(0, 1) > 0   # lane 0 = False, lane 1 = True
got = s.read(UInt32(1, 1), mask)
dr.eval(got)
print([bool(got[k]) for k in range(2)])   # pre fix: [True, True], post fix: [False, True]

drjit-core fa4597d, test c1aaab53

coop_vec: fix fma zero shortcut for fma(a, b, 0). The shortcut branch was guarded on the second operand being zero instead of the third operand.

import drjit as dr, drjit.nn as nn
from drjit.cuda import Float
a = nn.CoopVec(Float(2), Float(3))
z = nn.CoopVec(Float(0), Float(0))
c = nn.CoopVec(Float(11), Float(13))
r0, r1 = dr.fma(a, z, c)
dr.eval(r0, r1)
print(list(r0)[0], list(r1)[0])   # pre fix: 0.0 0.0, post fix: 11.0 13.0

drjit-core ce486d1

coop_vec: fix ternary op type check typo (a1 to a2). The type validator checked a0->type != a1->type twice instead of also checking a2, so a third argument with a different type slipped through silently.

import drjit as dr, drjit.nn as nn
from drjit.cuda import Float, Float16
a = nn.CoopVec(Float(1), Float(2))
b = nn.CoopVec(Float(1), Float(2))
c = nn.CoopVec(Float16(1), Float16(2))
try:
    dr.fma(a, b, c)
    print("pre fix: no error raised")
except RuntimeError as e:
    print(f"post fix: {e}")

drjit-core 523d49b

cuda_tex: fix memset size for CUDA_MEMCPY3D d2t path. The struct was being zeroed with sizeof(CUDA_MEMCPY2D) instead of sizeof(CUDA_MEMCPY3D),

import drjit as dr
from drjit.cuda import Float, TensorXf, Texture3f
tex = Texture3f((4, 4, 4), 1)
tex.set_tensor(TensorXf(dr.arange(Float, 64), (4, 4, 4, 1)))
print(tex.eval((0.5, 0.5, 0.5)))

drjit-core 7a96a60

cuda: uint8/int8 for min, max, div, and mod. These ops were missing from the jitc_int8_unsupported list.

import drjit as dr
from drjit.cuda import UInt8
r = dr.minimum(UInt8(5, 7, 3), UInt8(4, 4, 4))
dr.eval(r)
print(list(r))

drjit-core bfe874f, test 2db675e0

loop: fix side effect size scan to walk the list. The size scan inside jitc_var_loop_end was reading se_list.back() on every iteration instead of indexing the list, so only the last side effect contributed to the kernel size.

import drjit as dr
from drjit.cuda import UInt32
big   = dr.zeros(UInt32, 200); dr.make_opaque(big)
small = dr.zeros(UInt32, 4);   dr.make_opaque(small)
@dr.syntax
def f():
    i = UInt32(0)
    while i < 1:
        dr.scatter(big,   dr.arange(UInt32, 100), dr.arange(UInt32, 100))
        dr.scatter(small, UInt32(42), UInt32(0))
        i += 1
f()

drjit-core 7370ca6

op: avoid UB for literal integer div and mod. The literal folding path applied the host C++ / and % operators directly, which is undefined behavior for divide by zero and produced a SIGFPE. We decided to follow the CUDA semantics in this case.

import drjit as dr
from drjit.cuda import Int32
r = Int32(7) // Int32(0)
dr.eval(r)
print(r[0])   # pre fix: SIGFPE (process crash), post fix: -1

drjit-core 0d40178

scatter_inc: return 0 for masked off lanes. Two parts: the CUDA runtime mask path left the result register uninitialized on masked off lanes, and the literal mask shortcut returned UINT32_MAX.

import drjit as dr
from drjit.cuda import UInt32, Bool
c_lit = dr.zeros(UInt32, 4)
c_rt  = dr.zeros(UInt32, 4)
dr.eval(c_lit, c_rt)
offs_lit = dr.scatter_inc(c_lit, UInt32(0, 0, 0, 0), False)
offs_rt  = dr.scatter_inc(c_rt,  UInt32(0, 0, 0, 0), Bool(False, False, False, False))
dr.eval(offs_lit, offs_rt)
print(list(offs_lit), list(offs_rt))
# pre fix: [4294967295,4294967295,4294967295,4294967295] vs stale (e.g. [3,3,3,3])
# post fix: [0,0,0,0] [0,0,0,0]

@wjakob
Copy link
Copy Markdown
Member

wjakob commented May 25, 2026

This new test exceeds the available memory of the Windows CI machine (RTX2080 I think), so I think that something about the back-of-the-envelope calculation of the needed memory isn't right: test43_scatter_gather_power_of_two_indices. In general I think it's good to stay below 1GiB of memory usage per test since the Linux CI can run up to 3 tests in parallel, and students are sometimes using the machine at the same time.

@lnuic lnuic force-pushed the cuda_ptx_fixes branch from 20f5486 to 23a374d Compare May 25, 2026 21:19
@lnuic
Copy link
Copy Markdown
Contributor Author

lnuic commented May 25, 2026

Sorry for the extra debugging on your end, I was hoping to sweep a bit broader than just the regression with the 4 GiB, but it makes sense to stay under 1 GiB for tests, I will keep that in mind for the future. Pushed a commit that brings it down and targets the regression directly, though the minimal reproducer is actually 1 GiB + 4 B (need idx = 2^28 for index overlap to occur), which drjit's allocator rounds up to 2 GiB, leaving the test at ~2.5 GiB total. Happy to drop the test, restrict it to Linux CI, or replace it with a static PTX-text check, whichever you prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants