Several bug fixes in the drjit-core by lnuic · Pull Request #503 · mitsuba-renderer/drjit

lnuic · 2026-05-24T11:15:13Z

This PR bumps the drjit-core submodule to pick up a series of bug fixes, and adds regression tests for several of them.
Currently, the only included tests cover logical bugs. This keeps the tests concise, but I'm happy to add tests for any of them before this gets merged.

drjit-core `c8c9c2d`, test `3db9a65f`

scatter_inc: fix peer-match pointer hash (shl to shr). We convert the 64-bit pointer to 32 bits in order to use match.any.sync.b32, but in this case the typo used the left shift, which would make the offset bigger than 2^28 and start overlapping with the start of the array.

# requires roughly 1 GiB of GPU memory
import drjit as dr
from drjit.cuda import UInt32
c = dr.zeros(UInt32, (1 << 28) + 1)
idx = UInt32([0 if i % 2 == 0 else (1 << 28) for i in range(64)])
dr.scatter_inc(c, idx)
dr.eval(c)
print(int(c[0]), int(c[1 << 28]))

drjit-core `955ea2f`

scatter_{exch,cas}: fix address stride for non u32 values. The byte stride for the target address was being computed from the index type (always 4 bytes) instead of the value type, so any non 4 byte value was scattered to the wrong slot.

import drjit as dr
from drjit.cuda import UInt64, UInt32
target = dr.zeros(UInt64, 8)
dr.scatter_exch(target, UInt64(99), UInt32(4))
dr.eval(target)
print(list(target))

drjit-core `7f97a64`

local memory: fix masked read for bool/8 bit types. The default initialization of the result register for masked off lanes used mov.b1 or mov.b8, which do not exist in PTX, and the LLVM path had a matching type mismatch and a name collision between the read and write intermediates.

# A prior Bool write is needed to leave %w0=1 in the register; pre fix
# the masked off lane then reads that stale 1 instead of False.
import drjit as dr
from drjit.cuda import Bool, UInt32
s = dr.alloc_local(Bool, 4, value=Bool(True, True))
s.write(Bool(True, True), UInt32(0, 0), Bool(True, True))
mask = UInt32(0, 1) > 0   # lane 0 = False, lane 1 = True
got = s.read(UInt32(1, 1), mask)
dr.eval(got)
print([bool(got[k]) for k in range(2)])   # pre fix: [True, True], post fix: [False, True]

drjit-core `fa4597d`, test `c1aaab53`

coop_vec: fix fma zero shortcut for fma(a, b, 0). The shortcut branch was guarded on the second operand being zero instead of the third operand.

import drjit as dr, drjit.nn as nn
from drjit.cuda import Float
a = nn.CoopVec(Float(2), Float(3))
z = nn.CoopVec(Float(0), Float(0))
c = nn.CoopVec(Float(11), Float(13))
r0, r1 = dr.fma(a, z, c)
dr.eval(r0, r1)
print(list(r0)[0], list(r1)[0])   # pre fix: 0.0 0.0, post fix: 11.0 13.0

drjit-core `ce486d1`

coop_vec: fix ternary op type check typo (a1 to a2). The type validator checked a0->type != a1->type twice instead of also checking a2, so a third argument with a different type slipped through silently.

import drjit as dr, drjit.nn as nn
from drjit.cuda import Float, Float16
a = nn.CoopVec(Float(1), Float(2))
b = nn.CoopVec(Float(1), Float(2))
c = nn.CoopVec(Float16(1), Float16(2))
try:
    dr.fma(a, b, c)
    print("pre fix: no error raised")
except RuntimeError as e:
    print(f"post fix: {e}")

drjit-core `523d49b`

cuda_tex: fix memset size for CUDA_MEMCPY3D d2t path. The struct was being zeroed with sizeof(CUDA_MEMCPY2D) instead of sizeof(CUDA_MEMCPY3D),

import drjit as dr
from drjit.cuda import Float, TensorXf, Texture3f
tex = Texture3f((4, 4, 4), 1)
tex.set_tensor(TensorXf(dr.arange(Float, 64), (4, 4, 4, 1)))
print(tex.eval((0.5, 0.5, 0.5)))

drjit-core `7a96a60`

cuda: uint8/int8 for min, max, div, and mod. These ops were missing from the jitc_int8_unsupported list.

import drjit as dr
from drjit.cuda import UInt8
r = dr.minimum(UInt8(5, 7, 3), UInt8(4, 4, 4))
dr.eval(r)
print(list(r))

drjit-core `bfe874f`, test `2db675e0`

loop: fix side effect size scan to walk the list. The size scan inside jitc_var_loop_end was reading se_list.back() on every iteration instead of indexing the list, so only the last side effect contributed to the kernel size.

import drjit as dr
from drjit.cuda import UInt32
big   = dr.zeros(UInt32, 200); dr.make_opaque(big)
small = dr.zeros(UInt32, 4);   dr.make_opaque(small)
@dr.syntax
def f():
    i = UInt32(0)
    while i < 1:
        dr.scatter(big,   dr.arange(UInt32, 100), dr.arange(UInt32, 100))
        dr.scatter(small, UInt32(42), UInt32(0))
        i += 1
f()

drjit-core `7370ca6`

op: avoid UB for literal integer div and mod. The literal folding path applied the host C++ / and % operators directly, which is undefined behavior for divide by zero and produced a SIGFPE. We decided to follow the CUDA semantics in this case.

import drjit as dr
from drjit.cuda import Int32
r = Int32(7) // Int32(0)
dr.eval(r)
print(r[0])   # pre fix: SIGFPE (process crash), post fix: -1

drjit-core `0d40178`

scatter_inc: return 0 for masked off lanes. Two parts: the CUDA runtime mask path left the result register uninitialized on masked off lanes, and the literal mask shortcut returned UINT32_MAX.

import drjit as dr
from drjit.cuda import UInt32, Bool
c_lit = dr.zeros(UInt32, 4)
c_rt  = dr.zeros(UInt32, 4)
dr.eval(c_lit, c_rt)
offs_lit = dr.scatter_inc(c_lit, UInt32(0, 0, 0, 0), False)
offs_rt  = dr.scatter_inc(c_rt,  UInt32(0, 0, 0, 0), Bool(False, False, False, False))
dr.eval(offs_lit, offs_rt)
print(list(offs_lit), list(offs_rt))
# pre fix: [4294967295,4294967295,4294967295,4294967295] vs stale (e.g. [3,3,3,3])
# post fix: [0,0,0,0] [0,0,0,0]

…test

wjakob · 2026-05-25T20:02:57Z

This new test exceeds the available memory of the Windows CI machine (RTX2080 I think), so I think that something about the back-of-the-envelope calculation of the needed memory isn't right: test43_scatter_gather_power_of_two_indices. In general I think it's good to stay below 1GiB of memory usage per test since the Linux CI can run up to 3 tests in parallel, and students are sometimes using the machine at the same time.

lnuic · 2026-05-25T21:26:56Z

Sorry for the extra debugging on your end, I was hoping to sweep a bit broader than just the regression with the 4 GiB, but it makes sense to stay under 1 GiB for tests, I will keep that in mind for the future. Pushed a commit that brings it down and targets the regression directly, though the minimal reproducer is actually 1 GiB + 4 B (need idx = 2^28 for index overlap to occur), which drjit's allocator rounds up to 2 GiB, leaving the test at ~2.5 GiB total. Happy to drop the test, restrict it to Linux CI, or replace it with a static PTX-text check, whichever you prefer.

lnuic added 3 commits May 24, 2026 10:56

test_memop: add power-of-2 scatter/gather peer-match hash regression …

3db9a65

…test

test_coop_vec: test fma literal shortcut paths

c1aaab5

test_while_loop: multi-size loop side-effects

2db675e

lnuic mentioned this pull request May 24, 2026

Several bug fixes mitsuba-renderer/drjit-core#192

Merged

lnuic requested a review from wjakob May 25, 2026 09:13

test_memop: reduce scatter power-of-two test to 1GB

23a374d

lnuic force-pushed the cuda_ptx_fixes branch from 20f5486 to 23a374d Compare May 25, 2026 21:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several bug fixes in the drjit-core#503

Several bug fixes in the drjit-core#503
lnuic wants to merge 4 commits into
masterfrom
cuda_ptx_fixes

lnuic commented May 24, 2026 •

edited

Loading

Uh oh!

wjakob commented May 25, 2026 •

edited

Loading

Uh oh!

lnuic commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lnuic commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

drjit-core c8c9c2d, test 3db9a65f

drjit-core 955ea2f

drjit-core 7f97a64

drjit-core fa4597d, test c1aaab53

drjit-core ce486d1

drjit-core 523d49b

drjit-core 7a96a60

drjit-core bfe874f, test 2db675e0

drjit-core 7370ca6

drjit-core 0d40178

Uh oh!

wjakob commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lnuic commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lnuic commented May 24, 2026 •

edited

Loading

drjit-core `c8c9c2d`, test `3db9a65f`

drjit-core `955ea2f`

drjit-core `7f97a64`

drjit-core `fa4597d`, test `c1aaab53`

drjit-core `ce486d1`

drjit-core `523d49b`

drjit-core `7a96a60`

drjit-core `bfe874f`, test `2db675e0`

drjit-core `7370ca6`

drjit-core `0d40178`

wjakob commented May 25, 2026 •

edited

Loading