Several bug fixes in the drjit-core#503
Conversation
|
This new test exceeds the available memory of the Windows CI machine (RTX2080 I think), so I think that something about the back-of-the-envelope calculation of the needed memory isn't right: |
|
Sorry for the extra debugging on your end, I was hoping to sweep a bit broader than just the regression with the 4 GiB, but it makes sense to stay under 1 GiB for tests, I will keep that in mind for the future. Pushed a commit that brings it down and targets the regression directly, though the minimal reproducer is actually 1 GiB + 4 B (need idx = 2^28 for index overlap to occur), which drjit's allocator rounds up to 2 GiB, leaving the test at ~2.5 GiB total. Happy to drop the test, restrict it to Linux CI, or replace it with a static PTX-text check, whichever you prefer. |
This PR bumps the drjit-core submodule to pick up a series of bug fixes, and adds regression tests for several of them.
Currently, the only included tests cover logical bugs. This keeps the tests concise, but I'm happy to add tests for any of them before this gets merged.
drjit-core
c8c9c2d, test3db9a65fscatter_inc: fix peer-match pointer hash (shl to shr). We convert the 64-bit pointer to 32 bits in order to use match.any.sync.b32, but in this case the typo used the left shift, which would make the offset bigger than 2^28 and start overlapping with the start of the array.
drjit-core
955ea2fscatter_{exch,cas}: fix address stride for non u32 values. The byte stride for the target address was being computed from the index type (always 4 bytes) instead of the value type, so any non 4 byte value was scattered to the wrong slot.
drjit-core
7f97a64local memory: fix masked read for bool/8 bit types. The default initialization of the result register for masked off lanes used
mov.b1ormov.b8, which do not exist in PTX, and the LLVM path had a matching type mismatch and a name collision between the read and write intermediates.drjit-core
fa4597d, testc1aaab53coop_vec: fix fma zero shortcut for
fma(a, b, 0). The shortcut branch was guarded on the second operand being zero instead of the third operand.drjit-core
ce486d1coop_vec: fix ternary op type check typo (a1 to a2). The type validator checked
a0->type != a1->typetwice instead of also checkinga2, so a third argument with a different type slipped through silently.drjit-core
523d49bcuda_tex: fix memset size for
CUDA_MEMCPY3Dd2t path. The struct was being zeroed withsizeof(CUDA_MEMCPY2D)instead ofsizeof(CUDA_MEMCPY3D),drjit-core
7a96a60cuda: uint8/int8 for
min,max,div, andmod. These ops were missing from thejitc_int8_unsupportedlist.drjit-core
bfe874f, test2db675e0loop: fix side effect size scan to walk the list. The size scan inside
jitc_var_loop_endwas readingse_list.back()on every iteration instead of indexing the list, so only the last side effect contributed to the kernel size.drjit-core
7370ca6op: avoid UB for literal integer div and mod. The literal folding path applied the host C++
/and%operators directly, which is undefined behavior for divide by zero and produced a SIGFPE. We decided to follow the CUDA semantics in this case.drjit-core
0d40178scatter_inc: return 0 for masked off lanes. Two parts: the CUDA runtime mask path left the result register uninitialized on masked off lanes, and the literal mask shortcut returned
UINT32_MAX.