Fork a small set of nanoVDB headers to share PyTorch's CUDA allocator by fwilliams · Pull Request #655 · openvdb/fvdb-core

fwilliams · 2026-05-19T17:26:46Z

Summary

Add an in-tree override directory at src/fvdb/nanovdb_overrides/ that holds modified copies of a small number of upstream nanoVDB headers. The build prepends this directory to the include search path so that #include <nanovdb/...> for the forked files resolves here, while the rest of nanoVDB continues to come from the upstream source tree (still fetched unmodified at its pinned SHA).

Two changes are forked in:

nanovdb/cuda/DeviceBuffer.h and nanovdb/cuda/DeviceResource.h route device-side allocations through c10::cuda::CUDACachingAllocator (PyTorch's caching allocator) instead of cudaMallocAsync / cudaFreeAsync. This keeps transient scratch allocations made by nanoVDB internals (MergeGrids, TopologyBuilder, DilateGrid, …) inside the same pool that fvdb / PyTorch tensors use. Without this, the two pools fragment each other and large workloads such as multi-frame TSDF integration hit a clean OOM even when the GPU still has free memory in aggregate.
nanovdb/tools/cuda/TopologyBuilder.cuh adds an opt-in debug trace gated on the FVDB_NANOVDB_TRACE_ALLOCS env var that prints tile count and per-call scratch size from allocateInternalMaskBuffers. Useful for diagnosing topology-op memory blowup on large scenes; off by default and otherwise byte-identical with upstream.

Each forked header carries a short FVDB FORK: banner documenting what diverges from upstream and why, with inline // FVDB FORK: tags on every non-trivial code change so git blame and text search make the delta obvious. src/fvdb/nanovdb_overrides/README.md documents the resync and add-an-override procedures.

CMake wiring lives in src/cmake/get_nanovdb.cmake, with mirror include-path orderings in src/CMakeLists.txt (the fvdb library target) and the top-level CMakeLists.txt (the _fvdb_cpp pybind module).

Why a directory of forked headers rather than a CPM patch?

CPM's PATCH_COMMAND mechanism is fragile: it silently stops applying when upstream moves the surrounding lines, and it makes it hard to edit nanoVDB during development. Forking the exact headers we care about into this tree gives us full edit access with a narrow, reviewable patch surface and a clean diff-against-upstream workflow.

Test plan

./build.sh install succeeds with the new include ordering on a clean tree.
cd tests && pytest unit/ passes against the new install (verified locally, full suite green including the existing TSDF and nanoVDB I/O tests).
FVDB_NANOVDB_TRACE_ALLOCS=1 python -c "..." over any TSDF integration workload prints the per-call tile counts and scratch sizes; unsetting the var reverts to byte-identical upstream behavior.

Followups

This is the foundation for a follow-up PR that adds new TSDF / ESDF / Occupancy / Decay / fast-marching-cubes APIs (stacked on this branch). That PR depends on the allocator routing for the multi-frame integrator paths.

Copilot

Pull request overview

Forks a small set of nanoVDB headers into src/fvdb/nanovdb_overrides/ so that nanoVDB's device-side scratch allocations route through PyTorch's c10::cuda::CUDACachingAllocator instead of cudaMallocAsync/cudaFreeAsync. This unifies the CUDA memory pool between nanoVDB internals and PyTorch tensors to avoid dual-pool OOMs in workloads like multi-frame TSDF integration. Also adds an opt-in env-gated allocation trace (FVDB_NANOVDB_TRACE_ALLOCS).

Changes:

Forked nanovdb/cuda/DeviceBuffer.h and nanovdb/cuda/DeviceResource.h to allocate through PyTorch's caching allocator with optional env-gated alloc traces.
Forked nanovdb/tools/cuda/TopologyBuilder.cuh to add an env-gated tile-count/scratch-size debug print in allocateInternalMaskBuffers.
Updated CMake (src/cmake/get_nanovdb.cmake, src/CMakeLists.txt, top-level CMakeLists.txt) to prepend the overrides directory to the include search path; added a README documenting layout and resync procedure.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`src/fvdb/nanovdb_overrides/README.md`	Documents rationale, layout, and procedures for adding/resyncing overrides.
`src/fvdb/nanovdb_overrides/nanovdb/cuda/DeviceBuffer.h`	Routes device alloc/free through `c10::cuda::CUDACachingAllocator`; adds env-gated alloc trace.
`src/fvdb/nanovdb_overrides/nanovdb/cuda/DeviceResource.h`	Same allocator routing for scratch allocations used by CUB/topology ops.
`src/fvdb/nanovdb_overrides/nanovdb/tools/cuda/TopologyBuilder.cuh`	Adds env-gated tile-count/scratch-size trace in `allocateInternalMaskBuffers`.
`src/cmake/get_nanovdb.cmake`	Defines `FVDB_NANOVDB_OVERRIDES_DIR` and prepends it to the nanovdb interface include path.
`src/CMakeLists.txt`	Adds overrides dir before upstream nanoVDB sources on the `fvdb` target.
`CMakeLists.txt`	Mirror include ordering for the `_fvdb_cpp` pybind module.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

fwilliams · 2026-05-19T17:47:35Z

Addressed Copilot's review comments and the failing code-style checks (force-pushed amended commit, since this PR is still a draft):

Copilot #1 + #2 — raw_alloc vs documented stream-ordered semantics
Both call sites in DeviceBuffer.h::init / deviceUpload and DeviceResource.h::allocateAsync now use c10::cuda::CUDACachingAllocator::raw_alloc_with_stream(bytes, stream) instead of the plain raw_alloc. The comments in those sites already documented the intent (record the stream against the block so torch defers reuse until stream work completes, matching cudaMallocAsync's stream-ordered semantics); the code now actually does that. The (void)stream; silencers are gone.

Copilot #3 — README out of sync with directory contents
Both the ## Layout example block and the ## Current overrides table in src/fvdb/nanovdb_overrides/README.md now list all three forked headers (DeviceBuffer.h, DeviceResource.h, tools/cuda/TopologyBuilder.cuh) with a short line each on what diverges from upstream.

CI: C++ code style (clang-format) + Include guards failures
Added matching exclude / ignore lines in .github/workflows/codestyle.yml so the two style checks skip src/fvdb/nanovdb_overrides/. The forked headers intentionally keep upstream nanoVDB's formatting and include-guard names verbatim (apart from documented FVDB FORK: deltas) so that diff-against-upstream and resync stay tractable, and so the override and any inadvertent upstream copy on the include path share the same guard.

Add an override directory at `src/fvdb/nanovdb_overrides/` that holds modified copies of a few upstream nanoVDB headers. The build prepends this directory to the include search path so that, e.g., `#include <nanovdb/cuda/DeviceBuffer.h>` resolves to the forked copy under `src/fvdb/nanovdb_overrides/`, while the rest of nanoVDB continues to come from the upstream source tree. Forked headers: - `nanovdb/cuda/DeviceBuffer.h` and `nanovdb/cuda/DeviceResource.h` route device-side allocations through `c10::cuda::CUDACachingAllocator` (PyTorch's caching allocator) instead of `cudaMallocAsync` / `cudaFreeAsync`. This keeps transient scratch allocations made by nanoVDB internals (MergeGrids, TopologyBuilder, DilateGrid, ...) inside the same pool that fvdb / PyTorch tensors use. Without this, the two pools fragment each other and large workloads such as multi-frame TSDF integration hit a clean OOM even when the GPU still has free memory in aggregate. - `nanovdb/tools/cuda/TopologyBuilder.cuh` adds an opt-in debug trace gated on the `FVDB_NANOVDB_TRACE_ALLOCS` env var that prints tile count and per-call scratch size from `allocateInternalMaskBuffers`. Useful for diagnosing topology-op memory blowup on large scenes; off by default and otherwise byte-identical with upstream. Each forked header carries a short `FVDB FORK:` banner documenting what diverges from upstream and why, with inline `// FVDB FORK:` tags on every non-trivial change so `git blame` and text search make the delta obvious. `src/fvdb/nanovdb_overrides/README.md` documents the resync and add-an-override procedures. CMake wiring lives in `src/cmake/get_nanovdb.cmake`, with mirror include-path orderings in `src/CMakeLists.txt` (fvdb target) and the top-level `CMakeLists.txt` (_fvdb_cpp pybind module). Signed-off-by: Francis Williams <francis@fwilliams.info>

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.

`pyproject.toml`'s `torch>=2.8,<2.12` cap was older than what the CI build pipeline now picks up. `env/build_requirements.txt` and `env/test_requirements.txt` leave `torch` unpinned, so `uv pip install --extra-index-url https://download.pytorch.org/whl/cu130 ...` resolved it to whatever was newest on the CUDA index. That worked while the newest cu130 torch matched the wheel's range. Once torch `2.12.0+cu130` was published, the build env installed 2.12.0+cu130 (so fvdb's `.so` ended up compiled against the 2.12 ABI, which includes `at::TensorBase::const_data_ptr<int8_t>`), but the *test* env's install of the resulting wheel hit pyproject's `<2.12` ceiling and pip downgraded torch to the CPU-only `2.11.0` from PyPI (which is missing those CUDA-side template symbols). The CUDA 13.0 unit-test pipeline then surfaces: ImportError: .../libfvdb.so: undefined symbol: _ZNK2at10TensorBase14const_data_ptrIaEEPKT_v at `import fvdb` time. Every CUDA 13.0 PR since 2026-05-16 has been red on this; CUDA 12.8 isn't affected (its index still serves a wheel in-range). Bumping the upper bound to `<2.13` lets the wheel and both the build and test environments converge on the same torch 2.12.x build, with no extra pin needed in the requirements files. A short comment in pyproject.toml documents the convention so the next torch minor release bump is mechanical. Signed-off-by: Francis Williams <francis@fwilliams.info>

fwilliams · 2026-05-19T19:43:45Z

Replaced the previous CI-fix commit with a cleaner one-liner approach (force-pushed):

Instead of pinning torch in both env/build_requirements.txt and env/test_requirements.txt to a range derived from pyproject.toml (which has to be kept in sync in three places), the requirements files are back to unpinned torch and pyproject.toml's upper bound is bumped from <2.12 to <2.13. With the wheel's range widened, uv pip install resolves to the same torch in both build and test envs (no more downgrade-mismatch on install), and there's only one source of truth for the supported torch range.

The CUDA 12.8 pipeline is already known to build+test fine against torch 2.12 (the build step uses torch 2.12.0+cu130 today; only the test install was downgrading because of the pyproject ceiling). CUDA 13.0 should now match.

A short comment in pyproject.toml documents the convention so the next torch minor-release bump is a one-line edit.

… torch Recent conda-forge `pytorch-gpu` builds (e.g. `pytorch-2.10.0-cuda130_mkl_py312_*_304`) hit a packaging bug where the `{torch,ATen,c10,caffe2,tensorpipe}` symlinks inside `<site-packages>/torch/include/` land with a `.c~` suffix -- the file-conflict rename suffix conda uses when a package install would overwrite an existing file. The result is that `find_package(Torch)`'s references to `<site-packages>/torch/include/torch/...` don't resolve, and fvdb builds against a conda-forge torch fail with: fatal error: torch/custom_class.h: No such file or directory (observed in `src/fvdb/JaggedTensor.h:7` from `src/tests/PackedJaggedAccessorTest.cu` on the Conda Build CI job). The canonical conda-forge install location for those headers is `$CONDA_PREFIX/include/torch/`, `$CONDA_PREFIX/include/ATen/`, etc., and the broken symlinks were meant to point at exactly those paths. Append `$CONDA_PREFIX/include` to `TORCH_INCLUDE_DIRS` when it exists, guarded by `IS_DIRECTORY "$ENV{CONDA_PREFIX}/include/torch"` so we only do this when torch is actually conda-installed (pip-installed torch in a conda env is unaffected, and non-conda environments don't trigger the branch at all). Signed-off-by: Francis Williams <francis@fwilliams.info>

sifakis · 2026-05-19T22:44:05Z

It might be worth prioritizing an improvement/generalization of the core nanovdb ops to address this more properly. At minimum ... we could replace the hardcoded DeviceBuffer used for transient allocation in the topoOps with a templatized ScratchBufferT (which would default to DeviceBuffer, but could be set to TorchDeviceBuffer in fVDB).
The only obstacles to that are minor (I think):

TorchDeviceBuffer::clear() takes no arguments, while DeviceBuffer::clear takes a stream argument. Remedy: Have the TB version accept (and ignore) a stream argument. TDB::create already does the same ...
TDB::data() returns uint8_t, DeviceBuffer::data returns void . This messes up some static_cast's. Remedy: Have both return void.

fwilliams mentioned this pull request May 19, 2026

Add TSDF / ESDF / Occupancy / Decay APIs and a fast marching-cubes variant #656

Draft

fwilliams requested a review from Copilot May 19, 2026 17:29

fwilliams marked this pull request as ready for review May 19, 2026 17:29

fwilliams requested a review from a team as a code owner May 19, 2026 17:29

fwilliams requested review from blackencino and sifakis May 19, 2026 17:29

fwilliams mentioned this pull request May 19, 2026

Add cross-library TSDF / ESDF / Occupancy benchmark drivers openvdb/fvdb-reality-capture#287

Draft

3 tasks

Copilot started reviewing on behalf of fwilliams May 19, 2026 17:29 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread src/fvdb/nanovdb_overrides/nanovdb/cuda/DeviceResource.h

Comment thread src/fvdb/nanovdb_overrides/nanovdb/cuda/DeviceBuffer.h Outdated

Comment thread src/fvdb/nanovdb_overrides/README.md Outdated

fwilliams force-pushed the feat/nanovdb-allocator-overrides branch from 14613ce to 4ec7afe Compare May 19, 2026 17:44

fwilliams force-pushed the feat/nanovdb-allocator-overrides branch from 4ec7afe to bedf87e Compare May 19, 2026 17:49

fwilliams requested a review from Copilot May 19, 2026 17:51

Copilot started reviewing on behalf of fwilliams May 19, 2026 17:51 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

fwilliams force-pushed the feat/nanovdb-allocator-overrides branch from 010424d to c700a40 Compare May 19, 2026 19:41

harrism mentioned this pull request Jun 11, 2026

NanoVDB: injectable CUDA memory resources — design & roadmap AcademySoftwareFoundation/openvdb#2232

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fork a small set of nanoVDB headers to share PyTorch's CUDA allocator#655

Fork a small set of nanoVDB headers to share PyTorch's CUDA allocator#655
fwilliams wants to merge 3 commits into
mainfrom
feat/nanovdb-allocator-overrides

fwilliams commented May 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fwilliams commented May 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

fwilliams commented May 19, 2026

Uh oh!

sifakis commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fwilliams commented May 19, 2026

Summary

Why a directory of forked headers rather than a CPM patch?

Test plan

Followups

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fwilliams commented May 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

fwilliams commented May 19, 2026

Uh oh!

sifakis commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants