Skip to content

Fork a small set of nanoVDB headers to share PyTorch's CUDA allocator#655

Open
fwilliams wants to merge 3 commits into
mainfrom
feat/nanovdb-allocator-overrides
Open

Fork a small set of nanoVDB headers to share PyTorch's CUDA allocator#655
fwilliams wants to merge 3 commits into
mainfrom
feat/nanovdb-allocator-overrides

Conversation

@fwilliams

Copy link
Copy Markdown
Collaborator

Summary

Add an in-tree override directory at src/fvdb/nanovdb_overrides/ that holds modified copies of a small number of upstream nanoVDB headers. The build prepends this directory to the include search path so that #include <nanovdb/...> for the forked files resolves here, while the rest of nanoVDB continues to come from the upstream source tree (still fetched unmodified at its pinned SHA).

Two changes are forked in:

  • nanovdb/cuda/DeviceBuffer.h and nanovdb/cuda/DeviceResource.h route device-side allocations through c10::cuda::CUDACachingAllocator (PyTorch's caching allocator) instead of cudaMallocAsync / cudaFreeAsync. This keeps transient scratch allocations made by nanoVDB internals (MergeGrids, TopologyBuilder, DilateGrid, …) inside the same pool that fvdb / PyTorch tensors use. Without this, the two pools fragment each other and large workloads such as multi-frame TSDF integration hit a clean OOM even when the GPU still has free memory in aggregate.

  • nanovdb/tools/cuda/TopologyBuilder.cuh adds an opt-in debug trace gated on the FVDB_NANOVDB_TRACE_ALLOCS env var that prints tile count and per-call scratch size from allocateInternalMaskBuffers. Useful for diagnosing topology-op memory blowup on large scenes; off by default and otherwise byte-identical with upstream.

Each forked header carries a short FVDB FORK: banner documenting what diverges from upstream and why, with inline // FVDB FORK: tags on every non-trivial code change so git blame and text search make the delta obvious. src/fvdb/nanovdb_overrides/README.md documents the resync and add-an-override procedures.

CMake wiring lives in src/cmake/get_nanovdb.cmake, with mirror include-path orderings in src/CMakeLists.txt (the fvdb library target) and the top-level CMakeLists.txt (the _fvdb_cpp pybind module).

Why a directory of forked headers rather than a CPM patch?

CPM's PATCH_COMMAND mechanism is fragile: it silently stops applying when upstream moves the surrounding lines, and it makes it hard to edit nanoVDB during development. Forking the exact headers we care about into this tree gives us full edit access with a narrow, reviewable patch surface and a clean diff-against-upstream workflow.

Test plan

  • ./build.sh install succeeds with the new include ordering on a clean tree.
  • cd tests && pytest unit/ passes against the new install (verified locally, full suite green including the existing TSDF and nanoVDB I/O tests).
  • FVDB_NANOVDB_TRACE_ALLOCS=1 python -c "..." over any TSDF integration workload prints the per-call tile counts and scratch sizes; unsetting the var reverts to byte-identical upstream behavior.

Followups

This is the foundation for a follow-up PR that adds new TSDF / ESDF / Occupancy / Decay / fast-marching-cubes APIs (stacked on this branch). That PR depends on the allocator routing for the multi-frame integrator paths.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Forks a small set of nanoVDB headers into src/fvdb/nanovdb_overrides/ so that nanoVDB's device-side scratch allocations route through PyTorch's c10::cuda::CUDACachingAllocator instead of cudaMallocAsync/cudaFreeAsync. This unifies the CUDA memory pool between nanoVDB internals and PyTorch tensors to avoid dual-pool OOMs in workloads like multi-frame TSDF integration. Also adds an opt-in env-gated allocation trace (FVDB_NANOVDB_TRACE_ALLOCS).

Changes:

  • Forked nanovdb/cuda/DeviceBuffer.h and nanovdb/cuda/DeviceResource.h to allocate through PyTorch's caching allocator with optional env-gated alloc traces.
  • Forked nanovdb/tools/cuda/TopologyBuilder.cuh to add an env-gated tile-count/scratch-size debug print in allocateInternalMaskBuffers.
  • Updated CMake (src/cmake/get_nanovdb.cmake, src/CMakeLists.txt, top-level CMakeLists.txt) to prepend the overrides directory to the include search path; added a README documenting layout and resync procedure.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/fvdb/nanovdb_overrides/README.md Documents rationale, layout, and procedures for adding/resyncing overrides.
src/fvdb/nanovdb_overrides/nanovdb/cuda/DeviceBuffer.h Routes device alloc/free through c10::cuda::CUDACachingAllocator; adds env-gated alloc trace.
src/fvdb/nanovdb_overrides/nanovdb/cuda/DeviceResource.h Same allocator routing for scratch allocations used by CUB/topology ops.
src/fvdb/nanovdb_overrides/nanovdb/tools/cuda/TopologyBuilder.cuh Adds env-gated tile-count/scratch-size trace in allocateInternalMaskBuffers.
src/cmake/get_nanovdb.cmake Defines FVDB_NANOVDB_OVERRIDES_DIR and prepends it to the nanovdb interface include path.
src/CMakeLists.txt Adds overrides dir before upstream nanoVDB sources on the fvdb target.
CMakeLists.txt Mirror include ordering for the _fvdb_cpp pybind module.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/fvdb/nanovdb_overrides/nanovdb/cuda/DeviceResource.h
Comment thread src/fvdb/nanovdb_overrides/nanovdb/cuda/DeviceBuffer.h Outdated
Comment thread src/fvdb/nanovdb_overrides/README.md Outdated
@fwilliams fwilliams force-pushed the feat/nanovdb-allocator-overrides branch from 14613ce to 4ec7afe Compare May 19, 2026 17:44
@fwilliams

Copy link
Copy Markdown
Collaborator Author

Addressed Copilot's review comments and the failing code-style checks (force-pushed amended commit, since this PR is still a draft):

Copilot #1 + #2raw_alloc vs documented stream-ordered semantics
Both call sites in DeviceBuffer.h::init / deviceUpload and DeviceResource.h::allocateAsync now use c10::cuda::CUDACachingAllocator::raw_alloc_with_stream(bytes, stream) instead of the plain raw_alloc. The comments in those sites already documented the intent (record the stream against the block so torch defers reuse until stream work completes, matching cudaMallocAsync's stream-ordered semantics); the code now actually does that. The (void)stream; silencers are gone.

Copilot #3 — README out of sync with directory contents
Both the ## Layout example block and the ## Current overrides table in src/fvdb/nanovdb_overrides/README.md now list all three forked headers (DeviceBuffer.h, DeviceResource.h, tools/cuda/TopologyBuilder.cuh) with a short line each on what diverges from upstream.

CI: C++ code style (clang-format) + Include guards failures
Added matching exclude / ignore lines in .github/workflows/codestyle.yml so the two style checks skip src/fvdb/nanovdb_overrides/. The forked headers intentionally keep upstream nanoVDB's formatting and include-guard names verbatim (apart from documented FVDB FORK: deltas) so that diff-against-upstream and resync stay tractable, and so the override and any inadvertent upstream copy on the include path share the same guard.

Add an override directory at `src/fvdb/nanovdb_overrides/` that holds
modified copies of a few upstream nanoVDB headers. The build prepends
this directory to the include search path so that, e.g.,
`#include <nanovdb/cuda/DeviceBuffer.h>` resolves to the forked copy
under `src/fvdb/nanovdb_overrides/`, while the rest of nanoVDB
continues to come from the upstream source tree.

Forked headers:

- `nanovdb/cuda/DeviceBuffer.h` and `nanovdb/cuda/DeviceResource.h`
  route device-side allocations through
  `c10::cuda::CUDACachingAllocator` (PyTorch's caching allocator)
  instead of `cudaMallocAsync` / `cudaFreeAsync`. This keeps
  transient scratch allocations made by nanoVDB internals
  (MergeGrids, TopologyBuilder, DilateGrid, ...) inside the same
  pool that fvdb / PyTorch tensors use. Without this, the two
  pools fragment each other and large workloads such as
  multi-frame TSDF integration hit a clean OOM even when the GPU
  still has free memory in aggregate.

- `nanovdb/tools/cuda/TopologyBuilder.cuh` adds an opt-in debug
  trace gated on the `FVDB_NANOVDB_TRACE_ALLOCS` env var that
  prints tile count and per-call scratch size from
  `allocateInternalMaskBuffers`. Useful for diagnosing topology-op
  memory blowup on large scenes; off by default and otherwise
  byte-identical with upstream.

Each forked header carries a short `FVDB FORK:` banner documenting
what diverges from upstream and why, with inline `// FVDB FORK:`
tags on every non-trivial change so `git blame` and text search make
the delta obvious. `src/fvdb/nanovdb_overrides/README.md` documents
the resync and add-an-override procedures.

CMake wiring lives in `src/cmake/get_nanovdb.cmake`, with mirror
include-path orderings in `src/CMakeLists.txt` (fvdb target) and
the top-level `CMakeLists.txt` (_fvdb_cpp pybind module).

Signed-off-by: Francis Williams <francis@fwilliams.info>
@fwilliams fwilliams force-pushed the feat/nanovdb-allocator-overrides branch from 4ec7afe to bedf87e Compare May 19, 2026 17:49
@fwilliams fwilliams requested a review from Copilot May 19, 2026 17:51

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.

`pyproject.toml`'s `torch>=2.8,<2.12` cap was older than what the CI
build pipeline now picks up. `env/build_requirements.txt` and
`env/test_requirements.txt` leave `torch` unpinned, so `uv pip install
--extra-index-url https://download.pytorch.org/whl/cu130 ...` resolved
it to whatever was newest on the CUDA index.

That worked while the newest cu130 torch matched the wheel's range.
Once torch `2.12.0+cu130` was published, the build env installed
2.12.0+cu130 (so fvdb's `.so` ended up compiled against the 2.12 ABI,
which includes `at::TensorBase::const_data_ptr<int8_t>`), but the
*test* env's install of the resulting wheel hit pyproject's `<2.12`
ceiling and pip downgraded torch to the CPU-only `2.11.0` from PyPI
(which is missing those CUDA-side template symbols). The CUDA 13.0
unit-test pipeline then surfaces:

    ImportError: .../libfvdb.so: undefined symbol:
      _ZNK2at10TensorBase14const_data_ptrIaEEPKT_v

at `import fvdb` time. Every CUDA 13.0 PR since 2026-05-16 has been
red on this; CUDA 12.8 isn't affected (its index still serves a
wheel in-range).

Bumping the upper bound to `<2.13` lets the wheel and both the build
and test environments converge on the same torch 2.12.x build, with
no extra pin needed in the requirements files. A short comment in
pyproject.toml documents the convention so the next torch minor
release bump is mechanical.

Signed-off-by: Francis Williams <francis@fwilliams.info>
@fwilliams fwilliams force-pushed the feat/nanovdb-allocator-overrides branch from 010424d to c700a40 Compare May 19, 2026 19:41
@fwilliams

Copy link
Copy Markdown
Collaborator Author

Replaced the previous CI-fix commit with a cleaner one-liner approach (force-pushed):

Instead of pinning torch in both env/build_requirements.txt and env/test_requirements.txt to a range derived from pyproject.toml (which has to be kept in sync in three places), the requirements files are back to unpinned torch and pyproject.toml's upper bound is bumped from <2.12 to <2.13. With the wheel's range widened, uv pip install resolves to the same torch in both build and test envs (no more downgrade-mismatch on install), and there's only one source of truth for the supported torch range.

The CUDA 12.8 pipeline is already known to build+test fine against torch 2.12 (the build step uses torch 2.12.0+cu130 today; only the test install was downgrading because of the pyproject ceiling). CUDA 13.0 should now match.

A short comment in pyproject.toml documents the convention so the next torch minor-release bump is a one-line edit.

… torch

Recent conda-forge `pytorch-gpu` builds (e.g.
`pytorch-2.10.0-cuda130_mkl_py312_*_304`) hit a packaging bug where
the `{torch,ATen,c10,caffe2,tensorpipe}` symlinks inside
`<site-packages>/torch/include/` land with a `.c~` suffix -- the
file-conflict rename suffix conda uses when a package install would
overwrite an existing file. The result is that `find_package(Torch)`'s
references to `<site-packages>/torch/include/torch/...` don't resolve,
and fvdb builds against a conda-forge torch fail with:

    fatal error: torch/custom_class.h: No such file or directory

(observed in `src/fvdb/JaggedTensor.h:7` from
`src/tests/PackedJaggedAccessorTest.cu` on the Conda Build CI job).

The canonical conda-forge install location for those headers is
`$CONDA_PREFIX/include/torch/`, `$CONDA_PREFIX/include/ATen/`, etc.,
and the broken symlinks were meant to point at exactly those paths.
Append `$CONDA_PREFIX/include` to `TORCH_INCLUDE_DIRS` when it exists,
guarded by `IS_DIRECTORY "$ENV{CONDA_PREFIX}/include/torch"` so we
only do this when torch is actually conda-installed (pip-installed
torch in a conda env is unaffected, and non-conda environments don't
trigger the branch at all).

Signed-off-by: Francis Williams <francis@fwilliams.info>
@sifakis

sifakis commented May 19, 2026

Copy link
Copy Markdown

It might be worth prioritizing an improvement/generalization of the core nanovdb ops to address this more properly. At minimum ... we could replace the hardcoded DeviceBuffer used for transient allocation in the topoOps with a templatized ScratchBufferT (which would default to DeviceBuffer, but could be set to TorchDeviceBuffer in fVDB).
The only obstacles to that are minor (I think):

  • TorchDeviceBuffer::clear() takes no arguments, while DeviceBuffer::clear takes a stream argument. Remedy: Have the TB version accept (and ignore) a stream argument. TDB::create already does the same ...
  • TDB::data() returns uint8_t, DeviceBuffer::data returns void . This messes up some static_cast's. Remedy: Have both return void.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants