Fork a small set of nanoVDB headers to share PyTorch's CUDA allocator#655
Fork a small set of nanoVDB headers to share PyTorch's CUDA allocator#655fwilliams wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Forks a small set of nanoVDB headers into src/fvdb/nanovdb_overrides/ so that nanoVDB's device-side scratch allocations route through PyTorch's c10::cuda::CUDACachingAllocator instead of cudaMallocAsync/cudaFreeAsync. This unifies the CUDA memory pool between nanoVDB internals and PyTorch tensors to avoid dual-pool OOMs in workloads like multi-frame TSDF integration. Also adds an opt-in env-gated allocation trace (FVDB_NANOVDB_TRACE_ALLOCS).
Changes:
- Forked
nanovdb/cuda/DeviceBuffer.handnanovdb/cuda/DeviceResource.hto allocate through PyTorch's caching allocator with optional env-gated alloc traces. - Forked
nanovdb/tools/cuda/TopologyBuilder.cuhto add an env-gated tile-count/scratch-size debug print inallocateInternalMaskBuffers. - Updated CMake (
src/cmake/get_nanovdb.cmake,src/CMakeLists.txt, top-levelCMakeLists.txt) to prepend the overrides directory to the include search path; added a README documenting layout and resync procedure.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/fvdb/nanovdb_overrides/README.md |
Documents rationale, layout, and procedures for adding/resyncing overrides. |
src/fvdb/nanovdb_overrides/nanovdb/cuda/DeviceBuffer.h |
Routes device alloc/free through c10::cuda::CUDACachingAllocator; adds env-gated alloc trace. |
src/fvdb/nanovdb_overrides/nanovdb/cuda/DeviceResource.h |
Same allocator routing for scratch allocations used by CUB/topology ops. |
src/fvdb/nanovdb_overrides/nanovdb/tools/cuda/TopologyBuilder.cuh |
Adds env-gated tile-count/scratch-size trace in allocateInternalMaskBuffers. |
src/cmake/get_nanovdb.cmake |
Defines FVDB_NANOVDB_OVERRIDES_DIR and prepends it to the nanovdb interface include path. |
src/CMakeLists.txt |
Adds overrides dir before upstream nanoVDB sources on the fvdb target. |
CMakeLists.txt |
Mirror include ordering for the _fvdb_cpp pybind module. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
14613ce to
4ec7afe
Compare
|
Addressed Copilot's review comments and the failing code-style checks (force-pushed amended commit, since this PR is still a draft): Copilot #1 + #2 — Copilot #3 — README out of sync with directory contents CI: |
Add an override directory at `src/fvdb/nanovdb_overrides/` that holds modified copies of a few upstream nanoVDB headers. The build prepends this directory to the include search path so that, e.g., `#include <nanovdb/cuda/DeviceBuffer.h>` resolves to the forked copy under `src/fvdb/nanovdb_overrides/`, while the rest of nanoVDB continues to come from the upstream source tree. Forked headers: - `nanovdb/cuda/DeviceBuffer.h` and `nanovdb/cuda/DeviceResource.h` route device-side allocations through `c10::cuda::CUDACachingAllocator` (PyTorch's caching allocator) instead of `cudaMallocAsync` / `cudaFreeAsync`. This keeps transient scratch allocations made by nanoVDB internals (MergeGrids, TopologyBuilder, DilateGrid, ...) inside the same pool that fvdb / PyTorch tensors use. Without this, the two pools fragment each other and large workloads such as multi-frame TSDF integration hit a clean OOM even when the GPU still has free memory in aggregate. - `nanovdb/tools/cuda/TopologyBuilder.cuh` adds an opt-in debug trace gated on the `FVDB_NANOVDB_TRACE_ALLOCS` env var that prints tile count and per-call scratch size from `allocateInternalMaskBuffers`. Useful for diagnosing topology-op memory blowup on large scenes; off by default and otherwise byte-identical with upstream. Each forked header carries a short `FVDB FORK:` banner documenting what diverges from upstream and why, with inline `// FVDB FORK:` tags on every non-trivial change so `git blame` and text search make the delta obvious. `src/fvdb/nanovdb_overrides/README.md` documents the resync and add-an-override procedures. CMake wiring lives in `src/cmake/get_nanovdb.cmake`, with mirror include-path orderings in `src/CMakeLists.txt` (fvdb target) and the top-level `CMakeLists.txt` (_fvdb_cpp pybind module). Signed-off-by: Francis Williams <francis@fwilliams.info>
4ec7afe to
bedf87e
Compare
`pyproject.toml`'s `torch>=2.8,<2.12` cap was older than what the CI build pipeline now picks up. `env/build_requirements.txt` and `env/test_requirements.txt` leave `torch` unpinned, so `uv pip install --extra-index-url https://download.pytorch.org/whl/cu130 ...` resolved it to whatever was newest on the CUDA index. That worked while the newest cu130 torch matched the wheel's range. Once torch `2.12.0+cu130` was published, the build env installed 2.12.0+cu130 (so fvdb's `.so` ended up compiled against the 2.12 ABI, which includes `at::TensorBase::const_data_ptr<int8_t>`), but the *test* env's install of the resulting wheel hit pyproject's `<2.12` ceiling and pip downgraded torch to the CPU-only `2.11.0` from PyPI (which is missing those CUDA-side template symbols). The CUDA 13.0 unit-test pipeline then surfaces: ImportError: .../libfvdb.so: undefined symbol: _ZNK2at10TensorBase14const_data_ptrIaEEPKT_v at `import fvdb` time. Every CUDA 13.0 PR since 2026-05-16 has been red on this; CUDA 12.8 isn't affected (its index still serves a wheel in-range). Bumping the upper bound to `<2.13` lets the wheel and both the build and test environments converge on the same torch 2.12.x build, with no extra pin needed in the requirements files. A short comment in pyproject.toml documents the convention so the next torch minor release bump is mechanical. Signed-off-by: Francis Williams <francis@fwilliams.info>
010424d to
c700a40
Compare
|
Replaced the previous CI-fix commit with a cleaner one-liner approach (force-pushed): Instead of pinning torch in both The CUDA 12.8 pipeline is already known to build+test fine against torch 2.12 (the build step uses torch 2.12.0+cu130 today; only the test install was downgrading because of the pyproject ceiling). CUDA 13.0 should now match. A short comment in |
… torch
Recent conda-forge `pytorch-gpu` builds (e.g.
`pytorch-2.10.0-cuda130_mkl_py312_*_304`) hit a packaging bug where
the `{torch,ATen,c10,caffe2,tensorpipe}` symlinks inside
`<site-packages>/torch/include/` land with a `.c~` suffix -- the
file-conflict rename suffix conda uses when a package install would
overwrite an existing file. The result is that `find_package(Torch)`'s
references to `<site-packages>/torch/include/torch/...` don't resolve,
and fvdb builds against a conda-forge torch fail with:
fatal error: torch/custom_class.h: No such file or directory
(observed in `src/fvdb/JaggedTensor.h:7` from
`src/tests/PackedJaggedAccessorTest.cu` on the Conda Build CI job).
The canonical conda-forge install location for those headers is
`$CONDA_PREFIX/include/torch/`, `$CONDA_PREFIX/include/ATen/`, etc.,
and the broken symlinks were meant to point at exactly those paths.
Append `$CONDA_PREFIX/include` to `TORCH_INCLUDE_DIRS` when it exists,
guarded by `IS_DIRECTORY "$ENV{CONDA_PREFIX}/include/torch"` so we
only do this when torch is actually conda-installed (pip-installed
torch in a conda env is unaffected, and non-conda environments don't
trigger the branch at all).
Signed-off-by: Francis Williams <francis@fwilliams.info>
|
It might be worth prioritizing an improvement/generalization of the core nanovdb ops to address this more properly. At minimum ... we could replace the hardcoded
|
Summary
Add an in-tree override directory at
src/fvdb/nanovdb_overrides/that holds modified copies of a small number of upstream nanoVDB headers. The build prepends this directory to the include search path so that#include <nanovdb/...>for the forked files resolves here, while the rest of nanoVDB continues to come from the upstream source tree (still fetched unmodified at its pinned SHA).Two changes are forked in:
nanovdb/cuda/DeviceBuffer.handnanovdb/cuda/DeviceResource.hroute device-side allocations throughc10::cuda::CUDACachingAllocator(PyTorch's caching allocator) instead ofcudaMallocAsync/cudaFreeAsync. This keeps transient scratch allocations made by nanoVDB internals (MergeGrids, TopologyBuilder, DilateGrid, …) inside the same pool that fvdb / PyTorch tensors use. Without this, the two pools fragment each other and large workloads such as multi-frame TSDF integration hit a clean OOM even when the GPU still has free memory in aggregate.nanovdb/tools/cuda/TopologyBuilder.cuhadds an opt-in debug trace gated on theFVDB_NANOVDB_TRACE_ALLOCSenv var that prints tile count and per-call scratch size fromallocateInternalMaskBuffers. Useful for diagnosing topology-op memory blowup on large scenes; off by default and otherwise byte-identical with upstream.Each forked header carries a short
FVDB FORK:banner documenting what diverges from upstream and why, with inline// FVDB FORK:tags on every non-trivial code change sogit blameand text search make the delta obvious.src/fvdb/nanovdb_overrides/README.mddocuments the resync and add-an-override procedures.CMake wiring lives in
src/cmake/get_nanovdb.cmake, with mirror include-path orderings insrc/CMakeLists.txt(thefvdblibrary target) and the top-levelCMakeLists.txt(the_fvdb_cpppybind module).Why a directory of forked headers rather than a CPM patch?
CPM's
PATCH_COMMANDmechanism is fragile: it silently stops applying when upstream moves the surrounding lines, and it makes it hard to edit nanoVDB during development. Forking the exact headers we care about into this tree gives us full edit access with a narrow, reviewable patch surface and a clean diff-against-upstream workflow.Test plan
./build.sh installsucceeds with the new include ordering on a clean tree.cd tests && pytest unit/passes against the new install (verified locally, full suite green including the existing TSDF and nanoVDB I/O tests).FVDB_NANOVDB_TRACE_ALLOCS=1 python -c "..."over any TSDF integration workload prints the per-call tile counts and scratch sizes; unsetting the var reverts to byte-identical upstream behavior.Followups
This is the foundation for a follow-up PR that adds new TSDF / ESDF / Occupancy / Decay / fast-marching-cubes APIs (stacked on this branch). That PR depends on the allocator routing for the multi-frame integrator paths.