Skip to content

Fix handling of CUDA Compat header on Orin systems#1804

Merged
cdesiniotis merged 3 commits into
mainfrom
repro-orin-out-of-bounds
May 12, 2026
Merged

Fix handling of CUDA Compat header on Orin systems#1804
cdesiniotis merged 3 commits into
mainfrom
repro-orin-out-of-bounds

Conversation

@elezar
Copy link
Copy Markdown
Member

@elezar elezar commented May 7, 2026

On Orin systems, the CUDA compat header included in libcuda.so in the container has two problems:

  1. The padding is different to that of the standard compat libraries which would cause an out-of-bounds panic when processing the relevant note.
  2. The JSON data for the CUDA 13.2 compat libraries is malformed meaning that no CUDA compat information can be extracted.

This PR makes the following changes:

  1. Updates the code to be resillient to out of bounds errors.
  2. Adds a test case including the malformed JSON and an example including the corrected library.

Note that the libraries used as test data have been truncated to only include the relevant notes.

elezar added 3 commits May 12, 2026 08:56
This change truncates the libraries used to test the processing of the
CUDA compat elf header to avoid adding files that are ~90MB in size.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a unit test to reprocude the out-of-bounds access
when processing a libcuda.so orin compat library.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change fixes the handling of the CUDA compat ELF header on Orin
systems. The padding of the header was such that the existing implementation
would cause an out of bounds error when processing the relevant ELF note.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the repro-orin-out-of-bounds branch from a9e2d67 to 1a09d4a Compare May 12, 2026 07:00
@elezar elezar changed the title TOFIX: Reproduce orin compat check error Fix handling of CUDA Compat header on Orin systems May 12, 2026
@coveralls
Copy link
Copy Markdown

Coverage Report for CI Build 25718833415

Coverage increased (+0.009%) to 43.335%

Details

  • Coverage increased (+0.009%) from the base build.
  • Patch coverage: 2 uncovered changes across 1 file (5 of 7 lines covered, 71.43%).
  • 31 coverage regressions across 2 files.

Uncovered Changes

File Changed Covered %
cmd/nvidia-cdi-hook/cudacompat/cuda-elf-header.go 7 5 71.43%

Coverage Regressions

31 previously-covered lines in 2 files lost coverage.

File Lines Losing Coverage Coverage
pkg/nvcdi/driver-nvml.go 28 69.05%
cmd/nvidia-cdi-hook/update-ldcache/update-ldcache.go 3 0.0%

Coverage Stats

Coverage Status
Relevant Lines: 14861
Covered Lines: 6440
Line Coverage: 43.33%
Coverage Strength: 0.48 hits per line

💛 - Coveralls

@elezar elezar added the bug Issue/PR to expose/discuss/fix a bug label May 12, 2026
@elezar elezar added this to the v1.19.1 milestone May 12, 2026
@elezar elezar requested review from cdesiniotis and tariq1890 May 12, 2026 07:04
@elezar elezar marked this pull request as ready for review May 12, 2026 07:04
@elezar
Copy link
Copy Markdown
Member Author

elezar commented May 12, 2026

/cherry-pick release-1.19

Copy link
Copy Markdown
Contributor

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @elezar! CI failure is known and is unrelated to this PR (is fixed on main already).

@cdesiniotis cdesiniotis merged commit cb1cd2b into main May 12, 2026
19 of 20 checks passed
@cdesiniotis cdesiniotis deleted the repro-orin-out-of-bounds branch May 12, 2026 17:29
@github-actions
Copy link
Copy Markdown

🤖 Backport PR created for release-1.19: #1822

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Issue/PR to expose/discuss/fix a bug cherry-pick/release-1.19

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants