Skip to content

Update systemd unit conditions to execute nvidia-smi and re-add restart logic#1825

Merged
cdesiniotis merged 2 commits into
NVIDIA:mainfrom
cdesiniotis:cdi-refresh-service-run-nvidia-smi
May 14, 2026
Merged

Update systemd unit conditions to execute nvidia-smi and re-add restart logic#1825
cdesiniotis merged 2 commits into
NVIDIA:mainfrom
cdesiniotis:cdi-refresh-service-run-nvidia-smi

Conversation

@cdesiniotis
Copy link
Copy Markdown
Contributor

@cdesiniotis cdesiniotis commented May 14, 2026

This ensures that the nvidia-cdi-refresh.service does not start until after the driver is loaded and available. The added benefit of running nvidia-smi is that it will load the module (if not already loaded) and create the device nodes (if not already created).

This PR also re-introduces the restart logic which was removed in 5fe6b42. Without the restart logic, it is possible that our service gets started too early in the boot process and executing "nvidia-smi" fails. With retries, the "nvidia-smi" invocation should eventually succeed, allowing us to generate a CDI spec.

Note, re-introducing the restart logic in the systemd service means we should re-open #1624. I believe there are ways to address the use case in #1624 while still keeping the restart logic -- 1) Reducing the logs emitted by nvidia-ctk cdi generate and/or 2) Updating nvidia-ctk cdi generate to exit early with a zero exit code if no GPUs are detected on the system.

This ensures that the nvidia-cdi-refresh.service does not start until
after the driver is loaded and available. The added benefit of running
nvidia-smi is that it will load the module (if not already loaded) and
create the device nodes (if not already created).

Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
@cdesiniotis cdesiniotis self-assigned this May 14, 2026
@cdesiniotis cdesiniotis requested a review from tariq1890 May 14, 2026 19:13
Comment thread deployments/systemd/nvidia-cdi-refresh.service
The restart logic was removed when a dependency on the multi-user-target was added in
NVIDIA@5fe6b42.
However, this change had unforeseen consequences and was later reverted in
NVIDIA@5eee5ce.
When reverting this change, the restart logic was not re-added to the service.

Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
@coveralls
Copy link
Copy Markdown

Coverage Report for CI Build 25883283353

Coverage remained the same at 43.342%

Details

  • Coverage remained the same as the base build.
  • Patch coverage: No coverable lines changed in this PR.
  • No coverage regressions found.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

No coverage regressions found.


Coverage Stats

Coverage Status
Relevant Lines: 14861
Covered Lines: 6441
Line Coverage: 43.34%
Coverage Strength: 0.48 hits per line

💛 - Coveralls

@cdesiniotis cdesiniotis changed the title Update systemd unit conditions to execute nvidia-smi Update systemd unit conditions to execute nvidia-smi and re-add restart logic May 14, 2026
@cdesiniotis
Copy link
Copy Markdown
Contributor Author

/cherry-pick release-1.19

@cdesiniotis cdesiniotis merged commit fb51f43 into NVIDIA:main May 14, 2026
20 checks passed
@github-actions
Copy link
Copy Markdown

🤖 Backport PR created for release-1.19: #1826

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants