cmdrunner: release process handle in _pidAlive to avoid pidfd leak by texasich · Pull Request #378 · hashicorp/go-plugin

texasich · 2026-04-22T03:05:57Z

Description

_pidAlive in internal/cmdrunner/process_posix.go calls os.FindProcess(pid) and never releases the returned *os.Process. On Linux with Go 1.23+ that call now opens a pidfd under the hood (os.pidfdFind → pidfd_open), so every invocation leaks one file descriptor.

pidWait polls once per second per plugin, so the leak scales linearly with plugin count × uptime. In the downstream Nomad report (hashicorp/nomad#27847) a client with a few hundred allocations saw the host-side FD count cycle between 20k and 130k, eventually tripping EMFILE and breaking CNI config loads, disk-stats collection, pipe2, and docker socket dials.

The fix is a one-liner: defer proc.Release() right after the FindProcess call, so the pidfd is closed on every return path. The Windows implementation already does the equivalent with defer syscall.CloseHandle(h) in process_windows.go, so this just brings the POSIX side in line.

Before:

pidfd_open(686713, 0) = 90771
pidfd_open(690846, 0) = 90772
pidfd_open(123738, 0) = 90774
...

(one new FD per poll iteration per plugin, never closed)

Related Issue

Downstream report: hashicorp/nomad#27847

How Has This Been Tested?

go build ./... and go vet ./... clean on both host (Windows) and GOOS=linux
go test ./internal/cmdrunner/... passes
Reviewed against the Windows _pidAlive which has always done the equivalent cleanup via defer syscall.CloseHandle(h), so behavior is symmetric across platforms

Caching the *os.Process on the runner would avoid the repeated FindProcess entirely and is probably the better long-term change, but it's a larger refactor touching Runner lifecycle. Keeping this PR to the minimal, backport-friendly fix; happy to follow up with the caching variant in a separate PR if preferred.

os.FindProcess on Linux with Go 1.23+ opens a pidfd, and pidWait polls _pidAlive roughly once per second for every plugin process. Without a matching Release the pidfd leaks on each poll, and under Nomad with a few hundred allocations it adds up fast -- one reporter saw it cycle between 20k and 130k open FDs until the process hit EMFILE. Defer proc.Release() right after FindProcess so the handle is closed on every return path. Mirrors the syscall.CloseHandle defer already used in the Windows implementation. Reported downstream in hashicorp/nomad#27847.

hashicorp-cla-app · 2026-04-22T03:06:12Z

All committers have signed the CLA.

tgross

Thanks for the PR @texasich! LGTM once the lint is addressed!

I've reproduced the circumstances described in hashicorp/nomad#27847 with both the current tip of Noamd and a version of Nomad using this PR for go-plugin (via a replace directive in my go.mod). I run a single minimal job and then restart the agent. (Note this requires running not-in--dev mode.)

Using this little script:

#!/usr/bin/env bash

while :
do
    ls /proc/$1/fd/ | wc -l
    sleep 1
done

We see that before the patch, the number of open file handles increases after restart Nomad. After the patch, it does not.

tgross · 2026-04-28T13:53:40Z

 	if err == nil {
+		// On Linux with Go 1.23+, FindProcess opens a pidfd which must be
+		// released or it leaks an FD on every call.
+		defer proc.Release()


This is triggering a very dumb lint. Can we either swallow the return value or put a //nolint:errcheck directive here?

@tgross

defer proc.Release() trips the errcheck lint because Release returns an error. The handle is short-lived and there's nothing actionable to recover from a release failure, so swallow the error explicitly to make the intent clear. Per @tgross review on hashicorp#378. Signed-off-by: texasich <texasich@users.noreply.github.com>

texasich · 2026-04-28T14:21:05Z

Thanks @tgross — appreciate you running the repro with the replace directive. Pushed 8382575 swapping the bare defer proc.Release() for defer func() { _ = proc.Release() }() so errcheck stays quiet.

tgross

LGTM

texasich · 2026-04-28T16:08:59Z

Quick CI update — lint is clean on 8382575. The remaining failure is TestClient_TLS in the top-level package, which looks like a flake. The error tls: first record does not look like a TLS handshake at client_test.go:1064 is the classic TLS-handshake race pattern, and master CI is green on this test. The diff between attempts on our side is just defer proc.Release() → defer func() { _ = proc.Release() }(), which is semantically identical — proc.Release() releases the local handle without touching the running process, so there's no path from cmdrunner internals to the top-level TLS test.

When you get a moment, could you re-run the go-test job? If it fails the same way a second time I'll dig deeper.

ritikrajdev

Thanks for checking this out. Looks Good to me as well.

When Nomad clients built on Go >=1.23 are restarted, the go-plugin client starts leaking a pidfd file handle 1/sec when polling for the plugin server. Long-term we should look at this polling behavior in more detail in the library because the pidfd may let us track the plugin status without polling. But in the meantime the leaking pidfd has been fixed upstream and we should pull that in. Ref: hashicorp/go-plugin#378 Fixes: #27847

texasich requested a review from a team as a code owner April 22, 2026 03:05

texasich mentioned this pull request Apr 27, 2026

Nomad opening a large number of file descriptors to its logging sub-processes when log collection is enabled hashicorp/nomad#27847

Closed

tgross self-requested a review April 28, 2026 13:28

tgross requested changes Apr 28, 2026

View reviewed changes

tgross approved these changes Apr 28, 2026

View reviewed changes

ritikrajdev approved these changes Apr 29, 2026

View reviewed changes

ritikrajdev merged commit 155dcdd into hashicorp:main Apr 29, 2026
4 of 5 checks passed

tgross mentioned this pull request Apr 29, 2026

deps: update go-plugin hashicorp/nomad#27885

Merged

7 tasks

hc-github-team-nomad-core mentioned this pull request May 1, 2026

Backport of deps: update go-plugin into release/2.0.x hashicorp/nomad#27893

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmdrunner: release process handle in _pidAlive to avoid pidfd leak#378

cmdrunner: release process handle in _pidAlive to avoid pidfd leak#378
ritikrajdev merged 2 commits into
hashicorp:mainfrom
texasich:fix/pidfd-leak-pidalive

texasich commented Apr 22, 2026

Uh oh!

hashicorp-cla-app Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

tgross left a comment •

edited

Loading

Uh oh!

tgross Apr 28, 2026

Uh oh!

texasich commented Apr 28, 2026

Uh oh!

tgross left a comment

Uh oh!

texasich commented Apr 28, 2026

Uh oh!

ritikrajdev left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

texasich commented Apr 22, 2026

Description

Related Issue

How Has This Been Tested?

Uh oh!

hashicorp-cla-app Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tgross left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tgross Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

texasich commented Apr 28, 2026

Uh oh!

tgross left a comment

Choose a reason for hiding this comment

Uh oh!

texasich commented Apr 28, 2026

Uh oh!

ritikrajdev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hashicorp-cla-app Bot commented Apr 22, 2026 •

edited

Loading

tgross left a comment •

edited

Loading