Skip to content

fix: isolate exporter timeout phases#154

Open
jingxiang-z wants to merge 3 commits intomainfrom
fix/exporter-timeout-phases
Open

fix: isolate exporter timeout phases#154
jingxiang-z wants to merge 3 commits intomainfrom
fix/exporter-timeout-phases

Conversation

@jingxiang-z
Copy link
Copy Markdown
Collaborator

@jingxiang-z jingxiang-z commented Apr 1, 2026

Summary

Fix exporter timeout handling so a stalled collection phase does not force HTTP upload to inherit an expired context.

What Changed

  • split exporter collection and upload into separate timeout-scoped contexts
  • stop collector work promptly when cancellation or deadline expiry is observed
  • preserve partial event data collected before cancellation
  • add regression tests for partial collection and fresh upload context behavior

Root Cause

The exporter previously used one shared timeout for metadata refresh, collection, and upload. When GPU-facing collection work stalled, the shared context expired and the later event-read and HTTP-upload phases inherited a dead context, causing cascading context deadline exceeded failures.

Impact

The agent now degrades more cleanly under driver/NVML stalls: partial collection can still be uploaded, and cancellation no longer produces misleading per-phase timeout cascades.

Validation

  • go test ./internal/exporter/...

Summary by CodeRabbit

  • Bug Fixes

    • Collection now returns partial health data when components time out or are canceled, and preserves accumulated events/component data on early return.
    • Machine info refresh runs asynchronously and uses a cached result so metrics/events aren’t blocked by slow machine-info reads.
    • Upload uses a fresh timeout and is skipped when collection returns no data, avoiding unnecessary uploads.
  • Tests

    • Added tests covering partial returns, machine-info caching/refresh behavior, and upload context handling.

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 1, 2026

📝 Walkthrough

Walkthrough

Collector now checks context cancellation/deadline throughout collection and returns partial HealthData on early exit; machine-info collection is background-refreshed and cached. Exporter separates collection and upload timeouts so uploads can proceed with a fresh context even when collection timed out.

Changes

Cohort / File(s) Summary
Collector Implementation
internal/exporter/collector/collector.go
Added repeated ctx.Err() checks and early returns with partially populated HealthData. Component errors context.Canceled/DeadlineExceeded cause immediate return; events/component-data preserve accumulated results. collectMachineInfo now accepts ctx, uses a mutex-protected cached *machineinfo.MachineInfo, and always triggers an asynchronous refresh without blocking Collect.
Collector Tests
internal/exporter/collector/collector_test.go
Added tests for partial-return behavior (TestCollectReturnsPartialDataWhenEventCollectionTimesOut, TestCollectReturnsPartialDataWhenComponentDataCollectionTimesOut) and machine-info behavior (TestCollector_CollectMachineInfoHangDoesNotBlockOtherData, TestCollector_CollectMachineInfoUsesCachedRefreshResult). Extended mockComponent with eventFunc and added cancelingHealthStateComponent. Added test-only imports to support machine-info/NVML stubs.
Exporter Implementation
internal/exporter/exporter.go
Split single export timeout into collectCtx (for metadata/collector) and uploadCtx (for HTTP upload). Treat collector errors as fatal only when healthData == nil; otherwise proceed to upload using fresh uploadCtx and log partial outcomes. Simplified non-offline control flow.
Exporter Tests
internal/exporter/exporter_test.go
Added mockHTTPWriter and tests: TestExportDoesNotAttemptUploadWhenCollectionReturnsNoData (no upload when collector returns nil data + error) and TestExportUsesFreshUploadContextAfterPartialCollection (upload uses fresh, uncanceled context and receives partial data).
Config Tests
internal/config/config_test.go
Added useTempHomeDir(t *testing.T) helper and updated TestDefault and TestDefaultWithHealthExporter to use a temporary home directory for deterministic state-path resolution.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Exporter
    participant Collector
    participant Components
    participant HTTPWriter

    Client->>Exporter: export()
    activate Exporter

    Note over Exporter: create collectCtx (timeout for collection)
    Exporter->>Collector: Collect(collectCtx)
    activate Collector

    Collector->>Collector: periodic ctx.Err() checks
    Collector->>Components: collectMachineInfo (background refresh), collect metrics, collect events, collect component data
    rect rgba(200,150,100,0.5)
        Note over Components,Collector: a component or step hits deadline/cancel
        Components-->>Collector: context.DeadlineExceeded / context.Canceled
    end

    Collector-->>Exporter: partial HealthData + ctx.Err()
    deactivate Collector

    Note over Exporter: if HealthData != nil -> create fresh uploadCtx
    Exporter->>HTTPWriter: Send(uploadCtx, partial HealthData)
    activate HTTPWriter
    HTTPWriter-->>Exporter: upload result
    deactivate HTTPWriter

    Exporter-->>Client: return result (success or error)
    deactivate Exporter
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I sniff the timeouts, hop and cheer,
Partial carrots brought safely near,
Two clocks ticking—collect then send,
Background refresh, a helpful friend,
I nibble proofs while uploads clear! 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.79% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'fix: isolate exporter timeout phases' directly and concisely describes the main change: splitting the exporter's single timeout into isolated phases for collection and upload to prevent stalled operations from cascading.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/exporter-timeout-phases

Comment @coderabbitai help to get the list of available commands and usage tips.

@jingxiang-z jingxiang-z marked this pull request as ready for review April 1, 2026 20:23
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/exporter/collector/collector.go (1)

178-187: Consider adding context checks after attestation and config collection.

The collectAttestationData and collectConfigData methods are not followed by context cancellation checks, unlike the other collection steps. While these are typically fast operations, for consistency and to ensure prompt cancellation, consider adding checks after them as well.

♻️ Optional: Add context checks for consistency
 	// Collect attestation data if provider is available
 	// Attestation is always enabled if manager is available
 	if err := c.collectAttestationData(data); err != nil {
 		log.Logger.Errorw("Failed to collect attestation data", "error", err)
 	}
+	if err := ctx.Err(); err != nil {
+		return data, err
+	}

 	// Collect config data
 	if err := c.collectConfigData(data); err != nil {
 		log.Logger.Errorw("Failed to collect config data", "error", err)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/exporter/collector/collector.go` around lines 178 - 187, After
calling c.collectAttestationData(data) and c.collectConfigData(data) add the
same context cancellation checks used elsewhere in this function so collection
stops promptly on cancel; specifically, after each call (and its error handling)
check the request context (ctx or the collector's context variable) for
cancellation (e.g., select on ctx.Done()) and return early if cancelled,
mirroring the existing pattern used after other collection steps.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/exporter/collector/collector.go`:
- Around line 178-187: After calling c.collectAttestationData(data) and
c.collectConfigData(data) add the same context cancellation checks used
elsewhere in this function so collection stops promptly on cancel; specifically,
after each call (and its error handling) check the request context (ctx or the
collector's context variable) for cancellation (e.g., select on ctx.Done()) and
return early if cancelled, mirroring the existing pattern used after other
collection steps.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e987c168-0949-4056-a56b-54653c58a4e7

📥 Commits

Reviewing files that changed from the base of the PR and between 752b61f and af43c37.

📒 Files selected for processing (4)
  • internal/exporter/collector/collector.go
  • internal/exporter/collector/collector_test.go
  • internal/exporter/exporter.go
  • internal/exporter/exporter_test.go

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
@jingxiang-z jingxiang-z requested a review from mukilsh April 1, 2026 21:45
@jingxiang-z jingxiang-z self-assigned this Apr 1, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
internal/exporter/exporter.go (1)

180-181: Separate the collection timeout from the HTTP timeout.

internal/config/config.go:96-104 still documents HealthExporter.Timeout as the HTTP request timeout, but these lines now spend that budget twice—once on collection and again on upload. That changes the worst-case duration of one export cycle and couples GPU stall tuning to network tuning. A dedicated collection timeout, or a config/doc rename, would make this behavior much easier to reason about.

Also applies to: 205-206

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/exporter/exporter.go` around lines 180 - 181, The code currently
reuses e.options.timeout for both collection and upload (see the
context.WithTimeout using e.ctx to create collectCtx and the later upload path),
which doubles the intended HTTP timeout and couples collection and network
tuning; define and use a separate collection timeout (e.g. add
e.options.collectTimeout or rename existing config) and replace the collection
context creation (collectCtx, cancelCollect := context.WithTimeout(e.ctx,
e.options.timeout)) with the new value (e.options.collectTimeout), update any
other collection uses (including the similar lines at 205-206), and update the
configuration/docs in internal/config/config.go to document the distinct
collection timeout vs HTTP/upload timeout so the two budgets are independent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/exporter/exporter.go`:
- Around line 180-181: ExportNow(ctx) currently ignores the caller context and
export() roots phase contexts off e.ctx, so thread the caller-provided ctx into
export() (call export(ctx, ...)) and within export use the caller ctx as the
parent for phase timeouts (use context.WithTimeout(ctx, e.options.timeout) for
collection and likewise for upload) instead of context.WithTimeout(e.ctx,...);
after the collection phase returns, check collectCtx.Err() (or context
cancellation) and return early if cancelled/expired so you don't start upload
when collection was cancelled; this also ensures Stop() canceling e.ctx won't
immediately cancel on-demand exports that used the caller ctx.

---

Nitpick comments:
In `@internal/exporter/exporter.go`:
- Around line 180-181: The code currently reuses e.options.timeout for both
collection and upload (see the context.WithTimeout using e.ctx to create
collectCtx and the later upload path), which doubles the intended HTTP timeout
and couples collection and network tuning; define and use a separate collection
timeout (e.g. add e.options.collectTimeout or rename existing config) and
replace the collection context creation (collectCtx, cancelCollect :=
context.WithTimeout(e.ctx, e.options.timeout)) with the new value
(e.options.collectTimeout), update any other collection uses (including the
similar lines at 205-206), and update the configuration/docs in
internal/config/config.go to document the distinct collection timeout vs
HTTP/upload timeout so the two budgets are independent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f68d14bc-8736-4ca9-bf2e-b6867ef6ac8e

📥 Commits

Reviewing files that changed from the base of the PR and between af43c37 and 1372c70.

📒 Files selected for processing (4)
  • internal/exporter/collector/collector.go
  • internal/exporter/collector/collector_test.go
  • internal/exporter/exporter.go
  • internal/exporter/exporter_test.go
✅ Files skipped from review due to trivial changes (1)
  • internal/exporter/exporter_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal/exporter/collector/collector.go
  • internal/exporter/collector/collector_test.go

Comment on lines +180 to +181
collectCtx, cancelCollect := context.WithTimeout(e.ctx, e.options.timeout)
defer cancelCollect()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify that ExportNow accepts a caller context while export() roots phase contexts in e.ctx.
rg -n -C2 --type go \
  -e 'func \(e \*healthExporter\) ExportNow' \
  -e 'func \(e \*healthExporter\) export\(' \
  -e 'context\.WithTimeout\(e\.ctx'

Repository: NVIDIA/fleet-intelligence-agent

Length of output: 1364


🏁 Script executed:

#!/bin/bash
# Find all calls to export() within the file
rg -n 'e\.export\(\)' internal/exporter/exporter.go

# Check the Start() method to understand context lifecycle
rg -n -A 20 'func \(e \*healthExporter\) Start' internal/exporter/exporter.go

# Check Stop/Close methods if they exist
rg -n -B 2 -A 10 'func \(e \*healthExporter\) Stop\|func \(e \*healthExporter\) Close' internal/exporter/exporter.go

Repository: NVIDIA/fleet-intelligence-agent

Length of output: 739


🏁 Script executed:

#!/bin/bash
# Show more of the Start() method, particularly around line 147
sed -n '123,160p' internal/exporter/exporter.go

# Search for where e.ctx is created (likely in constructor or New function)
rg -n 'e\.ctx.*=' internal/exporter/exporter.go | head -20

# Show the healthExporter struct definition
rg -n -B 5 -A 15 'type healthExporter struct' internal/exporter/exporter.go

Repository: NVIDIA/fleet-intelligence-agent

Length of output: 1737


🏁 Script executed:

#!/bin/bash
# Show the complete New() function
sed -n '61,100p' internal/exporter/exporter.go

# Show the complete Stop() method
rg -n -A 20 'func \(e \*healthExporter\) Stop\(\)' internal/exporter/exporter.go

Repository: NVIDIA/fleet-intelligence-agent

Length of output: 2028


Pass the caller context to the export method.

ExportNow(ctx) accepts a context parameter but ignores it—the export() method roots both phase contexts in e.ctx instead. This prevents caller deadlines and cancellations from taking effect for on-demand exports.

Additionally, when Stop() calls e.cancel(), it cancels e.ctx. Any in-flight export() call will then create timeout contexts from an already-canceled parent, immediately failing both collection and upload phases and defeating the intended independent timeout budgets. Thread the caller context into export() for on-demand invocations (line 171) and return early when collection is canceled to avoid propagating cancellation into the upload phase.

Also applies to: 193-198, 205-206

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/exporter/exporter.go` around lines 180 - 181, ExportNow(ctx)
currently ignores the caller context and export() roots phase contexts off
e.ctx, so thread the caller-provided ctx into export() (call export(ctx, ...))
and within export use the caller ctx as the parent for phase timeouts (use
context.WithTimeout(ctx, e.options.timeout) for collection and likewise for
upload) instead of context.WithTimeout(e.ctx,...); after the collection phase
returns, check collectCtx.Err() (or context cancellation) and return early if
cancelled/expired so you don't start upload when collection was cancelled; this
also ensures Stop() canceling e.ctx won't immediately cancel on-demand exports
that used the caller ctx.

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
internal/config/config_test.go (1)

187-207: Consider refactoring to use useTempHomeDir for consistency.

This test manually implements the same override/cleanup pattern that the new useTempHomeDir helper provides. The same applies to TestDefaultStateFileRepairsExistingDirectoryPermissions below. Refactoring both to use the helper would reduce duplication and improve maintainability.

♻️ Proposed refactor
 func TestDefaultStateFile(t *testing.T) {
-	origGeteuid := osGeteuid
-	origHomeDirFn := homeDirFn
-	t.Cleanup(func() {
-		osGeteuid = origGeteuid
-		homeDirFn = origHomeDirFn
-	})
-
-	tmpHome := t.TempDir()
-	osGeteuid = func() int { return 1000 }
-	homeDirFn = func() (string, error) { return tmpHome, nil }
+	tmpHome := useTempHomeDir(t)

 	path, err := DefaultStateFile()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/config/config_test.go` around lines 187 - 207, Replace the manual
osGeteuid/homeDirFn override and cleanup in TestDefaultStateFile (and similarly
in TestDefaultStateFileRepairsExistingDirectoryPermissions) with the existing
useTempHomeDir test helper: call useTempHomeDir(t) to set up tmpHome and to
handle restoration, then use the returned tmpHome when asserting
DefaultStateFile() behavior and directory permissions; keep references to
DefaultStateFile, osGeteuid, and homeDirFn only as context and remove the
explicit override/cleanup blocks to eliminate duplication.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/exporter/collector/collector.go`:
- Around line 203-215: collectMachineInfo currently launches
startMachineInfoRefresh (which spawns machineInfoFetcher) before checking
ctx.Err() and the fetcher has no cancellation or guaranteed reset of
machineInfoCollecting; change the code to check ctx.Err() before calling
startMachineInfoRefresh and modify startMachineInfoRefresh/machineInfoFetcher to
accept a context (or return a cancellable func) so the goroutine selects on
ctx.Done() to abort early, and ensure machineInfoCollecting is always cleared
(use defer or a finalizer) even when fetcher exits early or errors so the flag
cannot remain stuck true.
- Around line 139-145: The machine-info collection currently logs all errors
from c.collectMachineInfo as Errorw and then lets the function hit the ctx.Err()
return path; change this to mirror metrics/events/component behavior: after
calling c.collectMachineInfo(ctx, data) check the returned err, and if err is
context.Canceled or context.DeadlineExceeded (use errors.Is(err,
context.Canceled) or errors.Is(err, context.DeadlineExceeded)) do NOT call
log.Logger.Errorw — instead return the partial data and the ctx error
immediately; for other non-nil errors keep the existing Errorw call and
continue. Ensure you update the logic around c.config.IncludeMachineInfo and the
ctx.Err() handling so machine-info cancellation is suppressed like the other
partial-data phases.

---

Nitpick comments:
In `@internal/config/config_test.go`:
- Around line 187-207: Replace the manual osGeteuid/homeDirFn override and
cleanup in TestDefaultStateFile (and similarly in
TestDefaultStateFileRepairsExistingDirectoryPermissions) with the existing
useTempHomeDir test helper: call useTempHomeDir(t) to set up tmpHome and to
handle restoration, then use the returned tmpHome when asserting
DefaultStateFile() behavior and directory permissions; keep references to
DefaultStateFile, osGeteuid, and homeDirFn only as context and remove the
explicit override/cleanup blocks to eliminate duplication.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e930565d-b6e7-45d8-9c75-e580b2f96aa7

📥 Commits

Reviewing files that changed from the base of the PR and between 1372c70 and 9aba411.

📒 Files selected for processing (3)
  • internal/config/config_test.go
  • internal/exporter/collector/collector.go
  • internal/exporter/collector/collector_test.go

Comment on lines 139 to +145
if c.config.IncludeMachineInfo {
if err := c.collectMachineInfo(data); err != nil {
if err := c.collectMachineInfo(ctx, data); err != nil {
log.Logger.Errorw("Failed to collect machine info", "error", err)
}
}
if err := ctx.Err(); err != nil {
return data, err
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Treat machine-info cancellation like the other partial-data phases.

This branch still logs context.Canceled / context.DeadlineExceeded as Errorw before returning the same partial-data error path. Metrics, events, and component data already suppress that noise; machine info should match.

🔧 Suggested adjustment
 	if c.config.IncludeMachineInfo {
 		if err := c.collectMachineInfo(ctx, data); err != nil {
+			if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
+				return data, err
+			}
 			log.Logger.Errorw("Failed to collect machine info", "error", err)
 		}
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if c.config.IncludeMachineInfo {
if err := c.collectMachineInfo(data); err != nil {
if err := c.collectMachineInfo(ctx, data); err != nil {
log.Logger.Errorw("Failed to collect machine info", "error", err)
}
}
if err := ctx.Err(); err != nil {
return data, err
if c.config.IncludeMachineInfo {
if err := c.collectMachineInfo(ctx, data); err != nil {
if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
return data, err
}
log.Logger.Errorw("Failed to collect machine info", "error", err)
}
}
if err := ctx.Err(); err != nil {
return data, err
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/exporter/collector/collector.go` around lines 139 - 145, The
machine-info collection currently logs all errors from c.collectMachineInfo as
Errorw and then lets the function hit the ctx.Err() return path; change this to
mirror metrics/events/component behavior: after calling
c.collectMachineInfo(ctx, data) check the returned err, and if err is
context.Canceled or context.DeadlineExceeded (use errors.Is(err,
context.Canceled) or errors.Is(err, context.DeadlineExceeded)) do NOT call
log.Logger.Errorw — instead return the partial data and the ctx error
immediately; for other non-nil errors keep the existing Errorw call and
continue. Ensure you update the logic around c.config.IncludeMachineInfo and the
ctx.Err() handling so machine-info cancellation is suppressed like the other
partial-data phases.

Comment on lines +203 to +215
func (c *collector) collectMachineInfo(ctx context.Context, data *HealthData) error {
if c.nvmlInstance == nil {
return fmt.Errorf("NVML instance not available")
}

var opts []machineinfo.MachineInfoOption
if len(c.dcgmGPUIndexes) > 0 {
opts = append(opts, machineinfo.WithDCGMGPUIndexes(c.dcgmGPUIndexes))
if cached := c.getCachedMachineInfo(); cached != nil {
data.MachineInfo = cached
log.Logger.Debugw("Using cached machine info")
}

machineInfo, err := machineinfo.GetMachineInfo(c.nvmlInstance, opts...)
if err != nil {
return fmt.Errorf("failed to get machine info: %w", err)
c.startMachineInfoRefresh()
if err := ctx.Err(); err != nil {
return err
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't enqueue uncancellable machine-info work from a dead collection.

collectMachineInfo() checks ctx.Err() only after startMachineInfoRefresh(), and the refresh goroutine itself has no cancel/deadline control. A timed-out collection can still launch NVML work, and one wedged machineInfoFetcher call leaves machineInfoCollecting stuck true, so later collections never retry.

Also applies to: 221-252

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/exporter/collector/collector.go` around lines 203 - 215,
collectMachineInfo currently launches startMachineInfoRefresh (which spawns
machineInfoFetcher) before checking ctx.Err() and the fetcher has no
cancellation or guaranteed reset of machineInfoCollecting; change the code to
check ctx.Err() before calling startMachineInfoRefresh and modify
startMachineInfoRefresh/machineInfoFetcher to accept a context (or return a
cancellable func) so the goroutine selects on ctx.Done() to abort early, and
ensure machineInfoCollecting is always cleared (use defer or a finalizer) even
when fetcher exits early or errors so the flag cannot remain stuck true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant