Conversation
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
📝 WalkthroughWalkthroughCollector now checks context cancellation/deadline throughout collection and returns partial HealthData on early exit; machine-info collection is background-refreshed and cached. Exporter separates collection and upload timeouts so uploads can proceed with a fresh context even when collection timed out. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Exporter
participant Collector
participant Components
participant HTTPWriter
Client->>Exporter: export()
activate Exporter
Note over Exporter: create collectCtx (timeout for collection)
Exporter->>Collector: Collect(collectCtx)
activate Collector
Collector->>Collector: periodic ctx.Err() checks
Collector->>Components: collectMachineInfo (background refresh), collect metrics, collect events, collect component data
rect rgba(200,150,100,0.5)
Note over Components,Collector: a component or step hits deadline/cancel
Components-->>Collector: context.DeadlineExceeded / context.Canceled
end
Collector-->>Exporter: partial HealthData + ctx.Err()
deactivate Collector
Note over Exporter: if HealthData != nil -> create fresh uploadCtx
Exporter->>HTTPWriter: Send(uploadCtx, partial HealthData)
activate HTTPWriter
HTTPWriter-->>Exporter: upload result
deactivate HTTPWriter
Exporter-->>Client: return result (success or error)
deactivate Exporter
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
internal/exporter/collector/collector.go (1)
178-187: Consider adding context checks after attestation and config collection.The
collectAttestationDataandcollectConfigDatamethods are not followed by context cancellation checks, unlike the other collection steps. While these are typically fast operations, for consistency and to ensure prompt cancellation, consider adding checks after them as well.♻️ Optional: Add context checks for consistency
// Collect attestation data if provider is available // Attestation is always enabled if manager is available if err := c.collectAttestationData(data); err != nil { log.Logger.Errorw("Failed to collect attestation data", "error", err) } + if err := ctx.Err(); err != nil { + return data, err + } // Collect config data if err := c.collectConfigData(data); err != nil { log.Logger.Errorw("Failed to collect config data", "error", err) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/exporter/collector/collector.go` around lines 178 - 187, After calling c.collectAttestationData(data) and c.collectConfigData(data) add the same context cancellation checks used elsewhere in this function so collection stops promptly on cancel; specifically, after each call (and its error handling) check the request context (ctx or the collector's context variable) for cancellation (e.g., select on ctx.Done()) and return early if cancelled, mirroring the existing pattern used after other collection steps.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/exporter/collector/collector.go`:
- Around line 178-187: After calling c.collectAttestationData(data) and
c.collectConfigData(data) add the same context cancellation checks used
elsewhere in this function so collection stops promptly on cancel; specifically,
after each call (and its error handling) check the request context (ctx or the
collector's context variable) for cancellation (e.g., select on ctx.Done()) and
return early if cancelled, mirroring the existing pattern used after other
collection steps.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e987c168-0949-4056-a56b-54653c58a4e7
📒 Files selected for processing (4)
internal/exporter/collector/collector.gointernal/exporter/collector/collector_test.gointernal/exporter/exporter.gointernal/exporter/exporter_test.go
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
internal/exporter/exporter.go (1)
180-181: Separate the collection timeout from the HTTP timeout.
internal/config/config.go:96-104still documentsHealthExporter.Timeoutas the HTTP request timeout, but these lines now spend that budget twice—once on collection and again on upload. That changes the worst-case duration of one export cycle and couples GPU stall tuning to network tuning. A dedicated collection timeout, or a config/doc rename, would make this behavior much easier to reason about.Also applies to: 205-206
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/exporter/exporter.go` around lines 180 - 181, The code currently reuses e.options.timeout for both collection and upload (see the context.WithTimeout using e.ctx to create collectCtx and the later upload path), which doubles the intended HTTP timeout and couples collection and network tuning; define and use a separate collection timeout (e.g. add e.options.collectTimeout or rename existing config) and replace the collection context creation (collectCtx, cancelCollect := context.WithTimeout(e.ctx, e.options.timeout)) with the new value (e.options.collectTimeout), update any other collection uses (including the similar lines at 205-206), and update the configuration/docs in internal/config/config.go to document the distinct collection timeout vs HTTP/upload timeout so the two budgets are independent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/exporter/exporter.go`:
- Around line 180-181: ExportNow(ctx) currently ignores the caller context and
export() roots phase contexts off e.ctx, so thread the caller-provided ctx into
export() (call export(ctx, ...)) and within export use the caller ctx as the
parent for phase timeouts (use context.WithTimeout(ctx, e.options.timeout) for
collection and likewise for upload) instead of context.WithTimeout(e.ctx,...);
after the collection phase returns, check collectCtx.Err() (or context
cancellation) and return early if cancelled/expired so you don't start upload
when collection was cancelled; this also ensures Stop() canceling e.ctx won't
immediately cancel on-demand exports that used the caller ctx.
---
Nitpick comments:
In `@internal/exporter/exporter.go`:
- Around line 180-181: The code currently reuses e.options.timeout for both
collection and upload (see the context.WithTimeout using e.ctx to create
collectCtx and the later upload path), which doubles the intended HTTP timeout
and couples collection and network tuning; define and use a separate collection
timeout (e.g. add e.options.collectTimeout or rename existing config) and
replace the collection context creation (collectCtx, cancelCollect :=
context.WithTimeout(e.ctx, e.options.timeout)) with the new value
(e.options.collectTimeout), update any other collection uses (including the
similar lines at 205-206), and update the configuration/docs in
internal/config/config.go to document the distinct collection timeout vs
HTTP/upload timeout so the two budgets are independent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f68d14bc-8736-4ca9-bf2e-b6867ef6ac8e
📒 Files selected for processing (4)
internal/exporter/collector/collector.gointernal/exporter/collector/collector_test.gointernal/exporter/exporter.gointernal/exporter/exporter_test.go
✅ Files skipped from review due to trivial changes (1)
- internal/exporter/exporter_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/exporter/collector/collector.go
- internal/exporter/collector/collector_test.go
| collectCtx, cancelCollect := context.WithTimeout(e.ctx, e.options.timeout) | ||
| defer cancelCollect() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify that ExportNow accepts a caller context while export() roots phase contexts in e.ctx.
rg -n -C2 --type go \
-e 'func \(e \*healthExporter\) ExportNow' \
-e 'func \(e \*healthExporter\) export\(' \
-e 'context\.WithTimeout\(e\.ctx'Repository: NVIDIA/fleet-intelligence-agent
Length of output: 1364
🏁 Script executed:
#!/bin/bash
# Find all calls to export() within the file
rg -n 'e\.export\(\)' internal/exporter/exporter.go
# Check the Start() method to understand context lifecycle
rg -n -A 20 'func \(e \*healthExporter\) Start' internal/exporter/exporter.go
# Check Stop/Close methods if they exist
rg -n -B 2 -A 10 'func \(e \*healthExporter\) Stop\|func \(e \*healthExporter\) Close' internal/exporter/exporter.goRepository: NVIDIA/fleet-intelligence-agent
Length of output: 739
🏁 Script executed:
#!/bin/bash
# Show more of the Start() method, particularly around line 147
sed -n '123,160p' internal/exporter/exporter.go
# Search for where e.ctx is created (likely in constructor or New function)
rg -n 'e\.ctx.*=' internal/exporter/exporter.go | head -20
# Show the healthExporter struct definition
rg -n -B 5 -A 15 'type healthExporter struct' internal/exporter/exporter.goRepository: NVIDIA/fleet-intelligence-agent
Length of output: 1737
🏁 Script executed:
#!/bin/bash
# Show the complete New() function
sed -n '61,100p' internal/exporter/exporter.go
# Show the complete Stop() method
rg -n -A 20 'func \(e \*healthExporter\) Stop\(\)' internal/exporter/exporter.goRepository: NVIDIA/fleet-intelligence-agent
Length of output: 2028
Pass the caller context to the export method.
ExportNow(ctx) accepts a context parameter but ignores it—the export() method roots both phase contexts in e.ctx instead. This prevents caller deadlines and cancellations from taking effect for on-demand exports.
Additionally, when Stop() calls e.cancel(), it cancels e.ctx. Any in-flight export() call will then create timeout contexts from an already-canceled parent, immediately failing both collection and upload phases and defeating the intended independent timeout budgets. Thread the caller context into export() for on-demand invocations (line 171) and return early when collection is canceled to avoid propagating cancellation into the upload phase.
Also applies to: 193-198, 205-206
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/exporter/exporter.go` around lines 180 - 181, ExportNow(ctx)
currently ignores the caller context and export() roots phase contexts off
e.ctx, so thread the caller-provided ctx into export() (call export(ctx, ...))
and within export use the caller ctx as the parent for phase timeouts (use
context.WithTimeout(ctx, e.options.timeout) for collection and likewise for
upload) instead of context.WithTimeout(e.ctx,...); after the collection phase
returns, check collectCtx.Err() (or context cancellation) and return early if
cancelled/expired so you don't start upload when collection was cancelled; this
also ensures Stop() canceling e.ctx won't immediately cancel on-demand exports
that used the caller ctx.
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
internal/config/config_test.go (1)
187-207: Consider refactoring to useuseTempHomeDirfor consistency.This test manually implements the same override/cleanup pattern that the new
useTempHomeDirhelper provides. The same applies toTestDefaultStateFileRepairsExistingDirectoryPermissionsbelow. Refactoring both to use the helper would reduce duplication and improve maintainability.♻️ Proposed refactor
func TestDefaultStateFile(t *testing.T) { - origGeteuid := osGeteuid - origHomeDirFn := homeDirFn - t.Cleanup(func() { - osGeteuid = origGeteuid - homeDirFn = origHomeDirFn - }) - - tmpHome := t.TempDir() - osGeteuid = func() int { return 1000 } - homeDirFn = func() (string, error) { return tmpHome, nil } + tmpHome := useTempHomeDir(t) path, err := DefaultStateFile()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/config/config_test.go` around lines 187 - 207, Replace the manual osGeteuid/homeDirFn override and cleanup in TestDefaultStateFile (and similarly in TestDefaultStateFileRepairsExistingDirectoryPermissions) with the existing useTempHomeDir test helper: call useTempHomeDir(t) to set up tmpHome and to handle restoration, then use the returned tmpHome when asserting DefaultStateFile() behavior and directory permissions; keep references to DefaultStateFile, osGeteuid, and homeDirFn only as context and remove the explicit override/cleanup blocks to eliminate duplication.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/exporter/collector/collector.go`:
- Around line 203-215: collectMachineInfo currently launches
startMachineInfoRefresh (which spawns machineInfoFetcher) before checking
ctx.Err() and the fetcher has no cancellation or guaranteed reset of
machineInfoCollecting; change the code to check ctx.Err() before calling
startMachineInfoRefresh and modify startMachineInfoRefresh/machineInfoFetcher to
accept a context (or return a cancellable func) so the goroutine selects on
ctx.Done() to abort early, and ensure machineInfoCollecting is always cleared
(use defer or a finalizer) even when fetcher exits early or errors so the flag
cannot remain stuck true.
- Around line 139-145: The machine-info collection currently logs all errors
from c.collectMachineInfo as Errorw and then lets the function hit the ctx.Err()
return path; change this to mirror metrics/events/component behavior: after
calling c.collectMachineInfo(ctx, data) check the returned err, and if err is
context.Canceled or context.DeadlineExceeded (use errors.Is(err,
context.Canceled) or errors.Is(err, context.DeadlineExceeded)) do NOT call
log.Logger.Errorw — instead return the partial data and the ctx error
immediately; for other non-nil errors keep the existing Errorw call and
continue. Ensure you update the logic around c.config.IncludeMachineInfo and the
ctx.Err() handling so machine-info cancellation is suppressed like the other
partial-data phases.
---
Nitpick comments:
In `@internal/config/config_test.go`:
- Around line 187-207: Replace the manual osGeteuid/homeDirFn override and
cleanup in TestDefaultStateFile (and similarly in
TestDefaultStateFileRepairsExistingDirectoryPermissions) with the existing
useTempHomeDir test helper: call useTempHomeDir(t) to set up tmpHome and to
handle restoration, then use the returned tmpHome when asserting
DefaultStateFile() behavior and directory permissions; keep references to
DefaultStateFile, osGeteuid, and homeDirFn only as context and remove the
explicit override/cleanup blocks to eliminate duplication.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e930565d-b6e7-45d8-9c75-e580b2f96aa7
📒 Files selected for processing (3)
internal/config/config_test.gointernal/exporter/collector/collector.gointernal/exporter/collector/collector_test.go
| if c.config.IncludeMachineInfo { | ||
| if err := c.collectMachineInfo(data); err != nil { | ||
| if err := c.collectMachineInfo(ctx, data); err != nil { | ||
| log.Logger.Errorw("Failed to collect machine info", "error", err) | ||
| } | ||
| } | ||
| if err := ctx.Err(); err != nil { | ||
| return data, err |
There was a problem hiding this comment.
Treat machine-info cancellation like the other partial-data phases.
This branch still logs context.Canceled / context.DeadlineExceeded as Errorw before returning the same partial-data error path. Metrics, events, and component data already suppress that noise; machine info should match.
🔧 Suggested adjustment
if c.config.IncludeMachineInfo {
if err := c.collectMachineInfo(ctx, data); err != nil {
+ if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
+ return data, err
+ }
log.Logger.Errorw("Failed to collect machine info", "error", err)
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if c.config.IncludeMachineInfo { | |
| if err := c.collectMachineInfo(data); err != nil { | |
| if err := c.collectMachineInfo(ctx, data); err != nil { | |
| log.Logger.Errorw("Failed to collect machine info", "error", err) | |
| } | |
| } | |
| if err := ctx.Err(); err != nil { | |
| return data, err | |
| if c.config.IncludeMachineInfo { | |
| if err := c.collectMachineInfo(ctx, data); err != nil { | |
| if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) { | |
| return data, err | |
| } | |
| log.Logger.Errorw("Failed to collect machine info", "error", err) | |
| } | |
| } | |
| if err := ctx.Err(); err != nil { | |
| return data, err |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/exporter/collector/collector.go` around lines 139 - 145, The
machine-info collection currently logs all errors from c.collectMachineInfo as
Errorw and then lets the function hit the ctx.Err() return path; change this to
mirror metrics/events/component behavior: after calling
c.collectMachineInfo(ctx, data) check the returned err, and if err is
context.Canceled or context.DeadlineExceeded (use errors.Is(err,
context.Canceled) or errors.Is(err, context.DeadlineExceeded)) do NOT call
log.Logger.Errorw — instead return the partial data and the ctx error
immediately; for other non-nil errors keep the existing Errorw call and
continue. Ensure you update the logic around c.config.IncludeMachineInfo and the
ctx.Err() handling so machine-info cancellation is suppressed like the other
partial-data phases.
| func (c *collector) collectMachineInfo(ctx context.Context, data *HealthData) error { | ||
| if c.nvmlInstance == nil { | ||
| return fmt.Errorf("NVML instance not available") | ||
| } | ||
|
|
||
| var opts []machineinfo.MachineInfoOption | ||
| if len(c.dcgmGPUIndexes) > 0 { | ||
| opts = append(opts, machineinfo.WithDCGMGPUIndexes(c.dcgmGPUIndexes)) | ||
| if cached := c.getCachedMachineInfo(); cached != nil { | ||
| data.MachineInfo = cached | ||
| log.Logger.Debugw("Using cached machine info") | ||
| } | ||
|
|
||
| machineInfo, err := machineinfo.GetMachineInfo(c.nvmlInstance, opts...) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to get machine info: %w", err) | ||
| c.startMachineInfoRefresh() | ||
| if err := ctx.Err(); err != nil { | ||
| return err |
There was a problem hiding this comment.
Don't enqueue uncancellable machine-info work from a dead collection.
collectMachineInfo() checks ctx.Err() only after startMachineInfoRefresh(), and the refresh goroutine itself has no cancel/deadline control. A timed-out collection can still launch NVML work, and one wedged machineInfoFetcher call leaves machineInfoCollecting stuck true, so later collections never retry.
Also applies to: 221-252
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/exporter/collector/collector.go` around lines 203 - 215,
collectMachineInfo currently launches startMachineInfoRefresh (which spawns
machineInfoFetcher) before checking ctx.Err() and the fetcher has no
cancellation or guaranteed reset of machineInfoCollecting; change the code to
check ctx.Err() before calling startMachineInfoRefresh and modify
startMachineInfoRefresh/machineInfoFetcher to accept a context (or return a
cancellable func) so the goroutine selects on ctx.Done() to abort early, and
ensure machineInfoCollecting is always cleared (use defer or a finalizer) even
when fetcher exits early or errors so the flag cannot remain stuck true.
Summary
Fix exporter timeout handling so a stalled collection phase does not force HTTP upload to inherit an expired context.
What Changed
Root Cause
The exporter previously used one shared timeout for metadata refresh, collection, and upload. When GPU-facing collection work stalled, the shared context expired and the later event-read and HTTP-upload phases inherited a dead context, causing cascading
context deadline exceededfailures.Impact
The agent now degrades more cleanly under driver/NVML stalls: partial collection can still be uploaded, and cancellation no longer produces misleading per-phase timeout cascades.
Validation
go test ./internal/exporter/...Summary by CodeRabbit
Bug Fixes
Tests