Skip to content

Commit 864cb66

Browse files
committed
gather-info: rewrite as Go binary with DCGM install support
Replace the bash diagnostics collector with a modular Go implementation. Adds NVIDIA CUDA repo bootstrap and DCGM 4 installation for Ubuntu 22.04/24.04, structured output (manifest.json, report.ndjson), and timeout-protected platform detection.
1 parent 758f1df commit 864cb66

66 files changed

Lines changed: 6879 additions & 2302 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
name: Release gather-info
2+
3+
on:
4+
push:
5+
tags:
6+
- "gather-info/v*"
7+
8+
permissions:
9+
contents: write
10+
11+
# working-directory applies to `run` steps only, not `uses` steps.
12+
# All `uses` step inputs (go-version-file, cache-dependency-path, files)
13+
# must use paths relative to the repository root.
14+
defaults:
15+
run:
16+
working-directory: customers/vm-troubleshooting
17+
18+
jobs:
19+
release:
20+
runs-on: ubuntu-latest
21+
steps:
22+
- name: Checkout
23+
uses: actions/checkout@v4
24+
with:
25+
fetch-depth: 0
26+
27+
- name: Set up Go
28+
uses: actions/setup-go@v5
29+
with:
30+
# Path relative to repo root (working-directory does not apply to uses steps)
31+
go-version-file: customers/vm-troubleshooting/go.mod
32+
cache-dependency-path: customers/vm-troubleshooting/go.sum
33+
34+
- name: Download modules
35+
run: go mod download
36+
37+
- name: Vet
38+
run: go vet ./...
39+
40+
- name: Test
41+
run: go test ./...
42+
43+
- name: Build
44+
env:
45+
CGO_ENABLED: "0"
46+
GOOS: linux
47+
GOARCH: amd64
48+
run: |
49+
# Strip the "gather-info/" prefix to get the clean version string
50+
VERSION="${GITHUB_REF_NAME#gather-info/}"
51+
COMMIT="${{ github.sha }}"
52+
DATE="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
53+
go build -trimpath \
54+
-ldflags="-s -w \
55+
-X github.com/NexGenCloud/vm-diagnostics/internal/config.Version=${VERSION} \
56+
-X github.com/NexGenCloud/vm-diagnostics/internal/config.Commit=${COMMIT:0:7} \
57+
-X github.com/NexGenCloud/vm-diagnostics/internal/config.BuildDate=${DATE}" \
58+
-o gather-info \
59+
./cmd/gather-info
60+
61+
- name: Generate checksum
62+
run: sha256sum gather-info > gather-info.sha256
63+
64+
- name: Verify binary
65+
run: |
66+
./gather-info --version
67+
file gather-info | grep -q "statically linked"
68+
69+
- name: Release
70+
uses: softprops/action-gh-release@v2
71+
with:
72+
name: "gather-info ${{ github.ref_name }}"
73+
generate_release_notes: true
74+
# Paths relative to repo root (working-directory does not apply to uses steps)
75+
files: |
76+
customers/vm-troubleshooting/gather-info
77+
customers/vm-troubleshooting/gather-info.sha256

.gitignore

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Build artifacts
2+
*.o
3+
*.exe
4+
bin/
5+
6+
# OS metadata
7+
.DS_Store
8+
Thumbs.db
9+
10+
# Editor/IDE
11+
*.swp
12+
*.swo
13+
*~
14+
.idea/
15+
.vscode/
16+
*.sublime-project
17+
*.sublime-workspace
18+
19+
# Environment and secrets
20+
.env
21+
.env.*
22+
*.pem
23+
*.key

AGENTS.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# AGENTS.md
2+
3+
## Project purpose
4+
This repository contains customer-facing support scripts and binaries for collecting diagnostics or applying narrow workarounds on customer VMs.
5+
Assume the operator is a customer or support engineer running the tool locally on a machine we do not control and cannot access directly.
6+
7+
## Operating constraints
8+
- Prefer self-contained tooling with minimal external dependencies.
9+
- Assume common Linux distributions first: Ubuntu 20.04/22.04/24.04. Treat other distros as best-effort unless explicitly supported.
10+
- Tools must continue to provide useful output when some commands, drivers, services, or files are missing.
11+
- Never assume outbound internet access.
12+
- Never assume package managers are usable.
13+
- Never assume GPUs, kernel modules, systemd, sudo, or container runtimes are present.
14+
15+
## Safety and privacy
16+
- Do not collect secrets or high-risk data.
17+
- Never collect private keys, tokens, passwords, shell history, environment variables, cloud credentials, SSH material, or application secrets.
18+
- Prefer explicit allowlists of collected files/commands over broad filesystem collection.
19+
- Redact sensitive values if collection is unavoidable.
20+
- Any remediation script must be narrowly scoped, reversible where possible, and clearly logged.
21+
22+
## UX requirements
23+
- Output must be understandable by a customer and useful to support.
24+
- Print what is being collected and why when practical.
25+
- Fail per section, not all-or-nothing.
26+
- Exit codes must be meaningful.
27+
- Write deterministic output paths and filenames.
28+
- Prefer a single archive or directory that the customer can send back.
29+
30+
## Implementation rules
31+
- Prefer read-only diagnostics.
32+
- For any mutating action, require explicit user intent and document risk.
33+
- Use timeouts for subprocesses and slow probes.
34+
- Handle missing commands gracefully.
35+
- Avoid distro-specific parsing when a portable interface exists.
36+
- Avoid brittle text parsing of commands whose output changes across versions.
37+
- Prefer machine-readable sources when available.
38+
- Do not silently ignore errors that affect support value; record them in output.
39+
40+
## Verification
41+
Before considering work complete, run the narrowest relevant checks that exist.
42+
43+
Current commands:
44+
- Bash lint: `shellcheck nvidia-drm-disable-modeset.sh`
45+
- Bash syntax: `bash -n nvidia-drm-disable-modeset.sh`
46+
47+
If a Go implementation exists:
48+
- Format: `cd customers/vm-troubleshooting && gofmt -w .`
49+
- Vet: `cd customers/vm-troubleshooting && go vet ./...`
50+
- Test: `cd customers/vm-troubleshooting && go test ./...`
51+
- Build: `cd customers/vm-troubleshooting && CGO_ENABLED=0 go build ./cmd/gather-info`
52+
53+
## Repo structure
54+
- `customers/`: customer-run support tooling and related assets.
55+
- Root scripts: focused one-off support or remediation utilities.
56+
- `customers/vm-troubleshooting/`: current Go-based diagnostics collector.
57+
- `customers/vm-troubleshooting/CODEMAP.md`: architecture and collector map for the diagnostics collector.
58+
59+
## Diagnostics collector guidance
60+
For the Go-based diagnostics collector:
61+
- Preserve user-visible behavior and output structure unless there is a clear improvement.
62+
- Keep collectors modular.
63+
- Keep command execution timeout-bound.
64+
- Check command availability before execution.
65+
- Make unsupported probes report "not available" rather than fail the whole run.
66+
- Prefer static builds for portability unless there is a strong reason not to.
67+
- Keep `customers/vm-troubleshooting/CODEMAP.md` current when changing architecture or collector ownership.
68+
69+
## Done means
70+
A change is not done until:
71+
- the relevant checks pass,
72+
- the tool behavior is explained in customer-safe terms,
73+
- output remains useful on partially broken systems,
74+
- and no new sensitive data is collected.

CLAUDE.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
See @AGENTS.md for all project instructions, constraints, and verification commands.
6+
7+
For the diagnostics collector, also see `customers/vm-troubleshooting/AGENTS.md` and `customers/vm-troubleshooting/CODEMAP.md`.
8+
9+
ShellCheck is pre-allowed in `.claude/settings.local.json`.

0 commit comments

Comments
 (0)