|
| 1 | +# AGENTS.md |
| 2 | + |
| 3 | +## Project purpose |
| 4 | +This repository contains customer-facing support scripts and binaries for collecting diagnostics or applying narrow workarounds on customer VMs. |
| 5 | +Assume the operator is a customer or support engineer running the tool locally on a machine we do not control and cannot access directly. |
| 6 | + |
| 7 | +## Operating constraints |
| 8 | +- Prefer self-contained tooling with minimal external dependencies. |
| 9 | +- Assume common Linux distributions first: Ubuntu 20.04/22.04/24.04. Treat other distros as best-effort unless explicitly supported. |
| 10 | +- Tools must continue to provide useful output when some commands, drivers, services, or files are missing. |
| 11 | +- Never assume outbound internet access. |
| 12 | +- Never assume package managers are usable. |
| 13 | +- Never assume GPUs, kernel modules, systemd, sudo, or container runtimes are present. |
| 14 | + |
| 15 | +## Safety and privacy |
| 16 | +- Do not collect secrets or high-risk data. |
| 17 | +- Never collect private keys, tokens, passwords, shell history, environment variables, cloud credentials, SSH material, or application secrets. |
| 18 | +- Prefer explicit allowlists of collected files/commands over broad filesystem collection. |
| 19 | +- Redact sensitive values if collection is unavoidable. |
| 20 | +- Any remediation script must be narrowly scoped, reversible where possible, and clearly logged. |
| 21 | + |
| 22 | +## UX requirements |
| 23 | +- Output must be understandable by a customer and useful to support. |
| 24 | +- Print what is being collected and why when practical. |
| 25 | +- Fail per section, not all-or-nothing. |
| 26 | +- Exit codes must be meaningful. |
| 27 | +- Write deterministic output paths and filenames. |
| 28 | +- Prefer a single archive or directory that the customer can send back. |
| 29 | + |
| 30 | +## Implementation rules |
| 31 | +- Prefer read-only diagnostics. |
| 32 | +- For any mutating action, require explicit user intent and document risk. |
| 33 | +- Use timeouts for subprocesses and slow probes. |
| 34 | +- Handle missing commands gracefully. |
| 35 | +- Avoid distro-specific parsing when a portable interface exists. |
| 36 | +- Avoid brittle text parsing of commands whose output changes across versions. |
| 37 | +- Prefer machine-readable sources when available. |
| 38 | +- Do not silently ignore errors that affect support value; record them in output. |
| 39 | + |
| 40 | +## Verification |
| 41 | +Before considering work complete, run the narrowest relevant checks that exist. |
| 42 | + |
| 43 | +Current commands: |
| 44 | +- Bash lint: `shellcheck nvidia-drm-disable-modeset.sh` |
| 45 | +- Bash syntax: `bash -n nvidia-drm-disable-modeset.sh` |
| 46 | + |
| 47 | +If a Go implementation exists: |
| 48 | +- Format: `cd customers/vm-troubleshooting && gofmt -w .` |
| 49 | +- Vet: `cd customers/vm-troubleshooting && go vet ./...` |
| 50 | +- Test: `cd customers/vm-troubleshooting && go test ./...` |
| 51 | +- Build: `cd customers/vm-troubleshooting && CGO_ENABLED=0 go build ./cmd/gather-info` |
| 52 | + |
| 53 | +## Repo structure |
| 54 | +- `customers/`: customer-run support tooling and related assets. |
| 55 | +- Root scripts: focused one-off support or remediation utilities. |
| 56 | +- `customers/vm-troubleshooting/`: current Go-based diagnostics collector. |
| 57 | +- `customers/vm-troubleshooting/CODEMAP.md`: architecture and collector map for the diagnostics collector. |
| 58 | + |
| 59 | +## Diagnostics collector guidance |
| 60 | +For the Go-based diagnostics collector: |
| 61 | +- Preserve user-visible behavior and output structure unless there is a clear improvement. |
| 62 | +- Keep collectors modular. |
| 63 | +- Keep command execution timeout-bound. |
| 64 | +- Check command availability before execution. |
| 65 | +- Make unsupported probes report "not available" rather than fail the whole run. |
| 66 | +- Prefer static builds for portability unless there is a strong reason not to. |
| 67 | +- Keep `customers/vm-troubleshooting/CODEMAP.md` current when changing architecture or collector ownership. |
| 68 | + |
| 69 | +## Done means |
| 70 | +A change is not done until: |
| 71 | +- the relevant checks pass, |
| 72 | +- the tool behavior is explained in customer-safe terms, |
| 73 | +- output remains useful on partially broken systems, |
| 74 | +- and no new sensitive data is collected. |
0 commit comments