fix(onboard): fail fast in preflight when all dashboard ports are occupied (#3953)#3980
fix(onboard): fail fast in preflight when all dashboard ports are occupied (#3953)#3980nvshaxie wants to merge 1 commit into
Conversation
…upied (#3953) `findAvailableDashboardPort` already raises the right error ("All dashboard ports in range 18789-18799 are occupied …") when every port is held, but it only runs late in onboarding — after preflight, gateway start, and inference selection have already had side effects. A user with all 11 dashboard ports held by external processes (python3, etc.) saw onboard proceed all the way to the inference menu before any port-related failure surfaced, contradicting the contract in the docs / test plan. Add a narrow `preflightDashboardPortRangeAvailability()` helper to `src/lib/onboard/dashboard-port.ts` and call it from `preflight()` right before it returns the GPU detection. The helper only checks host bindings (it does not need OpenShell forward state, so it is safe before gateway start). When every port in `DASHBOARD_PORT_RANGE_START..END` is bound, it prints the same canonical message and exits 1. To keep `src/lib/onboard.ts` net-neutral per the `onboard-entrypoint-budget` workflow, the unused `findDashboardForwardOwner` import / re-export is dropped (no callers outside the colocated unit test, which imports it directly from `./onboard/dashboard-port`). Net delta on `onboard.ts`: +2/-2. Test plan - Added 3 unit tests in `src/lib/onboard/dashboard-port.test.ts` with injected `isPortBoundOnHost`/`process.exit` stubs: * exits 1 with the canonical message when every port in the range is bound; stderr lists each port → non-OpenShell host listener. * returns without exiting when 10 of 11 are bound and 1 is free. * returns without exiting when no port is bound. - `npx vitest run src/lib/onboard/dashboard-port.test.ts` — 11/11 pass. - Manual end-to-end on Ubuntu 24.04 x86_64: * Occupy 18789-18799 with `python3 -m http.server` placeholders. * `nemoclaw onboard --non-interactive ...` now exits inside `[1/8] Preflight checks` with `All dashboard ports in range 18789-18799 are occupied:` followed by one line per occupied port. `[2/8] Starting OpenShell gateway` and `[3/8] Configuring inference` are never printed. Signed-off-by: Shawn Xie <shaxie@nvidia.com>
📝 WalkthroughWalkthroughThis PR adds an early preflight check that fails fast when all dashboard ports (18789–18799) are occupied by external processes. The new ChangesDashboard Port Exhaustion Preflight
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
PR Review AdvisorRecommendation: blocked This is an automated advisory review. A human maintainer must make the final merge decision. Limitations: Review is based on the provided trusted metadata and repository read-only inspection; tests and package-manager commands were not executed.; CI and E2E Advisor were still pending for the provided head SHA, so final pass/fail status is unknown.; No review-thread state beyond the supplied GraphQL snapshot was available; CodeRabbit was still processing.; Linked issue had no comments in the provided context, so acceptance mapping covers the issue body only. Full advisor summaryPR Review AdvisorBase: The implementation adds useful early dashboard-port exhaustion detection, but it blocks the documented --control-ui-port escape hatch and current hard gates are not satisfied (CI pending, merge blocked, E2E Advisor not complete). Gate status
🔴 Blockers
🟡 Warnings
🔵 Suggestions
Acceptance coverage
Security review
Test / E2E status
✅ What looks good
Review completeness
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Line 3844: preflightDashboardPortRangeAvailability() is being called
unconditionally which causes failures even when the user provided an explicit
--control-ui-port; update the logic so the range-exhaustion check runs only when
no explicit control UI port override is present — detect the override (the CLI
option or config value that represents the explicit port, e.g. controlUiPort /
controlUiPortOverride / options.controlUiPort or the corresponding env var used
earlier in the file) and skip calling preflightDashboardPortRangeAvailability()
when that override is set, leaving the existing explicit-port path intact (the
explicit-port handling you preserved earlier) so the onboarding respects
--control-ui-port inputs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a75a1aed-7ee0-41b2-be45-486dcc29bbfe
📒 Files selected for processing (3)
src/lib/onboard.tssrc/lib/onboard/dashboard-port.test.tssrc/lib/onboard/dashboard-port.ts
| } | ||
| } | ||
|
|
||
| preflightDashboardPortRangeAvailability(); // #3953 — fail fast on dashboard-port exhaustion before step 2 |
There was a problem hiding this comment.
Honor explicit --control-ui-port overrides before failing on range exhaustion.
Line 3844 unconditionally checks the default dashboard range, which breaks the explicit-port path you already preserve in Lines 3666-3674. If a user reruns with --control-ui-port 3000, onboarding can still exit just because 18789-18799 are full, even though that range is no longer relevant.
💡 Suggested fix
- preflightDashboardPortRangeAvailability(); // `#3953` — fail fast on dashboard-port exhaustion before step 2
+ if (_preflightDashboardPort === null) {
+ preflightDashboardPortRangeAvailability(); // `#3953` — fail fast on dashboard-port exhaustion before step 2
+ }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/lib/onboard.ts` at line 3844, preflightDashboardPortRangeAvailability()
is being called unconditionally which causes failures even when the user
provided an explicit --control-ui-port; update the logic so the range-exhaustion
check runs only when no explicit control UI port override is present — detect
the override (the CLI option or config value that represents the explicit port,
e.g. controlUiPort / controlUiPortOverride / options.controlUiPort or the
corresponding env var used earlier in the file) and skip calling
preflightDashboardPortRangeAvailability() when that override is set, leaving the
existing explicit-port path intact (the explicit-port handling you preserved
earlier) so the onboarding respects --control-ui-port inputs.
Summary
findAvailableDashboardPortalready raises the right "All dashboard ports in range 18789-18799 are occupied …" error when every port is held, but it only fires in late onboarding (after preflight + gateway start + inference selection). Reporters expect the failure to be upfront.preflightDashboardPortRangeAvailability()helper insrc/lib/onboard/dashboard-port.tsthat only checks host bindings (safe to run before OpenShell is reachable) and call it at the end ofpreflight(). Exits 1 with the same canonical message when every port inDASHBOARD_PORT_RANGE_START..ENDis bound.findDashboardForwardOwnerimport/re-export fromsrc/lib/onboard.tsto keep the entrypoint net-neutral per theonboard-entrypoint-budgetworkflow (the only remaining caller isdashboard-port.test.ts, which imports it directly from the module).Bug reproduction (pre-fix)
```
$ for p in $(seq 18789 18799); do python3 -m http.server "$p" &>/dev/null & done
$ nemoclaw onboard --non-interactive --yes-i-accept-third-party-software --name overflow-test ...
[1/8] Preflight checks ← passes
[2/8] Starting OpenShell gateway ← runs anyway
[3/8] Configuring inference (NIM) ← runs anyway
Inference options menu … ← reaches the wizard
```
Behavior post-fix (same setup)
```
[1/8] Preflight checks
✓ Docker is running … ✓ Memory OK …
All dashboard ports in range 18789-18799 are occupied:
18789 → non-OpenShell host listener
18790 → non-OpenShell host listener
...
18799 → non-OpenShell host listener
Free a sandbox or use --control-ui-port with a port outside this range.
```
Exit 1. `[2/8] Starting OpenShell gateway` is never printed.
Why the helper is sound at this stage
findAvailableDashboardPortneedsopenshell forward list(so it can distinguish "this port is bound by an OpenShell forward that belongs to this sandbox" from "host listener"). The preflight helper deliberately skips that distinction and treats every bound port as a non-OpenShell listener.--control-ui-port <N>or freeing a sandbox).Test plan
Fixes #3953.
Signed-off-by: Shawn Xie shaxie@nvidia.com
Summary by CodeRabbit
New Features
Tests