chore: cutover Proto Fleet source from upstream repo by mcharles-square · Pull Request #2 · block/proto-fleet

mcharles-square · 2026-04-20T17:27:44Z

Summary

Migrates the full ProtoFleet codebase into this repository as the initial cutover from the source repo
Adds GitHub workflows, issue templates, hermit-managed tooling, and project source (2506 files changed)

Test plan

Verify CI workflows trigger correctly on this PR
Confirm hermit-managed binaries resolve
Spot-check directory structure matches the source repo

Migrate ProtoFleet codebase from the source repository, including GitHub workflows, issue templates, documentation, hermit-managed tooling, and the full project source tree.

The workflow only reads and runs git diff — it doesn't push back, so it doesn't need the elevated GitHub App token. This avoids requiring GH_APP_ID/GH_APP_PRIVATE_KEY secrets on the repo.

… URL guard Address three follow-up review findings on the fleet-agent CLI. #1 Persist state immediately after Register, before the api_key paste prompt. A Ctrl-C or terminal disconnect during the prompt previously discarded the freshly-generated identity + miner-signing private keys while the server held an orphaned agent row that could not be re-registered (unique-name conflict). State now lands as soon as Register returns; recovery is a `fleet-agent refresh --api-key=<paste>` away. #2 Narrow the refresh credential wipe to api_key rejection only. Define an errAPIKeyRejected sentinel returned only when BeginAuthHandshake reports Unauthenticated. CompleteAuthHandshake's Unauthenticated cases (expired challenge, signature mismatch) no longer wipe local credentials, so transient or signature-side bugs cannot turn a still-valid api_key into a forced re-enrollment. #3 Re-validate the server URL on refresh. Persist allow_insecure_transport on State so refresh enforces the same https-or-loopback rule that enroll did. Without this, a state.yaml written via --allow-insecure-transport would have refresh silently keep using cleartext for the long-lived api_key on every renewal. refresh now also accepts --api-key (and FLEET_AGENT_API_KEY) so it can complete a partial enrollment recovered from an interrupted enroll. Tests cover: state persistence right after Register (no api_key, incomplete state file), refresh happy path, refresh recovering a partial enroll via flag, refresh requiring an api_key, refresh wiping on rejected api_key, refresh preserving state on a signature failure, refresh rejecting plain http when AllowInsecureTransport is unset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…concurrent runs #1 refresh now keeps the override candidate separate from the stored api_key during validation. A rejected --api-key override only surfaces an error and leaves the stored credentials on disk; the wipe path fires only when the stored key itself is what the server rejected. A typo in --api-key no longer takes a healthy agent offline. #2 saveState fsyncs the temp file before close and fsyncs the containing directory after rename. A power loss or kernel crash between rename and disk flush can no longer roll back the rename or leave a torn write. The crash-safety claim in the comment is now honored end-to-end. #3 enroll and refresh wrap their full load/handshake/save sequences in an exclusive flock on <state-dir>/state.lock. Two concurrent refresh invocations (timer overlap, scripted retry, double-click) now serialize, so a slower writer cannot overwrite a newer session_token the server has already replaced. Lock + dir-fsync are linux/darwin only; non-Unix targets get no-op stubs and a comment noting the gap. Tests: stored-key survives a rejected --api-key override; 5 concurrent refreshes serialize and converge to the same final state. Existing wipe-on-rejected-stored-key + preserve-on-signature-failure scenarios keep working. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ambiguous auth #1 Connect-RPC client rejects every 30x response. Default http.Client follows 307/308 redirects, replaying the POST body (enrollment_token, api_key, signature, challenge response) to the redirect target. A downgrade redirect from https to http would defeat the non-loopback https requirement and leak the api_key over cleartext. Connect does not use redirects, so refusing all 30x is the cleanest fix. #2 Interactive api_key prompt uses term.ReadPassword when stdin is a TTY. The pasted secret no longer ends up in scrollback, tmux logs, session recordings, or screenshots. Piped input keeps the scanner path so scripted enrollment is unaffected. Adds golang.org/x/term as a direct dependency. #3 refresh no longer wipes local credentials on Unauthenticated. The server returns Unauthenticated for several distinct cases (api_key revocation, identity_pubkey mismatch, expired challenge, bad signature) and refresh cannot tell them apart. The previous wipe turned a recoverable identity-mismatch or transient proxy error into a forced re-enrollment. State is preserved on every failure mode now; the operator decides whether to re-enroll based on the surfaced error. A rejected --api-key override still does not mutate disk and surfaces a clear "retry without --api-key" message. Tests: redirect rejected for all 5 30x codes, http.Client carries the 30s timeout, refresh preserves state when the stored api_key is rejected, override still preserves stored creds when typo'd. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Triage of the parallel adversarial review run on PR #206. Findings applied as code: [P1 #1] activity_log composite-FK MATCH SIMPLE bypass on NULL organization_id closed via CHECK constraint: ck_activity_log_site_requires_org enforces site_id IS NULL OR organization_id IS NOT NULL. Verified: insert with site_id + NULL org rejected; insert with NULL site_id + NULL org accepted (system events unchanged). [P1 #2] command_on_device_log org_id backfill: pre-flight DO block counts orphan rows (codl rows whose device row is missing) and RAISES with a clear message before SET NOT NULL. A clean abort beats SET NOT NULL failing mid-migration with the dirty flag set. [P2 #5] InsertError CTE rewrite: documented the contract change (missing device $3 yields sql.ErrNoRows instead of FK violation). Existing caller wraps generically so surface unchanged. [P2 #6] Plan doc trimmed: power-contract column list marked DEFERRED in entity description and Phase 1 migration bullet. Future readers won't write service code against columns that don't exist. [P2 #7] ListSites count subqueries: added org_id predicate to device and building scans so they hit idx_device_org_site / idx_building_org_deleted instead of full-table scan in multi-tenant prod. [P2 #8] InsertDeviceMetrics sub-select dropped AND deleted_at IS NULL to match InsertError / InsertMinerStateSnapshot. Telemetry from a soft-deleted device is still legitimate per-site history; three writers, one behavior. [P3 #10] building.default_rack_order_index: added ck_building_default_rack_order_index CHECK (BETWEEN 0 AND 4) to match sibling CHECKs. [P3 #11] fk_device_set_rack_device_set_org: added ON DELETE CASCADE to match the single-column FK on device_set_id and the building FK on the same row. Composite adds the org-matching invariant without changing cascade semantics. Findings deferred: - #3 (non-CONCURRENTLY indexes inside tx): deploy-time concern, document in PR/deploy notes; restructuring migrations to use CONCURRENTLY is a separate effort. - #4 (no integration tests): defer to Phase 1B service layer where end-to-end flows exercise the invariants. - #9 (device.uq_device_id_org_id missing): already exists (verified on live DB); finding is incorrect. - #12 (CREATE TABLE IF NOT EXISTS): IF NOT EXISTS hides genuine schema errors; force-clean is the right golang-migrate pattern. Round-trip clean, build clean, lint clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Five fixes from Codex + Copilot reviews: 1. run.go: ValidateServerURL on every entry, not just the refresh path. A tampered or stale state with a fresh session_token would have let bearer heartbeats hit a plaintext non-loopback URL. Added TestRunCmd_ValidatesServerURLBeforeBuildingClient. (Codex MEDIUM; Copilot inline.) 2. run.go: Move signal.NotifyContext above the initial refresh so SIGINT can interrupt a startup refresh against a slow/unreachable gateway. Previously the initial refreshAndSave used context.Background() and could hang up to the 30s HTTP timeout before shutting down. (Codex inline #1, Copilot #3.) 3. run.go: Bump the "heartbeat sent" log line from Debug to Info. The default slog handler is Info, so the daemon was silent under normal operation; a stuck daemon would be indistinguishable from a working one. (Copilot #4.) 4. authclient.go: Streaming wrapper now matches unary on empty token. Previously WrapStreamingClient silently omitted the Authorization header when tokenSource returned empty, letting an unauthenticated stream open and fail later in harder-to-debug ways. Now returns a failingStreamingClientConn whose Send/Receive surface Unauthenticated immediately, matching the unary path's fail-fast contract. (Copilot #5; affects future ControlStream work, not currently exercised.) 5. run_test.go: ed25519.GenerateKey(nil) -> rand.Reader explicitly. The stdlib actually handles nil by falling back to crypto/rand, so this isn't a real panic risk, but explicit is clearer and matches the integration tests' style. (Copilot #2.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… per-org admin lookup + dual-write Addresses round-2 PR review: **Reconciler (Codex Security HIGH #1):** previous implementation re-asserted every seed permission on every boot for ADMIN and FIELD_TECH, which silently restored operator-revoked rows the next time fleetd started. The intent of "additive-only" was operator-edits- survive — actual behavior was "seed-set-is-restored." Fixed by splitting reconciliation: GetBuiltinRoleForOrg first; if the row does not exist, INSERT and seed every permission in the spec; if it does exist, SUPER_ADMIN gets full reconcile (assign + prune), ADMIN and FIELD_TECH get nothing. Once a per-org built-in row exists the operator owns it; new catalog permissions only show up on roles the operator opts into editing. Two new regression tests: TestReconcile_OperatorRevokedSensitivePermStaysRevokedOnRestart and TestReconcile_NewCatalogPermissionDoesNotPropagateToExistingBuiltins. **Composite role FK (Codex Security HIGH #2):** added UNIQUE(id, organization_id) on role and replaced user_organization_role's role_id FK with a composite FK on (role_id, organization_id) → role(id, organization_id). An assignment can no longer reference a role owned by a different organization — DB-enforced tenant isolation. ListEffectivePermissionsForUser, CountOrgScopeSuperAdminsExcludingAssignment, and CountOrgScopeSuperAdminsExcludingUser now also assert r.organization_id = uor.organization_id in their joins as defense in depth. **Per-org ADMIN lookup (Codex thread P1):** auth.Service.CreateUser was calling GetRoleByName(AdminRoleName) which now returns an arbitrary per-org ADMIN row across orgs. Added UserManagementStore.GetBuiltinRoleForOrg(orgID, builtinKey) and the SQL implementation; CreateUser uses it with the user's actual orgID so the new user is bound to their own org's ADMIN row. **Dual-write into user_organization_role (Codex Security MEDIUM):** new users created post-migration only landed in legacy user_organization. The resolver, once it goes live, reads from user_organization_role, which means fresh installs would have a founding SUPER_ADMIN with no effective permissions. Both CreateAdminUserWithOrganization and CreateUserOrganizationRole now dual-write: legacy user_organization plus an org-scope row in user_organization_role. New test TestOnboarding_FoundingUserGetsOrgScopeSuperAdminAssignment locks this in.

Addresses three findings from PR #285 review: 1. RacksPage filter-empty state (Codex P2). `hasEverLoaded` only flips when `total > 0`, so deep-linking `/racks?building=<id>` where the building has zero racks (but the org has racks elsewhere) fell through to the "You haven't set up any racks" null state instead of the filtered-empty state. Including `hasActiveFilters` in the `hasRacks` check short-circuits the null state whenever a filter is active. 2. `handleClearFilters` double-fetch (Copilot suppressed). When the building filter was active, clearing all triggered one synchronous resetAndFetch with stale ref values plus a second via the URL-change effect. Snapshot the building-active state up front; skip the explicit resetAndFetch when the URL-change effect will handle the refetch. 3. No FE tests for the URL parse (me + Copilot). Extracted `parseBuildingIdsFromParams` + `BUILDING_URL_PARAM` to a new `rackManagement/utils/buildingFilterUrl.ts` module and added a vitest covering repeated keys, legacy comma-joined values, invalid input, ordering, and large bigint ids. Follow-up issue #291 tracks the pre-existing `<Link><Button/></Link>` nesting on BuildingPageHeader (PR review finding #2) — fixing both View racks and View miners together rather than mixing scope into this PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

#1 (P1) ProcedurePermissions in middleware/rpc_permissions.go did not classify the two new BuildingService RPCs, so the generated handler interface had unclassified procedures and the rpc_permissions_test RBAC-contract assertion would fail in CI. Added: - BuildingServiceListBuildingRacksProcedure → PermSiteRead (building- scoped read). - BuildingServiceAssignRackToBuildingProcedure → PermSiteManage (mutates rack building/site/zone/grid). #2 (P2) SiteSettingsSingleView kept rendering the previous site's buildings until the new listBuildingsBySite call resolved after an active-site switch. With row-click now opening the edit modal, an operator could open a building from the prior site under the new site's header (with the new site name as cascade context). Paired the buildings state with the siteId that produced it (mirrors BuildingPage's response pairing) so stale rows fall back to the Loading… state instead of being clickable. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

#1 server-checks (gosec G115). The int32(len(rows)) cast in the new ListBuildingRacks pagination tripped integer-overflow; pageSize is already service-clamped to ListBuildingRacksMaxPageSize (1000), so cast pageSize to int instead and compare in int space. #2 PR — stale rack state on building switch (P2 Codex). When the host swaps ManageBuildingModal's building.id without first closing (e.g. clicking a second row in the buildings table while the modal is open), the load-effect started a fresh fetch but the prior building's entries + initialPlacementRef + loadError + isLoading were still visible. Save/Manage controls could fire against the wrong snapshot in the load-pending window. Resolved by keying ManageBuildingModal on building.id in BuildingModals so React unmounts/remounts on switch — avoids the setState-in-effect anti-pattern (the project's ESLint flags those explicitly). Test fixture: BuildingModals.test.tsx hoisted the useBuildings mock to a module-level object so mock-fn identity stays stable across renders. The previous per-render `vi.fn()` factory triggered infinite re-renders once the modal's load effect started doing any state work. E2E protofleet-e2e-tests failure was in 00-onboarding setup spec + 01-miningPools — unrelated to buildings; no references to building/ aisle/ListBuildingRacks in the failing log. Likely flake; will re-run after this push. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

#1 (P2 Codex) page_token cap was 256 bytes; encodeBuildingRackCursor emits base64(JSON{label, id}) where the label can be a 100-char multi-byte string. A page boundary on a long label could push the cursor past 256 bytes and the next request would fail validation mid-loop, stranding ManageBuildingModal on a partial seed. Raised the buf.validate max_len to 2048 — covers worst-case label + id + struct overhead with headroom. The token is server-issued, so this cap is defensive, not a contract. #2 (ankit) third cursor with the same encode/decode shape — extracted shared encodeCursor[T]/decodeCursor[T] generic helpers and kept the named call-site wrappers (encodeCollectionCursor etc.) so existing store paths read clearly at a glance. The generic carries an errLabel so encode-failure logs still identify the offending list. #3 (ankit) ClearRackPlacementForSoftDelete was outside the `if collType == COLLECTION_TYPE_RACK` block in DeleteCollection, so the rack-only cleanup fired for groups too. Moved it inside the rack branch — non-rack types now skip it as intended. service_test gomock InOrder updated for the new ordering (clear now lands immediately after the unassign-site cascade, before the generic RemoveAllDevicesFromCollection step). The two group-typed delete tests no longer expect the call. #4 (ankit) ListRacksOutsideBuildingBounds did not LIMIT — caller only reads len > 0 and orphans[0] for the error. Added LIMIT 1 to keep the scan cheap on large buildings; result shape stays a slice so the existing service code is unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

SEC-HIGH: layer permission checks on stats RPCs so site:read alone can't fetch fleet-wide telemetry or miner inventory. - GetSiteStats now requires site:read (with SiteID narrowing) + fleet:read. - GetBuildingStats now requires site:read + fleet:read + miner:read; the miner:read gate matches the surface exposed by device_identifiers. SEC-MED: cap GetBuildingStats device_identifiers at 50k (MaxDevicesPerStatsResponse) to bound response payload + FE memory; mirrors the existing rack cap pattern. R4-#1: BuildingCard rack grid no longer collapses at high column counts — floor the cell size with MIN_CELL_PX so calc() can't resolve negative when columns × gap > container width. R4-#2: BuildingCard footer rack count reads stats.rackCount when available so /sites cards stay in sync after rack assignment changes. R4-#3: thread per-field reporting counts through devicerollup + the stats RPCs so the FE can render the dash when efficiency / hashrate / power is missing on otherwise-reporting devices (avoids misleading partial averages). R4-#4: SitesPage now polls ListSites + ListBuildings on the shared POLL_INTERVAL_MS so building cards reflect adds/removes without manual refresh.

Six unresolved Codex threads + three security review findings on the FE PR. Stale BE-handler thread resolved separately (now on main). Sec MED — Per-card stats polling flood Every BuildingCard previously ran its own GetBuildingStats poll, so the All-Sites view fanned out one RPC per card per tick. Add a viewport gate: cards observe their intersection via a new `useInViewport` hook and only poll while visible. Offscreen cards keep their last-good snapshot (the hook no longer resets state on enabled-toggle), so re-revealing a card doesn't flash a skeleton. Sec LOW — Sites polling drops request lifecycle (= inline thread #5) fetchSites / fetchBuildings now return the listSites / listAllBuildings promise so usePoll can await them. Without this the scheduler measured intervals from request start, allowing slow responses to overlap and late stale responses to clobber newer state. Inline #2 — Preserve last-good lists during polling errors Track `sitesLoadedRef` / `buildingsLoadedRef`. Initial-load failures still take the empty-state / retry path. Once we've shown real data, poll errors keep the last-good list rendered and surface inline retry banners instead of blanking the screen. Inline #3 — Surface GetBuildingStats failures on building page Plumb `error` + `hasLoaded` + `refetch` from useBuildingStats into BuildingPage and render an inline retry banner when stats fails on initial load — previously the metrics row, diagnostics, and performance section sat indefinitely in skeleton state. Inline #4 — Keep site names visible in overview sections Render the site name as an h2 above the metric row. Location remains a tile for at-a-glance scanning, but operators can now distinguish two sites that share blank or duplicate locations. Inline #6 — Preserve cascade warning until rack count loads Disable the BuildingPage Edit affordance until ListBuildingRacks resolves (or fails). Previously clicking Edit before that completed opened the manage/delete dialog with rackCount: 0n, hiding the cascade warning for buildings that actually had racks. Inline #7 / Sec MED — "No issues" before scope known Add `enabled` flag to useComponentErrors. While the building's device scope is still loading (memberDeviceIds === null), pass enabled=false so the hook holds undefined counts and the FleetErrors tiles render skeletons rather than racing to "No issues" against an empty-scope fetch.

cutover files from source repo

3e7ed53

Migrate ProtoFleet codebase from the source repository, including GitHub workflows, issue templates, documentation, hermit-managed tooling, and the full project source tree.

ankitgoswami previously approved these changes Apr 20, 2026

View reviewed changes

negarn previously approved these changes Apr 20, 2026

View reviewed changes

add default code owners

1d2ac08

mcharles-square dismissed stale reviews from negarn and ankitgoswami via 1d2ac08 April 20, 2026 18:05

mcharles-square and others added 2 commits April 21, 2026 12:21

use default GITHUB_TOKEN for generated-code-check

4295a1f

The workflow only reads and runs git diff — it doesn't push back, so it doesn't need the elevated GitHub App token. This avoids requiring GH_APP_ID/GH_APP_PRIVATE_KEY secrets on the repo.

add per-device feature support matrix to README (#3)

68d9ffe

mcharles-square requested review from ankitgoswami and negarn April 21, 2026 17:21

mcharles-square changed the title ~~cutover: migrate ProtoFleet source from upstream repo~~ chore: cutover Proto Fleet source from upstream repo Apr 21, 2026

ankitgoswami approved these changes Apr 21, 2026

View reviewed changes

mcharles-square merged commit 2712d0d into main Apr 21, 2026
94 of 106 checks passed

flesher mentioned this pull request May 2, 2026

feat(filters): bifurcated chips + Add Filter trigger (#137) #147

Merged

4 tasks

rongxin-liu mentioned this pull request May 6, 2026

feat(curtailment): persistence + preview selector #185

Closed

6 tasks

ankitgoswami mentioned this pull request May 12, 2026

docs(rfc): expand RFC 0001 to cover operational concerns #222

Merged

ankitgoswami mentioned this pull request May 13, 2026

feat(fleetnode): long-running daemon (heartbeat-first) #227

Merged

6 tasks

rongxin-liu mentioned this pull request May 15, 2026

Harden Dependabot config: cooldown + Go/Rust ecosystems #240

Merged

flesher mentioned this pull request May 20, 2026

Multi-site UX — design pass for Unassigned selection on /sites and /buildings/:id #273

Open

rongxin-liu mentioned this pull request May 21, 2026

feat(curtailment): Stop, staggered restore, max-duration enforcement #232

Merged

mcharles-square mentioned this pull request May 26, 2026

feat(authz): swap existing-gate handlers to RequirePermission (PR 2c) #305

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: cutover Proto Fleet source from upstream repo#2

chore: cutover Proto Fleet source from upstream repo#2
mcharles-square merged 4 commits into
mainfrom
cutover

mcharles-square commented Apr 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mcharles-square commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mcharles-square commented Apr 20, 2026 •

edited

Loading