Skip to content

chore: cutover Proto Fleet source from upstream repo#2

Merged
mcharles-square merged 4 commits into
mainfrom
cutover
Apr 21, 2026
Merged

chore: cutover Proto Fleet source from upstream repo#2
mcharles-square merged 4 commits into
mainfrom
cutover

Conversation

@mcharles-square

@mcharles-square mcharles-square commented Apr 20, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Migrates the full ProtoFleet codebase into this repository as the initial cutover from the source repo
  • Adds GitHub workflows, issue templates, hermit-managed tooling, and project source (2506 files changed)

Test plan

  • Verify CI workflows trigger correctly on this PR
  • Confirm hermit-managed binaries resolve
  • Spot-check directory structure matches the source repo

Migrate ProtoFleet codebase from the source repository, including GitHub
workflows, issue templates, documentation, hermit-managed tooling, and
the full project source tree.
ankitgoswami
ankitgoswami previously approved these changes Apr 20, 2026
negarn
negarn previously approved these changes Apr 20, 2026
@mcharles-square mcharles-square dismissed stale reviews from negarn and ankitgoswami via 1d2ac08 April 20, 2026 18:05
mcharles-square and others added 2 commits April 21, 2026 12:21
The workflow only reads and runs git diff — it doesn't push back, so it
doesn't need the elevated GitHub App token. This avoids requiring
GH_APP_ID/GH_APP_PRIVATE_KEY secrets on the repo.
@mcharles-square mcharles-square changed the title cutover: migrate ProtoFleet source from upstream repo chore: cutover Proto Fleet source from upstream repo Apr 21, 2026
@mcharles-square mcharles-square merged commit 2712d0d into main Apr 21, 2026
94 of 106 checks passed
ankitgoswami added a commit that referenced this pull request May 6, 2026
… URL guard

Address three follow-up review findings on the fleet-agent CLI.

#1 Persist state immediately after Register, before the api_key paste
prompt. A Ctrl-C or terminal disconnect during the prompt previously
discarded the freshly-generated identity + miner-signing private keys
while the server held an orphaned agent row that could not be
re-registered (unique-name conflict). State now lands as soon as
Register returns; recovery is a `fleet-agent refresh --api-key=<paste>`
away.

#2 Narrow the refresh credential wipe to api_key rejection only.
Define an errAPIKeyRejected sentinel returned only when
BeginAuthHandshake reports Unauthenticated. CompleteAuthHandshake's
Unauthenticated cases (expired challenge, signature mismatch) no
longer wipe local credentials, so transient or signature-side bugs
cannot turn a still-valid api_key into a forced re-enrollment.

#3 Re-validate the server URL on refresh. Persist
allow_insecure_transport on State so refresh enforces the same
https-or-loopback rule that enroll did. Without this, a state.yaml
written via --allow-insecure-transport would have refresh silently
keep using cleartext for the long-lived api_key on every renewal.

refresh now also accepts --api-key (and FLEET_AGENT_API_KEY) so it can
complete a partial enrollment recovered from an interrupted enroll.

Tests cover: state persistence right after Register (no api_key,
incomplete state file), refresh happy path, refresh recovering a
partial enroll via flag, refresh requiring an api_key, refresh wiping
on rejected api_key, refresh preserving state on a signature failure,
refresh rejecting plain http when AllowInsecureTransport is unset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request May 7, 2026
…concurrent runs

#1 refresh now keeps the override candidate separate from the stored
api_key during validation. A rejected --api-key override only surfaces
an error and leaves the stored credentials on disk; the wipe path
fires only when the stored key itself is what the server rejected.
A typo in --api-key no longer takes a healthy agent offline.

#2 saveState fsyncs the temp file before close and fsyncs the
containing directory after rename. A power loss or kernel crash
between rename and disk flush can no longer roll back the rename or
leave a torn write. The crash-safety claim in the comment is now
honored end-to-end.

#3 enroll and refresh wrap their full load/handshake/save sequences in
an exclusive flock on <state-dir>/state.lock. Two concurrent refresh
invocations (timer overlap, scripted retry, double-click) now
serialize, so a slower writer cannot overwrite a newer session_token
the server has already replaced. Lock + dir-fsync are linux/darwin
only; non-Unix targets get no-op stubs and a comment noting the gap.

Tests: stored-key survives a rejected --api-key override; 5 concurrent
refreshes serialize and converge to the same final state. Existing
wipe-on-rejected-stored-key + preserve-on-signature-failure scenarios
keep working.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request May 7, 2026
…ambiguous auth

#1 Connect-RPC client rejects every 30x response. Default http.Client
follows 307/308 redirects, replaying the POST body (enrollment_token,
api_key, signature, challenge response) to the redirect target. A
downgrade redirect from https to http would defeat the non-loopback
https requirement and leak the api_key over cleartext. Connect does
not use redirects, so refusing all 30x is the cleanest fix.

#2 Interactive api_key prompt uses term.ReadPassword when stdin is a
TTY. The pasted secret no longer ends up in scrollback, tmux logs,
session recordings, or screenshots. Piped input keeps the scanner
path so scripted enrollment is unaffected. Adds golang.org/x/term as
a direct dependency.

#3 refresh no longer wipes local credentials on Unauthenticated. The
server returns Unauthenticated for several distinct cases (api_key
revocation, identity_pubkey mismatch, expired challenge, bad
signature) and refresh cannot tell them apart. The previous wipe
turned a recoverable identity-mismatch or transient proxy error into
a forced re-enrollment. State is preserved on every failure mode now;
the operator decides whether to re-enroll based on the surfaced error.
A rejected --api-key override still does not mutate disk and surfaces
a clear "retry without --api-key" message.

Tests: redirect rejected for all 5 30x codes, http.Client carries the
30s timeout, refresh preserves state when the stored api_key is
rejected, override still preserves stored creds when typo'd.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
flesher added a commit that referenced this pull request May 8, 2026
Triage of the parallel adversarial review run on PR #206. Findings
applied as code:

[P1 #1] activity_log composite-FK MATCH SIMPLE bypass on NULL
organization_id closed via CHECK constraint:
ck_activity_log_site_requires_org enforces site_id IS NULL OR
organization_id IS NOT NULL. Verified: insert with site_id + NULL
org rejected; insert with NULL site_id + NULL org accepted (system
events unchanged).

[P1 #2] command_on_device_log org_id backfill: pre-flight DO block
counts orphan rows (codl rows whose device row is missing) and
RAISES with a clear message before SET NOT NULL. A clean abort beats
SET NOT NULL failing mid-migration with the dirty flag set.

[P2 #5] InsertError CTE rewrite: documented the contract change
(missing device $3 yields sql.ErrNoRows instead of FK violation).
Existing caller wraps generically so surface unchanged.

[P2 #6] Plan doc trimmed: power-contract column list marked DEFERRED
in entity description and Phase 1 migration bullet. Future readers
won't write service code against columns that don't exist.

[P2 #7] ListSites count subqueries: added org_id predicate to
device and building scans so they hit idx_device_org_site /
idx_building_org_deleted instead of full-table scan in multi-tenant
prod.

[P2 #8] InsertDeviceMetrics sub-select dropped AND deleted_at IS NULL
to match InsertError / InsertMinerStateSnapshot. Telemetry from a
soft-deleted device is still legitimate per-site history; three
writers, one behavior.

[P3 #10] building.default_rack_order_index: added
ck_building_default_rack_order_index CHECK (BETWEEN 0 AND 4) to
match sibling CHECKs.

[P3 #11] fk_device_set_rack_device_set_org: added ON DELETE CASCADE
to match the single-column FK on device_set_id and the building FK
on the same row. Composite adds the org-matching invariant without
changing cascade semantics.

Findings deferred:
- #3 (non-CONCURRENTLY indexes inside tx): deploy-time concern,
  document in PR/deploy notes; restructuring migrations to use
  CONCURRENTLY is a separate effort.
- #4 (no integration tests): defer to Phase 1B service layer where
  end-to-end flows exercise the invariants.
- #9 (device.uq_device_id_org_id missing): already exists (verified
  on live DB); finding is incorrect.
- #12 (CREATE TABLE IF NOT EXISTS): IF NOT EXISTS hides genuine
  schema errors; force-clean is the right golang-migrate pattern.

Round-trip clean, build clean, lint clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request May 13, 2026
Five fixes from Codex + Copilot reviews:

1. run.go: ValidateServerURL on every entry, not just the refresh
   path. A tampered or stale state with a fresh session_token would
   have let bearer heartbeats hit a plaintext non-loopback URL.
   Added TestRunCmd_ValidatesServerURLBeforeBuildingClient. (Codex
   MEDIUM; Copilot inline.)

2. run.go: Move signal.NotifyContext above the initial refresh so
   SIGINT can interrupt a startup refresh against a slow/unreachable
   gateway. Previously the initial refreshAndSave used
   context.Background() and could hang up to the 30s HTTP timeout
   before shutting down. (Codex inline #1, Copilot #3.)

3. run.go: Bump the "heartbeat sent" log line from Debug to Info.
   The default slog handler is Info, so the daemon was silent under
   normal operation; a stuck daemon would be indistinguishable from
   a working one. (Copilot #4.)

4. authclient.go: Streaming wrapper now matches unary on empty
   token. Previously WrapStreamingClient silently omitted the
   Authorization header when tokenSource returned empty, letting an
   unauthenticated stream open and fail later in harder-to-debug
   ways. Now returns a failingStreamingClientConn whose Send/Receive
   surface Unauthenticated immediately, matching the unary path's
   fail-fast contract. (Copilot #5; affects future ControlStream
   work, not currently exercised.)

5. run_test.go: ed25519.GenerateKey(nil) -> rand.Reader explicitly.
   The stdlib actually handles nil by falling back to crypto/rand,
   so this isn't a real panic risk, but explicit is clearer and
   matches the integration tests' style. (Copilot #2.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mcharles-square added a commit that referenced this pull request May 21, 2026
… per-org admin lookup + dual-write

Addresses round-2 PR review:

**Reconciler (Codex Security HIGH #1):** previous implementation
re-asserted every seed permission on every boot for ADMIN and
FIELD_TECH, which silently restored operator-revoked rows the next
time fleetd started. The intent of "additive-only" was operator-edits-
survive — actual behavior was "seed-set-is-restored." Fixed by
splitting reconciliation: GetBuiltinRoleForOrg first; if the row does
not exist, INSERT and seed every permission in the spec; if it does
exist, SUPER_ADMIN gets full reconcile (assign + prune), ADMIN and
FIELD_TECH get nothing. Once a per-org built-in row exists the
operator owns it; new catalog permissions only show up on roles the
operator opts into editing. Two new regression tests:
TestReconcile_OperatorRevokedSensitivePermStaysRevokedOnRestart and
TestReconcile_NewCatalogPermissionDoesNotPropagateToExistingBuiltins.

**Composite role FK (Codex Security HIGH #2):** added UNIQUE(id,
organization_id) on role and replaced user_organization_role's
role_id FK with a composite FK on (role_id, organization_id) →
role(id, organization_id). An assignment can no longer reference a
role owned by a different organization — DB-enforced tenant
isolation. ListEffectivePermissionsForUser,
CountOrgScopeSuperAdminsExcludingAssignment, and
CountOrgScopeSuperAdminsExcludingUser now also assert
r.organization_id = uor.organization_id in their joins as defense in
depth.

**Per-org ADMIN lookup (Codex thread P1):** auth.Service.CreateUser
was calling GetRoleByName(AdminRoleName) which now returns an
arbitrary per-org ADMIN row across orgs. Added
UserManagementStore.GetBuiltinRoleForOrg(orgID, builtinKey) and the
SQL implementation; CreateUser uses it with the user's actual orgID
so the new user is bound to their own org's ADMIN row.

**Dual-write into user_organization_role (Codex Security MEDIUM):**
new users created post-migration only landed in legacy
user_organization. The resolver, once it goes live, reads from
user_organization_role, which means fresh installs would have a
founding SUPER_ADMIN with no effective permissions. Both
CreateAdminUserWithOrganization and CreateUserOrganizationRole now
dual-write: legacy user_organization plus an org-scope row in
user_organization_role. New test
TestOnboarding_FoundingUserGetsOrgScopeSuperAdminAssignment locks
this in.
flesher added a commit that referenced this pull request May 21, 2026
Addresses three findings from PR #285 review:

1. RacksPage filter-empty state (Codex P2). `hasEverLoaded` only flips when
   `total > 0`, so deep-linking `/racks?building=<id>` where the building has
   zero racks (but the org has racks elsewhere) fell through to the
   "You haven't set up any racks" null state instead of the filtered-empty
   state. Including `hasActiveFilters` in the `hasRacks` check short-circuits
   the null state whenever a filter is active.

2. `handleClearFilters` double-fetch (Copilot suppressed). When the building
   filter was active, clearing all triggered one synchronous resetAndFetch
   with stale ref values plus a second via the URL-change effect. Snapshot
   the building-active state up front; skip the explicit resetAndFetch
   when the URL-change effect will handle the refetch.

3. No FE tests for the URL parse (me + Copilot). Extracted
   `parseBuildingIdsFromParams` + `BUILDING_URL_PARAM` to a new
   `rackManagement/utils/buildingFilterUrl.ts` module and added a vitest
   covering repeated keys, legacy comma-joined values, invalid input,
   ordering, and large bigint ids.

Follow-up issue #291 tracks the pre-existing `<Link><Button/></Link>`
nesting on BuildingPageHeader (PR review finding #2) — fixing both View
racks and View miners together rather than mixing scope into this PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
flesher added a commit that referenced this pull request May 27, 2026
#1 (P1) ProcedurePermissions in middleware/rpc_permissions.go did not
classify the two new BuildingService RPCs, so the generated handler
interface had unclassified procedures and the rpc_permissions_test
RBAC-contract assertion would fail in CI. Added:

- BuildingServiceListBuildingRacksProcedure → PermSiteRead (building-
  scoped read).
- BuildingServiceAssignRackToBuildingProcedure → PermSiteManage
  (mutates rack building/site/zone/grid).

#2 (P2) SiteSettingsSingleView kept rendering the previous site's
buildings until the new listBuildingsBySite call resolved after an
active-site switch. With row-click now opening the edit modal, an
operator could open a building from the prior site under the new
site's header (with the new site name as cascade context). Paired
the buildings state with the siteId that produced it (mirrors
BuildingPage's response pairing) so stale rows fall back to the
Loading… state instead of being clickable.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
flesher added a commit that referenced this pull request May 27, 2026
#1 server-checks (gosec G115). The int32(len(rows)) cast in the new
ListBuildingRacks pagination tripped integer-overflow; pageSize is
already service-clamped to ListBuildingRacksMaxPageSize (1000), so
cast pageSize to int instead and compare in int space.

#2 PR — stale rack state on building switch (P2 Codex). When the
host swaps ManageBuildingModal's building.id without first closing
(e.g. clicking a second row in the buildings table while the modal
is open), the load-effect started a fresh fetch but the prior
building's entries + initialPlacementRef + loadError + isLoading
were still visible. Save/Manage controls could fire against the
wrong snapshot in the load-pending window. Resolved by keying
ManageBuildingModal on building.id in BuildingModals so React
unmounts/remounts on switch — avoids the setState-in-effect
anti-pattern (the project's ESLint flags those explicitly).

Test fixture: BuildingModals.test.tsx hoisted the useBuildings mock
to a module-level object so mock-fn identity stays stable across
renders. The previous per-render `vi.fn()` factory triggered
infinite re-renders once the modal's load effect started doing any
state work.

E2E protofleet-e2e-tests failure was in 00-onboarding setup spec +
01-miningPools — unrelated to buildings; no references to building/
aisle/ListBuildingRacks in the failing log. Likely flake; will
re-run after this push.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
flesher added a commit that referenced this pull request May 27, 2026
#1 (P2 Codex) page_token cap was 256 bytes; encodeBuildingRackCursor
emits base64(JSON{label, id}) where the label can be a 100-char
multi-byte string. A page boundary on a long label could push the
cursor past 256 bytes and the next request would fail validation
mid-loop, stranding ManageBuildingModal on a partial seed. Raised
the buf.validate max_len to 2048 — covers worst-case label + id +
struct overhead with headroom. The token is server-issued, so this
cap is defensive, not a contract.

#2 (ankit) third cursor with the same encode/decode shape — extracted
shared encodeCursor[T]/decodeCursor[T] generic helpers and kept the
named call-site wrappers (encodeCollectionCursor etc.) so existing
store paths read clearly at a glance. The generic carries an
errLabel so encode-failure logs still identify the offending list.

#3 (ankit) ClearRackPlacementForSoftDelete was outside the
`if collType == COLLECTION_TYPE_RACK` block in DeleteCollection, so
the rack-only cleanup fired for groups too. Moved it inside the
rack branch — non-rack types now skip it as intended. service_test
gomock InOrder updated for the new ordering (clear now lands
immediately after the unassign-site cascade, before the generic
RemoveAllDevicesFromCollection step). The two group-typed delete
tests no longer expect the call.

#4 (ankit) ListRacksOutsideBuildingBounds did not LIMIT — caller
only reads len > 0 and orphans[0] for the error. Added LIMIT 1 to
keep the scan cheap on large buildings; result shape stays a slice
so the existing service code is unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
flesher added a commit that referenced this pull request May 29, 2026
SEC-HIGH: layer permission checks on stats RPCs so site:read alone can't
fetch fleet-wide telemetry or miner inventory.
  - GetSiteStats now requires site:read (with SiteID narrowing) + fleet:read.
  - GetBuildingStats now requires site:read + fleet:read + miner:read; the
    miner:read gate matches the surface exposed by device_identifiers.

SEC-MED: cap GetBuildingStats device_identifiers at 50k (MaxDevicesPerStatsResponse)
to bound response payload + FE memory; mirrors the existing rack cap pattern.

R4-#1: BuildingCard rack grid no longer collapses at high column counts —
floor the cell size with MIN_CELL_PX so calc() can't resolve negative when
columns × gap > container width.

R4-#2: BuildingCard footer rack count reads stats.rackCount when available
so /sites cards stay in sync after rack assignment changes.

R4-#3: thread per-field reporting counts through devicerollup + the stats
RPCs so the FE can render the dash when efficiency / hashrate / power is
missing on otherwise-reporting devices (avoids misleading partial averages).

R4-#4: SitesPage now polls ListSites + ListBuildings on the shared
POLL_INTERVAL_MS so building cards reflect adds/removes without manual
refresh.
flesher added a commit that referenced this pull request Jun 1, 2026
Six unresolved Codex threads + three security review findings on the
FE PR. Stale BE-handler thread resolved separately (now on main).

Sec MED — Per-card stats polling flood
  Every BuildingCard previously ran its own GetBuildingStats poll, so
  the All-Sites view fanned out one RPC per card per tick. Add a
  viewport gate: cards observe their intersection via a new
  `useInViewport` hook and only poll while visible. Offscreen cards
  keep their last-good snapshot (the hook no longer resets state on
  enabled-toggle), so re-revealing a card doesn't flash a skeleton.

Sec LOW — Sites polling drops request lifecycle (= inline thread #5)
  fetchSites / fetchBuildings now return the listSites / listAllBuildings
  promise so usePoll can await them. Without this the scheduler
  measured intervals from request start, allowing slow responses to
  overlap and late stale responses to clobber newer state.

Inline #2 — Preserve last-good lists during polling errors
  Track `sitesLoadedRef` / `buildingsLoadedRef`. Initial-load failures
  still take the empty-state / retry path. Once we've shown real data,
  poll errors keep the last-good list rendered and surface inline
  retry banners instead of blanking the screen.

Inline #3 — Surface GetBuildingStats failures on building page
  Plumb `error` + `hasLoaded` + `refetch` from useBuildingStats into
  BuildingPage and render an inline retry banner when stats fails on
  initial load — previously the metrics row, diagnostics, and
  performance section sat indefinitely in skeleton state.

Inline #4 — Keep site names visible in overview sections
  Render the site name as an h2 above the metric row. Location remains
  a tile for at-a-glance scanning, but operators can now distinguish
  two sites that share blank or duplicate locations.

Inline #6 — Preserve cascade warning until rack count loads
  Disable the BuildingPage Edit affordance until ListBuildingRacks
  resolves (or fails). Previously clicking Edit before that completed
  opened the manage/delete dialog with rackCount: 0n, hiding the
  cascade warning for buildings that actually had racks.

Inline #7 / Sec MED — "No issues" before scope known
  Add `enabled` flag to useComponentErrors. While the building's
  device scope is still loading (memberDeviceIds === null), pass
  enabled=false so the hook holds undefined counts and the FleetErrors
  tiles render skeletons rather than racing to "No issues" against an
  empty-scope fetch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants