Skip to content

feat(rbac): resolver + RequirePermission middleware (PR 2a of 4)#301

Merged
mcharles-square merged 9 commits into
mainfrom
feat/rbac-pr2a-resolver
May 22, 2026
Merged

feat(rbac): resolver + RequirePermission middleware (PR 2a of 4)#301
mcharles-square merged 9 commits into
mainfrom
feat/rbac-pr2a-resolver

Conversation

@mcharles-square

Copy link
Copy Markdown
Collaborator

First half of PR 2 of the four-part granular RBAC plan (the plan PR). Building on the foundation in #283.

This PR introduces the permission resolver and the RequirePermission middleware — the security gate the rest of the codebase will route through. Critically, no callsites are swapped yet: every existing RequireAdmin check still gates handler entry, this PR is purely additive. The gate exists, but nothing reads it. PR 2b will flip the switch (swap callers and neutralize the legacy user_organization.role_id column).

Splitting PR 2 in half keeps the "live security gate flip" change small and reviewable — see #283's review history for context (five rounds, many design-level pivots). The resolver-and-middleware piece doesn't change behavior, so it can settle in review without holding up the swap.

What landed

domain/authz/effective.go — in-memory decision type

EffectivePermissions is the per-request immutable snapshot of one user's authorization state within one organization.

  • Has(key, ResourceContext) implements narrowing semantics from the plan:
    • Org-scope assignment grants the action everywhere within the org.
    • Site-scope assignment overrides the org grant at that site only, so admins can add a smaller site-scoped role to narrow a user without revoking their org-wide access.
    • Org-scoped actions (no site context in the request) are never shadowed by site-scope grants — there's no site key to narrow on.
  • FlatKeys() projects the union for UserInfo.permissions (client-side coarse gate; the server still enforces scope via Has() on every call).
  • Pure data structure with no I/O; built test-first with 10 unit tests covering the narrowing matrix, multi-site union, empty deny, unknown-key deny, and the FIELD_TECH AE.

domain/authz/resolver.goPermissionResolver

  • LoadEffective(ctx, userID, orgID) runs one query (ListEffectivePermissionsForUser) against user_organization_role × role × role_permission and materializes the result into an EffectivePermissions.
  • Groups flat (assignment_id, scope, permission_key) rows by assignment id so multi-permission roles fold into one Assignment value.
  • A user with no live assignments gets a non-nil empty value — the fail-closed default for a deactivated user or a user who was never in this org.
  • 5 DB-backed integration tests (via testutil.GetTestDB) cover SUPER_ADMIN-at-org-scope, FIELD_TECH-at-site-scope, narrowing end-to-end with two assignments, soft-delete is ignored, and empty deny-all.

handlers/middleware/permission.goRequirePermission

The runtime call site each handler will use in place of RequireAdmin. Returns the session.Info for handlers that need the caller's identity, or one of:

  • Unauthenticated — no session.Info on context.
  • Internal — fail-closed; EffectivePermissions missing from context (the interceptor failed to populate it). ALLOW is never the fail-closed default.
  • PermissionDenied with structured JSON body:
    {"required": "<permission_key>", "scope": {"site_id": <N>}}
    Scope echoes the caller's ResourceContext only — never server-side assignment IDs, role names, or the caller's effective permission list.

Internal-actor short-circuit: when session.Info.Actor is non-empty (scheduler, curtailment-reconciler, etc.), RequirePermission short-circuits to ALLOW without consulting EffectivePermissions. Synthesized sessions have no real user; they're trusted by virtue of running in-process.

WithEffectivePermissions(ctx, eff) is the public stash function used by the interceptor; the read path is private to enforce that handlers only consume via RequirePermission.

8 unit tests cover allow, deny (both org-scoped and site-scoped payload shape), unauthenticated, fail-closed-on-missing, scheduler short-circuit, curtailment-reconciler short-circuit, and the narrowing matrix.

handlers/interceptors/authentication.go — wiring

Both authenticateWithSession and authenticateWithApiKey end with a shared loadEffectivePermissions(ctx, info) helper that calls resolver.LoadEffective and stashes the result via middleware.WithEffectivePermissions before returning the request context.

  • API-key callers inherit the user's current effective set on every request (live inheritance). Revocation propagates on next request. Snapshot-at-issue is out of scope (noted in resolver godoc).
  • DB or wiring errors from LoadEffective flow through the same classifyLookupError path as user/role lookups — broken resolver path fails closed.
  • main.go constructs the resolver after Reconcile (so per-org built-in rows exist) and threads it into NewAuthInterceptor.
  • testutil.ServiceProvider gains a PermissionResolver field; all NewAuthInterceptor callsites updated.

What this PR does NOT do

  • Does not swap any RequireAdmin callers — every handler still uses the old gate. PR 2b.
  • Does not introduce the rpc_permissions.go declaration map or the contract test — PR 2b.
  • Does not rename user_organization.role_id or add the raising trigger — PR 2b.
  • Does not delete handlers/middleware/admin.go — kept until callers move in PR 2b.

Test plan

  • go build ./... clean
  • go vet ./... clean
  • lefthook run pre-push (golangci-lint) clean
  • Non-DB tests pass: 10 EffectivePermissions tests + 8 RequirePermission middleware tests, all green
  • DB-dependent resolver integration tests (5 functions) — need Docker for the full reconcile+assign+resolve loop
  • Behavior smoke: existing handler gates still work because no callsite was swapped; lefthook run pre-push covers the typecheck for the interceptor threading the new resolver in

Rollback

Reverting this PR removes the resolver call from the auth interceptor, the new files, and the PermissionResolver field on AuthInterceptor. No schema change, no data migration, no live behavior change to reverse.

In-memory permission set the per-request resolver materializes from
the user_organization_role × role × role_permission join. Pure data
structure — DB integration lands in the next commit (PermissionResolver).

Has(key, ResourceContext) implements the plan's narrowing rule:
- Org-scope assignment grants the action everywhere within the org.
- Site-scope assignment overrides the org grant at that site only
  (site-scope shadow), so an admin can add a smaller site role to
  narrow a user without revoking their org-wide access.
- Org-scoped actions (no site context in the request) are NEVER
  shadowed by site-scope grants — there is no site key to narrow on,
  so only org-scope assignments can satisfy user:manage and friends.

FlatKeys() projects the union for UserInfo.permissions (client-side
coarse gate; the server still enforces scope via Has() on every call).

Test-first per the plan's execution note. Ten tests cover: org-scope
allows everywhere, site-scope only at that site, org-scoped action
needs org grant, narrowing site-overrides-org at the matched site
but org still applies elsewhere, narrowing doesn't shadow org-scoped
actions, multi-site union, empty deny, unknown-key deny, FlatKeys
determinism, and the FIELD_TECH AE (blink yes, reboot no).
Has() never reads role identity (id/name/builtin_key) — it only
consults permission_key + scope. AssignmentRole was a speculative
field; remove it to keep the resolver's input shape minimal and avoid
a population requirement that the ListEffectivePermissionsForUser
query doesn't even satisfy.
…ions

LoadEffective(userID, orgID) runs a single query against
user_organization_role × role × role_permission via the
ListEffectivePermissionsForUser sqlc binding, groups the flat
(assignment_id, scope, permission_key) rows into one Assignment per
assignment_id, and materializes the slice into an
EffectivePermissions value the middleware can query via Has().

Soft-deleted assignments and roles are excluded by the underlying SQL.
A user with no live assignments returns a non-nil empty value — the
fail-closed default for a deactivated user or a user who was never in
this org.

Integration tests (DB-backed via testutil.GetTestDB) cover:
- SUPER_ADMIN at org-scope gets the full catalog.
- FIELD_TECH at site-scope is bounded at that site.
- Narrowing end-to-end with two assignments (ADMIN @ org +
  FIELD_TECH @ site).
- Soft-deleted assignment is ignored on the next resolver call.
- No-assignments returns empty deny-all (non-nil).
… denial payload

Runtime counterpart of EffectivePermissions.Has() — the call site each
handler uses in place of RequireAdmin (the swap lands in PR 2b).

Behavior:
- Unauthenticated request (no session.Info on context) → Connect
  Unauthenticated.
- Internal actor (session.Info.Actor != "") short-circuits to ALLOW.
  Scheduler and curtailment-reconciler synthesize sessions without a
  real user; they have no rows in user_organization_role and are
  trusted by virtue of running in-process.
- Authenticated user, EffectivePermissions missing on context →
  Connect Internal. Fail-closed: the gate is the security boundary
  and the missing value indicates a wiring bug, not "allow."
- Authenticated user, eff.Has(key, rc) false → Connect
  PermissionDenied with structured JSON body
  `{"required": "<key>", "scope": {"site_id": <N>}}`. Scope echoes
  the caller's ResourceContext only — never server-side assignment
  ids, role names, or the effective-permission list.

WithEffectivePermissions stashes the resolver's per-request output on
the context under a private key so handler code can't accidentally
read or overwrite it.

Eight unit tests cover: allow path, denial payload shape (both
org-scoped and site-scoped variants), unauthenticated, fail-closed on
missing EffectivePermissions, ActorScheduler short-circuit,
ActorCurtailment short-circuit, narrowing semantics threading
through the full middleware path (site-1 denies, site-2 allows under
the same caller).
Every authenticated request now ends with a per-request
EffectivePermissions stashed on the context, regardless of whether the
caller authenticated by session cookie or API key. RequirePermission
(committed in the previous change) consults that stashed value to
gate handler entry.

Implementation:
- AuthInterceptor gains a *authz.PermissionResolver field; both
  authenticateWithSession and authenticateWithApiKey end with a
  shared loadEffectivePermissions helper that calls
  resolver.LoadEffective and wraps the result with
  middleware.WithEffectivePermissions before returning the context.
- API-key callers inherit the user's CURRENT effective set on every
  request (live inheritance). Revocation propagates on next request.
  Snapshot-at-issue is out of scope for this PR and noted in the
  resolver godoc.
- DB or wiring errors from LoadEffective are surfaced via the same
  classifyLookupError path as the existing user/role lookups —
  authenticated requests with a broken resolver path fail closed.
- main.go constructs the resolver after migrations finish (Reconcile
  runs first so per-org built-in rows exist) and threads it into
  NewAuthInterceptor.
- testutil.ServiceProvider gains a PermissionResolver field; all
  NewAuthInterceptor callsites updated. Two unit tests that passed nil
  for unused dependencies (config_test, authentication_apikey_error_test)
  add a nil for the resolver too — those paths exit before the
  resolver runs and don't need a real one.
@mcharles-square mcharles-square requested a review from a team as a code owner May 22, 2026 16:41
Copilot AI review requested due to automatic review settings May 22, 2026 16:41
@github-actions

github-actions Bot commented May 22, 2026

Copy link
Copy Markdown

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (3f157ca2f68465471f88a1af8397104bae72d239...7a202c74f2fc9f198c8cc7066a75131df2079684, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: MEDIUM

Findings

[MEDIUM] Permission Rechecks Reuse Stale Authorization Snapshot

  • Category: Auth
  • Location: server/internal/handlers/middleware/permission.go:102
  • Description: RequirePermission only reads the EffectivePermissions value cached on the request context during authentication. The new comment says long-running RPCs can re-invoke RequirePermission so revocation propagates within the same streaming session, but re-invocation uses the same cached snapshot and never reloads from the resolver.
  • Impact: If a user or API key loses a role/permission while a long-running RPC is active, later side effects in that same RPC can continue under the permissions loaded at request start. This matters for firmware updates, log downloads, and streaming control paths where a request can outlive an operator’s revocation action.
  • Recommendation: Either document that permissions are strictly request-start snapshots, or add a fresh-check path for long-running handlers that reloads permissions from PermissionResolver before each significant side effect.

Notes

No SQL injection, command injection, pool hijacking, protobuf wire-format, plugin-boundary, or frontend security issues were evident in the reviewed diff.


Generated by Codex Security Review |
Triggered by: @mcharles-square |
Review workflow run

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the new granular RBAC “effective permissions” snapshot + resolver, wires it into the auth interceptor, and introduces a RequirePermission middleware that will become the new handler gate (without switching any call sites yet).

Changes:

  • Introduces authz.EffectivePermissions with narrowing semantics and a DB-backed PermissionResolver.
  • Adds handlers/middleware.RequirePermission and context plumbing for per-request effective permissions.
  • Wires resolver loading into the auth interceptor and updates server/test wiring to pass the resolver.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
server/internal/testutil/service_provider.go Adds PermissionResolver to test service provider wiring.
server/internal/testutil/infrastructure_provider.go Threads the resolver into NewAuthInterceptor in test infra.
server/internal/handlers/middleware/permission.go New RequirePermission middleware + context stash/accessors and structured deny payload.
server/internal/handlers/middleware/permission_test.go Unit tests for allow/deny, payload shape, fail-closed behavior, and actor short-circuit.
server/internal/handlers/interceptors/config_test.go Updates interceptor construction signature in tests.
server/internal/handlers/interceptors/authentication.go Loads/stashes effective permissions after auth (session + API key).
server/internal/handlers/interceptors/authentication_test.go Updates interceptor construction to pass resolver in integration tests.
server/internal/handlers/interceptors/authentication_apikey_error_test.go Updates interceptor construction signature.
server/internal/domain/authz/effective.go New in-memory effective-permissions decision structure + Has/FlatKeys.
server/internal/domain/authz/effective_test.go Unit tests for narrowing semantics, unions, and flat projection.
server/internal/domain/authz/resolver.go New resolver that materializes EffectivePermissions from one SQL query.
server/internal/domain/authz/resolver_test.go DB-backed integration tests for org/site scope and narrowing behavior.
server/cmd/fleetd/main.go Constructs and injects the resolver into the auth interceptor at boot.

Comment thread server/internal/handlers/interceptors/authentication.go
Comment thread server/internal/domain/authz/resolver.go
…-closed

Round-1 review on #301 caught three real bugs.

**Zero-permission site assignments fail open (Codex HIGH + thread 2):**
ListEffectivePermissionsForUser was INNER JOINing role_permission, so
a site-scope role with zero permissions produced no rows. The
resolver never recorded bySite[siteID] for that site and narrowing
collapsed back to the user's org-scope grant — an empty "lockdown"
role intended to revoke broad access at a single site would have
silently failed and allowed miner:reboot / miner:update_pools / etc.
through. Switched to LEFT JOIN role_permission and permission;
permission_key is now nullable. assignmentsFromRows skips null keys
but still creates the Assignment row so the bySite[] entry exists.
Two regression tests added: pure unit test (in-memory empty
assignment narrows correctly) plus DB-backed test exercising the
LEFT JOIN path with a real empty custom role.

**Any non-empty Actor bypasses RBAC (Codex MEDIUM):** RequirePermission
short-circuited on `info.Actor != ""`, which would accept any future
or mistyped Actor value as fully trusted. Switched to an explicit
allowlist: ActorScheduler and ActorCurtailment short-circuit to
ALLOW; any other non-empty Actor returns Internal. Regression test
proves an unknown Actor surfaces as Internal, not ALLOW.

**Nil PermissionResolver panic (thread 1):** loadEffectivePermissions
dereferenced i.permissionResolver without a nil check. A miswired
test or boot path constructing the interceptor with nil (a few unit
tests pass nil for unused dependencies) would have panicked the
server instead of failing closed. Added the nil check — returns
Internal so the wiring bug surfaces immediately rather than as a
crash.
@mcharles-square

Copy link
Copy Markdown
Collaborator Author

Codex Security Review (round 1 on #301) — response

Both findings addressed in e7f762a.

HIGH — Zero-permission site assignments fail open to org-scope grants

A site-scoped role with zero permissions produces no permission rows, so the resolver never records bySite[siteID]; Has then falls back to the user's org-scope grant.

You're right — the INNER JOIN dropped empty-role assignments and narrowing collapsed. Fixed by switching ListEffectivePermissionsForUser to LEFT JOIN role_permission and LEFT JOIN permission, with permission_key now sql.NullString. The resolver creates the Assignment (and therefore the bySite[siteID] marker) on first sight of the assignment id, and only appends a permission key when the row carries a non-null one. An assignment with zero permissions still triggers narrowing.

Two regression tests added:

  • TestEffective_ZeroPermissionSiteAssignmentStillNarrows — pure unit test against NewEffectivePermissions directly.
  • TestResolver_ZeroPermissionSiteAssignmentStillNarrows — DB-backed, creates an empty custom role via UpsertCustomRoleForOrg and confirms an org-scope ADMIN + zero-permission site role denies miner:reboot at the targeted site, but still allows it at sites with no narrower assignment.

MEDIUM — Any non-empty internal actor bypasses all permission checks

Switch on explicit allowed constants (session.ActorScheduler, session.ActorCurtailment) and fail closed for unknown non-empty actors. Add a test proving an unknown actor does not bypass EffectivePermissions.

Done. Replaced the info.Actor != "" short-circuit with an explicit allowlist:

if info.Actor != "" {
    switch info.Actor {
    case session.ActorScheduler, session.ActorCurtailment:
        return info, nil
    default:
        return nil, fleeterror.NewInternalErrorf(
            "authz: unknown internal actor %q; refusing to short-circuit RBAC",
            info.Actor,
        )
    }
}

TestRequirePermission_UnknownActorDoesNotBypass mounts a context with Actor: session.Actor("future-orchestrator-typo") and no EffectivePermissions; the call returns Internal (not ALLOW) — proves the typo path does not silently grant.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e7f762ab8c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/handlers/interceptors/authentication.go Outdated
… succeeds

Codex P2 caught: authenticateWithApiKey was calling
RecordSuccessfulUse before loadEffectivePermissions, so a resolver
failure (DB error, nil wiring) would reject the request while still
crediting last_used_at as a successful authentication. Audit and
usage telemetry would report failed auths as successful.

Moved RecordSuccessfulUse below the loadEffectivePermissions call so
it only fires when the full authentication chain succeeds. Behavior
on success is unchanged.
…Manage

CI caught that TestResolver_ZeroPermissionSiteAssignmentStillNarrows
asserted ADMIN holds user:manage at org scope. ADMIN's seed formula is
AllPermissions() minus {user:read, user:manage, role:manage} — the
user:manage exclusion is the whole point of the SUPER_ADMIN-only
floor. Swapped the org-scope assertion to site:manage, which IS in
ADMIN's seed and exercises the same property (org-scoped action,
satisfied by org-scope grant despite the empty site-scope assignment).
…d scope helper

Review-driven simplification pass over PR 2a (no behavior change):

**resolver.go assignmentsFromRows**: dropped the `started bool` + `flush()` closure pattern. Pre-seed the slice from `rows[0]` then iterate uniformly; the same code path now handles first-row and same-assignment-row cases. Extracted `newAssignmentFromRow` for the scope-only seed. Net: -8 lines and clearer flow.

**effective.go Has()**: tightened the inline narration. The type-level doc already explains narrowing in full; the function body's three repeat-explanations were noise. Code is short enough to read directly.

**resolver_test.go assignAssignment**: helper now takes `authz.ScopeType` instead of raw `string`, and call sites use `authz.ScopeOrg` / `authz.ScopeSite` constants instead of literal `"org"` / `"site"`. Removes seven stringly-typed call sites.

**resolver_test.go / backfill_test.go**: moved `insertTestSite` next to its sibling `insertTestOrganization` and `insertTestUser` helpers in backfill_test.go — all three are package-local DB-seeding helpers that belong together.

**permission.go**: dropped the "in U7" task-tracker reference from the RequirePermission godoc; replaced with "as call sites migrate to the permission model" so the doc outlives the migration.

**resolver.go NewPermissionResolver**: trimmed the constructor godoc to the WHY (per-request read query, no surrounding tx) and dropped the sentence that just narrated the signature.

Build, vet, golangci-lint, and non-DB authz/middleware tests all green.

@ankitgoswami ankitgoswami left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@mcharles-square mcharles-square merged commit cf6ff01 into main May 22, 2026
68 checks passed
@mcharles-square mcharles-square deleted the feat/rbac-pr2a-resolver branch May 22, 2026 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants