notifications: preliminary UI work#306
Conversation
🔐 Codex Security Review
Review SummaryOverall Risk: MEDIUM Findings[MEDIUM] Destination Checks Do Not Enforce Grafana Egress
[MEDIUM] Notification HTTP Surface Diverges From Generated Protobuf Contract
NotesNo auth bypass, SQL injection, command injection, cryptostealing/pool hijack logic, or Rust ASIC plugin unsafe-code changes were found in the reviewed diff. This was a static review of Generated by Codex Security Review | |
9534768 to
e52b9ea
Compare
7811814 to
4370492
Compare
5ee6d9f to
a4625a5
Compare
4370492 to
8d92a42
Compare
494025b to
ba75406
Compare
Addresses the Codex security review on PR #306: - RBAC bypass (HIGH): add notification:read / notification:manage to the authz catalog and gate every notifications route through middleware.RequirePermission. The handler now loads effective permissions after session validation, mirroring the Connect auth interceptor. Reads sit on :read; all mutations including TestChannel sit on :manage. - Org-wide silences via empty scopes (HIGH): validate silence scopes in the domain layer — every scope kind now requires its target (rule_id / group_id / site_id / device_ids). The UI hides the group/site/device scope options until their pickers exist. - Channels not routed to alerts (HIGH): made the UI explicit that saved channels are not yet attached to alert routing; policy-tree routing lands as a follow-up. - Secret leakage in Grafana error logs (HIGH): redact authorization_credentials, smtpPassword, and other credential fields from the request/response bodies logged on non-2xx Grafana responses. - SSRF via webhook/SMTP destinations (MEDIUM): new DestinationPolicy rejects destinations that are or resolve to loopback, link-local, private, or unspecified addresses; opt out via FLEET_METRICS_NOTIFICATIONS_ALLOW_PRIVATE_DESTINATIONS for dev stacks (enabled in the dev compose overlay). - Updates wiping stored secrets (MEDIUM): UpdateChannel now carries the existing settings' secret field into the PUT payload when the request doesn't include a fresh secret, so renames and destination edits no longer clear webhook credentials or SMTP passwords. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Addressed the Codex security review findings in 146c469: [HIGH] Notifications endpoints bypass RBAC — Added [HIGH] Empty silence scopes can mute the whole org — [HIGH] Saved channels are not routed to alerts — Routing via the Grafana notification-policy tree is a feature of its own and isn't in this PR; the Channels section copy now states explicitly that saved channels are not yet attached to alert routing and that "Test" sends directly to the destination. Policy-tree routing lands as a follow-up. [HIGH] Grafana error logging leaks notification secrets — Request/response bodies logged on non-2xx Grafana responses now pass through structured key-based redaction ( [MEDIUM] Webhook/SMTP destinations allow server-side egress — New [MEDIUM] Updating a webhook channel clears its stored secret — 🤖 Generated with Claude Code |
Follow-ups from the Codex re-review on PR #306: - DNS rebinding caveat (HIGH, partial): destination DNS lookups now fail closed — an unresolvable host is rejected instead of waved through. The rebinding residual (re-resolution at Grafana delivery time) is documented on checkDestinationHost; a hard guarantee needs egress enforcement at the Grafana container's network boundary. - Webhook URLs in error logs (MEDIUM): "url" added to the redacted log keys — webhook URLs routinely embed capability tokens (Slack, PagerDuty) in the path or query. - Silence regex injection (MEDIUM): device ids in device-scoped silences are validated against the identifier alphabet, escaped with regexp.QuoteMeta, and compiled into an anchored regex matcher so a crafted id like ".*" can't widen the silence org-wide. matchersToScope strips the anchors/escapes on reads. - ADMIN backfill (MEDIUM): migration 000087 seeds notification:read / notification:manage and backfills them onto existing ADMIN roles — the additive reconciler intentionally never re-asserts seed keys on existing built-ins (mirrors 000069). - Grafana admin creds in deployment (MEDIUM, partial): the deployment overlay now passes FLEET_METRICS_GRAFANA_TOKEN through (it takes precedence over basic auth in the client) so operators can move fleet-api onto a least-privilege service-account token; the admin fallback remains until a token is configured. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Second-round review findings addressed in 72d9d9e: [HIGH] SSRF policy bypassable via DNS rebinding — Partially mitigated: DNS lookups now fail closed (an unresolvable destination is rejected rather than allowed), per the reviewer's minimum recommendation. The fundamental rebinding window can't be closed from fleet-api — Grafana re-resolves the hostname at delivery time — so the residual risk is documented on [MEDIUM] Webhook URLs leak into error logs — [MEDIUM] Device-scope silence regex injection — Device ids are now validated against the identifier alphabet ( [MEDIUM] Existing ADMIN roles not backfilled — Correct catch: the additive reconciler never re-asserts seed keys on existing built-ins. Migration [MEDIUM] Grafana admin creds in deployment overlay — The overlay now passes [MEDIUM] Generated Connect API not served — Intentionally deferred, as documented in the handler header: the hand-written JSON routes are the interim surface until the 🤖 Generated with Claude Code |
Follow-ups from the Codex re-review on PR #306: - Webhook URLs leak to read-only users (MEDIUM): webhook URLs embed capability tokens in the path/query, so reads (reachable by notification:read) now return them host-only (scheme://host). UpdateChannel carries the stored full URL through an edit that didn't replace it — an empty or still-redacted URL is treated as unchanged, mirroring the secret-carry path; a manage user submitting a fresh full URL overrides it. - Raw Grafana error bodies returned to clients (MEDIUM): GrafanaError messages reach the browser via err.Error(), and update payloads now carry stored secrets, so a Grafana error echoing the request body could leak them. The message is redacted with the same key-based redactor used for logs. - Hand-rolled JSON routes bypass validation/body limits (MEDIUM): added http.MaxBytesReader (1 MiB) on the body, and enforce the proto's name (255) / comment (1024) / device-id-count (500) bounds in the hand-written path until the generated Connect handlers with the validate interceptor take over. DNS rebinding (MEDIUM) is already mitigated at the application layer (fail-closed resolution) and documented; the network-boundary fix is infra. HistoryService-outside-proto (MEDIUM) is part of the tracked generated-handler migration. Both noted on the PR. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Third-round findings addressed in 0b131db: [MEDIUM] Webhook URLs leak to read-only users — Good catch, consistent with the logger change. Reads now return webhook URLs host-only ( [MEDIUM] Raw Grafana error bodies returned to clients — [MEDIUM] Hand-rolled routes bypass validation/body limits — Added [MEDIUM] DNS rebinding bypasses SSRF checks — Unchanged from round two: mitigated at the application layer (fail-closed resolution) and documented on [MEDIUM] HistoryService outside the proto contract — Acknowledged; this is part of the same generated-handler migration the handler header documents. When 🤖 Generated with Claude Code |
Fourth-round Codex finding: redactSecrets passed non-JSON upstream error bodies through verbatim, and that string is both logged and returned to the browser via GrafanaError. A Grafana/proxy HTML or plaintext error that echoes the request payload could leak bearer tokens, SMTP passwords, or secret-bearing webhook URLs — key-based redaction can't reach secrets in unstructured text. - redactSecrets now returns a length marker for non-JSON input instead of the raw body. - The client-facing GrafanaError message is the generic status text for non-JSON responses; only Grafana's own structured (JSON) error is surfaced, after key redaction. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Fourth-round review — risk is back at HIGH on two findings. Triage: [HIGH] Channel destination updates preserve old secrets — Valid, and a regression from my round-1 secret-carry. This is already fixed in the working tree (in-progress on the branch): the secret carry is now gated on the destination being unchanged — a webhook URL or SMTP host/port change without a fresh secret drops the stored credential instead of delivering the old bearer token / SMTP password to the new destination. A rename or recipients-only edit still preserves it. It'll land with the next push of the in-progress channel work on this branch. [HIGH] DNS rebinding bypasses SSRF check — Unchanged position: this cannot be closed in fleet-api. Pinning the resolved IP and handing Grafana an IP instead of the hostname would break TLS SNI / the Host header for real destinations (Slack, PagerDuty). The validation is an application-level backstop (fail-closed resolution, documented on [MEDIUM] Grafana error redaction passes non-JSON bodies through — Fixed in 988f58a. [MEDIUM] Saved channel tests ignore the saved channel ID — Valid; [MEDIUM] History endpoint outside the proto contract — Same disposition as prior rounds: part of the tracked hand-written-JSON → generated-Connect-handler migration documented in the handler header. When 🤖 Generated with Claude Code |
Fifth-round Codex security findings:
- Grafana JSON error bodies can leak secrets in generic string fields
(MEDIUM): key-based redaction misses a secret Grafana echoes inside
a value like {"message":"failed to POST to https://hooks/.../TOKEN"}.
redactValue now scrubs webhook-URL and bearer-token substrings out
of every string value, wherever they sit — applied to both the log
body and the client-facing GrafanaError message.
- Fleet API always receives the Grafana admin password (MEDIUM): the
deployment overlay hard-required GRAFANA_ADMIN_PASSWORD for
fleet-api even when a service-account token is configured, so a
fleet-api compromise leaked the admin credential. The basic-auth
password now defaults empty and is an explicit opt-in fallback; a
token deployment no longer carries the admin credential in
fleet-api's environment. (Needs a companion run-fleet.sh change to
set the password for no-token deployments — flagged on the PR.)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Fifth-round review (MEDIUM). Two fixed in 97ca093, three deferred with consistent rationale: [MEDIUM] Grafana JSON error bodies leak secrets in generic string fields — Fixed. My round-4 redaction was key-based only, so a secret echoed inside a value (e.g. [MEDIUM] Fleet API always receives the Grafana admin password — Fixed in the deployment overlay. It previously hard-required [MEDIUM] Hand-written JSON not Connect/protobuf-compatible & [MEDIUM] History endpoint missing from proto — Both are the same tracked item: the hand-written JSON routes are the documented interim surface (see the handler header) until the [MEDIUM] SSRF is only a pre-resolution check (DNS rebinding) — Unchanged: not closable in fleet-api without breaking TLS SNI/Host for real destinations. Application-level fail-closed validation stays as the backstop; the guarantee needs egress policy at the Grafana container boundary or a pinning delivery proxy. Flagged for human risk-acceptance + infra. (Noted the CGNAT range suggestion — worth folding into the host check if/when we extend it.) 🤖 Generated with Claude Code |
Sixth-round Codex finding (also a regression surfaced by the read-time URL redaction): TestChannel ignored Channel.ID and always probed the request payload. Since reads now hand the UI a redacted destination (webhook URLs host-only, Slack URLs omitted), clicking "Test" on a saved channel hit the wrong target — or a host root that isn't the real destination. - When an id is supplied, TestChannel now resolves the owned Grafana contact point (ownership-checked via findOwnedChannel) and tests its stored settings — the full URL and Grafana's secure fields. An unsaved definition (no id) still validates and tests the request payload for the "Test before save" flow. - The handler now surfaces TestChannel's service error (unknown/foreign id, invalid destination, Grafana unreachable) as a 4xx/5xx instead of discarding it, distinct from a delivery failure reported in the body. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Sixth-round review (MEDIUM). One fixed, two are the same tracked migration item: [MEDIUM] Saved channel tests don't use the saved destination — Fixed in c587607. This was the round-4 item I'd deferred while Caveat noted in the finding: a saved Slack channel still can't be fully tested because Grafana stores the Slack URL as a secure field and doesn't return it on reads — same fundamental limitation as any stored secret, and not regressed by this change (it failed before too). Fully fixing it would require fleet-api to keep a parallel copy of the secret, which the design deliberately avoids. [MEDIUM] Manual wire format diverges from generated protobuf & [MEDIUM] History endpoint not in proto — Same tracked item as prior rounds: the hand-written JSON routes are the documented interim surface until the The DNS-rebinding item moved to the reviewer's Notes this round ("deployment should enforce Grafana egress at the network layer") — agreed, that's the resolution: infra-level egress, not a fleet-api code change. 🤖 Generated with Claude Code |
|
Seventh-round review — converged. No new actionable findings this round; both remaining items are ones already dispositioned over prior rounds, and I'm not making code changes for them (flagging here so it's explicit rather than re-answering each re-review):
All concrete, code-level findings from rounds 1–6 (RBAC bypass, org-wide silences, secret-log/error leakage, SSRF pre-checks, secret-reuse-on-destination-change, secret redaction depth, token-only Grafana auth, saved-channel testing) are addressed and on the branch. The two items above are an infra task and a tracked refactor respectively — both appropriate to handle outside this preliminary-UI PR. I'll continue watching and will only follow up if a future re-review surfaces something genuinely new. 🤖 Generated with Claude Code |
UI work expose notifications controls and alerting sinks.