Skip to content

notifications (4/5): Grafana-backed service, handlers, and server wiring#455

Open
illegalprime wants to merge 3 commits into
eden/notifications-3-authzfrom
eden/notifications-4-server
Open

notifications (4/5): Grafana-backed service, handlers, and server wiring#455
illegalprime wants to merge 3 commits into
eden/notifications-3-authzfrom
eden/notifications-4-server

Conversation

@illegalprime

Copy link
Copy Markdown
Contributor

Stack 4/5 — base: `eden/notifications-3-authz`

Wires up the server side end-to-end.

  • Notifications domain: Grafana sidecar client, service with org isolation and value-level secret redaction, models
  • Connect handlers for Channel / Rule / Silence / History
  • `main.go` wiring; Grafana + destination-policy config on the metrics provider
  • Marks channel mutations as redacted + session-only in the interceptor config (a leaked API key can't reconfigure alert delivery or mute alerts)

Stack

  1. proto + generated code
  2. notification_history keyset listing
  3. notification:read / notification:manage permissions
  4. this PR — Grafana-backed service, handlers, server wiring
  5. notifications UI

🤖 Generated with Claude Code

@illegalprime illegalprime requested a review from a team as a code owner June 15, 2026 15:53
@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (66e55db8f5c0d4bfe97e1fd3f88307bceeb66a05...75185f66396fe7581a62d34f4869e4be93b79392, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: HIGH

Findings

[HIGH] Created notification channels are not used for real alert delivery

  • Category: Reliability
  • Location: server/internal/domain/notifications/service.go:96
  • Description: CreateChannel only creates a Grafana contact point. It never updates Grafana’s notification policy tree or any fleet-api fan-out path to route org alerts to that contact point. The existing provisioned policy still routes alerts to the internal protofleet-internal receiver only.
  • Impact: Operators can create and successfully test channels, but production alerts will not be delivered to those channels. This is a silent alerting failure.
  • Recommendation: On channel create/update/delete, also manage an org-scoped notification policy route, or make the internal webhook fan out to configured channels. Add an integration test that fires a real alert and verifies it reaches a user-created channel.

[HIGH] Destination validation is bypassable by DNS rebinding or redirects

  • Category: Network Discovery
  • Location: server/internal/domain/notifications/service.go:369
  • Description: Notification destinations are classified only at save/test-preview time. Grafana performs the actual outbound request later, and the code explicitly notes the check is not rebinding-proof. A user can save a public domain, then change DNS or return a redirect to loopback, link-local, metadata, or private addresses before TestChannel or alert delivery.
  • Impact: A user with notification:manage can turn Grafana into an SSRF primitive against services reachable from the Grafana container or host network.
  • Recommendation: Enforce egress controls at the Grafana/container network boundary, including loopback, RFC1918, link-local, and metadata IP blocks. If app-level enforcement remains, proxy delivery through code that re-resolves and validates every dial and redirect hop.

[MEDIUM] TestChannel does not implement the request contract

  • Category: gRPC
  • Location: server/internal/handlers/notifications/handler.go:128
  • Description: TestChannel always calls protoToChannel, which rejects CHANNEL_KIND_UNSPECIFIED, even when the request only needs to test an existing saved channel by id. It also contradicts the proto comment that an unsaved definition wins when both id and config are present; the service uses the saved channel whenever id is non-empty.
  • Impact: Saved-channel tests can fail unexpectedly, and edited unsaved destinations can get false-positive test results from the old saved destination.
  • Recommendation: Treat id-only requests as Channel{ID: id} without requiring kind; when both id and a definition are present, either test the supplied definition or reject the ambiguous request. Add handler-level tests.

[MEDIUM] Grafana admin password is exposed in process arguments

  • Category: Infrastructure
  • Location: deployment-files/run-fleet.sh:933
  • Description: The new token provisioning path passes admin:${admin_pass} to curl -u. Command-line arguments are commonly visible via ps or /proc to local users while the command runs.
  • Impact: A local user on the deployment host could capture the Grafana admin password and mint broader Grafana access.
  • Recommendation: Avoid secrets in argv. Use a temporary 0600 curl config/netrc file or stdin-fed config, remove it immediately after use, and keep the password out of process arguments.

[MEDIUM] Grafana JSON error bodies are returned after incomplete redaction

  • Category: gRPC
  • Location: server/internal/domain/notifications/grafana_client.go:213
  • Description: For JSON error responses, the client returns the redacted body as the error message. The redactor only handles known JSON keys plus URLs and bearer strings inside generic string fields, so JSON messages that echo SMTP passwords or other secrets in free-form text can still reach logs and RPC errors.
  • Impact: Notification credentials can leak to server logs or API clients during Grafana validation/provisioning failures.
  • Recommendation: Do not surface upstream Grafana bodies to clients. Return a generic status/category, and log only whitelisted safe fields or a fully generic redacted summary.

Notes

Reviewed only .git/codex-review.diff. I did not find changed code in this diff touching miner pool configuration, command execution, SQL queries, frontend rendering, plugin execution, Rust ASIC code, or protobuf wire-format definitions.


Generated by Codex Security Review |
Triggered by: @illegalprime |
Review workflow run

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8c55e88d7a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

FLEET_METRICS_GRAFANA_TOKEN: "${FLEET_METRICS_GRAFANA_TOKEN:-}"
# Basic-auth fallback; defaults empty so a token deployment never holds the Grafana admin credential.
FLEET_METRICS_GRAFANA_USER: "${FLEET_METRICS_GRAFANA_USER:-admin}"
FLEET_METRICS_GRAFANA_PASSWORD: "${FLEET_METRICS_GRAFANA_PASSWORD:-}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pass the generated Grafana password to fleet-api

In the documented production path (run-fleet.sh --enable-beta-notifications), the script generates GRAFANA_ADMIN_PASSWORD but does not set FLEET_METRICS_GRAFANA_TOKEN or FLEET_METRICS_GRAFANA_PASSWORD; this overlay then passes an empty password while GrafanaConfig falls back to basic auth as admin. The stack therefore starts with fleet-api unable to authenticate to the Grafana sidecar, so the newly wired channel/rule/silence APIs return Grafana 401s unless an operator manually discovers and sets an extra env var.

Useful? React with 👍 / 👎.

}
case ChannelKindSlack:
// Slack URLs are write-only secrets: an edit without a fresh URL keeps the stored destination, a fresh URL is both a destination change and a new secret.
keepStoredSlackURL = c.Slack == nil || c.Slack.WebhookURL == ""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require a Slack URL when changing channel kind

When an existing non-Slack channel is updated to SLACK with no webhook_url, this treats the missing URL as “keep the stored Slack URL” without checking owned.Kind; validation is skipped and carrySecretSettings later copies the old settings url into the new Slack contact point. That can silently reuse a webhook destination as a Slack secret when changing kinds, so only preserve an empty Slack URL for existing Slack channels and require a fresh URL otherwise.

Useful? React with 👍 / 👎.

case SilenceScopeGroup:
if sil.Scope.GroupID != "" {
matchers = append(matchers, GrafanaSilenceMatcher{
Name: "group_id",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match group silences to labels alerts actually emit

A group-scoped silence is accepted but written with a group_id matcher; none of the bundled Grafana rules emit a group_id label (the static rule label is rule_group, while the metrics contract uses device_group). In the current notification stack this silence will look active in the API but will never match firing alerts, so group mutes do not stop notifications.

Useful? React with 👍 / 👎.

case SilenceScopeSite:
if sil.Scope.SiteID != "" {
matchers = append(matchers, GrafanaSilenceMatcher{
Name: "site_id",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include site labels before accepting site silences

Site-scoped silences are written with a site_id matcher, but the provisioned Grafana alert queries in proto-fleet-rules.yaml select only organization_id/device_id for device alerts and only organization_id for fleet alerts, so alert instances do not carry site_id. In that deployment a site silence is stored and returned as active but cannot match any alert, causing notifications for the site to continue firing.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6cdd7d2724

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Type: grafanaTypeFor(c.Kind),
Settings: settings,
}
created, err := s.grafana.CreateContactPoint(ctx, cp)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Route created channels into Grafana notification policies

When a user creates a Slack/SMTP/webhook channel, this only creates a contact point; the provisioned policy tree in server/monitoring/grafana/provisioning/alerting/notification-policies.yaml still has a single root receiver protofleet-internal and there is no policy API update in this service. In that deployed configuration Grafana will keep routing alerts only to the internal webhook, so newly created channels can be listed and tested but will never receive real alert notifications unless an operator manually edits Grafana routing.

Useful? React with 👍 / 👎.

return nil, err
}
// TestChannelRequest carries no name; the domain service derives one from the id when present.
dom, err := protoToChannel(req.Msg.GetId(), "", req.Msg.GetKind(), req.Msg.GetWebhook(), req.Msg.GetSmtp(), req.Msg.GetSlack())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor unsaved test definitions when an id is present

For TestChannelRequest payloads that include both an existing id and edited channel settings, passing the id through here makes Service.TestChannel take the saved-channel branch and ignore the supplied webhook/SMTP/Slack fields. That contradicts the proto contract that the unsaved definition wins when both are set, so an edit form that tests a changed destination before saving will send the probe to the old stored channel and can report success for settings that were never tested.

Useful? React with 👍 / 👎.

@illegalprime illegalprime force-pushed the eden/notifications-3-authz branch from 507a4e5 to 3bd492f Compare June 15, 2026 17:37
illegalprime added a commit that referenced this pull request Jun 15, 2026
The prod overlay left FLEET_METRICS_GRAFANA_PASSWORD empty so fleet-api never
held the Grafana admin credential, but that meant a default deployment had no
way to authenticate to Grafana and every Grafana-proxying notification RPC
401'd while the stack came up healthy.

Instead of defaulting fleet-api to the admin password, mint a least-privilege
Grafana service account (Editor on org 1) and token post-boot in run-fleet.sh,
persist it to .env, and recreate fleet-api so it authenticates with the scoped
token. The basic-auth password stays empty by default. Idempotent across
re-runs.

Addresses Codex security review finding (MEDIUM) on #455.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@illegalprime illegalprime force-pushed the eden/notifications-4-server branch from 6cdd7d2 to 6655a15 Compare June 15, 2026 17:37
@illegalprime

Copy link
Copy Markdown
Contributor Author

Codex findings on this PR:

  • [MEDIUM] prod overlay defaults fleet-api to Grafana admin creds — fixed (commit 6655a15). Reverted the admin-password default back to empty and instead provision a least-privilege Grafana service-account token (Editor on org 1) post-boot in run-fleet.sh, persisted to .env, with fleet-api recreated to pick it up. Basic-auth password stays empty by default, so a token deployment never holds the admin credential. Idempotent across re-runs.
  • [MEDIUM] destination validation bypass / SSRF — not fixed inline; the durable fix is Grafana-side egress enforcement (the pre-flight DNS check in fleet-api isn't the real egress boundary). Tracked in Notifications: enforce Grafana egress restrictions (destination SSRF) #459.

⚠️ The run-fleet.sh token provisioning calls the live Grafana API and could not be end-to-end tested here — please validate on a real --enable-beta-notifications deploy.

illegalprime and others added 2 commits June 15, 2026 13:43
Adds the notifications domain (Grafana sidecar client, service with org
isolation and secret redaction, models), the Connect handlers for Channel/
Rule/Silence/History, and the main.go wiring. Marks channel mutations as
redacted + session-only in the interceptor config and adds Grafana +
destination-policy config to the metrics provider.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The prod overlay left FLEET_METRICS_GRAFANA_PASSWORD empty so fleet-api never
held the Grafana admin credential, but that meant a default deployment had no
way to authenticate to Grafana and every Grafana-proxying notification RPC
401'd while the stack came up healthy.

Instead of defaulting fleet-api to the admin password, mint a least-privilege
Grafana service account (Editor on org 1) and token post-boot in run-fleet.sh,
persist it to .env, and recreate fleet-api so it authenticates with the scoped
token. The basic-auth password stays empty by default. Idempotent across
re-runs.

Addresses Codex security review finding (MEDIUM) on #455.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6655a150e3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


func (g *Grafana) UpdateContactPoint(ctx context.Context, uid string, cp GrafanaContactPoint) (*GrafanaContactPoint, error) {
var out GrafanaContactPoint
if err := g.do(ctx, http.MethodPut, "/api/v1/provisioning/contact-points/"+uid, cp, &out); err != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle Grafana contact-point update ACK

When updating any saved channel against real Grafana, PUT /api/v1/provisioning/contact-points/:uid succeeds with a 202 Ack rather than an EmbeddedContactPoint. Decoding that ACK into GrafanaContactPoint leaves Settings empty (or returns EOF on an empty body), and UpdateChannel then passes it to contactPointToChannel, so a successful edit is reported as an RPC error and the UI cannot reliably update channels. Return/refetch the contact point or build the response from the request instead.

Useful? React with 👍 / 👎.

}
return GrafanaSilence{
StartsAt: sil.StartsAt,
EndsAt: sil.EndsAt,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require or materialize a silence end time

When CreateSilence/UpdateSilence omits ends_at, protoToSilence leaves sil.EndsAt at the Go zero time; this copies that zero into the Alertmanager payload even though the Alertmanager silence schema requires endsAt. The service's own silenceActive treats zero as indefinite, so open-ended mutes accepted by this API shape are rejected by Grafana instead of being created; either reject missing ends_at or convert it to a real future expiration before posting.

Useful? React with 👍 / 👎.

@illegalprime illegalprime force-pushed the eden/notifications-3-authz branch from 3bd492f to 66e55db Compare June 15, 2026 17:44
@illegalprime illegalprime force-pushed the eden/notifications-4-server branch from 6655a15 to 753a5b8 Compare June 15, 2026 17:44
…Ls in parse errors

Two follow-ups from review:
- run-fleet.sh swallowed the post-token fleet-api restart (>/dev/null 2>&1
  || true), so a failed recreate left fleet-api on the pre-token env (still
  401ing) while the script reported success. Check the result and fail loudly
  with manual-restart instructions.
- checkDestinationURL returned url.Parse's error verbatim, which embeds the
  raw input — a webhook/Slack URL can carry a capability token. Return a
  generic 'not parseable' message so the secret can't leak via errors or logs.

Addresses Codex security review findings (MEDIUM) on #455.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@illegalprime

Copy link
Copy Markdown
Contributor Author

Re the re-review (the first two are now fixed in commit 75185f6):

  • [MEDIUM] malformed webhook URLs leak tokens through errors — fixed. checkDestinationURL no longer returns url.Parse's error verbatim (it embeds the raw input, which can carry a capability token); returns a generic "not parseable" message instead.
  • [MEDIUM] token provisioning can strand fleet-api on hidden restart failure — fixed. run-fleet.sh no longer swallows the post-token fleet-api recreate (>/dev/null 2>&1 || true); it checks the result and fails loudly with manual-restart instructions.
  • [MEDIUM] TestChannel validates stored channel instead of edited definition — by design (documented at the call site): for a saved channel the echoed-back payload is redacted, so testing the request body would probe a secret-stripped target; we test the stored settings instead. New/unsaved channels are tested from the entered payload directly (AddChannelModal → testChannelApi). Not changing this.
  • [HIGH] DNS-rebinding SSRF — unchanged: tracked in Notifications: enforce Grafana egress restrictions (destination SSRF) #459 (needs Grafana-side egress enforcement, not a fleet-api pre-flight fix).

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 75185f6639

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +804 to +807
"smtpHost": c.SMTP.Host,
"smtpPort": c.SMTP.Port,
"smtpUsername": c.SMTP.Username,
"fromAddress": c.SMTP.From,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Don't store SMTP server settings on email contact points

For SMTP channels, these per-channel smtpHost/smtpPort/credential fields are written into the Grafana email contact point, but Grafana OSS reads SMTP server configuration from [smtp]/GF_SMTP_*, not from the contact point; the Grafana email docs require SMTP to be configured in grafana.ini and list the email contact-point settings as addresses/single-email/template fields (https://grafana.com/docs/grafana/latest/alerting/configure-notifications/manage-contact-points/integrations/configure-email/). Since the bundled server/monitoring/grafana/grafana.ini has no [smtp] section, creating/testing an SMTP channel with a host/password here can appear saved while real email delivery still uses an unconfigured global SMTP transport and fails.

Useful? React with 👍 / 👎.

return nil, ErrNotFound
}
sil.OrganizationID = orgID
gs := domainSilenceToGrafana(orgID, sil)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve the silence creator on update

When updating an existing silence, UpdateSilenceRequest has no created_by, so sil.CreatedBy is still empty when this payload is built. For any edit of a silence created by a user, the POST replaces the Grafana silence with an empty createdBy, so subsequent reads lose the audit/display owner (and may be rejected if Grafana enforces the Alertmanager field); carry e.CreatedBy from the matched existing silence or stamp the updater before calling domainSilenceToGrafana.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant