notifications (4/5): Grafana-backed service, handlers, and server wiring#455
notifications (4/5): Grafana-backed service, handlers, and server wiring#455illegalprime wants to merge 3 commits into
Conversation
🔐 Codex Security Review
Review SummaryOverall Risk: HIGH Findings[HIGH] Created notification channels are not used for real alert delivery
[HIGH] Destination validation is bypassable by DNS rebinding or redirects
[MEDIUM]
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8c55e88d7a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| FLEET_METRICS_GRAFANA_TOKEN: "${FLEET_METRICS_GRAFANA_TOKEN:-}" | ||
| # Basic-auth fallback; defaults empty so a token deployment never holds the Grafana admin credential. | ||
| FLEET_METRICS_GRAFANA_USER: "${FLEET_METRICS_GRAFANA_USER:-admin}" | ||
| FLEET_METRICS_GRAFANA_PASSWORD: "${FLEET_METRICS_GRAFANA_PASSWORD:-}" |
There was a problem hiding this comment.
Pass the generated Grafana password to fleet-api
In the documented production path (run-fleet.sh --enable-beta-notifications), the script generates GRAFANA_ADMIN_PASSWORD but does not set FLEET_METRICS_GRAFANA_TOKEN or FLEET_METRICS_GRAFANA_PASSWORD; this overlay then passes an empty password while GrafanaConfig falls back to basic auth as admin. The stack therefore starts with fleet-api unable to authenticate to the Grafana sidecar, so the newly wired channel/rule/silence APIs return Grafana 401s unless an operator manually discovers and sets an extra env var.
Useful? React with 👍 / 👎.
| } | ||
| case ChannelKindSlack: | ||
| // Slack URLs are write-only secrets: an edit without a fresh URL keeps the stored destination, a fresh URL is both a destination change and a new secret. | ||
| keepStoredSlackURL = c.Slack == nil || c.Slack.WebhookURL == "" |
There was a problem hiding this comment.
Require a Slack URL when changing channel kind
When an existing non-Slack channel is updated to SLACK with no webhook_url, this treats the missing URL as “keep the stored Slack URL” without checking owned.Kind; validation is skipped and carrySecretSettings later copies the old settings url into the new Slack contact point. That can silently reuse a webhook destination as a Slack secret when changing kinds, so only preserve an empty Slack URL for existing Slack channels and require a fresh URL otherwise.
Useful? React with 👍 / 👎.
| case SilenceScopeGroup: | ||
| if sil.Scope.GroupID != "" { | ||
| matchers = append(matchers, GrafanaSilenceMatcher{ | ||
| Name: "group_id", |
There was a problem hiding this comment.
Match group silences to labels alerts actually emit
A group-scoped silence is accepted but written with a group_id matcher; none of the bundled Grafana rules emit a group_id label (the static rule label is rule_group, while the metrics contract uses device_group). In the current notification stack this silence will look active in the API but will never match firing alerts, so group mutes do not stop notifications.
Useful? React with 👍 / 👎.
| case SilenceScopeSite: | ||
| if sil.Scope.SiteID != "" { | ||
| matchers = append(matchers, GrafanaSilenceMatcher{ | ||
| Name: "site_id", |
There was a problem hiding this comment.
Include site labels before accepting site silences
Site-scoped silences are written with a site_id matcher, but the provisioned Grafana alert queries in proto-fleet-rules.yaml select only organization_id/device_id for device alerts and only organization_id for fleet alerts, so alert instances do not carry site_id. In that deployment a site silence is stored and returned as active but cannot match any alert, causing notifications for the site to continue firing.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6cdd7d2724
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| Type: grafanaTypeFor(c.Kind), | ||
| Settings: settings, | ||
| } | ||
| created, err := s.grafana.CreateContactPoint(ctx, cp) |
There was a problem hiding this comment.
Route created channels into Grafana notification policies
When a user creates a Slack/SMTP/webhook channel, this only creates a contact point; the provisioned policy tree in server/monitoring/grafana/provisioning/alerting/notification-policies.yaml still has a single root receiver protofleet-internal and there is no policy API update in this service. In that deployed configuration Grafana will keep routing alerts only to the internal webhook, so newly created channels can be listed and tested but will never receive real alert notifications unless an operator manually edits Grafana routing.
Useful? React with 👍 / 👎.
| return nil, err | ||
| } | ||
| // TestChannelRequest carries no name; the domain service derives one from the id when present. | ||
| dom, err := protoToChannel(req.Msg.GetId(), "", req.Msg.GetKind(), req.Msg.GetWebhook(), req.Msg.GetSmtp(), req.Msg.GetSlack()) |
There was a problem hiding this comment.
Honor unsaved test definitions when an id is present
For TestChannelRequest payloads that include both an existing id and edited channel settings, passing the id through here makes Service.TestChannel take the saved-channel branch and ignore the supplied webhook/SMTP/Slack fields. That contradicts the proto contract that the unsaved definition wins when both are set, so an edit form that tests a changed destination before saving will send the probe to the old stored channel and can report success for settings that were never tested.
Useful? React with 👍 / 👎.
507a4e5 to
3bd492f
Compare
The prod overlay left FLEET_METRICS_GRAFANA_PASSWORD empty so fleet-api never held the Grafana admin credential, but that meant a default deployment had no way to authenticate to Grafana and every Grafana-proxying notification RPC 401'd while the stack came up healthy. Instead of defaulting fleet-api to the admin password, mint a least-privilege Grafana service account (Editor on org 1) and token post-boot in run-fleet.sh, persist it to .env, and recreate fleet-api so it authenticates with the scoped token. The basic-auth password stays empty by default. Idempotent across re-runs. Addresses Codex security review finding (MEDIUM) on #455. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6cdd7d2 to
6655a15
Compare
|
Codex findings on this PR:
|
Adds the notifications domain (Grafana sidecar client, service with org isolation and secret redaction, models), the Connect handlers for Channel/ Rule/Silence/History, and the main.go wiring. Marks channel mutations as redacted + session-only in the interceptor config and adds Grafana + destination-policy config to the metrics provider. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The prod overlay left FLEET_METRICS_GRAFANA_PASSWORD empty so fleet-api never held the Grafana admin credential, but that meant a default deployment had no way to authenticate to Grafana and every Grafana-proxying notification RPC 401'd while the stack came up healthy. Instead of defaulting fleet-api to the admin password, mint a least-privilege Grafana service account (Editor on org 1) and token post-boot in run-fleet.sh, persist it to .env, and recreate fleet-api so it authenticates with the scoped token. The basic-auth password stays empty by default. Idempotent across re-runs. Addresses Codex security review finding (MEDIUM) on #455. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6655a150e3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| func (g *Grafana) UpdateContactPoint(ctx context.Context, uid string, cp GrafanaContactPoint) (*GrafanaContactPoint, error) { | ||
| var out GrafanaContactPoint | ||
| if err := g.do(ctx, http.MethodPut, "/api/v1/provisioning/contact-points/"+uid, cp, &out); err != nil { |
There was a problem hiding this comment.
Handle Grafana contact-point update ACK
When updating any saved channel against real Grafana, PUT /api/v1/provisioning/contact-points/:uid succeeds with a 202 Ack rather than an EmbeddedContactPoint. Decoding that ACK into GrafanaContactPoint leaves Settings empty (or returns EOF on an empty body), and UpdateChannel then passes it to contactPointToChannel, so a successful edit is reported as an RPC error and the UI cannot reliably update channels. Return/refetch the contact point or build the response from the request instead.
Useful? React with 👍 / 👎.
| } | ||
| return GrafanaSilence{ | ||
| StartsAt: sil.StartsAt, | ||
| EndsAt: sil.EndsAt, |
There was a problem hiding this comment.
Require or materialize a silence end time
When CreateSilence/UpdateSilence omits ends_at, protoToSilence leaves sil.EndsAt at the Go zero time; this copies that zero into the Alertmanager payload even though the Alertmanager silence schema requires endsAt. The service's own silenceActive treats zero as indefinite, so open-ended mutes accepted by this API shape are rejected by Grafana instead of being created; either reject missing ends_at or convert it to a real future expiration before posting.
Useful? React with 👍 / 👎.
3bd492f to
66e55db
Compare
6655a15 to
753a5b8
Compare
…Ls in parse errors Two follow-ups from review: - run-fleet.sh swallowed the post-token fleet-api restart (>/dev/null 2>&1 || true), so a failed recreate left fleet-api on the pre-token env (still 401ing) while the script reported success. Check the result and fail loudly with manual-restart instructions. - checkDestinationURL returned url.Parse's error verbatim, which embeds the raw input — a webhook/Slack URL can carry a capability token. Return a generic 'not parseable' message so the secret can't leak via errors or logs. Addresses Codex security review findings (MEDIUM) on #455. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Re the re-review (the first two are now fixed in commit 75185f6):
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 75185f6639
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| "smtpHost": c.SMTP.Host, | ||
| "smtpPort": c.SMTP.Port, | ||
| "smtpUsername": c.SMTP.Username, | ||
| "fromAddress": c.SMTP.From, |
There was a problem hiding this comment.
Don't store SMTP server settings on email contact points
For SMTP channels, these per-channel smtpHost/smtpPort/credential fields are written into the Grafana email contact point, but Grafana OSS reads SMTP server configuration from [smtp]/GF_SMTP_*, not from the contact point; the Grafana email docs require SMTP to be configured in grafana.ini and list the email contact-point settings as addresses/single-email/template fields (https://grafana.com/docs/grafana/latest/alerting/configure-notifications/manage-contact-points/integrations/configure-email/). Since the bundled server/monitoring/grafana/grafana.ini has no [smtp] section, creating/testing an SMTP channel with a host/password here can appear saved while real email delivery still uses an unconfigured global SMTP transport and fails.
Useful? React with 👍 / 👎.
| return nil, ErrNotFound | ||
| } | ||
| sil.OrganizationID = orgID | ||
| gs := domainSilenceToGrafana(orgID, sil) |
There was a problem hiding this comment.
Preserve the silence creator on update
When updating an existing silence, UpdateSilenceRequest has no created_by, so sil.CreatedBy is still empty when this payload is built. For any edit of a silence created by a user, the POST replaces the Grafana silence with an empty createdBy, so subsequent reads lose the audit/display owner (and may be rejected if Grafana enforces the Alertmanager field); carry e.CreatedBy from the matched existing silence or stamp the updater before calling domainSilenceToGrafana.
Useful? React with 👍 / 👎.
Stack 4/5 — base: `eden/notifications-3-authz`
Wires up the server side end-to-end.
Stack
🤖 Generated with Claude Code