Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 118 additions & 0 deletions docs/Alerts & Notifications/Alert Configuration Reference.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -1382,6 +1382,124 @@ template: ml_5min_node
<br/>
</details><br/>

<details>
<summary><strong>Example 8: Boolean / Binary Metric Alerting</strong></summary><br/>

**Scenario:** Monitor a boolean 0/1 health-check gauge and choose the right aggregation method for your alerting intent.

**Why This Matters:** Boolean metrics require different aggregation strategies depending on whether you need to detect any single failure, confirm a sustained outage, or check the current state. Choosing the wrong method leads to missed alerts or alert noise.

**Approach 1: Detect Any Failure Event (average)**

```text
alarm: service_failure_event
on: my_service.health_status
lookup: average -10s of health_status
every: 10s
warn: $this > 0
info: any failure detected in the last 10 seconds
to: sysadmin
```

Use when the metric acts as a failure indicator — the value is 0 normally and 1 when a failure occurs. `average` over a short window naturally reflects any non-zero sample: if the metric was 1 at any point, the average will be greater than 0. This is the same pattern used by Netdata's Docker container health monitoring (`average -10s of unhealthy`, `warn: $this > 0`).

:::note

Do not use `sum` for boolean 0/1 gauges. While `sum -5m unaligned absolute` would technically detect failures (any non-zero sample makes the sum positive), `sum` produces a count of seconds in state 1 rather than an intuitive threshold. Use `sum` only for counter/cumulative metrics like packet drops or error totals — see [Example 4: Network Packet Drops](#example-4-network-packet-drops) for a correct `sum` use case.

:::

**Approach 2: Detect Any Downtime (min) or Continuous Outage (max)**

```text
alarm: service_any_downtime
on: my_service.health_status
lookup: min -5m unaligned
every: 10s
crit: $this == 0
info: metric dropped to 0 at some point in the last 5 minutes
to: sysadmin
```

Use when the metric is 1 = healthy and 0 = unhealthy. `min` returns the lowest value in the window — if the metric dropped to 0 at any point, the alert fires. This catches even brief outages.

For the stricter check of **continuous outage** (metric was never 1), use `max`:

```text
alarm: service_continuous_outage
on: my_service.health_status
lookup: max -5m unaligned
every: 10s
crit: $this == 0
info: service was down for the entire last 5 minutes
to: sysadmin
```

`max` returns the highest value in the window. If `max == 0`, the metric never reached 1 — the service was down the entire time.

**Approach 3: Measure Failure Rate (average)**

```text
alarm: service_failure_rate
on: my_service.health_status
lookup: average -5m unaligned of health_status
every: 1m
warn: $this > 0.1
crit: $this > 0.5
info: failure rate exceeded threshold over the last 5 minutes
to: sysadmin
```

When the metric is 0 = healthy and 1 = failure, `average` over the window returns a value between 0.0 and 1.0 representing the fraction of time spent in failure. `warn: $this > 0.1` fires when the service was failing more than 10% of the time, and `crit: $this > 0.5` fires when failures exceeded half the window. This is useful for SLO-style alerting where occasional failures are acceptable.

:::note
**Note on `percentage`:** The `percentage` option calculates each dimension's share of the chart total — it is designed for multi-dimension charts like `system.ram` (see [Task 3: Create a Simple Alert](#task-3-create-a-simple-alert): `lookup: average -1m percentage of used`). For a single-dimension boolean gauge, `percentage` always returns 100. Use plain `average` and compare against 0.0–1.0 thresholds instead.
:::

**Approach 4: Instant State Check (calc, no lookup)**

```text
alarm: service_current_state
on: my_service.health_status
calc: $health_status
every: 10s
crit: $this == 0
info: service is currently down
to: sysadmin
delay: down 5m
```

Use to check only the current value without time-window aggregation. The `calc: $health_status` references the chart dimension directly — no `lookup` needed. Note that `$status` is a built-in alert variable (the alert's own status code, −2 to 3) and must not be used here; use the dimension name instead (e.g. `$health_status` for a dimension named `health_status`). The `delay: down 5m` debounces recovery notifications, requiring the alert to stay clear for 5 minutes before sending recovery. This is the same pattern used in `health.d/timex.conf` for clock sync state monitoring (`calc: $state`).

**Comparison: Which Method to Use**

| Intent | Method | Lookup / Calc | Condition | Fires When |
| ------------------------------------------------------- | ------ | -------------------------- | ------------ | ----------------------------------------- |
| Any failure event (metric is 0 normally, 1 on failure) | `average` | `average -10s of health_status` | `$this > 0` | Metric was non-zero at any point in the window |
| Any downtime (metric is 1=healthy, 0=down) | `min` | `min -5m unaligned` | `$this == 0` | Metric hit 0 at any point in the window |
| Continuous outage (metric is 1=healthy, 0=down) | `max` | `max -5m unaligned` | `$this == 0` | Metric was 0 for the entire window |
| Failure rate over time | `average` | `average -5m unaligned of health_status` | `$this > 0.N` | Failure fraction exceeds threshold (0.0–1.0) |
| Current state only | `calc` | `calc: $health_status` (no lookup) | `$this == 0` | Current value is 0 (debounce with delay) |

For a full list of available lookup methods and processing options (`average`, `min`, `max`, `sum`, `percentage`, `absolute`, etc.), see the [Alert Line `lookup`](#alert-line-lookup) section.

**Key Points:**

- Boolean 0/1 metrics work with all standard lookup methods — the choice depends on your alerting intent
- Use `average` over a short window for failure detection (`average -10s of <dimension>`, `warn: $this > 0`) — the same pattern Netdata uses in its own health configs (e.g., `health.d/docker.conf`)
- Use `min` for "was it ever down?" and `max` for "was it continuously down?"
- Use `average` for SLO-style failure-rate alerting (returns 0.0–1.0 fraction of time in failure state; compare against decimal thresholds)
- Use `calc` without `lookup` for instant state checks, combined with `delay` for debouncing
- Avoid `sum` on boolean gauges — it produces a count of seconds in state 1, not an intuitive threshold. Use `sum` only for counter/cumulative metrics (e.g., total packet drops in a time window)

**Variables Used:**

- `$this` — Result of the `lookup` or `calc` expression
- `$health_status` — Dimension value from the chart (used in the `calc` approach; the variable name matches the dimension name, e.g. `health_status`)

<br/>
</details><br/>

**Next Steps:** Having trouble with your alerts? Continue to [Troubleshooting](#troubleshooting) for debugging techniques.

## Troubleshooting
Expand Down
14 changes: 9 additions & 5 deletions docs/Collecting Metrics/Collectors/Collectors.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -254,6 +254,10 @@ import { Grid, Box } from '@site/src/components/Grid_integrations';
<img custom-image src="https://netdata.cloud/img/cassandra.svg" style={{width: '90%', maxHeight: '100%', verticalAlign: 'middle' }} data-integration-logo="true" data-logo-contrast-light="unknown" data-logo-contrast-dark="unknown" data-logo-contrast-confidence="low"/>
</Box>

<Box banner="by Netdata" banner_color="#00ab44" to="/docs/collecting-metrics/collectors/networking/cato-networks" title="Cato Networks">
<img custom-image src="https://netdata.cloud/img/network-wired.svg" style={{width: '90%', maxHeight: '100%', verticalAlign: 'middle' }} data-integration-logo="true" data-logo-contrast-light="unknown" data-logo-contrast-dark="unknown" data-logo-contrast-confidence="low"/>
</Box>

<Box banner="by Netdata" banner_color="#00ab44" to="/docs/collecting-metrics/collectors/storage-and-filesystems/ceph" title="Ceph">
<img custom-image src="https://netdata.cloud/img/ceph.svg" style={{width: '90%', maxHeight: '100%', verticalAlign: 'middle' }} data-integration-logo="true" data-logo-contrast-light="unknown" data-logo-contrast-dark="unknown" data-logo-contrast-confidence="low"/>
</Box>
Expand Down Expand Up @@ -810,10 +814,6 @@ import { Grid, Box } from '@site/src/components/Grid_integrations';
<img custom-image src="https://netdata.cloud/img/netfilter.png" style={{width: '90%', maxHeight: '100%', verticalAlign: 'middle' }} data-integration-logo="true" data-logo-contrast-light="low" data-logo-contrast-dark="ok" data-logo-contrast-confidence="high"/>
</Box>

<Box banner="by Netdata" banner_color="#00ab44" to="/docs/collecting-metrics/collectors/networking/network-connections" title="Network Connections">
<img custom-image src="https://netdata.cloud/img/network.svg" style={{width: '90%', maxHeight: '100%', verticalAlign: 'middle' }} data-integration-logo="true" data-logo-contrast-light="unknown" data-logo-contrast-dark="unknown" data-logo-contrast-confidence="low"/>
</Box>

<Box banner="by Netdata" banner_color="#00ab44" to="/docs/collecting-metrics/collectors/networking/network-interfaces" title="Network interfaces">
<img custom-image src="https://netdata.cloud/img/network-wired.svg" style={{width: '90%', maxHeight: '100%', verticalAlign: 'middle' }} data-integration-logo="true" data-logo-contrast-light="unknown" data-logo-contrast-dark="unknown" data-logo-contrast-confidence="low"/>
</Box>
Expand Down Expand Up @@ -1123,7 +1123,7 @@ import { Grid, Box } from '@site/src/components/Grid_integrations';
</Box>

<Box banner="by Netdata" banner_color="#00ab44" to="/docs/collecting-metrics/collectors/operating-systems/system-statistics" title="System statistics">
<img custom-image src="https://netdata.cloud/img/linuxserver.svg" style={{width: '90%', maxHeight: '100%', verticalAlign: 'middle' }} data-integration-logo="true" data-logo-contrast-light="unknown" data-logo-contrast-dark="unknown" data-logo-contrast-confidence="low"/>
<img custom-image src="https://netdata.cloud/img/windows.svg" style={{width: '90%', maxHeight: '100%', verticalAlign: 'middle' }} data-integration-logo="true" data-logo-contrast-light="ok" data-logo-contrast-dark="ok" data-logo-contrast-confidence="medium"/>
</Box>

<Box banner="by Netdata" banner_color="#00ab44" to="/docs/collecting-metrics/collectors/hardware-and-sensors/system-thermal-zone" title="System thermal zone">
Expand Down Expand Up @@ -1694,6 +1694,10 @@ import { Grid, Box } from '@site/src/components/Grid_integrations';
<img custom-image src="https://netdata.cloud/img/netatmo.svg" style={{width: '90%', maxHeight: '100%', verticalAlign: 'middle' }} data-integration-logo="true" data-logo-contrast-light="low" data-logo-contrast-dark="ok" data-logo-contrast-confidence="medium"/>
</Box>

<Box banner="by Community" banner_color="rgba(0, 0, 0, 0.25)" to="/docs/collecting-metrics/collectors/networking/network-connections" title="Network Connections">

</Box>

<Box banner="by Community" banner_color="rgba(0, 0, 0, 0.25)" to="/docs/collecting-metrics/collectors/applications/nextcloud-servers" title="Nextcloud servers">
<img custom-image src="https://netdata.cloud/img/nextcloud.png" style={{width: '90%', maxHeight: '100%', verticalAlign: 'middle' }} data-integration-logo="true" data-logo-contrast-light="ok" data-logo-contrast-dark="ok" data-logo-contrast-confidence="high"/>
</Box>
Expand Down
Loading