|
| 1 | +--- |
| 2 | +title: "Prometheus Alert Rule Configuration" |
| 3 | +description: "This document provides detailed instructions on configuring alert rules for Prometheus data sources in Monitors. Monitors is compatible with Prometheus PromQL query syntax and offers flexible alert evaluation and recovery mechanisms to meet various monitoring requirements." |
| 4 | +date: "2025-12-30T18:21:50.775+08:00" |
| 5 | +url: "https://docs.flashcat.cloud/en/flashduty/monitors/prometheus-alert-rules" |
| 6 | +--- |
| 7 | + |
| 8 | +This document provides detailed instructions on configuring alert rules for Prometheus data sources in Monitors. Monitors is compatible with Prometheus PromQL query syntax and offers flexible alert evaluation and recovery mechanisms to meet various monitoring requirements. |
| 9 | + |
| 10 | +## Core concepts |
| 11 | + |
| 12 | +Before configuring alert rules, it's essential to understand how Monitors processes Prometheus data. The alert engine supports three core evaluation modes: |
| 13 | + |
| 14 | +1. **Threshold**: The query returns raw metric values, and the alert engine performs threshold comparisons in memory. |
| 15 | +2. **Data exists**: The query itself contains filter conditions and only returns anomalous data. The engine triggers an alert when data is found. |
| 16 | +3. **No data**: Used to monitor scenarios where data reporting is interrupted. |
| 17 | + |
| 18 | +--- |
| 19 | + |
| 20 | +## 1. Threshold mode |
| 21 | + |
| 22 | +This mode is suitable for scenarios requiring multi-level alerts on the same metric (e.g., Info/Warning/Critical) or when precise recovery values are needed. |
| 23 | + |
| 24 | +### Configuration |
| 25 | + |
| 26 | +- **Query (PromQL)**: Write a PromQL query **without** comparison operators that only returns metric values. |
| 27 | + - Example: Query memory usage |
| 28 | + ```promql |
| 29 | + (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 |
| 30 | + ``` |
| 31 | +- **Threshold conditions**: Define threshold expressions for different severity levels in the rule configuration. The variable `$A` represents the query result value. |
| 32 | + - **Critical**: `$A > 90` (triggers critical alert when memory usage exceeds 90%) |
| 33 | + - **Warning**: `$A > 80` (triggers warning alert when memory usage exceeds 80%) |
| 34 | +
|
| 35 | +### Multiple queries and data correlation |
| 36 | +
|
| 37 | +Monitors supports configuring multiple query statements in a single alert rule (named A, B, C...) and referencing these query results simultaneously in threshold expressions (e.g., `$A > 90 and $B < 50`). |
| 38 | +
|
| 39 | +- **Auto join**: The alert engine automatically correlates results from different queries based on **labels**. |
| 40 | +- **Alignment requirements**: Two queries can only be correlated within the same context when their returned data contains **exactly the same** label sets. |
| 41 | + - Example: If query A returns `cpu_usage_percent{instance="host-1", job="node"}` and query B returns `mem_usage_percent{instance="host-1", job="node"}`, then `$A` and `$B` can be successfully correlated. |
| 42 | + - Note: If query A has an additional label (e.g., `disk="/"`), but query B does not, they cannot be correlated. It's recommended to use aggregation operations like `sum by (...)` or `avg by (...)` in PromQL to explicitly control returned labels and ensure label consistency across multiple queries. |
| 43 | +
|
| 44 | +### How it works |
| 45 | +
|
| 46 | +The alert engine periodically executes PromQL and retrieves current values for all Series. Then it iterates through each Series, matching Critical, Warning, and Info conditions in order. Once a condition is met, an alert event of the corresponding severity is generated. |
| 47 | +
|
| 48 | +### Recovery logic |
| 49 | +
|
| 50 | +Multiple recovery strategies are supported: |
| 51 | +
|
| 52 | +- **Auto recovery**: When the latest query result no longer meets any alert threshold, a recovery event is automatically generated. |
| 53 | +- **Specific recovery conditions**: Configure additional recovery expressions (e.g., `$A < 75`). Recovery is only confirmed when the value falls back to the specified level, preventing frequent flapping near the threshold. |
| 54 | +- **Recovery query**: Allows users to define a custom PromQL query for recovery evaluation. |
| 55 | + - **Principle**: After an alert is triggered, the engine periodically executes this recovery query. If the query returns data (i.e., the result is not empty), the incident is considered recovered. |
| 56 | + - **Variable support**: The recovery query supports embedded variables (in the format `${label_name}`), which are automatically replaced with corresponding label values from the alert event. This enables the recovery query to perform precise detection for specific alert objects (such as a specific `instance` or `device`). |
| 57 | +
|
| 58 | +--- |
| 59 | +
|
| 60 | +## 2. Data exists mode |
| 61 | +
|
| 62 | +This mode behaves consistently with Prometheus native alerting rules. It's suitable for users who prefer defining thresholds directly in PromQL or scenarios requiring high-performance processing of large numbers of Series. |
| 63 | +
|
| 64 | +### Configuration |
| 65 | +
|
| 66 | +- **Query (PromQL)**: Write a PromQL query **with** comparison operators that only filters out anomalous data. |
| 67 | + - Example: Query nodes with memory usage exceeding 90% |
| 68 | + ```promql |
| 69 | + (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90 |
| 70 | + ``` |
| 71 | +- **Evaluation rule**: No additional threshold expressions are needed. The engine triggers an alert as soon as it gets query results. The number of alert events equals the number of data rows returned. |
| 72 | +
|
| 73 | +### Recovery logic |
| 74 | +
|
| 75 | +- **Data disappearance means recovery**: The engine queries periodically. If some data can no longer be found, the engine determines that the corresponding alert has recovered. Note: Data identification is based on label sets. |
| 76 | +- **Recovery query (optional)**: Configure an independent query for recovery evaluation (e.g., `up{instance="${instance}"} == 1` to confirm service recovery). Recovery is confirmed only when data is found. The variable `${instance}` in this query will be replaced with the actual label value from the alert event. |
| 77 | +
|
| 78 | +### Pros and cons |
| 79 | +
|
| 80 | +- **Pros**: |
| 81 | + - **Better performance**: Filtering logic is pushed down to the Prometheus server, reducing data transmitted to the alert engine. |
| 82 | + - **Low migration cost**: Existing Prometheus rule statements can be directly reused. |
| 83 | +- **Cons**: |
| 84 | + - **Single severity level**: One rule typically corresponds to one severity level. To distinguish between >90 and >80, two rules must be configured. |
| 85 | + - **Recovery value retrieval**: During recovery, since Prometheus no longer returns data (because the threshold is no longer met), the alert engine cannot directly obtain the specific value at recovery time. However, you can use enrichment queries to get real-time values. |
| 86 | +
|
| 87 | +--- |
| 88 | +
|
| 89 | +## 3. No data mode |
| 90 | +
|
| 91 | +This mode is specifically designed to monitor whether monitored objects are alive or whether data reporting pipelines are functioning properly. |
| 92 | +
|
| 93 | +### Configuration |
| 94 | +
|
| 95 | +- **Query**: Write a query for metrics that should always exist. |
| 96 | + - Example: `up{job="my-service"}` |
| 97 | +- **Evaluation rule**: If no data for a Series can be queried for N consecutive cycles (note: this means completely no data, not a value of 0), a "no data" alert is triggered. |
| 98 | +- **Engine restart**: If the alert engine (monitedge) restarts, the in-memory state is lost. If data that was queryable before the restart happens to be unavailable after restart, no alert will be triggered. The missing cycle counter restarts after the engine restarts. |
| 99 | +
|
| 100 | +### Typical applications |
| 101 | +
|
| 102 | +- Monitor Exporter crashes. |
| 103 | +- Monitor instrumentation reporting service interruptions. |
| 104 | +- Monitor batch jobs not executing on schedule. |
| 105 | +
|
| 106 | +### Comparison with Prometheus native `absent()` function |
| 107 | +
|
| 108 | +- Prometheus's `absent()` function requires listing every combination of identifying labels for a metric, e.g., `absent(up{instance="host-1"})`. Multiple statements are needed for multiple instances. |
| 109 | +- Monitors' no data mode only requires one query. The alert engine automatically monitors all returned Series, and any Series with missing data triggers an alert. |
| 110 | +
|
| 111 | +--- |
| 112 | +
|
| 113 | +## Advanced configuration |
| 114 | +
|
| 115 | +### Labels and variables |
| 116 | +
|
| 117 | +Monitors automatically parses labels returned by Prometheus. In **recovery queries** or **enrichment queries**, you can use `${label_name}` to reference label values. |
| 118 | +
|
| 119 | +For example, if a query result contains labels `instance="host-1"` and `job="node"`, you can write the recovery query as: |
| 120 | +
|
| 121 | +``` |
| 122 | +up{instance="${instance}", job="${job}"} == 1 |
| 123 | +``` |
| 124 | +
|
| 125 | +When executing, the alert engine replaces `${instance}` with `host-1` and `${job}` with `node`, which are the actual values from the alert event labels. |
| 126 | +
|
| 127 | +### Enrichment query |
| 128 | +
|
| 129 | +To enrich alert notification content, you can configure enrichment queries. Enrichment queries do not participate in alert evaluation and are only used to retrieve additional information. |
| 130 | +
|
| 131 | +- **Scenario**: During a CPU alert, simultaneously query the machine's `mem_usage_percent` to display in the alert details for troubleshooting assistance. |
| 132 | +- **Variables**: Enrichment queries support `${label_name}` variables for querying specific alert objects. |
| 133 | +- **Result display**: Enrichment query results can be referenced in the alert rule's description field using the `$relates` variable. Please refer to the **description field** documentation for detailed usage instructions. |
| 134 | +
|
0 commit comments