Skip to content

Commit e00fbc8

Browse files
committed
add monitors docs
1 parent 5d8d878 commit e00fbc8

17 files changed

Lines changed: 1573 additions & 488 deletions

File tree

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
---
2+
title: "Prometheus Alert Rule Configuration"
3+
description: "This document provides detailed instructions on configuring alert rules for Prometheus data sources in Monitors. Monitors is compatible with Prometheus PromQL query syntax and offers flexible alert evaluation and recovery mechanisms to meet various monitoring requirements."
4+
date: "2025-12-30T18:21:50.775+08:00"
5+
url: "https://docs.flashcat.cloud/en/flashduty/monitors/prometheus-alert-rules"
6+
---
7+
8+
This document provides detailed instructions on configuring alert rules for Prometheus data sources in Monitors. Monitors is compatible with Prometheus PromQL query syntax and offers flexible alert evaluation and recovery mechanisms to meet various monitoring requirements.
9+
10+
## Core concepts
11+
12+
Before configuring alert rules, it's essential to understand how Monitors processes Prometheus data. The alert engine supports three core evaluation modes:
13+
14+
1. **Threshold**: The query returns raw metric values, and the alert engine performs threshold comparisons in memory.
15+
2. **Data exists**: The query itself contains filter conditions and only returns anomalous data. The engine triggers an alert when data is found.
16+
3. **No data**: Used to monitor scenarios where data reporting is interrupted.
17+
18+
---
19+
20+
## 1. Threshold mode
21+
22+
This mode is suitable for scenarios requiring multi-level alerts on the same metric (e.g., Info/Warning/Critical) or when precise recovery values are needed.
23+
24+
### Configuration
25+
26+
- **Query (PromQL)**: Write a PromQL query **without** comparison operators that only returns metric values.
27+
- Example: Query memory usage
28+
```promql
29+
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
30+
```
31+
- **Threshold conditions**: Define threshold expressions for different severity levels in the rule configuration. The variable `$A` represents the query result value.
32+
- **Critical**: `$A > 90` (triggers critical alert when memory usage exceeds 90%)
33+
- **Warning**: `$A > 80` (triggers warning alert when memory usage exceeds 80%)
34+
35+
### Multiple queries and data correlation
36+
37+
Monitors supports configuring multiple query statements in a single alert rule (named A, B, C...) and referencing these query results simultaneously in threshold expressions (e.g., `$A > 90 and $B < 50`).
38+
39+
- **Auto join**: The alert engine automatically correlates results from different queries based on **labels**.
40+
- **Alignment requirements**: Two queries can only be correlated within the same context when their returned data contains **exactly the same** label sets.
41+
- Example: If query A returns `cpu_usage_percent{instance="host-1", job="node"}` and query B returns `mem_usage_percent{instance="host-1", job="node"}`, then `$A` and `$B` can be successfully correlated.
42+
- Note: If query A has an additional label (e.g., `disk="/"`), but query B does not, they cannot be correlated. It's recommended to use aggregation operations like `sum by (...)` or `avg by (...)` in PromQL to explicitly control returned labels and ensure label consistency across multiple queries.
43+
44+
### How it works
45+
46+
The alert engine periodically executes PromQL and retrieves current values for all Series. Then it iterates through each Series, matching Critical, Warning, and Info conditions in order. Once a condition is met, an alert event of the corresponding severity is generated.
47+
48+
### Recovery logic
49+
50+
Multiple recovery strategies are supported:
51+
52+
- **Auto recovery**: When the latest query result no longer meets any alert threshold, a recovery event is automatically generated.
53+
- **Specific recovery conditions**: Configure additional recovery expressions (e.g., `$A < 75`). Recovery is only confirmed when the value falls back to the specified level, preventing frequent flapping near the threshold.
54+
- **Recovery query**: Allows users to define a custom PromQL query for recovery evaluation.
55+
- **Principle**: After an alert is triggered, the engine periodically executes this recovery query. If the query returns data (i.e., the result is not empty), the incident is considered recovered.
56+
- **Variable support**: The recovery query supports embedded variables (in the format `${label_name}`), which are automatically replaced with corresponding label values from the alert event. This enables the recovery query to perform precise detection for specific alert objects (such as a specific `instance` or `device`).
57+
58+
---
59+
60+
## 2. Data exists mode
61+
62+
This mode behaves consistently with Prometheus native alerting rules. It's suitable for users who prefer defining thresholds directly in PromQL or scenarios requiring high-performance processing of large numbers of Series.
63+
64+
### Configuration
65+
66+
- **Query (PromQL)**: Write a PromQL query **with** comparison operators that only filters out anomalous data.
67+
- Example: Query nodes with memory usage exceeding 90%
68+
```promql
69+
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
70+
```
71+
- **Evaluation rule**: No additional threshold expressions are needed. The engine triggers an alert as soon as it gets query results. The number of alert events equals the number of data rows returned.
72+
73+
### Recovery logic
74+
75+
- **Data disappearance means recovery**: The engine queries periodically. If some data can no longer be found, the engine determines that the corresponding alert has recovered. Note: Data identification is based on label sets.
76+
- **Recovery query (optional)**: Configure an independent query for recovery evaluation (e.g., `up{instance="${instance}"} == 1` to confirm service recovery). Recovery is confirmed only when data is found. The variable `${instance}` in this query will be replaced with the actual label value from the alert event.
77+
78+
### Pros and cons
79+
80+
- **Pros**:
81+
- **Better performance**: Filtering logic is pushed down to the Prometheus server, reducing data transmitted to the alert engine.
82+
- **Low migration cost**: Existing Prometheus rule statements can be directly reused.
83+
- **Cons**:
84+
- **Single severity level**: One rule typically corresponds to one severity level. To distinguish between >90 and >80, two rules must be configured.
85+
- **Recovery value retrieval**: During recovery, since Prometheus no longer returns data (because the threshold is no longer met), the alert engine cannot directly obtain the specific value at recovery time. However, you can use enrichment queries to get real-time values.
86+
87+
---
88+
89+
## 3. No data mode
90+
91+
This mode is specifically designed to monitor whether monitored objects are alive or whether data reporting pipelines are functioning properly.
92+
93+
### Configuration
94+
95+
- **Query**: Write a query for metrics that should always exist.
96+
- Example: `up{job="my-service"}`
97+
- **Evaluation rule**: If no data for a Series can be queried for N consecutive cycles (note: this means completely no data, not a value of 0), a "no data" alert is triggered.
98+
- **Engine restart**: If the alert engine (monitedge) restarts, the in-memory state is lost. If data that was queryable before the restart happens to be unavailable after restart, no alert will be triggered. The missing cycle counter restarts after the engine restarts.
99+
100+
### Typical applications
101+
102+
- Monitor Exporter crashes.
103+
- Monitor instrumentation reporting service interruptions.
104+
- Monitor batch jobs not executing on schedule.
105+
106+
### Comparison with Prometheus native `absent()` function
107+
108+
- Prometheus's `absent()` function requires listing every combination of identifying labels for a metric, e.g., `absent(up{instance="host-1"})`. Multiple statements are needed for multiple instances.
109+
- Monitors' no data mode only requires one query. The alert engine automatically monitors all returned Series, and any Series with missing data triggers an alert.
110+
111+
---
112+
113+
## Advanced configuration
114+
115+
### Labels and variables
116+
117+
Monitors automatically parses labels returned by Prometheus. In **recovery queries** or **enrichment queries**, you can use `${label_name}` to reference label values.
118+
119+
For example, if a query result contains labels `instance="host-1"` and `job="node"`, you can write the recovery query as:
120+
121+
```
122+
up{instance="${instance}", job="${job}"} == 1
123+
```
124+
125+
When executing, the alert engine replaces `${instance}` with `host-1` and `${job}` with `node`, which are the actual values from the alert event labels.
126+
127+
### Enrichment query
128+
129+
To enrich alert notification content, you can configure enrichment queries. Enrichment queries do not participate in alert evaluation and are only used to retrieve additional information.
130+
131+
- **Scenario**: During a CPU alert, simultaneously query the machine's `mem_usage_percent` to display in the alert details for troubleshooting assistance.
132+
- **Variables**: Enrichment queries support `${label_name}` variables for querying specific alert objects.
133+
- **Result display**: Enrichment query results can be referenced in the alert rule's description field using the `$relates` variable. Please refer to the **description field** documentation for detailed usage instructions.
134+
Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
---
2+
title: "ElasticSearch Alert Rule Configuration"
3+
description: "This document provides detailed instructions on configuring alert rules for ElasticSearch data sources in Monitors. Monitors uses ElasticSearch SQL functionality to monitor log and metric data, supporting flexible aggregation queries and alert evaluation."
4+
date: "2025-12-30T18:21:50.775+08:00"
5+
url: "https://docs.flashcat.cloud/en/flashduty/monitors/elasticsearch-alert-rules"
6+
---
7+
8+
This document provides detailed instructions on configuring alert rules for ElasticSearch data sources in Monitors. Monitors uses ElasticSearch SQL functionality to monitor log and metric data, supporting flexible aggregation queries and alert evaluation.
9+
10+
## Core concepts
11+
12+
- **Version requirements**: Due to SQL feature dependencies, only **ElasticSearch 6.3** and above are supported.
13+
- **Query language**: Currently only **SQL** syntax is supported.
14+
- **Field handling**: The alert engine automatically converts all field names in query results to **lowercase**. When configuring value fields and label fields, always use lowercase letters.
15+
16+
---
17+
18+
## 1. Threshold mode
19+
20+
This mode is suitable for scenarios requiring threshold comparisons on aggregated values, such as monitoring "error log count in the last 5 minutes".
21+
22+
### Configuration
23+
24+
1. **Query**: Write a SQL aggregation query that returns numeric columns and (optional) grouping columns.
25+
- Example: Count error logs per service in the last 5 minutes.
26+
```sql
27+
SELECT service_name, count(*) AS error_cnt
28+
FROM "app-logs-*"
29+
WHERE "@timestamp" > now() - INTERVAL 5 MINUTES AND log_level = 'ERROR'
30+
GROUP BY service_name
31+
```
32+
2. **Field mapping**:
33+
- **Label fields**: Fields used to distinguish different alert objects. In the example above, this is `service_name`. This field can be left empty, and Monitors will automatically treat all fields except value fields as label fields.
34+
- **Value fields**: Numeric fields used for threshold evaluation. In the example above, this is `error_cnt`.
35+
3. **Threshold conditions**:
36+
- Use `$A.field_name` to reference values.
37+
- Example: `Critical: $A.error_cnt > 50`, `Warning: $A.error_cnt > 10`.
38+
- Shorthand: If only one value field is configured, you can use `$A` directly, e.g., `$A > 50`.
39+
40+
### How it works
41+
42+
The engine executes the SQL query and retrieves tabular data. It groups data by label fields, then extracts value field values for comparison against threshold expressions.
43+
44+
Note: The label field combination uniquely identifies an alert object. Query results cannot have multiple rows with the same label field value combination. Ensure each alert object corresponds to exactly one row of data. In the example above, `service_name` values must be unique. If two rows have the same `service_name`, the alert engine cannot properly distinguish between alert objects.
45+
46+
### Recovery logic
47+
48+
Similar to Prometheus data sources, ElasticSearch threshold mode also supports flexible recovery strategies:
49+
50+
- **Auto recovery**: When the latest SQL query result shows that a data group's value no longer meets any alert threshold (Critical/Warning), a recovery event is automatically generated.
51+
- **Specific recovery conditions**: Configure additional recovery expressions (e.g., `$A.error_cnt < 5`). Recovery is only confirmed when the value falls below this threshold, preventing alert flapping.
52+
- **Recovery query**:
53+
- **Scenario**: Sometimes alert queries and recovery queries have different logic. For example, the alert checks for "error log count > 10", while recovery might check for "success log count > 100" or query a different status index.
54+
- **Configuration**: Write an independent SQL statement for recovery evaluation.
55+
- **Variable support**: Recovery SQL supports using `${label_name}` to reference alert event label values.
56+
- Example: The alert SQL found that network card with `network_host="a", interface="b"` is down. The recovery SQL can be:
57+
```sql
58+
SELECT network_host, interface, status FROM "network-status-*"
59+
WHERE "@timestamp" > now() - INTERVAL 5 MINUTES
60+
AND network_host = '${network_host}'
61+
AND interface = '${interface}'
62+
AND status = 'UP'
63+
```
64+
- The engine replaces `${network_host}` and `${interface}` with actual values before executing the query. If data is found, recovery is confirmed.
65+
66+
---
67+
68+
## 2. Data exists mode
69+
70+
This mode is suitable for scenarios where filtering logic is written directly in SQL, or when you only need to check "whether any data is returned".
71+
72+
### Configuration
73+
74+
1. **Query**: Use a `HAVING` clause in SQL to directly filter out anomalous data.
75+
- Example: Directly query services with more than 50 errors.
76+
```sql
77+
SELECT service_name, count(*) AS error_cnt
78+
FROM "app-logs-*"
79+
WHERE "@timestamp" > now() - INTERVAL 5 MINUTES AND log_level = 'ERROR'
80+
GROUP BY service_name
81+
HAVING count(*) > 50
82+
```
83+
2. **Field mapping**:
84+
- In this mode, label fields and value fields are **optional**. If both are left empty, the engine treats all fields in the query result as label fields, which can be referenced in rule descriptions.
85+
86+
### Recovery logic
87+
88+
- **Data disappearance means recovery**: When the SQL query result is empty (i.e., the HAVING condition is no longer met), the engine determines the incident has recovered. This is the most common recovery method.
89+
- **Recovery query**:
90+
- **Scenario**: Sometimes "no data found" doesn't mean recovery (it could be that log collection failed), or stricter recovery conditions are needed (e.g., no errors for N consecutive minutes).
91+
- **Configuration**: Write an independent SQL statement for recovery evaluation. If this query finds data, the incident is considered recovered.
92+
- **Variable support**: Recovery SQL supports using `${label_name}` to reference alert event label values for precise recovery detection.
93+
94+
### Pros and cons
95+
96+
- **Pros**: Leverages the ES cluster's computing power for filtering, reducing network transmission and improving performance.
97+
- **Cons**: Cannot distinguish between multiple severity levels (e.g., Info/Warning), because SQL can only return data meeting specific conditions.
98+
99+
---
100+
101+
## 3. No data mode
102+
103+
This mode monitors scenarios where data is expected but actually missing, commonly used to monitor log collection pipeline interruptions or scheduled tasks not executing.
104+
105+
### Configuration
106+
107+
1. **Query**: Write a SQL query that should continuously return data.
108+
- Example: Query heartbeat logs from all hosts.
109+
```sql
110+
SELECT host_name
111+
FROM "heartbeat-logs-*"
112+
WHERE "@timestamp" > now() - INTERVAL 5 MINUTES
113+
GROUP BY host_name
114+
```
115+
2. **Evaluation rule**:
116+
- The engine periodically executes this SQL.
117+
- If a `host_name` appeared in previous cycles but no longer appears in the current cycle (and N consecutive cycles), a "no data" alert is triggered.
118+
- Note: This is the opposite of data exists mode. Data exists triggers alerts when data is found; no data triggers alerts when data is not found.
119+
120+
### Recovery logic
121+
122+
- **Data appearance means recovery**: Once the `host_name` reappears in query results, the alert automatically recovers.
123+
- **Auto recovery time**: Configure an auto recovery time (e.g., 24 hours). If not recovered after this time, the engine automatically closes the alert. This is typically used for handling decommissioned machines that no longer need monitoring.
124+
125+
---
126+
127+
## 4. Use case example
128+
129+
Log alerting often requires: counting ERROR logs in the last 5 minutes, triggering an alert if the count exceeds a threshold, and displaying the most recent ERROR log as a sample in the alert message. Here's the configuration:
130+
131+
- **Main alert condition**: Use threshold mode with a SQL statement counting ERROR logs in the last 5 minutes and configure threshold conditions.
132+
- **Enrichment query**: Configure an enrichment query with a SQL statement that retrieves the most recent ERROR log, using variables like `${service_name}` to limit to specific services.
133+
- **Rule description**: Reference enrichment query results in the alert rule's description field using the `$relates` variable to render the original log content.
134+

0 commit comments

Comments
 (0)