flashcatcloud
diff --git a/‎flashduty/en/3. Monitors/2. Alert Rules/1.prometheus.md‎
Lines changed: 134 additions & 0 deletions b/‎flashduty/en/3. Monitors/2. Alert Rules/1.prometheus.md‎
Lines changed: 134 additions & 0 deletions
diff --git a/‎flashduty/en/3. Monitors/2. Alert Rules/2.elasticsearch.md‎
Lines changed: 134 additions & 0 deletions b/‎flashduty/en/3. Monitors/2. Alert Rules/2.elasticsearch.md‎
Lines changed: 134 additions & 0 deletions
@@ -0,0 +1,134 @@
+---
+title: "Prometheus Alert Rule Configuration"
+description: "This document provides detailed instructions on configuring alert rules for Prometheus data sources in Monitors. Monitors is compatible with Prometheus PromQL query syntax and offers flexible alert evaluation and recovery mechanisms to meet various monitoring requirements."
+date: "2025-12-30T18:21:50.775+08:00"
+url: "https://docs.flashcat.cloud/en/flashduty/monitors/prometheus-alert-rules"
+---
+
+This document provides detailed instructions on configuring alert rules for Prometheus data sources in Monitors. Monitors is compatible with Prometheus PromQL query syntax and offers flexible alert evaluation and recovery mechanisms to meet various monitoring requirements.
+
+## Core concepts
+
+Before configuring alert rules, it's essential to understand how Monitors processes Prometheus data. The alert engine supports three core evaluation modes:
+
+1. **Threshold**: The query returns raw metric values, and the alert engine performs threshold comparisons in memory.
+2. **Data exists**: The query itself contains filter conditions and only returns anomalous data. The engine triggers an alert when data is found.
+3. **No data**: Used to monitor scenarios where data reporting is interrupted.
+
+---
+
+## 1. Threshold mode
+
+This mode is suitable for scenarios requiring multi-level alerts on the same metric (e.g., Info/Warning/Critical) or when precise recovery values are needed.
+
+### Configuration
+
+- **Query (PromQL)**: Write a PromQL query **without** comparison operators that only returns metric values.
+  - Example: Query memory usage
+    ```promql
+    (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
+    ```
+- **Threshold conditions**: Define threshold expressions for different severity levels in the rule configuration. The variable `$A` represents the query result value.
+  - **Critical**: `$A > 90` (triggers critical alert when memory usage exceeds 90%)
+  - **Warning**: `$A > 80` (triggers warning alert when memory usage exceeds 80%)
+
+### Multiple queries and data correlation
+
+Monitors supports configuring multiple query statements in a single alert rule (named A, B, C...) and referencing these query results simultaneously in threshold expressions (e.g., `$A > 90 and $B < 50`).
+
+- **Auto join**: The alert engine automatically correlates results from different queries based on **labels**.
+- **Alignment requirements**: Two queries can only be correlated within the same context when their returned data contains **exactly the same** label sets.
+  - Example: If query A returns `cpu_usage_percent{instance="host-1", job="node"}` and query B returns `mem_usage_percent{instance="host-1", job="node"}`, then `$A` and `$B` can be successfully correlated.
+  - Note: If query A has an additional label (e.g., `disk="/"`), but query B does not, they cannot be correlated. It's recommended to use aggregation operations like `sum by (...)` or `avg by (...)` in PromQL to explicitly control returned labels and ensure label consistency across multiple queries.
+
+### How it works
+
+The alert engine periodically executes PromQL and retrieves current values for all Series. Then it iterates through each Series, matching Critical, Warning, and Info conditions in order. Once a condition is met, an alert event of the corresponding severity is generated.
+
+### Recovery logic
+
+Multiple recovery strategies are supported:
+
+- **Auto recovery**: When the latest query result no longer meets any alert threshold, a recovery event is automatically generated.
+- **Specific recovery conditions**: Configure additional recovery expressions (e.g., `$A < 75`). Recovery is only confirmed when the value falls back to the specified level, preventing frequent flapping near the threshold.
+- **Recovery query**: Allows users to define a custom PromQL query for recovery evaluation.
+  - **Principle**: After an alert is triggered, the engine periodically executes this recovery query. If the query returns data (i.e., the result is not empty), the incident is considered recovered.
+  - **Variable support**: The recovery query supports embedded variables (in the format `${label_name}`), which are automatically replaced with corresponding label values from the alert event. This enables the recovery query to perform precise detection for specific alert objects (such as a specific `instance` or `device`).
+
+---
+
+## 2. Data exists mode
+
+This mode behaves consistently with Prometheus native alerting rules. It's suitable for users who prefer defining thresholds directly in PromQL or scenarios requiring high-performance processing of large numbers of Series.
+
+### Configuration
+
+- **Query (PromQL)**: Write a PromQL query **with** comparison operators that only filters out anomalous data.
+  - Example: Query nodes with memory usage exceeding 90%
+    ```promql
+    (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
+    ```
+- **Evaluation rule**: No additional threshold expressions are needed. The engine triggers an alert as soon as it gets query results. The number of alert events equals the number of data rows returned.
+
+### Recovery logic
+
+- **Data disappearance means recovery**: The engine queries periodically. If some data can no longer be found, the engine determines that the corresponding alert has recovered. Note: Data identification is based on label sets.
+- **Recovery query (optional)**: Configure an independent query for recovery evaluation (e.g., `up{instance="${instance}"} == 1` to confirm service recovery). Recovery is confirmed only when data is found. The variable `${instance}` in this query will be replaced with the actual label value from the alert event.
+
+### Pros and cons
+
+- **Pros**:
+  - **Better performance**: Filtering logic is pushed down to the Prometheus server, reducing data transmitted to the alert engine.
+  - **Low migration cost**: Existing Prometheus rule statements can be directly reused.
+- **Cons**:
+  - **Single severity level**: One rule typically corresponds to one severity level. To distinguish between >90 and >80, two rules must be configured.
+  - **Recovery value retrieval**: During recovery, since Prometheus no longer returns data (because the threshold is no longer met), the alert engine cannot directly obtain the specific value at recovery time. However, you can use enrichment queries to get real-time values.
+
+---
+
+## 3. No data mode
+
+This mode is specifically designed to monitor whether monitored objects are alive or whether data reporting pipelines are functioning properly.
+
+### Configuration
+
+- **Query**: Write a query for metrics that should always exist.
+  - Example: `up{job="my-service"}`
+- **Evaluation rule**: If no data for a Series can be queried for N consecutive cycles (note: this means completely no data, not a value of 0), a "no data" alert is triggered.
+- **Engine restart**: If the alert engine (monitedge) restarts, the in-memory state is lost. If data that was queryable before the restart happens to be unavailable after restart, no alert will be triggered. The missing cycle counter restarts after the engine restarts.
+
+### Typical applications
+
+- Monitor Exporter crashes.
+- Monitor instrumentation reporting service interruptions.
+- Monitor batch jobs not executing on schedule.
+
+### Comparison with Prometheus native `absent()` function
+
+- Prometheus's `absent()` function requires listing every combination of identifying labels for a metric, e.g., `absent(up{instance="host-1"})`. Multiple statements are needed for multiple instances.
+- Monitors' no data mode only requires one query. The alert engine automatically monitors all returned Series, and any Series with missing data triggers an alert.
+
+---
+
+## Advanced configuration
+
+### Labels and variables
+
+Monitors automatically parses labels returned by Prometheus. In **recovery queries** or **enrichment queries**, you can use `${label_name}` to reference label values.
+
+For example, if a query result contains labels `instance="host-1"` and `job="node"`, you can write the recovery query as:
+
+```
+up{instance="${instance}", job="${job}"} == 1
+```
+
+When executing, the alert engine replaces `${instance}` with `host-1` and `${job}` with `node`, which are the actual values from the alert event labels.
+
+### Enrichment query
+
+To enrich alert notification content, you can configure enrichment queries. Enrichment queries do not participate in alert evaluation and are only used to retrieve additional information.
+
+- **Scenario**: During a CPU alert, simultaneously query the machine's `mem_usage_percent` to display in the alert details for troubleshooting assistance.
+- **Variables**: Enrichment queries support `${label_name}` variables for querying specific alert objects.
+- **Result display**: Enrichment query results can be referenced in the alert rule's description field using the `$relates` variable. Please refer to the **description field** documentation for detailed usage instructions.
+
@@ -0,0 +1,134 @@
+---
+title: "ElasticSearch Alert Rule Configuration"
+description: "This document provides detailed instructions on configuring alert rules for ElasticSearch data sources in Monitors. Monitors uses ElasticSearch SQL functionality to monitor log and metric data, supporting flexible aggregation queries and alert evaluation."
+date: "2025-12-30T18:21:50.775+08:00"
+url: "https://docs.flashcat.cloud/en/flashduty/monitors/elasticsearch-alert-rules"
+---
+
+This document provides detailed instructions on configuring alert rules for ElasticSearch data sources in Monitors. Monitors uses ElasticSearch SQL functionality to monitor log and metric data, supporting flexible aggregation queries and alert evaluation.
+
+## Core concepts
+
+- **Version requirements**: Due to SQL feature dependencies, only **ElasticSearch 6.3** and above are supported.
+- **Query language**: Currently only **SQL** syntax is supported.
+- **Field handling**: The alert engine automatically converts all field names in query results to **lowercase**. When configuring value fields and label fields, always use lowercase letters.
+
+---
+
+## 1. Threshold mode
+
+This mode is suitable for scenarios requiring threshold comparisons on aggregated values, such as monitoring "error log count in the last 5 minutes".
+
+### Configuration
+
+1. **Query**: Write a SQL aggregation query that returns numeric columns and (optional) grouping columns.
+  - Example: Count error logs per service in the last 5 minutes.
+    ```sql
+    SELECT service_name, count(*) AS error_cnt 
+    FROM "app-logs-*" 
+    WHERE "@timestamp" > now() - INTERVAL 5 MINUTES AND log_level = 'ERROR'
+    GROUP BY service_name
+    ```
+2. **Field mapping**:
+  - **Label fields**: Fields used to distinguish different alert objects. In the example above, this is `service_name`. This field can be left empty, and Monitors will automatically treat all fields except value fields as label fields.
+  - **Value fields**: Numeric fields used for threshold evaluation. In the example above, this is `error_cnt`.
+3. **Threshold conditions**:
+  - Use `$A.field_name` to reference values.
+  - Example: `Critical: $A.error_cnt > 50`, `Warning: $A.error_cnt > 10`.
+  - Shorthand: If only one value field is configured, you can use `$A` directly, e.g., `$A > 50`.
+
+### How it works
+
+The engine executes the SQL query and retrieves tabular data. It groups data by label fields, then extracts value field values for comparison against threshold expressions.
+
+Note: The label field combination uniquely identifies an alert object. Query results cannot have multiple rows with the same label field value combination. Ensure each alert object corresponds to exactly one row of data. In the example above, `service_name` values must be unique. If two rows have the same `service_name`, the alert engine cannot properly distinguish between alert objects.
+
+### Recovery logic
+
+Similar to Prometheus data sources, ElasticSearch threshold mode also supports flexible recovery strategies:
+
+- **Auto recovery**: When the latest SQL query result shows that a data group's value no longer meets any alert threshold (Critical/Warning), a recovery event is automatically generated.
+- **Specific recovery conditions**: Configure additional recovery expressions (e.g., `$A.error_cnt < 5`). Recovery is only confirmed when the value falls below this threshold, preventing alert flapping.
+- **Recovery query**:
+  - **Scenario**: Sometimes alert queries and recovery queries have different logic. For example, the alert checks for "error log count > 10", while recovery might check for "success log count > 100" or query a different status index.
+  - **Configuration**: Write an independent SQL statement for recovery evaluation.
+  - **Variable support**: Recovery SQL supports using `${label_name}` to reference alert event label values.
+    - Example: The alert SQL found that network card with `network_host="a", interface="b"` is down. The recovery SQL can be:
+      ```sql
+      SELECT network_host, interface, status FROM "network-status-*" 
+      WHERE "@timestamp" > now() - INTERVAL 5 MINUTES 
+        AND network_host = '${network_host}' 
+        AND interface = '${interface}' 
+        AND status = 'UP'
+      ```
+    - The engine replaces `${network_host}` and `${interface}` with actual values before executing the query. If data is found, recovery is confirmed.
+
+---
+
+## 2. Data exists mode
+
+This mode is suitable for scenarios where filtering logic is written directly in SQL, or when you only need to check "whether any data is returned".
+
+### Configuration
+
+1. **Query**: Use a `HAVING` clause in SQL to directly filter out anomalous data.
+  - Example: Directly query services with more than 50 errors.
+    ```sql
+    SELECT service_name, count(*) AS error_cnt 
+    FROM "app-logs-*" 
+    WHERE "@timestamp" > now() - INTERVAL 5 MINUTES AND log_level = 'ERROR'
+    GROUP BY service_name
+    HAVING count(*) > 50
+    ```
+2. **Field mapping**:
+  - In this mode, label fields and value fields are **optional**. If both are left empty, the engine treats all fields in the query result as label fields, which can be referenced in rule descriptions.
+
+### Recovery logic
+
+- **Data disappearance means recovery**: When the SQL query result is empty (i.e., the HAVING condition is no longer met), the engine determines the incident has recovered. This is the most common recovery method.
+- **Recovery query**:
+  - **Scenario**: Sometimes "no data found" doesn't mean recovery (it could be that log collection failed), or stricter recovery conditions are needed (e.g., no errors for N consecutive minutes).
+  - **Configuration**: Write an independent SQL statement for recovery evaluation. If this query finds data, the incident is considered recovered.
+  - **Variable support**: Recovery SQL supports using `${label_name}` to reference alert event label values for precise recovery detection.
+
+### Pros and cons
+
+- **Pros**: Leverages the ES cluster's computing power for filtering, reducing network transmission and improving performance.
+- **Cons**: Cannot distinguish between multiple severity levels (e.g., Info/Warning), because SQL can only return data meeting specific conditions.
+
+---
+
+## 3. No data mode
+
+This mode monitors scenarios where data is expected but actually missing, commonly used to monitor log collection pipeline interruptions or scheduled tasks not executing.
+
+### Configuration
+
+1. **Query**: Write a SQL query that should continuously return data.
+  - Example: Query heartbeat logs from all hosts.
+    ```sql
+    SELECT host_name 
+    FROM "heartbeat-logs-*" 
+    WHERE "@timestamp" > now() - INTERVAL 5 MINUTES 
+    GROUP BY host_name
+    ```
+2. **Evaluation rule**:
+  - The engine periodically executes this SQL.
+  - If a `host_name` appeared in previous cycles but no longer appears in the current cycle (and N consecutive cycles), a "no data" alert is triggered.
+  - Note: This is the opposite of data exists mode. Data exists triggers alerts when data is found; no data triggers alerts when data is not found.
+
+### Recovery logic
+
+- **Data appearance means recovery**: Once the `host_name` reappears in query results, the alert automatically recovers.
+- **Auto recovery time**: Configure an auto recovery time (e.g., 24 hours). If not recovered after this time, the engine automatically closes the alert. This is typically used for handling decommissioned machines that no longer need monitoring.
+
+---
+
+## 4. Use case example
+
+Log alerting often requires: counting ERROR logs in the last 5 minutes, triggering an alert if the count exceeds a threshold, and displaying the most recent ERROR log as a sample in the alert message. Here's the configuration:
+
+- **Main alert condition**: Use threshold mode with a SQL statement counting ERROR logs in the last 5 minutes and configure threshold conditions.
+- **Enrichment query**: Configure an enrichment query with a SQL statement that retrieves the most recent ERROR log, using variables like `${service_name}` to limit to specific services.
+- **Rule description**: Reference enrichment query results in the alert rule's description field using the `$relates` variable to render the original log content.
+