flashcatcloud
diff --git a/‎flashduty/en/3. Monitors/2. Alert Rules/3.loki.md‎
Lines changed: 53 additions & 59 deletions b/‎flashduty/en/3. Monitors/2. Alert Rules/3.loki.md‎
Lines changed: 53 additions & 59 deletions
diff --git a/‎flashduty/en/3. Monitors/2. Alert Rules/9.victorialogs.md‎
Lines changed: 147 additions & 0 deletions b/‎flashduty/en/3. Monitors/2. Alert Rules/9.victorialogs.md‎
Lines changed: 147 additions & 0 deletions
@@ -1,110 +1,104 @@
 ---
 title: "Loki Alert Rule Configuration"
-description: "This document provides detailed instructions on configuring alert rules for Loki data sources in Monitors. Monitors supports Loki's LogQL query syntax and can perform aggregation analysis on log data to trigger alerts."
+description: "This document details how to configure Loki data source alert rules in the Monitors alert engine. Monitors supports Loki's LogQL query syntax, enabling aggregation analysis of log data and alert triggering."
 date: "2025-12-30T18:21:50.775+08:00"
-url: "https://docs.flashcat.cloud/en/flashduty/monitors/loki-alert-rules"
+url: "https://docs.flashcat.cloud/en/flashduty/monitors/loki-alert-rules?nav=01JCQ7A4N4WRWNXW8EWEHXCMF5"
 ---
 
-This document provides detailed instructions on configuring alert rules for Loki data sources in Monitors. Monitors supports Loki's LogQL query syntax and can perform aggregation analysis on log data to trigger alerts.
+This document details how to configure Loki data source alert rules in the Monitors alert engine. Monitors supports Loki's LogQL query syntax, enabling aggregation analysis of log data and alert triggering.
 
 ## Core concepts
 
 Loki's query language LogQL is divided into two categories:
-
 1. **Log queries**: Return log line content (Stream).
-2. **Metric queries**: Count or aggregate logs, returning numeric values (Vector).
-
-**Monitors alert engine primarily uses metric queries**. Always use functions like `count_over_time`, `rate`, `sum` to convert logs into numeric series for threshold evaluation.
+2. **Metric queries**: Count or aggregate logs, such as using `count_over_time` function to return values (Vector).
 
 ---
 
 ## 1. Threshold mode
 
-This mode is suitable for scenarios requiring multi-level threshold evaluation on log aggregation values (e.g., Info/Warning/Critical).
+This mode is suitable for scenarios that require multi-level threshold evaluation (such as Info/Warning/Critical) on log aggregation values.
 
 ### Configuration
 
-- **Query**: Write a LogQL that returns numeric vectors.
-  - Example: Count log entries containing the `error` keyword in the `mysql` job over the last 5 minutes.
-    ```logql
-    count_over_time({job="mysql"} |= "error" [5m])
-    ```
-- **Threshold conditions**:
-  - **Critical**: `$A > 50` (more than 50 error logs in 5 minutes)
-  - **Warning**: `$A > 10` (more than 10 error logs in 5 minutes)
+* **Query statement (LogQL)**: Write a LogQL that returns a numeric vector (select "Statistics" query mode).
+    * *Example*: Count the number of logs containing the `error` keyword in the `mysql` job over the last 5 minutes.
+        ```logql
+        count_over_time({job="mysql"} |= "error" [5m])
+        ```
+* **Threshold conditions**:
+    * **Critical**: `$A > 50` (more than 50 error logs in 5 minutes)
+    * **Warning**: `$A > 10` (more than 10 error logs in 5 minutes)
 
 ### How it works
-
-The engine executes the LogQL query and retrieves time series data with labels (Vector). It iterates through each series, extracting values to compare against the configured threshold expressions.
+The engine executes the LogQL query and retrieves time series data (Vector) with labels. The engine iterates through each series, extracts the value, and compares it against the configured threshold expressions.
 
 ### Recovery logic
-
-- **Auto recovery**: Automatically recovers when query result values fall below the threshold.
-- **Specific recovery conditions**: Configure conditions like `$A < 5` to prevent flapping near the threshold.
-- **Recovery query**:
-  - Supports configuring an independent LogQL for recovery evaluation.
-  - Supports `${label_name}` variable substitution.
-  - Example: Alert checks for error logs, recovery checks for specific recovery logs `count_over_time({job="mysql"} |= "recovered" [5m])`.
+* **Automatic recovery**: When the query result value falls below the threshold, it automatically recovers.
+* **Specific recovery condition**: Can be configured as `$A < 5` to avoid oscillation near the threshold.
+* **Recovery query**:
+    * Supports configuring a separate LogQL for recovery evaluation; recovery is triggered as long as data is found.
+    * Supports `${label_name}` variable substitution.
+    * *Example*: Alert on error logs, recover on specific recovery logs `count_over_time({job="mysql"} |= "recovered" [5m])`.
 
 ---
 
 ## 2. Data exists mode
 
-This mode is suitable for users who prefer writing filter conditions directly in LogQL, or scenarios where you only care about "whether anomalous data exists".
+This mode is suitable for scenarios where you prefer to write filter conditions directly in LogQL, or only care about "whether abnormal data exists". This mode is recommended for log anomaly detection alerts.
 
 ### Configuration
 
-- **Query**: Write a LogQL with comparison operators that only returns data meeting the conditions.
-  - Example: Directly filter services with error rates exceeding 5%.
-    ```logql
-    rate({job="ingress"} |= "500" [1m]) / rate({job="ingress"} [1m]) * 100 > 5
-    ```
-- **Evaluation rule**: An alert is triggered as soon as the LogQL query returns data.
+* **Query statement (LogQL)**: Write a LogQL containing comparison operators that returns only data meeting the conditions.
+    * *Example*: Directly filter out services with error rates exceeding 5%.
+        ```logql
+        count_over_time({job="ingress"} |= "error-code-500" [5m]) / count_over_time({job="ingress"} [5m]) * 100 > 5
+        ```
+* **Evaluation rule**: An alert is triggered as long as the LogQL query returns data.
 
 ### Pros and cons
-
-- **Pros**: Computation logic is pushed down to the Loki server, reducing data transmission.
-- **Cons**: Cannot distinguish between alert severity levels; can only trigger a single level alert.
+* **Pros**: Computation logic is pushed down to the Loki server, reducing data transfer.
+* **Cons**: Cannot distinguish alert levels; can only trigger a single level of alert.
 
 ### Recovery logic
-
-- **Data disappearance means recovery**: Recovery is confirmed when the LogQL query result is empty (i.e., the `> 5` condition is no longer met).
-- **Recovery query**: Supports configuring additional query statements to assist in determining recovery status.
+* **Recover when data disappears**: When the LogQL query result is empty (i.e., no longer meets the `> 5` condition), recovery is determined.
+* **Recovery query**: Supports configuring additional query statements to assist in determining recovery status.
 
 ---
 
 ## 3. No data mode
 
-This mode monitors whether log reporting pipelines are interrupted, or whether logs that should be continuously generated have stopped.
+This mode is used to monitor whether the log reporting pipeline is interrupted, or whether logs that should be continuously generated have stopped.
 
 ### Configuration
 
-- **Query**: Write a query that should always have data.
-  - Example: Calculate log reporting rate for all hosts.
-    ```logql
-    rate({job="node-logs"} [1m])
-    ```
-- **Evaluation rule**: If a Series (uniquely identified by labels, e.g., `instance="host-1"`) existed in previous cycles but cannot be found in the current and N consecutive cycles, a "no data" alert is triggered.
+* **Query statement (LogQL)**: Write a query that should always have data.
+    * *Example*: Count the log reporting rate for all hosts.
+        ```logql
+        rate({job="node-logs"} [1m])
+        ```
+* **Evaluation rule**: If a Series (uniquely identified by labels, such as `instance="host-1"`) existed in previous cycles but cannot be found in the current and consecutive N cycles, a "no data" alert is triggered.
 
 ### Typical applications
-
-- Monitor whether collection agents like Promtail/Fluentd have stopped working.
-- Monitor whether critical business logs (such as order creation logs) have been abnormally interrupted.
+* Monitor whether collection agents like Promtail/Fluentd have stopped working.
+* Monitor whether critical business logs (such as order creation logs) have been abnormally interrupted.
 
 ---
 
-## 4. Best practices and considerations
-
-### Avoid querying raw logs
-
-**Do not** use LogQL that only returns log streams in alert rules (e.g., `{job="mysql"} |= "error"`).
-
-- **Reason**: The alert engine needs numeric values for calculations and evaluation. Raw log streams cannot be directly used for threshold comparisons.
-- **Correct approach**: Must wrap with aggregation functions like `count_over_time(...)`.
+## 4. Get original logs when alerting
 
-### Performance optimization
+You can get original logs through related queries when alerting. However, it is usually not recommended to get too many; just get 1 as a log sample to include in the alert message.
 
-- **Time range**: The time range in LogQL (e.g., `[5m]`) should be moderate. Too large a range leads to slow queries, while too small a range may cause high data volatility.
-- **Label filtering**: Use precise label filters in the LogQL Stream Selector section (within braces `{...}`) as much as possible to reduce the amount of data scanned.
+![](https://docs-cdn.flashcat.cloud/imges/mon/ca5ea15fcfca066f260e09d0f8d91cf2.png)
 
+The results of related queries can be rendered in the "Note description", example:
 
+```
+{{- if eq $status "firing" }}
+error log count: {{ $value | printf "%.3f" }}
+{{- range $x := $relates.R1}}
+Loki log time: {{(nanoTime $x.Fields.__time__ 8).Format "2006-01-02T15:04:05Z07:00"}}
+Loki Log line: {{$x.Fields.__log__}}
+{{- end}}
+{{- end}}
+```
@@ -0,0 +1,147 @@
+---
+title: "VictoriaLogs Alert Rule Configuration"
+description: "This document describes how to configure VictoriaLogs data source alert rules in the Monitors alert engine."
+date: "2025-12-30T18:21:50.775+08:00"
+url: "https://docs.flashcat.cloud/en/flashduty/monitors/victorialogs-alert-rules?nav=01JCQ7A4N4WRWNXW8EWEHXCMF5"
+---
+
+This document describes how to configure VictoriaLogs data source alert rules in the Monitors alert engine. Monitors queries VictoriaLogs via HTTP, supporting raw log queries and statistical analysis, and performs threshold evaluation and data exists/missing detection based on the results.
+
+## 1. Prerequisites
+
+### 1.1 How it works
+
+Monitors provides two query modes for VictoriaLogs data source alert rule configuration:
+
+- **Raw query**: Calls the `/select/logsql/query` endpoint. The returned data can be viewed as a two-dimensional table. In **threshold mode**, "label fields" and "value fields" need to be mapped.
+- **Statistics**: Calls the `/select/logsql/stats_query` endpoint. The returned data follows the Prometheus protocol format. Monitors automatically identifies which fields are labels and which are values, requiring no additional configuration.
+- VictoriaLogs data sources still support three alert modes. The **data exists mode** is most recommended as it is most suitable for log scenarios.
+
+### 1.2 Raw query
+
+In "raw query" mode, the relevant configuration items are:
+
+- **Query statement**: Example: `error | fields _time, _stream, _msg | sort by (_time) desc`
+- **Result limit**: This configuration limits the maximum number of rows returned by a query to avoid performance impact from returning too much data in a single query. In Monitors, the maximum can be set to 100.
+- **Time range**: Specify the query time window, such as "last 5 minutes".
+- **Label fields**: Specify which fields in the query results serve as labels for the alert object, used to distinguish different alert entities. Multiple label fields can be configured. If left empty, Monitors will treat all fields except the value field as label fields.
+- **Value field**: Specify which field in the query results serves as the numeric value for threshold evaluation. This is usually a numeric type field. Required in **threshold mode**, optional in other modes.
+
+
+### 1.3 Statistics
+
+In "statistics" mode, the `stats` keyword must be used. Relevant configuration items are:
+
+- **Query statement**: Example: `_time:1d | stats by (level) count(*) total`
+- **No other parameters**: Note that the query statement must include a `_time` filter condition, such as `_time:5m`, to limit the query time range. Otherwise, it queries all data, which may cause performance issues.
+
+---
+
+## 2. Threshold mode
+
+Both **raw query** and **statistics** query modes can be used. Examples are provided below.
+
+### 2.1 Raw query example
+
+Query statement example:
+
+```
+level:ERROR | stats by (level) count(*) total
+```
+
+The result looks like:
+
+| level | total |
+|-------|-------|
+| ERROR | 150   |
+
+Configure the value field as `total` and the label field as `level` (or leave unconfigured, and Monitors will automatically identify). Example configuration for different thresholds and levels:
+
+- Warning: `$A.total >= 50` or simply `$A >= 50` (since there's only one value field: total)
+- Critical: `$A.total >= 100` or simply `$A >= 100` (since there's only one value field: total)
+
+### 2.2 Statistics example
+
+Query statement example:
+
+`_time:1d and level:ERROR | stats by (level) count(*) total`
+
+The result follows the Prometheus protocol format:
+
+```
+total{level="ERROR"} 150
+```
+
+Example configuration for different thresholds and levels:
+
+- Warning: `$A.total >= 50` or simply `$A >= 50` (since there's only one metric field: total)
+- Critical: `$A.total >= 100` or simply `$A >= 100` (since there's only one metric field: total)
+
+
+### 2.3 Recovery logic
+
+Similar to Prometheus / ElasticSearch threshold mode, VictoriaLogs threshold mode supports:
+
+- **Automatic recovery**: When the latest query result shows that an object's value no longer meets any alert threshold, a recovery event is automatically generated.
+- **Specific recovery condition**: A recovery expression can be configured, such as `$A.total < 10`, to only consider recovery when the error count drops below 10, reducing flapping.
+- **Recovery query**: A separate VictoriaLogs query statement can be configured for recovery evaluation.
+  - How it works: After an alert is triggered, Monitors periodically executes this recovery query statement. As long as the query returns data (i.e., the result is not empty), the incident is considered recovered.
+  - Variable support: The recovery query statement supports embedded variables (format: `${label_name}`), which are automatically replaced with the corresponding label values from the alert event, allowing the recovery query to detect specific alert objects.
+
+---
+
+## 3. Data exists mode
+
+This mode puts all filtering logic in the VictoriaLogs query, and Monitors only determines "whether data is returned". Suitable for "alert whenever there is data meeting the conditions" scenarios. **This is the most recommended VictoriaLogs alert configuration method** (because threshold mode requires data to be continuously available, with only the value changing, which is not suitable for log scenarios. Log scenarios are better suited for data exists mode).
+
+Query statement example (using **statistics** query mode):
+
+`_time:15m and level:ERROR | stats by (level) count(*) total | filter total:>10`
+
+Here `| filter total:>10` is used to filter data where `total` is greater than 10. As long as any data row meets this condition, Monitors will trigger an alert. If at some point no data row meets this condition, the alert is considered recovered.
+
+
+---
+
+## 4. No data mode
+
+No data mode is used to monitor situations where "logs that should be continuously generated no longer appear", commonly seen in:
+
+- Application instances no longer produce logs (possibly process exit).
+- Log collection pipeline anomalies (such as agent crash or output blocking).
+
+### 4.1 Configuration example
+
+Query statement (**statistics** mode):
+
+```
+_time:15m and level:INFO | stats by (level) count(*) total
+```
+
+Scenario: A service should always have INFO log output. If there is no INFO log generated in the last 15 minutes, trigger an alert.
+
+## 5. Get original logs when alerting
+
+Alert query conditions typically use "statistics" mode, which does not return original logs. Monitors supports configuring "related queries" in alert rules to additionally query original logs when an alert is triggered.
+
+![](https://docs-cdn.flashcat.cloud/imges/mon/b5c1890d90cecf967695a4b0a4b4fba0.png)
+
+The results of "related queries" can be rendered in the "Note description", example:
+
+```
+{{- if eq $status "firing" }}
+triggered value: {{ $value | printf "%.3f" }}
+{{- range $x := $relates.R1}}
+{{- range $k, $v := $x.Fields }}
+{{- if eq $k "_time" }}
+{{ $k }} : {{ timeFormat $v "2006-01-02T15:04:05Z07:00" 8 }}
+{{- else }}
+{{ $k }} : {{ $v }}
+{{- end }}
+{{- end }}
+{{- end}}
+{{- else}}
+Recovered
+{{- end}}
+```
+