Skip to content

Commit 74acdbd

Browse files
committed
add monitors docs
1 parent 2502d6a commit 74acdbd

101 files changed

Lines changed: 4537 additions & 59 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 53 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -1,110 +1,104 @@
11
---
22
title: "Loki Alert Rule Configuration"
3-
description: "This document provides detailed instructions on configuring alert rules for Loki data sources in Monitors. Monitors supports Loki's LogQL query syntax and can perform aggregation analysis on log data to trigger alerts."
3+
description: "This document details how to configure Loki data source alert rules in the Monitors alert engine. Monitors supports Loki's LogQL query syntax, enabling aggregation analysis of log data and alert triggering."
44
date: "2025-12-30T18:21:50.775+08:00"
5-
url: "https://docs.flashcat.cloud/en/flashduty/monitors/loki-alert-rules"
5+
url: "https://docs.flashcat.cloud/en/flashduty/monitors/loki-alert-rules?nav=01JCQ7A4N4WRWNXW8EWEHXCMF5"
66
---
77

8-
This document provides detailed instructions on configuring alert rules for Loki data sources in Monitors. Monitors supports Loki's LogQL query syntax and can perform aggregation analysis on log data to trigger alerts.
8+
This document details how to configure Loki data source alert rules in the Monitors alert engine. Monitors supports Loki's LogQL query syntax, enabling aggregation analysis of log data and alert triggering.
99

1010
## Core concepts
1111

1212
Loki's query language LogQL is divided into two categories:
13-
1413
1. **Log queries**: Return log line content (Stream).
15-
2. **Metric queries**: Count or aggregate logs, returning numeric values (Vector).
16-
17-
**Monitors alert engine primarily uses metric queries**. Always use functions like `count_over_time`, `rate`, `sum` to convert logs into numeric series for threshold evaluation.
14+
2. **Metric queries**: Count or aggregate logs, such as using `count_over_time` function to return values (Vector).
1815

1916
---
2017

2118
## 1. Threshold mode
2219

23-
This mode is suitable for scenarios requiring multi-level threshold evaluation on log aggregation values (e.g., Info/Warning/Critical).
20+
This mode is suitable for scenarios that require multi-level threshold evaluation (such as Info/Warning/Critical) on log aggregation values.
2421

2522
### Configuration
2623

27-
- **Query**: Write a LogQL that returns numeric vectors.
28-
- Example: Count log entries containing the `error` keyword in the `mysql` job over the last 5 minutes.
29-
```logql
30-
count_over_time({job="mysql"} |= "error" [5m])
31-
```
32-
- **Threshold conditions**:
33-
- **Critical**: `$A > 50` (more than 50 error logs in 5 minutes)
34-
- **Warning**: `$A > 10` (more than 10 error logs in 5 minutes)
24+
* **Query statement (LogQL)**: Write a LogQL that returns a numeric vector (select "Statistics" query mode).
25+
* *Example*: Count the number of logs containing the `error` keyword in the `mysql` job over the last 5 minutes.
26+
```logql
27+
count_over_time({job="mysql"} |= "error" [5m])
28+
```
29+
* **Threshold conditions**:
30+
* **Critical**: `$A > 50` (more than 50 error logs in 5 minutes)
31+
* **Warning**: `$A > 10` (more than 10 error logs in 5 minutes)
3532

3633
### How it works
37-
38-
The engine executes the LogQL query and retrieves time series data with labels (Vector). It iterates through each series, extracting values to compare against the configured threshold expressions.
34+
The engine executes the LogQL query and retrieves time series data (Vector) with labels. The engine iterates through each series, extracts the value, and compares it against the configured threshold expressions.
3935

4036
### Recovery logic
41-
42-
- **Auto recovery**: Automatically recovers when query result values fall below the threshold.
43-
- **Specific recovery conditions**: Configure conditions like `$A < 5` to prevent flapping near the threshold.
44-
- **Recovery query**:
45-
- Supports configuring an independent LogQL for recovery evaluation.
46-
- Supports `${label_name}` variable substitution.
47-
- Example: Alert checks for error logs, recovery checks for specific recovery logs `count_over_time({job="mysql"} |= "recovered" [5m])`.
37+
* **Automatic recovery**: When the query result value falls below the threshold, it automatically recovers.
38+
* **Specific recovery condition**: Can be configured as `$A < 5` to avoid oscillation near the threshold.
39+
* **Recovery query**:
40+
* Supports configuring a separate LogQL for recovery evaluation; recovery is triggered as long as data is found.
41+
* Supports `${label_name}` variable substitution.
42+
* *Example*: Alert on error logs, recover on specific recovery logs `count_over_time({job="mysql"} |= "recovered" [5m])`.
4843

4944
---
5045

5146
## 2. Data exists mode
5247

53-
This mode is suitable for users who prefer writing filter conditions directly in LogQL, or scenarios where you only care about "whether anomalous data exists".
48+
This mode is suitable for scenarios where you prefer to write filter conditions directly in LogQL, or only care about "whether abnormal data exists". This mode is recommended for log anomaly detection alerts.
5449

5550
### Configuration
5651

57-
- **Query**: Write a LogQL with comparison operators that only returns data meeting the conditions.
58-
- Example: Directly filter services with error rates exceeding 5%.
59-
```logql
60-
rate({job="ingress"} |= "500" [1m]) / rate({job="ingress"} [1m]) * 100 > 5
61-
```
62-
- **Evaluation rule**: An alert is triggered as soon as the LogQL query returns data.
52+
* **Query statement (LogQL)**: Write a LogQL containing comparison operators that returns only data meeting the conditions.
53+
* *Example*: Directly filter out services with error rates exceeding 5%.
54+
```logql
55+
count_over_time({job="ingress"} |= "error-code-500" [5m]) / count_over_time({job="ingress"} [5m]) * 100 > 5
56+
```
57+
* **Evaluation rule**: An alert is triggered as long as the LogQL query returns data.
6358

6459
### Pros and cons
65-
66-
- **Pros**: Computation logic is pushed down to the Loki server, reducing data transmission.
67-
- **Cons**: Cannot distinguish between alert severity levels; can only trigger a single level alert.
60+
* **Pros**: Computation logic is pushed down to the Loki server, reducing data transfer.
61+
* **Cons**: Cannot distinguish alert levels; can only trigger a single level of alert.
6862

6963
### Recovery logic
70-
71-
- **Data disappearance means recovery**: Recovery is confirmed when the LogQL query result is empty (i.e., the `> 5` condition is no longer met).
72-
- **Recovery query**: Supports configuring additional query statements to assist in determining recovery status.
64+
* **Recover when data disappears**: When the LogQL query result is empty (i.e., no longer meets the `> 5` condition), recovery is determined.
65+
* **Recovery query**: Supports configuring additional query statements to assist in determining recovery status.
7366

7467
---
7568

7669
## 3. No data mode
7770

78-
This mode monitors whether log reporting pipelines are interrupted, or whether logs that should be continuously generated have stopped.
71+
This mode is used to monitor whether the log reporting pipeline is interrupted, or whether logs that should be continuously generated have stopped.
7972

8073
### Configuration
8174

82-
- **Query**: Write a query that should always have data.
83-
- Example: Calculate log reporting rate for all hosts.
84-
```logql
85-
rate({job="node-logs"} [1m])
86-
```
87-
- **Evaluation rule**: If a Series (uniquely identified by labels, e.g., `instance="host-1"`) existed in previous cycles but cannot be found in the current and N consecutive cycles, a "no data" alert is triggered.
75+
* **Query statement (LogQL)**: Write a query that should always have data.
76+
* *Example*: Count the log reporting rate for all hosts.
77+
```logql
78+
rate({job="node-logs"} [1m])
79+
```
80+
* **Evaluation rule**: If a Series (uniquely identified by labels, such as `instance="host-1"`) existed in previous cycles but cannot be found in the current and consecutive N cycles, a "no data" alert is triggered.
8881

8982
### Typical applications
90-
91-
- Monitor whether collection agents like Promtail/Fluentd have stopped working.
92-
- Monitor whether critical business logs (such as order creation logs) have been abnormally interrupted.
83+
* Monitor whether collection agents like Promtail/Fluentd have stopped working.
84+
* Monitor whether critical business logs (such as order creation logs) have been abnormally interrupted.
9385

9486
---
9587

96-
## 4. Best practices and considerations
97-
98-
### Avoid querying raw logs
99-
100-
**Do not** use LogQL that only returns log streams in alert rules (e.g., `{job="mysql"} |= "error"`).
101-
102-
- **Reason**: The alert engine needs numeric values for calculations and evaluation. Raw log streams cannot be directly used for threshold comparisons.
103-
- **Correct approach**: Must wrap with aggregation functions like `count_over_time(...)`.
88+
## 4. Get original logs when alerting
10489

105-
### Performance optimization
90+
You can get original logs through related queries when alerting. However, it is usually not recommended to get too many; just get 1 as a log sample to include in the alert message.
10691

107-
- **Time range**: The time range in LogQL (e.g., `[5m]`) should be moderate. Too large a range leads to slow queries, while too small a range may cause high data volatility.
108-
- **Label filtering**: Use precise label filters in the LogQL Stream Selector section (within braces `{...}`) as much as possible to reduce the amount of data scanned.
92+
![](https://docs-cdn.flashcat.cloud/imges/mon/ca5ea15fcfca066f260e09d0f8d91cf2.png)
10993

94+
The results of related queries can be rendered in the "Note description", example:
11095

96+
```
97+
{{- if eq $status "firing" }}
98+
error log count: {{ $value | printf "%.3f" }}
99+
{{- range $x := $relates.R1}}
100+
Loki log time: {{(nanoTime $x.Fields.__time__ 8).Format "2006-01-02T15:04:05Z07:00"}}
101+
Loki Log line: {{$x.Fields.__log__}}
102+
{{- end}}
103+
{{- end}}
104+
```
Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
---
2+
title: "VictoriaLogs Alert Rule Configuration"
3+
description: "This document describes how to configure VictoriaLogs data source alert rules in the Monitors alert engine."
4+
date: "2025-12-30T18:21:50.775+08:00"
5+
url: "https://docs.flashcat.cloud/en/flashduty/monitors/victorialogs-alert-rules?nav=01JCQ7A4N4WRWNXW8EWEHXCMF5"
6+
---
7+
8+
This document describes how to configure VictoriaLogs data source alert rules in the Monitors alert engine. Monitors queries VictoriaLogs via HTTP, supporting raw log queries and statistical analysis, and performs threshold evaluation and data exists/missing detection based on the results.
9+
10+
## 1. Prerequisites
11+
12+
### 1.1 How it works
13+
14+
Monitors provides two query modes for VictoriaLogs data source alert rule configuration:
15+
16+
- **Raw query**: Calls the `/select/logsql/query` endpoint. The returned data can be viewed as a two-dimensional table. In **threshold mode**, "label fields" and "value fields" need to be mapped.
17+
- **Statistics**: Calls the `/select/logsql/stats_query` endpoint. The returned data follows the Prometheus protocol format. Monitors automatically identifies which fields are labels and which are values, requiring no additional configuration.
18+
- VictoriaLogs data sources still support three alert modes. The **data exists mode** is most recommended as it is most suitable for log scenarios.
19+
20+
### 1.2 Raw query
21+
22+
In "raw query" mode, the relevant configuration items are:
23+
24+
- **Query statement**: Example: `error | fields _time, _stream, _msg | sort by (_time) desc`
25+
- **Result limit**: This configuration limits the maximum number of rows returned by a query to avoid performance impact from returning too much data in a single query. In Monitors, the maximum can be set to 100.
26+
- **Time range**: Specify the query time window, such as "last 5 minutes".
27+
- **Label fields**: Specify which fields in the query results serve as labels for the alert object, used to distinguish different alert entities. Multiple label fields can be configured. If left empty, Monitors will treat all fields except the value field as label fields.
28+
- **Value field**: Specify which field in the query results serves as the numeric value for threshold evaluation. This is usually a numeric type field. Required in **threshold mode**, optional in other modes.
29+
30+
31+
### 1.3 Statistics
32+
33+
In "statistics" mode, the `stats` keyword must be used. Relevant configuration items are:
34+
35+
- **Query statement**: Example: `_time:1d | stats by (level) count(*) total`
36+
- **No other parameters**: Note that the query statement must include a `_time` filter condition, such as `_time:5m`, to limit the query time range. Otherwise, it queries all data, which may cause performance issues.
37+
38+
---
39+
40+
## 2. Threshold mode
41+
42+
Both **raw query** and **statistics** query modes can be used. Examples are provided below.
43+
44+
### 2.1 Raw query example
45+
46+
Query statement example:
47+
48+
```
49+
level:ERROR | stats by (level) count(*) total
50+
```
51+
52+
The result looks like:
53+
54+
| level | total |
55+
|-------|-------|
56+
| ERROR | 150 |
57+
58+
Configure the value field as `total` and the label field as `level` (or leave unconfigured, and Monitors will automatically identify). Example configuration for different thresholds and levels:
59+
60+
- Warning: `$A.total >= 50` or simply `$A >= 50` (since there's only one value field: total)
61+
- Critical: `$A.total >= 100` or simply `$A >= 100` (since there's only one value field: total)
62+
63+
### 2.2 Statistics example
64+
65+
Query statement example:
66+
67+
`_time:1d and level:ERROR | stats by (level) count(*) total`
68+
69+
The result follows the Prometheus protocol format:
70+
71+
```
72+
total{level="ERROR"} 150
73+
```
74+
75+
Example configuration for different thresholds and levels:
76+
77+
- Warning: `$A.total >= 50` or simply `$A >= 50` (since there's only one metric field: total)
78+
- Critical: `$A.total >= 100` or simply `$A >= 100` (since there's only one metric field: total)
79+
80+
81+
### 2.3 Recovery logic
82+
83+
Similar to Prometheus / ElasticSearch threshold mode, VictoriaLogs threshold mode supports:
84+
85+
- **Automatic recovery**: When the latest query result shows that an object's value no longer meets any alert threshold, a recovery event is automatically generated.
86+
- **Specific recovery condition**: A recovery expression can be configured, such as `$A.total < 10`, to only consider recovery when the error count drops below 10, reducing flapping.
87+
- **Recovery query**: A separate VictoriaLogs query statement can be configured for recovery evaluation.
88+
- How it works: After an alert is triggered, Monitors periodically executes this recovery query statement. As long as the query returns data (i.e., the result is not empty), the incident is considered recovered.
89+
- Variable support: The recovery query statement supports embedded variables (format: `${label_name}`), which are automatically replaced with the corresponding label values from the alert event, allowing the recovery query to detect specific alert objects.
90+
91+
---
92+
93+
## 3. Data exists mode
94+
95+
This mode puts all filtering logic in the VictoriaLogs query, and Monitors only determines "whether data is returned". Suitable for "alert whenever there is data meeting the conditions" scenarios. **This is the most recommended VictoriaLogs alert configuration method** (because threshold mode requires data to be continuously available, with only the value changing, which is not suitable for log scenarios. Log scenarios are better suited for data exists mode).
96+
97+
Query statement example (using **statistics** query mode):
98+
99+
`_time:15m and level:ERROR | stats by (level) count(*) total | filter total:>10`
100+
101+
Here `| filter total:>10` is used to filter data where `total` is greater than 10. As long as any data row meets this condition, Monitors will trigger an alert. If at some point no data row meets this condition, the alert is considered recovered.
102+
103+
104+
---
105+
106+
## 4. No data mode
107+
108+
No data mode is used to monitor situations where "logs that should be continuously generated no longer appear", commonly seen in:
109+
110+
- Application instances no longer produce logs (possibly process exit).
111+
- Log collection pipeline anomalies (such as agent crash or output blocking).
112+
113+
### 4.1 Configuration example
114+
115+
Query statement (**statistics** mode):
116+
117+
```
118+
_time:15m and level:INFO | stats by (level) count(*) total
119+
```
120+
121+
Scenario: A service should always have INFO log output. If there is no INFO log generated in the last 15 minutes, trigger an alert.
122+
123+
## 5. Get original logs when alerting
124+
125+
Alert query conditions typically use "statistics" mode, which does not return original logs. Monitors supports configuring "related queries" in alert rules to additionally query original logs when an alert is triggered.
126+
127+
![](https://docs-cdn.flashcat.cloud/imges/mon/b5c1890d90cecf967695a4b0a4b4fba0.png)
128+
129+
The results of "related queries" can be rendered in the "Note description", example:
130+
131+
```
132+
{{- if eq $status "firing" }}
133+
triggered value: {{ $value | printf "%.3f" }}
134+
{{- range $x := $relates.R1}}
135+
{{- range $k, $v := $x.Fields }}
136+
{{- if eq $k "_time" }}
137+
{{ $k }} : {{ timeFormat $v "2006-01-02T15:04:05Z07:00" 8 }}
138+
{{- else }}
139+
{{ $k }} : {{ $v }}
140+
{{- end }}
141+
{{- end }}
142+
{{- end}}
143+
{{- else}}
144+
Recovered
145+
{{- end}}
146+
```
147+

0 commit comments

Comments
 (0)