Skip to content

Commit ba38921

Browse files
ysyneuUlricQin
andauthored
add monitors doc (#100)
* add monitors doc * feat: add Monitors documentation and translations - Polish Chinese Monitors docs following style guidelines - Add English translations for Monitors documentation - Update introduction pages with Monitors product links - Reorganize directory structure (Platform moved from 3 to 5) - Add complete Monitors section with Getting Started and FAQ --------- Co-authored-by: Ulric Qin <ulric.qin@gmail.com>
1 parent c192a16 commit ba38921

14 files changed

Lines changed: 301 additions & 3 deletions

File tree

flashduty/en/0. Overview/1. Introduction.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ You can learn more at [On-call](https://docs.flashcat.cloud/en/flashduty/getting
2525
### Monitors (Beta)
2626
Multi-source monitoring and alerting engine supporting alert detection on data from various sources such as Prometheus, Elasticsearch, Clickhouse, and more.
2727

28-
You can experience it in the console.
28+
You can learn more at [Monitors](https://docs.flashcat.cloud/en/flashduty/monitors/introduction?nav=01JCQ7A4N4WRWNXW8EWEHXCMF5).
2929

3030
### RUM (Real User Monitoring, Beta)
3131
Real-time user experience monitoring that collects actual end-user data, analyzes page performance, error rates, and key metrics to help enterprises continuously optimize product experience and increase customer satisfaction.
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
---
2+
title: "Alert Engine (Monitors) Introduction"
3+
description: "The Alert Engine (Monitors) integrates with various metric and log data sources, performs threshold evaluation based on user-configured alert rules through periodic data queries, generates alert events, and finally pushes them to Flashduty On-call for aggregation and delivery."
4+
date: "2025-11-07T18:49:47.055+08:00"
5+
url: "https://docs.flashcat.cloud/en/flashduty/monitors/introduction?nav=01JCQ7A4N4WRWNXW8EWEHXCMF5"
6+
---
7+
8+
## What is the Alert Engine (Monitors)?
9+
10+
The Alert Engine (Monitors) integrates with various metric and log data sources, performs threshold evaluation based on your configured alert rules through periodic data queries, generates alert events, and finally pushes them to Flashduty On-call for aggregation and delivery.
11+
12+
Flashduty Monitors can replace the alerting capabilities of products like Nightingale, vmalert, and elastalert. The Monitors alert engine is designed to be extremely flexible and deeply integrated with On-call products, capable of meeting various complex alerting requirements.
13+
14+
## Alert Engine (Monitors) Architecture Design
15+
16+
Flashduty is a SaaS service that cannot access data sources within users' private networks from the SaaS side. Therefore, the Alert Engine (Monitors) consists of two parts:
17+
18+
- **SaaS Server**: Responsible for managing alert rules and permissions
19+
- **monitedge**: Deployed within users' private networks, synchronizes alert rules from SaaS, performs periodic data queries and threshold evaluation, generates alert events and pushes them to the SaaS side
20+
21+
The architecture diagram is shown below:
22+
23+
![Flashduty Monitors Architecture Diagram](https://docs-cdn.flashcat.cloud/imges/mon/a4341737494509d131b637a74399a43c.png)
24+
25+
The diagram assumes that the customer has two data centers, East US and South China. Each data center has a `monitedge` instance deployed, responsible for alert evaluation of data sources within their respective data centers and pushing alert events to the SaaS side.
26+
27+
If you only have one data center, or if the network quality between data centers is good, you can also deploy only one `monitedge` instance to handle alert evaluation for all data sources.
28+
29+
If you are concerned about single point of failure risks when deploying one `monitedge`, you can also deploy multiple `monitedge` instances to form a cluster. For example, deploy 2 `monitedge` instances in the East US data center to form a cluster, setting the same cluster name through the `--alerter.clusterName meidong` parameter when starting the instances; deploy 2 `monitedge` instances in the South China data center to form another cluster, setting another cluster name through the `--alerter.clusterName huanan` parameter when starting these two instances.
30+
31+
Multiple instances in an alert engine cluster will automatically shard the processing of alert rules. For example, if this cluster needs to process 100 alert rules, the system will automatically balance the load, allowing each `monitedge` instance to process 50 rules respectively. If one instance fails, another instance will take over the processing of all 100 alert rules, ensuring high availability while avoiding duplicate alert event delivery.
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
---
2+
title: "Quick Start"
3+
description: "Monitors quick start guide to help you quickly get started with Monitors functionality."
4+
date: "2025-11-07T19:24:43.956+08:00"
5+
url: "https://docs.flashcat.cloud/en/flashduty/monitors/installation?nav=01JCQ7A4N4WRWNXW8EWEHXCMF5"
6+
---
7+
8+
To experience Monitors functionality, there are three core steps: install `monitedge`, create data sources, and create alert rules.
9+
10+
## 1. Install monitedge
11+
12+
`monitedge` needs to be deployed within users' private networks, responsible for synchronizing alert rules from SaaS, performing periodic data source queries and threshold evaluation, generating alert events and pushing them to the SaaS side. To experience alerting functionality, you must first install `monitedge`.
13+
14+
Menu entry: Alert Engine → Engine Installation/Upgrade. You can choose any of the three installation methods: Linux, Docker, or Kubernetes.
15+
16+
Pay special attention to the **Engine Cluster Name**. `monitedge` instances with the same **Engine Cluster Name** will form a cluster, jointly sharding the processing of alert rules to avoid single point of failure risks. If you only plan one set of `monitedge` clusters, the engine cluster name can maintain the default `default`; if you plan multiple sets of `monitedge` clusters, such as one set for the East US data center and one set for the South China data center, please specify different engine cluster names for each set of clusters.
17+
18+
![Alert Engine Installation](https://docs-cdn.flashcat.cloud/imges/mon/64af7649b50f8ba705da9067afffc36a.png)
19+
20+
### Alert Engine Status
21+
22+
After the alert engine `monitedge` is installed, it will automatically connect to the SaaS side and periodically synchronize alert rules. You can view the current alert engine status information on the alert engine status page.
23+
24+
Some alert engine instances that have not had heartbeats for a long time will display a delete button. You can click the delete button to remove these long-inactive alert engine instances from the system to avoid engine disconnection alerts.
25+
26+
### Engine Disconnection Alerts
27+
28+
If the alert engine (`monitedge`) fails, the impact is significant. Therefore, engine disconnection alerts are provided to promptly issue alert notifications when the engine fails. For engine clusters composed of multiple instances, as long as one instance in the cluster is alive, engine disconnection alerts will not be triggered because the cluster can still work normally.
29+
30+
## 2. Create Data Sources
31+
32+
Menu entry: Data Sources, click the **New** button to create a data source.
33+
34+
![Create Data Source](https://docs-cdn.flashcat.cloud/imges/mon/3eee06165afe05405db6187205f7025c.png)
35+
36+
The two most critical configuration items:
37+
38+
- **Associated Alert Engine**: Through this configuration item, specify which alert engine cluster will perform data queries and alert evaluation for this data source. Usually, select the alert engine cluster in the same data center.
39+
- **Data Source Connection Address**: This address is for `monitedge` to connect to, and must be an address that `monitedge` can access. Usually, this is an internal network address.
40+
41+
## 3. Create Alert Rules
42+
43+
Menu entry: Alert Rules.
44+
45+
There may be many alert rules that need to be categorized and managed. Monitors provides a tree-structured grouping structure as a classification management solution for alert rules. Each alert rule must belong to a certain group. You can first create groups, then create alert rules under the groups.
46+
47+
The following details the various configurations of alert rules. Each field usually has help tips next to it. You can hover your mouse over the help tip icons to view specific instructions.
48+
49+
### Basic Configuration
50+
51+
![Basic Configuration](https://docs-cdn.flashcat.cloud/imges/mon/3a2978a22d7a23dd862fdbd409adf663.png)
52+
53+
- **Rule Name**: The name of the alert rule, for easy identification and management. Variable references are not supported because names may be used for filtering, aggregation and other operations in the future, and fixed names are more convenient for processing.
54+
- **Additional Labels**: Similar to `labels` in Prometheus alert rules, they will be attached to all alert events generated by this rule, facilitating filtering, routing, inhibition and other operations in On-call.
55+
56+
### Data Source Selection
57+
58+
![Data Source Selection](https://docs-cdn.flashcat.cloud/imges/mon/9971af45b4bc19bfe807898bf1bf10a0.png)
59+
60+
Monitors can make one rule effective for multiple data sources, and wildcards can be used, such as `db-*`, indicating that this rule will apply to all data sources whose names start with `db-`.
61+
62+
> ⚠️ Note: Because wildcards need to be supported here for data sources, data source names are stored instead of data source IDs. If the data source name is modified, it will affect the effectiveness of alert rules. Please be cautious when modifying data source names.
63+
64+
### Query Detection Method
65+
66+
![Query Detection Method](https://docs-cdn.flashcat.cloud/imges/mon/4e38b46952d6cbbfb97c2d28843dcdbe.png)
67+
68+
This section is used to configure how to query data from data sources and how to determine alert conditions. This functionality is designed to be very flexible, which also brings higher complexity. Please read the usage instructions on the right side of **Query Detection Method** on the page to understand the configuration method.
69+
70+
### Detection Frequency & Effective Time
71+
72+
![Detection Frequency & Effective Time](https://docs-cdn.flashcat.cloud/imges/mon/0980d71a653985a1706243fc6795685e.png)
73+
74+
- **Detection Frequency**: Usually periodic detection, also supports configuring `cron` expressions. The `cron` expressions in Monitors are accurate to the second.
75+
- **Effective Time**: Configure the effective time period for alert rules. Alerts will not be triggered during non-effective time periods.
76+
77+
### Event Configuration
78+
79+
- **Custom Fields**: Similar to `annotations` in Prometheus alert rules, they will be attached to all alert events generated by this rule, such as attaching dashboard URLs, SOP URLs, etc.
80+
- **Associated Query**: The results of associated queries are not used as data basis for alert threshold determination, but can be placed in remarks as variable references, facilitating viewing more contextual information in On-call to assist in troubleshooting. For example, if the number of Error logs in the last 5 minutes is 1000, greater than 0 triggers an alert, and you want to attach a log sample to the alert event, you can use additional queries to achieve this.
81+
- **Remark Description**: This field is extremely critical. It is an unstructured text field that supports variable references. Alert events will display the content of this field, facilitating rapid positioning and problem handling by on-call personnel. For specific configuration methods, please refer to the usage instructions on the right side of **Remark Description**.
82+
- **Channel**: Refers to the channel in Flashduty On-call. If a channel is specified, alert events will be sent to the specified channel; if not specified, alert events will be sent to integrations, and then determined which channels to deliver to based on routing rules configured in the integrations. For specific situations, please refer to the prompt instructions on the right side of **Channel**.
83+
- **Repeat Notification**: If alerts do not recover, continuous notifications can be sent at specified intervals, and the maximum number of notifications can also be specified, defaulting to 10000 times.
84+
85+
> ⚠️ Note: The maximum number of notifications does not represent the number of message reminders received by end users. Because alert events generated by Monitors will be delivered to On-call, On-call may perform aggregation and noise reduction processing on alert events. The final number of message reminders sent to end users depends on On-call configuration.
86+
87+
## 4. Results
88+
89+
After completing the above configuration, if alert conditions are triggered, alert events will be generated, and the status in front of the alert rule will also change to `Triggered`.
90+
91+
![Alert Rules List Page](https://docs-cdn.flashcat.cloud/imges/mon/6f1fad7d65b5aee6b89bf0f0a564a1be.png)
92+
93+
Clicking `Triggered` will show the alert events generated by this rule (you can also view them in On-call):
94+
95+
![Alert Events List](https://docs-cdn.flashcat.cloud/imges/mon/3a307bf41012c5e085d81ca8b2dc443b.png)
96+
97+
Continue clicking on the alert event title to see the alert event details, divided into three tabs: **Alert Overview**, **Timeline**, **Associated Events**. These are all functions of the On-call system, and the meaning of each field is also quite obvious, so they will not be described one by one here.
98+
99+
## 5. Import Alert Rules
100+
101+
If you already have a batch of Prometheus alert rules and want to quickly import them into Monitors for use, you can use the alert rule import function. Menu entry: Alert Rules → Import.
102+
103+
![Import Alert Rules](https://docs-cdn.flashcat.cloud/imges/mon/a613b20d1aeaf7321be5ab43bf07a83f.png)
104+
105+
The requirement is to import Prometheus alert rule YAML format text, in the standard Prometheus alert rule file format with `groups` as the root node. The YAML indentation must be correct, otherwise the import will fail.
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
---
2+
title: "Alert Engine (Monitors) FAQ"
3+
date: "2025-11-07T20:56:02.070+08:00"
4+
url: "https://docs.flashcat.cloud/en/flashduty/monitors/faq?nav=01JCQ7A4N4WRWNXW8EWEHXCMF5"
5+
---
6+
7+
1. **Alert rule execution does not meet expectations, how to debug?**
8+
- On the alert rules list page, there is a debug log switch. You can turn on this switch, after which the alert engine process (`monitedge`) will output detailed execution logs for this rule, making it convenient for you to troubleshoot issues.
9+
- `monitedge` outputs logs to standard output, i.e., `stdout`.
10+
- If `monitedge` is running via Docker, you can view log content through the `docker logs <container_id>` command.
11+
- If `monitedge` is running via Kubernetes, you can view log content through the `kubectl logs <pod_name>` command.
12+
- If `monitedge` is started via systemd (i.e., Linux installation mode), you can view log content through the `journalctl -u monitedge.service -f` command, or directly view the `/var/log/messages` file. `monitedge` follows cloud-native best practice recommendations by outputting logs to standard output and does not write logs to separate log files, facilitating log collection system gathering as well as rotation and compression.
File renamed without changes.
File renamed without changes.
File renamed without changes.

flashduty/zh/0. 概览/1. 简介.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,9 @@ Flashduty 不是夜莺(Nightingale)。夜莺作为 Flashcat 主要贡献的
2323
您可前往[On-call](https://docs.flashcat.cloud/zh/flashduty/getting-started)了解更多。
2424

2525
### Monitors (Beta)
26-
多源监控与告警引擎,支持对Prometheus、Elasticsearch、Clickhouse等各种来源数据进行告警判断
26+
多源监控与告警引擎,支持对 Prometheus、Elasticsearch、Clickhouse 等各种来源数据进行告警判断
2727

28-
您可前往控制台进行体验
28+
您可前往[Monitors](https://docs.flashcat.cloud/zh/flashduty/monitors/introduction)了解更多
2929

3030
### RUM (Real User Monitoring, Beta)
3131
实时用户体验监控,采集终端用户真实数据,分析页面性能、错误率与关键指标,帮助企业持续优化产品体验,提升客户满意度。
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
---
2+
title: "告警引擎(Monitors)介绍"
3+
description: "告警引擎(Monitors)对接各类指标、日志数据源,根据用户配置的告警规则,周期性查询数据并做阈值判定,进而产生告警事件,最后推送给 Flashduty On-call 做聚合发送。"
4+
date: "2025-11-07T18:49:47.055+08:00"
5+
url: "https://docs.flashcat.cloud/zh/flashduty/monitors/introduction"
6+
---
7+
8+
## 什么是告警引擎(Monitors)?
9+
10+
告警引擎(Monitors)对接各类指标、日志数据源,根据您配置的告警规则,周期性查询数据并进行阈值判定,进而产生告警事件,最后推送给 Flashduty On-call 进行聚合发送。
11+
12+
Flashduty Monitors 可以替代 Nightingale、vmalert、elastalert 等产品的告警能力。Monitors 的告警引擎设计极为灵活,深度整合 On-call 产品,能够满足各种复杂的告警需求。
13+
14+
## 告警引擎(Monitors)架构设计
15+
16+
Flashduty 是一个 SaaS 服务,无法从 SaaS 侧访问用户私有网络内的数据源,因此告警引擎(Monitors)包含两部分:
17+
18+
- **SaaS 服务端**:负责管理告警规则、管理权限
19+
- **monitedge**:部署在用户私有网络内,从 SaaS 同步告警规则,周期性查询数据源并进行阈值判定,产生告警事件并推送给 SaaS 端
20+
21+
架构图如下所示:
22+
23+
![Flashduty Monitors 架构图](https://docs-cdn.flashcat.cloud/imges/mon/a4341737494509d131b637a74399a43c.png)
24+
25+
示意图中假设客户有两个机房,美东机房和华南机房,每个机房内都部署了一个 `monitedge` 实例,分别负责各自机房内数据源的告警判定,并将告警事件推送给 SaaS 端。
26+
27+
如果您只有一个机房,或者机房间网络质量很好,也可以只部署一个 `monitedge` 实例,负责所有数据源的告警判定。
28+
29+
如果部署一个 `monitedge` 担心单点故障风险,也可以部署多个 `monitedge` 实例组成集群。比如美东机房部署 2 个 `monitedge` 实例组成集群,实例启动时通过 `--alerter.clusterName meidong` 参数设置相同的集群名字;华南机房部署 2 个 `monitedge` 实例组成另一个集群,这两个实例启动时通过 `--alerter.clusterName huanan` 参数设置另一个集群名字。
30+
31+
一个告警引擎集群中的多个实例会自动分片处理告警规则。比如这个集群要处理 100 条告警规则,系统会自动均衡,让每一个 `monitedge` 实例分别处理 50 条。如果其中一个实例挂掉,另一个实例会接管所有的这 100 条告警规则的处理,既保证了高可用,又避免了告警事件重复发送。
32+
33+

0 commit comments

Comments
 (0)