Prometheus之Alertmanager报警
文章目录
- Prometheus之Alertmanager报警
概述
- Prometheus通过 规则文件对比抓取到的数据,来判断是否触发告警,我们通过配置告警的工具Alertmanager进行告警通知。
- 规则文件,写的就是,当我们获取到PromeSQL的值到达一个设置的规则后,触发告警,也就是说,规则文件,是触发告警的关键,而Alertmanager是告警的手段、工具。
资源列表
操作系统 | 配置 | 主机名 | IP | 所需软件 |
---|---|---|---|---|
CentOS 7.9 | 2C4G | prometheus | 192.168.93.101 | prometheus-2.37.8.linux-amd64.tar.gz |
CentOS 7.9 | 2C4G | alertmanager | 192.168.93.102 | node_exporter-1.6.1.linux-amd64.tar.gz alertmanager-0.26.0.linux-amd64.tar.gz |
基础环境
- 关闭防火墙
bash
systemctl stop firewalld
systemctl disable firewalld
- 关闭内核安全机制
bash
setenforce 0
sed -i "s/^SELINUX=.*/SELINUX=disabled/g" /etc/selinux/config
- 修改主机名
bash
hostnamectl set-hostname prometheus
hostnamectl set-hostname alertmanager
一、部署Prometheus服务
- 作用:收集数据和展示数据
1.1、解压
bash
[root@prometheus ~]# tar -zxvf prometheus-2.37.8.linux-amd64.tar.gz
- 移动至指定目录
bash
[root@prometheus ~]# mv prometheus-2.37.8.linux-amd64 /usr/local/prometheus
1.2、配置systemctl启动
bash
[root@prometheus ~]# cat >> /etc/systemd/system/prometheus.service << EOF
[Unit]
Description=xinjizhiwa Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network.target
[Service]
Restart=on-failure
ExecStart=/usr/local/prometheus/prometheus \
--config.file=/usr/local/prometheus/prometheus.yml \
--web.enable-lifecycle
ExecReload=/bin/kill -HUP \$MAINPID
LimitNOFILE=65535
[Install]
WantedBy=multi-user.target
EOF
- 加载并启动服务
bash
[root@prometheus ~]# systemctl enable prometheus.service --now
1.3、监控端口
- Prometheus默认监听
9090
端口
bash
[root@prometheus ~]# netstat -anpt | grep 9090
tcp6 0 0 :::9090 :::* LISTEN 8311/prometheus
tcp6 0 0 ::1:9090 ::1:35776 ESTABLISHED 8311/prometheus
tcp6 0 0 ::1:35776 ::1:9090 ESTABLISHED 8311/prometheus
二、部署Node-Exporter
- 作用:用来收集节点上的数据
2.1、解压
bash
[root@alertmanager ~]# tar -zxvf node_exporter-1.6.1.linux-amd64.tar.gz
- 移动至指定目录
bash
[root@alertmanager ~]# mv node_exporter-1.6.1.linux-amd64 /usr/local/node_exporter
2.2、配置systemctl启动
bash
[root@alertmanager ~]# cat > /etc/systemd/system/node-exporter.service << EOF
[Unit]
Description=xinjizhiwa node-exporter
Documentation=https://prometheus.io/docs/introduction/overview/
After=network.target
[Service]
Restart=on-failure
ExecStart=/usr/local/node_exporter/node_exporter
ExecReload=/bin/kill -HUP \$MAINPID
LimitNOFILE=65535
[Install]
WantedBy=multi-user.target
EOF
- 加载并启动服务
bash
[root@alertmanager ~]# systemctl daemon-reload
[root@alertmanager ~]# systemctl enable node-exporter.service --now
2.3、监听端口
- node-exporter默认监听
9100
端口
bash
[root@alertmanager ~]# netstat -anpt | grep 9100
tcp6 0 0 :::9100 :::* LISTEN 8320/node_exporter
三、配置Prometheus收集Exporter采集的数据
- node-exporter会把数据统一收集,等待Prometheus进行收集数据展示
3.1、编辑Prometheus配置文件
bash
[root@prometheus ~]# grep -v "#" /usr/local/prometheus/prometheus.yml
global:
alerting:
alertmanagers:
- static_configs:
- targets:
rule_files:
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
####################################################
- job_name: "alertmanager"
static_configs:
- targets: ["192.168.93.102:9100"]
####################################################
3.2、重新加载Prometheus服务
- 此次的加载方式不适用
systemctl
进行加载的
bash
[root@prometheus ~]# curl -X POST http://192.168.93.101:9090/-/reload
3.3、访问Prometheus仪表盘
-
依次点击
Status
>Targets
-
此时,就会看到,新配置的被监控项主体的指标列表
四、部署Alertmanager报警工具
- 作用:用于集成邮件报警
4.1、下载软件包
bash
[root@alertmanager ~]# wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
4.2、解压软件包
bash
[root@alertmanager ~]# tar -zxvf alertmanager-0.26.0.linux-amd64.tar.gz
[root@alertmanager ~]# mv alertmanager-0.26.0.linux-amd64 /usr/local/alertmanager
五、配置Alertmanager邮件报警
5.1、编辑Alertmanager配置文件
bash
[root@alertmanager ~]# cat /usr/local/alertmanager/alertmanager.yml
#一、发件人信息配置
global:
#解析失败超时时间;
resolve_timeout: 5m
#【发件人】邮箱
smtp_from: '2516786946@qq.com'
#【邮箱官方主机】地址及端口
smtp_smarthost: 'smtp.qq.com:465'
#【发件人名称】邮箱
smtp_auth_username: 'wzh@qq.com'
#【发件人】邮箱授权码
smtp_auth_password: 'sqwmtjbnsbrlebie'
#发送信息是否tls加密
smtp_require_tls: false
smtp_hello: 'qq.com'
#二、报警的间隔信息配置;
route:
group_by: ['alertname']
group_wait: 5s
group_interval: 5s
#重复报警的间隔时间,如果报警问题没有解决,则会间隔指定的时间继续触发报警,比如5分钟;
repeat_interval: 5m
#采用什么报警方式?本次学习,我们使用邮箱;
receiver: 'email'
#三、接收告警的目标信息编辑;谁来接收告警?
receivers:
#定义接收者名称
- name: 'email'
email_configs:
#【收件人】
- to: '2516786946@qq.com'
send_resolved: true
inhibit_rules:
- source_match:
#匹配的告警级别
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
5.2、启动Alertmanager
bash
# 此命令是前台启动命令
[root@alertmanager ~]# /usr/local/alertmanager/alertmanager
5.3、监听端口
- Alertmanager默认监听
9093
端口
bash
[root@alertmanager ~]# netstat -anpt | grep 9093
tcp6 0 0 :::9093 :::* LISTEN 8393/alertmanager
六、配置Prometheus报警规则
6.1、编辑配置文件
bash
[root@prometheus ~]# cat /usr/local/prometheus/prometheus.yml
global:
scrape_interval: 3s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- 192.168.93.102:9093 ### 报警的中间件地址(Alertmanager的访问地址)
rule_files:
- "/usr/local/prometheus/rules.yml" ### 规则文件地址
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "alertmanager"
static_configs:
- targets: ["192.168.93.102:9100"]
6.2、编辑规则文件
- 规则文件内容:当监控数据,到达什么数值后,触发Alertmanager的报警
bash
[root@prometheus ~]# cat /usr/local/prometheus/rules.yml
groups:
- name: wzh-alert
rules:
- alert: 102节点挂掉啦
#当promeQL这个语句=0时(节点挂掉),开始报警
expr: up{instance="192.168.93.102:9100"} == 0 # PromeSQL语言(up{instance="192.168.93.102:9100"})
#连续3s=0才触发报警;
for: 3s
labels:
prometheus: wzh
#被监控节点ip
node: 192.168.93.102
annotations:
summary: "{{ $lables.instance }} 已停止运行超过 3s!"
6.3、Prometheus的配置文件语法检查
bash
[root@prometheus ~]# cd /usr/local/prometheus/
bash
[root@prometheus prometheus]# ./promtool check config prometheus.yml
Checking prometheus.yml
SUCCESS: 1 rule files found
SUCCESS: prometheus.yml is valid prometheus config file syntax
Checking /usr/local/prometheus/rules.yml
SUCCESS: 1 rules found
6.4、重新加载Prometheus服务
bash
[root@prometheus prometheus]# curl -X POST http://192.168.93.101:9090/-/reload
七、模拟被监控端挂掉
- 如果一直没有发送报警,可以选择多次重启
Prometheus
服务和Alertmanager
服务
bash
[root@alertmanager ~]# systemctl stop node-exporter.service