文章目录
- 一、部署alertmanager相关组件
- 二、调试邮件告警
- [三、钉钉群/企业微信群 报警](#三、钉钉群/企业微信群 报警)
- 总结
Prometheus报警功能利用Alertmanager组件完成,Prometheus会对接收的指标数据比对告警规则,如果满足条件,则将告警时间发送给Alertmanager组件,Alertmanager组件发送到接收人
使用步骤:
- 部署Alertmanager
- 配置告警接收人
- 配置Prometheus与Alertmanager通信
- 在Prometheus中创建告警规则
一、部署alertmanager相关组件
1.alertmanager-config
#alertmanager-config.yaml主配置文件,主要配置altermanager的告警配置
bash
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: ops
data:
alertmanager.yml: |
global:
#恢复告警等待时间,如果5m没收到来自prometheus的告警 则发送恢复告警通知
resolve_timeout: 5m
#邮箱服务器
smtp_smarthost: 'smtp.exmail.qq.com:465'
#发送告警的邮箱地址
smtp_from: 'fanxxxxuai@cxxxxne.com'
#发送者的邮箱登陆用户名
smtp_auth_username: 'fanxxxxuai@cxxxxne.com'
#发送者的邮箱授权吗(若是企业微信邮箱的话为发送者的登陆邮箱密码)
smtp_auth_password: '123456'
#关闭tls,默认是开启tls的,若不关闭则会报错,错误为本篇总结出所示
smtp_require_tls: false
#alertmanager告警消息的模版
templates:
- '/etc/alertmanager/msg-tmpl/*.tmpl'
#主路由
route:
#指定告警接收者
receiver: 'mail-receiver'
#分组(通过alertname标签的值分组)
group_by: [cluster, alertname]
#第一次产生告警,等待30s,足内有告警的话就一起发出,没有则单独发
group_wait: 30s
#第二次产生告警,先等待5m,如果5m后还没有恢复就进入repeat_interval。(定义相同的Group之间发送告警通知的时间间隔)
group_interval: 5m
#在最终发送消息前再等待30m,30m后还没恢复就发送第二次告警
repeat_interval: 30m
##所以每次告警之间的间隔时间为35m(group_interval+repeat_interval)
#配置告警接受者
receivers:
- name: 'mail-receiver'
#使用邮件接收
email_configs:
- to: 'fanxxxxuai@cxxxxne.com'
send_resolved: true
html: '{{ template "emailMessage" . }}'
2.alertmanager-message-tmpl
#alertmanager-message-tmpl.yaml 告警模版(邮件)
yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-message-tmpl
namespace: ops
data:
email.tmpl: |
{{ define "emailMessage" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
------ 告警问题 ------<br>
告警状态:{{ .Status }}<br>
告警级别:{{ .Labels.severity }}<br>
告警名称:{{ .Labels.alertname }}<br>
故障实例:{{ .Labels.instance }}<br>
告警概要:{{ .Annotations.summary }}<br>
告警详情:{{ .Annotations.description }}<br>
故障时间:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br>
------ END ------<br>
{{- end }}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
------ 告警恢复 ------<br>
告警状态:{{ .Status }}<br>
告警级别:{{ .Labels.severity }}<br>
告警名称:{{ .Labels.alertname }}<br>
恢复实例:{{ .Labels.instance }}<br>
告警概要:{{ .Annotations.summary }}<br>
告警详情:{{ .Annotations.description }}<br>
故障时间:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br>
恢复时间:{{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br>
------ END ------<br>
{{- end }}
{{- end }}
{{- end }}
{{- end }}
3.alertmanager
#alertmanager.yaml 部署altermanager
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
namespace: ops
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager
template:
metadata:
labels:
app: alertmanager
spec:
containers:
#用于热加载配置文件
- name: prometheus-alertmanager-configmap-reload
image: "jimmidyson/configmap-reload:v0.1"
imagePullPolicy: "IfNotPresent"
args:
- --volume-dir=/etc/config
- --webhook-url=http://localhost:9093/-/reload
volumeMounts:
- name: config
mountPath: /etc/config
readOnly: true
resources:
limits:
cpu: 10m
memory: 10Mi
requests:
cpu: 10m
memory: 10Mi
- name: alertmanager
image: "prom/alertmanager:latest"
ports:
- containerPort: 9093
readinessProbe:
httpGet:
path: /#/status
port: 9093
initialDelaySeconds: 30
timeoutSeconds: 30
livenessProbe:
httpGet:
path: /#/status
port: 9093
initialDelaySeconds: 30
timeoutSeconds: 30
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
volumeMounts:
- name: config
mountPath: /etc/alertmanager
- name: message-tmpl
mountPath: /etc/alertmanager/msg-tmpl
- name: data
mountPath: /data
- name: timezone
mountPath: /etc/localtime
volumes:
- name: config
configMap:
name: alertmanager-config
- name: message-tmpl
configMap:
name: alertmanager-message-tmpl
- name: data
persistentVolumeClaim:
claimName: alertmanager-data
- name: timezone
hostPath:
path: /usr/share/zoneinfo/Asia/Shanghai
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: alertmanager-data
namespace: ops
spec:
storageClassName: "managed-nfs-storage"
accessModes:
- ReadWriteOnce
resources:
requests:
storage: "2Gi"
---
apiVersion: v1
kind: Service
metadata:
name: alertmanager
namespace: ops
spec:
type: NodePort
ports:
- name: http
port: 9093
protocol: TCP
targetPort: 9093
nodePort: 30093
selector:
app: alertmanager
部署完成后访问 IP:30093即可访问altermanager的web展示界面
二、调试邮件告警
此时可以尝试重启一个pod进行调试
将资源配额调至超出可申请资源 让其为pending状态
如下:
test-alertmanager.yaml
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
resources:
requests:
memory: "24Gi"
cpu: "12000m"
limits:
memory: "24Gi"
cpu: "12000m"
启动此Pod观察
邮件告警如下:
三、钉钉群/企业微信群 报警
(webhook自定义机器人类型即可)
目前prometheus没有集成钉钉群和企业微信群接口,需要自己写webhook(数据转换)或者用别人写的
例如:https://github.com/bougou/alertmanager-webhook-adapter
3.1添加钉钉群机器人
创建完成后会有一个webhook地址,稍后会使用到此webhook的token
3.2添加企业微信群机器人
创建完成后会有一个webhook地址,稍后会使用到此webhook的key
3.3部署alertmanager-webhook-adapter
message-tmpl
#message-tmpl.yaml告警模版我把钉钉的和企业微信的放一起了
模版来源于:https://github.com/bougou/alertmanager-webhook-adapter/tree/main/pkg/models/templates
yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: message-tmpl
namespace: ops
data:
####################################################################################################################
dingding.tmpl: |
{{ define "__subject" -}}
【{{ .Signature }}】
{{- if eq (index .Alerts 0).Labels.severity "ok" }} OK{{ end }}
{{- if eq (index .Alerts 0).Labels.severity "info" }} INFO{{ end }}
{{- if eq (index .Alerts 0).Labels.severity "warning" }} WARNING{{ end }}
{{- if eq (index .Alerts 0).Labels.severity "error" }} ERROR{{ end }}
{{- ` • ` }}
{{- if .CommonLabels.alertname_cn }}{{ .CommonLabels.alertname_cn }}{{ else if .CommonLabels.alertname_custom }}{{ .CommonLabels.alertname_custom }}{{ else if .CommonAnnotations.alertname }}{{ .CommonAnnotations.alertname }}{{ else }}{{ .GroupLabels.alertname }}{{ end }}
{{- ` • ` }}
{{- if gt (.Alerts.Firing|len) 0 }}告警中:{{ .Alerts.Firing|len }}{{ end }}
{{- if and (gt (.Alerts.Firing|len) 0) (gt (.Alerts.Resolved|len) 0) }}/{{ end }}
{{- if gt (.Alerts.Resolved|len) 0 }}已恢复:{{ .Alerts.Resolved|len }}{{ end }}
{{ end }}
{{ define "__externalURL" -}}
{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}
{{- end }}
{{ define "__alertinstance" -}}
{{- if ne .Labels.alertinstance nil -}}{{ .Labels.alertinstance }}
{{- else if ne .Labels.instance nil -}}{{ .Labels.instance }}
{{- else if ne .Labels.node nil -}}{{ .Labels.node }}
{{- else if ne .Labels.nodename nil -}}{{ .Labels.nodename }}
{{- else if ne .Labels.host nil -}}{{ .Labels.host }}
{{- else if ne .Labels.hostname nil -}}{{ .Labels.hostname }}
{{- else if ne .Labels.ip nil -}}{{ .Labels.ip }}
{{- end -}}
{{- end }}
{{ define "__alert_list" }}
{{ range . }}
---
> **告警名称**: {{ if .Labels.alertname_cn }}{{ .Labels.alertname_cn }}{{ else if .Labels.alertname_custom }}{{ .Labels.alertname_custom }}{{ else if .Annotations.alertname }}{{ .Annotations.alertname }}{{ else }}{{ .Labels.alertname }}{{ end }}
>
> **告警级别**: {{ ` ` }}
{{- if eq .Labels.severity "ok" }}OK{{ end -}}
{{- if eq .Labels.severity "info" }}INFO{{ end -}}
{{- if eq .Labels.severity "warning" }}WARNING{{ end -}}
{{- if eq .Labels.severity "error" }}ERROR{{ end }}
>
> **告警实例**: `{{ template "__alertinstance" . }}`
>
{{- if .Labels.region }}
> **地域**: {{ .Labels.region }}
>
{{- end }}
{{- if .Labels.zone }}
> **可用区**: {{ .Labels.zone }}
>
{{- end }}
{{- if .Labels.product }}
> **产品**: {{ .Labels.product }}
>
{{- end }}
{{- if .Labels.component }}
> **组件**: {{ .Labels.component }}
>
{{- end }}
> **告警状态**: {{ if eq .Status "firing" }}🚨{{ else }}✅{{ end }} {{ .Status | toUpper }}
>
> **开始时间**: {{ .StartsAt.Format "2006-01-02T15:04:05Z07:00" }}
>
> **结束时间**: {{ if .EndsAt.After .StartsAt }}{{ .EndsAt.Format "2006-01-02T15:04:05Z07:00" }}{{ else }}Not End{{ end }}
>
{{- if eq .Status "firing" }}
> 告警描述: {{ if .Annotations.description_cn }}{{ .Annotations.description_cn }}{{ else }}{{ .Annotations.description }}{{ end }}
>
{{- end }}
{{ end }}
{{ end }}
{{ define "__alert_summary" }}
{{ range . }}- {{ template "__alertinstance" . }}
{{ end }}
{{ end }}
{{ define "prom.title" }}
{{ template "__subject" . }}
{{ end }}
{{ define "prom.markdown" }}
{{ .MessageAt.Format "2006-01-02T15:04:05Z07:00" }}
#### **摘要**
{{ if gt (.Alerts.Firing|len ) 0 }}
##### **🚨 触发中告警 [{{ .Alerts.Firing|len }}]**
{{ template "__alert_summary" .Alerts.Firing }}
{{ end }}
{{ if gt (.Alerts.Resolved|len) 0 }}
##### **✅ 已恢复告警 [{{ .Alerts.Resolved|len }}]**
{{ template "__alert_summary" .Alerts.Resolved }}
{{ end }}
#### **详请**
{{ if gt (.Alerts.Firing|len ) 0 }}
##### **🚨 触发中告警 [{{ .Alerts.Firing|len }}]**
{{ template "__alert_list" .Alerts.Firing }}
{{ end }}
{{ if gt (.Alerts.Resolved|len) 0 }}
##### **✅ 已恢复告警 [{{ .Alerts.Resolved|len }}]**
{{ template "__alert_list" .Alerts.Resolved }}
{{ end }}
{{ end }}
{{ define "prom.text" }}
{{ template "prom.markdown" . }}
{{ end }}
####################################################################################################################
wechat.tmpl: |
{{ define "__subject" -}}
【{{ .Signature }}】
{{- if eq (index .Alerts 0).Labels.severity "ok" }} OK{{ end }}
{{- if eq (index .Alerts 0).Labels.severity "info" }} INFO{{ end }}
{{- if eq (index .Alerts 0).Labels.severity "warning" }} WARNING{{ end }}
{{- if eq (index .Alerts 0).Labels.severity "error" }} ERROR{{ end }}
{{- ` • ` }}
{{- if .CommonLabels.alertname_cn }}{{ .CommonLabels.alertname_cn }}{{ else if .CommonLabels.alertname_custom }}{{ .CommonLabels.alertname_custom }}{{ else if .CommonAnnotations.alertname }}{{ .CommonAnnotations.alertname }}{{ else }}{{ .GroupLabels.alertname }}{{ end }}
{{- ` • ` }}
{{- if gt (.Alerts.Firing|len) 0 }}告警中:{{ .Alerts.Firing|len }}{{ end }}
{{- if and (gt (.Alerts.Firing|len) 0) (gt (.Alerts.Resolved|len) 0) }}/{{ end }}
{{- if gt (.Alerts.Resolved|len) 0 }}已恢复:{{ .Alerts.Resolved|len }}{{ end }}
{{ end }}
{{ define "__externalURL" -}}
{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}
{{- end }}
{{ define "__alertinstance" -}}
{{- if ne .Labels.alertinstance nil -}}{{ .Labels.alertinstance }}
{{- else if ne .Labels.instance nil -}}{{ .Labels.instance }}
{{- else if ne .Labels.node nil -}}{{ .Labels.node }}
{{- else if ne .Labels.nodename nil -}}{{ .Labels.nodename }}
{{- else if ne .Labels.host nil -}}{{ .Labels.host }}
{{- else if ne .Labels.hostname nil -}}{{ .Labels.hostname }}
{{- else if ne .Labels.ip nil -}}{{ .Labels.ip }}
{{- end -}}
{{- end }}
{{ define "__alert_list" }}
{{ range . }}
> <font color="comment"> 告警名称 </font>: {{ if .Labels.alertname_cn }}{{ .Labels.alertname_cn }}{{ else if .Labels.alertname_custom }}{{ .Labels.alertname_custom }}{{ else if .Annotations.alertname }}{{ .Annotations.alertname }}{{ else }}{{ .Labels.alertname }}{{ end }}
>
> <font color="comment"> 告警级别 </font>:{{ ` ` }}
{{- if eq .Labels.severity "ok" }}OK{{ end -}}
{{- if eq .Labels.severity "info" }}INFO{{ end -}}
{{- if eq .Labels.severity "warning" }}WARNING{{ end -}}
{{- if eq .Labels.severity "error" }}ERROR{{ end }}
>
> <font color="comment"> 实例 </font>: `{{ template "__alertinstance" . }}`
>
{{- if .Labels.region }}
> <font color="comment"> 地域 </font>: {{ .Labels.region }}
>
{{- end }}
{{- if .Labels.zone }}
> <font color="comment"> 可用区 </font>: {{ .Labels.zone }}
>
{{- end }}
{{- if .Labels.product }}
> <font color="comment"> 产品 </font>: {{ .Labels.product }}
>
{{- end }}
{{- if .Labels.component }}
> <font color="comment"> 组件 </font>: {{ .Labels.component }}
>
{{- end }}
> <font color="comment"> 告警状态 </font>: {{ if eq .Status "firing" }}🚨{{ else }}✅{{ end }} <font color="{{ if eq .Status "firing" }}warning{{ else }}info{{ end }}">{{ .Status | toUpper }}</font>
>
> <font color="comment"> 开始时间 </font>: {{ .StartsAt.Format "2006-01-02T15:04:05Z07:00" }}
>
> <font color="comment"> 结束时间 </font>: {{ if .EndsAt.After .StartsAt }}{{ .EndsAt.Format "2006-01-02T15:04:05Z07:00" }}{{ else }}Not End{{ end }}
{{- if eq .Status "firing" }}
>
> <font color="comment"> 告警描述 </font>: {{ if .Annotations.description_cn }}{{ .Annotations.description_cn }}{{ else }}{{ .Annotations.description }}{{ end }}
{{- end }}
{{ end }}
{{ end }}
{{ define "__alert_summary" -}}
{{ range . }}
<font color="{{ if eq .Status "firing" }}warning{{ else }}info{{ end }}">{{ template "__alertinstance" . }}</font>
{{ end }}
{{ end }}
{{ define "prom.title" -}}
{{ template "__subject" . }}
{{ end }}
{{ define "prom.markdown" }}
{{ .MessageAt.Format "2006-01-02T15:04:05Z07:00" }}
#### 摘要
{{ if gt (.Alerts.Firing|len ) 0 }}
##### <font color="warning">🚨 触发中告警 [{{ .Alerts.Firing|len }}]</font>
{{ template "__alert_summary" .Alerts.Firing }}
{{ end }}
{{ if gt (.Alerts.Resolved|len) 0 }}
##### <font color="info">✅ 已恢复告警 [{{ .Alerts.Resolved|len }}]</font>
{{ template "__alert_summary" .Alerts.Resolved }}
{{ end }}
#### 详请
{{ if gt (.Alerts.Firing|len ) 0 }}
##### <font color="warning">🚨 触发中告警 [{{ .Alerts.Firing|len }}]</font>
{{ template "__alert_list" .Alerts.Firing }}
{{ end }}
{{ if gt (.Alerts.Resolved|len) 0 }}
##### <font color="info">✅ 已恢复告警 [{{ .Alerts.Resolved|len }}]</font>
{{ template "__alert_list" .Alerts.Resolved }}
{{ end }}
{{ end }}
{{ define "prom.text" }}
{{ template "prom.markdown" . }}
{{ end }}
alertmanager-webhook-adapter
#alertmanager-webhook-adapter.yaml webhook连接器服务
来源于:https://github.com/bougou/alertmanager-webhook-adapter/tree/main/deploy/k8s
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager-webhook-adapter
namespace: ops
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager-webhook-adapter
template:
metadata:
labels:
app: alertmanager-webhook-adapter
spec:
containers:
- name: webhook
image: bougou/alertmanager-webhook-adapter:v1.1.7
command:
- /alertmanager-webhook-adapter
#监听端口
- --listen-address=:8090
#告警第一行告警数据来源(随便写)
- --signature=MyIDC
#告警模版所在目录
- --tmpl-dir=/msg-tmpl
#使用哪个告警模版(这里取决于你想用什么应用报警)
#钉钉群机器人的话就写 --tmpl-name=dingding
#企业微信群机器人的话就写 --tmpl-name=wechat
- --tmpl-name=dingding
#- --tmpl-lang=zh
env:
- name: TZ
value: Asia/Shanghai
resources:
requests:
memory: 50Mi
cpu: 100m
limits:
memory: 250Mi
cpu: 500m
volumeMounts:
- name: message-tmpl
mountPath: /msg-tmpl
volumes:
- name: message-tmpl
configMap:
name: message-tmpl
restartPolicy: Always
---
apiVersion: v1
kind: Service
metadata:
name: alertmanager-webhook-adapter
namespace: ops
spec:
ports:
- port: 80
targetPort: 8090
protocol: TCP
selector:
app: alertmanager-webhook-adapter
sessionAffinity: None
alertmanager-config
#alertmanager-config.yaml主配置文件,这里主要是修改发送告警的方式
yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: ops
data:
alertmanager.yml: |
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.exmail.qq.com:465'
smtp_from: 'fanxxxxuai@cxxxxne.com'
smtp_auth_username: 'fanxxxxuai@cxxxxne.com'
smtp_auth_password: '12345'
smtp_require_tls: false
templates:
- '/etc/alertmanager/msg-tmpl/*.tmpl'
route:
#receiver: 'email-receiver'
receiver: 'dingding-receiver'
#receiver: 'wechat-receiver'
group_by: [cluster, alertname]
group_wait: 30s
group_interval: 5m
repeat_interval: 30m
receivers:
#邮件告警
- name: 'email-receiver'
email_configs:
- to: 'fanxxxxuai@cxxxxne.com'
send_resolved: true
html: '{{ template "emailMessage" . }}'
#钉钉群告警
- name: 'dingding-receiver'
webhook_configs:
#如下url只需将地址串里的token
#替换为创建钉钉群机器人时webhook的token即可
#531c0f251944b69c6e731a3bea9a609d9557ebdbcf17b1bc0df8f7b9cf506734
- url: http://alertmanager-webhook-adapter:80/webhook/send?channel_type=dingtalk&token=531c0f251944b69c6e731a3bea9a609d9557ebdbcf17b1bc0df8f7b9cf506734
#是否发送告警恢复通知
send_resolved: true
#企业微信群告警
- name: 'wechat-receiver'
webhook_configs:
#如下url只需将地址串里的token
#替换为创建企业微信群机器人时webhook的token即可
#1cb01f46-f536-4c98-aeac-1455e1472e5d
- url: http://alertmanager-webhook-adapter:80/webhook/send?channel_type=weixin&token=1cb01f46-f536-4c98-aeac-1455e1472e5d
#是否发送告警恢复通知
send_resolved: true
ps:
将prometheus的告警规则 label处添加个'alertinstance' 若是没有则没有效果里的标题、描述等
规则如下:
3.4钉钉群报警信息效果
3.5企业微信群报警信息效果
总结
在使用alertmanager报警时, 若开启tlssmtp_require_tls: true
的话在发送告警时会报如下错误,需要设置成false即可发送成功;