【Prometheus Operator 的钉钉/企业微信告警配置】

提示:本文原创作品,良心制作,干货为主,简洁清晰,一看就会

告警推送

  • 一、钉钉告警
    • [1.1 添加机器人](#1.1 添加机器人)
    • [1.2 安装webhook](#1.2 安装webhook)
    • [1.3 配置alertmanager-alertmanager.yaml](#1.3 配置alertmanager-alertmanager.yaml)
    • [1.4 创建alertmanagerConfig](#1.4 创建alertmanagerConfig)
    • [1.5 测试告警](#1.5 测试告警)
  • 二、企业微信告警
    • [2.1 添加机器人](#2.1 添加机器人)
    • [2.2 告警格式转换](#2.2 告警格式转换)
    • [2.3 配置alertmanager-alertmanager.yaml](#2.3 配置alertmanager-alertmanager.yaml)
    • [2.4 创建alertmanagerConfig](#2.4 创建alertmanagerConfig)
    • [2.5 测试告警](#2.5 测试告警)

一、钉钉告警

1.1 添加机器人

在钉钉群设置中新增自定义机器人,填写机器人名称,安全校验优先选择加签模式,规避恶意调用导致的消息刷屏风险。创建完成后保存专属 Webhook 地址与加密密钥,这两组参数是后续告警配置的核心凭证

去群聊中添加机器人

1.2 安装webhook

原生 Alertmanager 无法直接对接钉钉消息格式,需要部署钉钉 Webhook 转发插件实现报文格式转换

插件会接收 Alertmanager 推送的原生告警 JSON 数据,自动完成钉钉签名加密、报文格式封装,转换为钉钉机器人可识别的消息结构

yaml 复制代码
root@k8s-master1:~# git clone https://github.com/timonwong/prometheus-webhook-dingtalk.git
root@k8s-master1:~# cd prometheus-webhook-dingtalk/contrib/k8s/
root@k8s-master1:~/prometheus-webhook-dingtalk/contrib/k8s# vim config/config.yaml
yaml 复制代码
root@k8s-master1:~/prometheus-webhook-dingtalk/contrib/k8s# vim deployment.yaml 
yaml 复制代码
root@k8s-master1:~/prometheus-webhook-dingtalk/contrib/k8s# kubectl kustomize | kubectl apply -f - -n monitoring
root@k8s-master1:~/prometheus-webhook-dingtalk/contrib/k8s# kubectl get pod -n monitoring | grep ding
alertmanager-webhook-dingtalk-cb7f6c584-92sqj   1/1     Running   0             13s

1.3 配置alertmanager-alertmanager.yaml

yaml 复制代码
root@k8s-master1:~/prometheus-webhook-dingtalk/contrib/k8s# cd /root/kube-prometheus/manifests/
root@k8s-master1:~/kube-prometheus/manifests# vim alertmanager-alertmanager.yaml

1.4 创建alertmanagerConfig

yaml 复制代码
root@k8s-master1:~/kube-prometheus/manifests# vim dingding-alertmanagerconfig.yaml
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: dingding
  labels:
    # 需要和alertmanager-alertmanager.yaml中的告警配置标签保持一致
    alertmanagerConfig: email
  namespace: monitoring
spec:
  route:
    groupBy: ['severity']
    groupWait: 1m
    groupInterval: 1m
    repeatInterval: 1m
    receiver: dingding-webhook
  receivers:
    - name: "dingding-webhook"
      webhookConfigs:
        # 告警恢复时发送恢复通知
        - sendResolved: true
        # 钉钉告警webhook服务的访问地址
          url: "http://alertmanager-webhook-dingtalk.monitoring/dingtalk/webhook1/send"

1.5 测试告警

我目前有一套mysql高可用集群,接下来将以mysql集群为例演示如何配置对应的 Prometheus 告警触发规则

yaml 复制代码
root@k8s-master1:~/kube-prometheus/manifests# kubectl get pod
NAME                 READY   STATUS    RESTARTS       AGE
mysql-rep-master-0   2/2     Running   8 (165m ago)   5d21h
mysql-rep-slave-0    2/2     Running   6 (165m ago)   5d21h
mysql-rep-slave-1    2/2     Running   6 (165m ago)   5d21h

## mysql告警规则
root@k8s-master1:~/kube-prometheus/manifests# vim mysql-rule.yaml 
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    app: kube-prometheus-stack
    role: alerting-rules
    prometheus: kube-prometheus-stack-prometheus
  name: prometheus-mysql-alerts
  namespace: monitoring  # 请替换为你的Prometheus所在namespace
spec:
  groups:
  - name: mysql
    rules:
    # ==================== 1. 集群可用性告警 ====================
    - alert: MySQLDown
      expr: mysql_up == 0
      for: 1m
      labels:
        severity: critical
        namespace: monitoring
      annotations:
        summary: "MySQL实例 {{ $labels.instance }} 已宕机"
        description: "Prometheus 无法连接到 {{ $labels.pod }} 上的 MySQL 实例。这通常意味着 mysqld 进程已停止或 exporter 无法连接。"
    # ==================== 2. 主从复制告警 ====================
    # 2.1 复制延迟过高
    - alert: MySQLReplicationLagHigh
      expr: mysql_slave_status_seconds_behind_source > 30
      for: 2m
      labels:
        severity: warning
        namespace: monitoring
      annotations:
        summary: "MySQL 复制延迟较高"
        description: "从库 {{ $labels.pod }} (实例: {{ $labels.instance }}) 复制落后主库 {{ $value }} 秒。请检查网络延迟或主库写入负载。"
    # 2.2 复制线程停止
    - alert: MySQLReplicationSQLThreadDown
      expr: mysql_slave_status_replica_sql_running == 0
      for: 1m
      labels:
        severity: critical
        namespace: monitoring
      annotations:
        summary: "MySQL 复制 SQL 线程停止"
        description: "从库 {{ $labels.pod }} 的 SQL 线程未运行,数据同步已中断。请检查 relay log 是否有损坏或错误。"
    - alert: MySQLReplicationIOThreadDown
      expr: mysql_slave_status_replica_io_running == 0
      for: 1m
      labels:
        severity: critical
        namespace: monitoring
      annotations:
        summary: "MySQL 复制 IO 线程停止"
        description: "从库 {{ $labels.pod }} 的 IO 线程未运行,无法从主库获取二进制日志,网络连接可能已断开。"

进入mysql slave pod内部,关闭主从同步,测试一下能不能收到告警

yaml 复制代码
root@k8s-master1:~/kube-prometheus/manifests# kubectl exec -it mysql-rep-slave-0 /bin/bash
I have no name!@mysql-rep-slave-0:/$ mysql -uroot -p'Root@12345'
mysql> stop replica;

收到告警消息

进入mysql slave pod内部,恢复主从同步,测试一下能不能收到恢复消息

yaml 复制代码
root@k8s-master1:~/kube-prometheus/manifests# kubectl exec -it mysql-rep-slave-0 /bin/bash
I have no name!@mysql-rep-slave-0:/$ mysql -uroot -p'Root@12345'
mysql> start replica;

收到恢复消息

到此,Prometheus钉钉告警就到此结束了!

二、企业微信告警

2.1 添加机器人

去群聊中添加机器人

2.2 告警格式转换

Alertmanager 用的是监控行业标准告警协议报文,企业微信群机器人用的是IM 聊天工具自定义消息协议报文,两套协议互不认识,必须通过中间件做「报文解析→内容提取→格式重组」,才能正常把监控告警发到微信群

我准备了一个 wechat.yaml 一共创建两类 K8s 资源

Deployment :部署运行 Python Flask 写的告警转发容器,启动一个监听 5000 端口的 Web 服务;接收 Alertmanager 推送的标准告警 JSON,自动格式化转换成企业微信可识别的消息格式,再调用企微接口发送告警、告警恢复通知

Service:给部署的 Pod 创建集群内固定访问入口,通过服务名 prometheus-webhook-wechat.monitoring:5000 让 Alertmanager 可以稳定调用这个转发服务,不需要依赖 Pod 动态变化的 IP 地址

yaml 复制代码
root@k8s-master1:~# vim wechat.yaml 
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: prometheus-webhook-wechat
  name: prometheus-webhook-wechat
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-webhook-wechat
  template:
    metadata:
      labels:
        app: prometheus-webhook-wechat
    spec:
      dnsConfig:
        options:
          - name: ndots
            value: "2"
      containers:
      - name: prometheus-webhook-wechat
        image: linge365/webhook-wechat:latest
        imagePullPolicy: IfNotPresent
        env:
        - name: ROBOT_TOKEN
          # 粘贴刚才企业微信上复制的token
          value: "6a1b465b-8e27-42c5-acc1-29c09084fa18"
        ports:
        - containerPort: 5000
          protocol: TCP
        resources:
          requests:
            cpu: 100m
            memory: 100Mi
          limits:
            cpu: 200m
            memory: 500Mi
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: prometheus-webhook-wechat
  name: prometheus-webhook-wechat
  namespace: monitoring
spec:
  ports:
  - port: 5000
    protocol: TCP
    targetPort: 5000
  selector:
    app: prometheus-webhook-wechat
yaml 复制代码
root@k8s-master1:~# kubectl apply -f wechat.yaml 

2.3 配置alertmanager-alertmanager.yaml

yaml 复制代码
root@k8s-master1:~/prometheus-webhook-dingtalk/contrib/k8s# cd /root/kube-prometheus/manifests/
root@k8s-master1:~/kube-prometheus/manifests# vim alertmanager-alertmanager.yaml

2.4 创建alertmanagerConfig

yaml 复制代码
root@k8s-master1:~/kube-prometheus/manifests# vim wechat-alertmanagerconfig.yaml 
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: wechat
  labels:
    alertmanagerConfig: email
  namespace: monitoring
spec:
  route:
    groupBy: ['severity']
    groupWait: 1m
    groupInterval: 1m
    repeatInterval: 5m  
    receiver: wechat-webhook
  receivers:
    - name: "wechat-webhook"
      webhookConfigs:
        - sendResolved: true
          url: "http://prometheus-webhook-wechat:5000"
yaml 复制代码
root@k8s-master1:~/kube-prometheus/manifests# kubectl apply -f wechat-alertmanagerconfig.yaml 

2.5 测试告警

与钉钉告警测试类似,我同样用我已有的一套mysql高可用来测试

进入mysql slave pod内部,关闭sql线程,测试一下能不能收到告警

yaml 复制代码
root@k8s-master1:~/kube-prometheus/manifests# kubectl exec -it mysql-rep-slave-0 /bin/bash
I have no name!@mysql-rep-slave-0:/$ mysql -uroot -p'Root@12345'
mysql> STOP REPLICA IO_THREAD;

收到告警消息

进入mysql slave pod内部,恢复主从同步,测试一下能不能收到恢复消息

yaml 复制代码
root@k8s-master1:~/kube-prometheus/manifests# kubectl exec -it mysql-rep-slave-0 /bin/bash
I have no name!@mysql-rep-slave-0:/$ mysql -uroot -p'Root@12345'
mysql> start replica;

收到恢复消息

到此,企业微信告警配置结束了!


注:

文中若有疏漏,欢迎大家指正赐教。

本文为100%原创,转载请务必标注原创作者,尊重劳动成果。

求赞、求关注、求评论!你的支持是我更新的最大动力,评论区等你~