一、整体落地架构(最终可用版)
Node Exporter → Prometheus → Alertmanager → DingTalk Webhook → 钉钉群
↓
告警恢复
✅ 核心结论:
告警能不能发出去,90% 取决于服务名、路径、模板三件事。
二、目录结构(必须完全一致)
/docker-compose/
├── docker-compose.yml
├── prometheus/
│ ├── prometheus.yml
│ └── rules/
│ └── node.yml
├── alertmanager/
│ └── alertmanager.yml
├── webhook/
│ ├── config.yml
│ └── templates/
│ └── default.tmpl ✅ 缺这个必报 400
三、核心配置(全部来自 docx 实测)
✅ 1️⃣ Alertmanager(最容易写错的地方)
/docker-compose/alertmanager/alertmanager.yml
route:
group_by: ["alertname"]
group_wait: 10s
group_interval: 30s
repeat_interval: 1h
receiver: dingtalk
receivers:
- name: dingtalk
webhook_configs:
- url: "http://dingtalk-webhook:8060/dingtalk/node/send"
send_resolved: true
🚨 踩坑点
-
❌
altermanager:9093(拼写错误) -
❌ 指向
webhook:8060 -
✅ 必须写容器名
dingtalk-webhook
✅ 2️⃣ Prometheus 告警规则
/docker-compose/prometheus/rules/node.yml
groups:
- name: node-alerts
rules:
- alert: NodeDown
expr: up{job="node_exporter"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "节点 {{ $labels.instance }} 已离线"
description: "Node Exporter 无法访问超过 1 分钟"
并在 prometheus.yml中引入:
rule_files:
- "/etc/prometheus/rules/*.yml"
🚨 踩坑点
-
Prometheus 里
job_name≠ 规则里job=→ 告警永远不会触发 -
测试时把
== 0改成== 1立刻触发
✅ 3️⃣ Webhook 配置(解决 400 的关键)
/docker-compose/webhook/config.yml
targets:
node:
url: https://oapi.dingtalk.com/robot/send?access_token=7f92268b91852caa413f46e18783424e53db16df7a1a390301b2868ddebd6368
message:
text: '{{ template "dingtalk.default.message" . }}'
http_config:
headers:
Content-Type: application/json
templates:
- /etc/prometheus-webhook-dingtalk/templates/default.tmpl
✅ 必须显式声明 Content-Type: application/json
✅ 4️⃣ 告警模板(Go Template)
/docker-compose/webhook/templates/default.tmpl
{{ define "dingtalk.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range .Alerts -}}
==== [告警触发] ====
告警名称:{{ .Labels.alertname }}
实例:{{ .Labels.instance }}
级别:{{ .Labels.severity }}
摘要:{{ .Annotations.summary }}
详情:{{ .Annotations.description }}
时间:{{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range .Alerts -}}
==== [告警恢复] ====
告警名称:{{ .Labels.alertname }}
实例:{{ .Labels.instance }}
时间:{{ .EndsAt.Format "2006-01-02 15:04:05" }}
{{- end }}
{{- end }}
{{ end }}
🚨 缺这个文件 = 100% 发不出告警
✅ 5️⃣ Docker Compose(webhook 部分)
dingtalk-webhook:
image: timonwong/prometheus-webhook-dingtalk:latest
container_name: dingtalk-webhook
restart: unless-stopped
ports:
- "8060:8060"
volumes:
- /docker-compose/webhook/config.yml:/etc/prometheus-webhook-dingtalk/config.yml
- /docker-compose/webhook/templates:/etc/prometheus-webhook-dingtalk/templates:ro
networks:
- ruoyi-net
四、Docx 中出现的「经典 6 大坑」
| 现象 | 根因 | 解决方式 |
|---|---|---|
no such host |
Docker DNS 找不到服务 | 用 容器名 而非服务名 |
| Prometheus 不告警 | job 不匹配 | job="node_exporter"两边一致 |
| 钉钉无消息 | 400 错误 | 补 Content-Type: application/json |
| Webhook 报错 | 模板缺失 | 新建 default.tmpl |
| 规则不生效 | 未挂载 rules | 重启整个 compose |
| Alertmanager 不通 | 拼写错误 | altermanager→ alertmanager |
五、验证流程(照做必成功)
# 1. 触发告警
vi /docker-compose/prometheus/rules/node.yml
# expr 改为 up == 1
# 2. 重启
docker compose restart prometheus alertmanager dingtalk-webhook
# 3. 查看 webhook 日志
docker logs dingtalk-webhook --tail=50
# 4. 恢复告警
# 改回 up == 0
docker compose restart prometheus