Prometheus 05-01: 告警规则与Alertmanager配置
相关文档链接
告警规则设计、Alertmanager配置和通知集成
官方文档资源
- Alerting Overview - Prometheus告警系统概述
- Alerting Rules - 告警规则配置文档
- Alertmanager Configuration - Alertmanager完整配置参考
- Notification Template Reference - 通知模板参考文档
GitHub项目资源
- Awesome Prometheus Alerts - 常用告警规则集合
- Alertmanager - Alertmanager项目主页
- Alert Rules Examples - 官方告警规则示例
- Prometheus Operator Alerts - K8s环境告警规则
中文资源和教程
- Prometheus告警系统 - 云原生社区告警教程
- Alertmanager配置详解 - 详细的中文配置指南
- 告警规则最佳实践 - 告警配置最佳实践
- 微信钉钉告警集成 - 中文消息通知集成
在线工具和资源
- Alertmanager WebUI - Alertmanager Web界面文档
- PrometheusRule Validator - 告警规则验证工具
- Alert Rule Testing - 规则单元测试
- Grafana Alerting - Grafana集成告警功能
一、告警系统架构概述
1.1 Prometheus告警系统组件
Prometheus告警系统由以下核心组件构成:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ │ │ │ │ │
│ Prometheus │───▶│ Alertmanager │───▶│ Receivers │
│ Server │ │ │ │ (Email/Slack/ │
│ │ │ │ │ WebHook etc.) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
│ │ │
┌────▼────┐ ┌───▼───┐ ┌────▼────┐
│ Alert │ │ Route │ │ Silence │
│ Rules │ │ Tree │ │ Manager │
└─────────┘ └───────┘ └─────────┘
组件职责分工
-
Prometheus Server:
- 执行告警规则评估
- 生成告警实例
- 向Alertmanager发送告警
-
Alertmanager:
- 接收Prometheus的告警
- 对告警进行分组、抑制、静默
- 路由告警到不同的接收器
- 管理告警的生命周期
-
Alert Rules:
- 定义告警条件
- 配置告警标签和注释
- 设置告警持续时间
1.2 告警流程详解#注释lwh-csdn
Yes No Yes No Metrics Collection Rule Evaluation Alert Condition Met? Generate Alert Continue Monitoring Send to Alertmanager Grouping Routing Inhibition Check Silence Check Send Notification? Send to Receiver Suppress Alert Notification Sent Wait for Next Cycle
告警状态转换
Inactive ──condition_met──▶ Pending ──for_duration──▶ Firing
▲ │ │
│ │ │
└──condition_not_met───────┴────condition_not_met──┘
- Inactive: 告警条件未满足
- Pending: 告警条件满足但未达到持续时间
- Firing: 告警条件满足且超过持续时间,开始发送通知
二、告警规则配置
2.1 告警规则基础语法
规则文件结构
yaml
# alert_rules.yml#注释lwh-csdn
groups:
- name: "基础系统告警"
rules:
- alert: "高CPU使用率"
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 2m
labels:
severity: warning
team: infrastructure
annotations:
summary: "实例 {{ $labels.instance }} CPU使用率过高"
description: "{{ $labels.instance }} 的CPU使用率为 {{ $value }}%,已超过80%阈值"
- name: "应用服务告警#注释lwh-csdn"
rules:
- alert: "服务不可用"
expr: up == 0
for: 1m
labels:
severity: critical
team: sre
annotations:
summary: "服务 {{ $labels.job }} 不可用"
description: "服务 {{ $labels.job }} 在实例 {{ $labels.instance }} 上已停止运行"
规则配置要素
yaml
# 完整的告警规则示例
- alert: "AlertName" # 告警名称(必需)
expr: prometheus_query_expression # PromQL表达式(必需)
for: 5m # 持续时间(可选,默认0)
labels: # 告警标签(可选)
severity: warning
team: platform
env: production
annotations: # 告警注释(可选)
summary: "简短描述"
description: "详细描述,支持模板变量"
runbook_url: "https://runbook.example.com/alerts/alert-name"
dashboard_url: "https://grafana.example.com/dashboard"
2.2 常用系统告警规则
CPU相关告警
yaml
groups:
- name: "cpu_alerts"
rules:
# CPU使用率告警
- alert: "HighCPUUsage"
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)) > 80
for: 2m
labels:
severity: warning
category: system
annotations:
summary: "高CPU使用率警告"
description: "实例 {{ $labels.instance }} CPU使用率 {{ printf \"%.2f\" $value }}% 超过80%"
# CPU负载告警
- alert: "HighCPULoad"
expr: node_load5 / count by(instance) (node_cpu_seconds_total{mode="idle"}) > 1.5
for: 5m
labels:
severity: warning
category: system
annotations:
summary: "高CPU负载警告"
description: "实例 {{ $labels.instance }} 5分钟负载 {{ printf \"%.2f\" $value }} 超过CPU核心数的1.5倍"
# 极高CPU使用率告警
- alert: "CriticalCPUUsage"
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)) > 95
for: 1m
labels:
severity: critical
category: system
annotations:
summary: "严重CPU使用率告警"
description: "实例 {{ $labels.instance }} CPU使用率 {{ printf \"%.2f\" $value }}% 超过95%,需要立即处理"
内存相关告警#注释lwh-csdn
groups:
-
name: "memory_alerts"
rules:
内存使用率告警
- alert: "HighMemoryUsage"
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 80
for: 2m
labels:
severity: warning
category: system
annotations:
summary: "高内存使用率警告#注释lwh-csdn"
description: "实例 {{ labels.instance }} 内存使用率 {{ printf "%.2f" value }}% 超过80%"
内存不足告警
- alert: "LowMemoryAvailable"
expr: node_memory_MemAvailable_bytes / 1024 / 1024 / 1024 < 1
for: 1m
labels:
severity: critical
category: system
annotations:
summary: "内存严重不足"
description: "实例 {{ labels.instance }} 可用内存仅剩 {{ printf "%.2f" value }}GB"
Swap使用率告警
- alert: "HighSwapUsage"
expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 50
for: 5m
labels:
severity: warning
category: system
annotations:
summary: "高Swap使用率警告#注释lwh-csdn"
description: "实例 {{ labels.instance }} Swap使用率 {{ printf "%.2f" value }}% 超过50%"
- alert: "HighMemoryUsage"
磁盘相关告警
groups:
-
name: "disk_alerts"
rules:
磁盘空间使用率告警
- alert: "HighDiskUsage"
expr: (1 - (node_filesystem_free_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})) * 100 > 80
for: 2m
labels:
severity: warning
category: system
annotations:
summary: "高磁盘使用率警告#注释lwh-csdn"
description: "实例 {{ labels.instance }} 挂载点 {{ labels.mountpoint }} 磁盘使用率 {{ printf "%.2f" $value }}% 超过80%"
磁盘空间即将耗尽
- alert: "DiskSpaceLow"
expr: node_filesystem_free_bytes{fstype!="tmpfs"} / 1024 / 1024 / 1024 < 5
for: 1m
labels:
severity: critical
category: system
annotations:
summary: "磁盘空间严重不足#注释lwh-csdn"
description: "实例 {{ labels.instance }} 挂载点 {{ labels.mountpoint }} 剩余空间仅 {{ printf "%.2f" $value }}GB"
磁盘IO延迟告警
- alert: "HighDiskIOLatency"
expr: rate(node_disk_io_time_seconds_total[5m]) * 100 > 80
for: 3m
labels:
severity: warning
category: system
annotations:
summary: "高磁盘IO延迟"
description: "实例 {{ labels.instance }} 磁盘 {{ labels.device }} IO延迟 {{ printf "%.2f" $value }}% 超过80%"
磁盘预测告警
- alert: "DiskWillFillSoon"
expr: predict_linear(node_filesystem_free_bytes{fstype!="tmpfs"}[1h], 4 * 3600) <= 0
for: 5m
labels:
severity: warning
category: system
annotations:
summary: "磁盘空间预警#注释lwh-csdn"
description: "根据当前趋势,实例 {{ labels.instance }} 挂载点 {{ labels.mountpoint }} 将在4小时内用完"
- alert: "HighDiskUsage"
网络相关告警
groups:
-
name: "network_alerts"
rules:
网络接口状态告警
- alert: "NetworkInterfaceDown"
expr: node_network_up{device!="lo"} == 0
for: 1m
labels:
severity: warning
category: network
annotations:
summary: "网络接口故障#注释lwh-csdn"
description: "实例 {{ labels.instance }} 网络接口 {{ labels.device }} 已断开"
高网络错误率
- alert: "HighNetworkErrors"
expr: rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m]) > 10
for: 2m
labels:
severity: warning
category: network
annotations:
summary: "高网络错误率"
description: "实例 {{ labels.instance }} 接口 {{ labels.device }} 网络错误率 {{ printf "%.2f" $value }} errors/sec"
高带宽使用率
- alert: "HighBandwidthUsage"
expr: rate(node_network_transmit_bytes_total{device!="lo"}[5m]) * 8 / 1000 / 1000 / 1000 > 0.8
for: 3m
labels:
severity: warning
category: network
annotations:
summary: "高带宽使用率"
description: "实例 {{ labels.instance }} 接口 {{ labels.device }} 出口带宽使用率 {{ printf "%.2f" $value }}Gbps"
- alert: "NetworkInterfaceDown"
2.3 应用服务告警规则
HTTP服务告警#注释lwh-csdn
groups:
-
name: "http_service_alerts"
rules:
服务不可用
- alert: "ServiceDown"
expr: up == 0
for: 1m
labels:
severity: critical
category: service
annotations:
summary: "服务不可用"
description: "服务 {{ labels.job }} 实例 {{ labels.instance }} 已停止响应"
高错误率
- alert: "HighErrorRate"
expr: (sum(rate(http_requests_total{status=~"5..."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)) * 100 > 5
for: 2m
labels:
severity: warning
category: application
annotations:
summary: "高HTTP错误率"
description: "服务 {{ labels.job }} HTTP 5xx错误率 {{ printf "%.2f" value }}% 超过5%"
高响应时间#注释lwh-csdn
- alert: "HighResponseTime"
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)) > 1
for: 3m
labels:
severity: warning
category: application
annotations:
summary: "高响应时间"
description: "服务 {{ labels.job }} P95响应时间 {{ printf "%.2f" value }}s 超过1秒"
低QPS告警
- alert: "LowRequestRate"
expr: sum(rate(http_requests_total[5m])) by (job) < 10
for: 5m
labels:
severity: info
category: application
annotations:
summary: "低请求率"
description: "服务 {{ labels.job }} QPS {{ printf "%.2f" value }} 低于正常水平"
异常流量激增
- alert: "TrafficSpike"
expr: sum(rate(http_requests_total[5m])) by (job) / avg_over_time(sum(rate(http_requests_total[5m])) by (job)[1h:5m]) > 3
for: 2m
labels:
severity: warning
category: application
annotations:
summary: "流量异常激增"
description: "服务 {{ labels.job }} 当前QPS是过去1小时平均值的 {{ printf "%.2f" value }} 倍"
- alert: "ServiceDown"
数据库告警
groups:
-
name: "database_alerts"
rules:
MySQL连接数告警
- alert: "MySQLHighConnections"
expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 80
for: 2m
labels:
severity: warning
category: database
annotations:
summary: "MySQL连接数过高"
description: "MySQL实例 {{ labels.instance }} 连接使用率 {{ printf "%.2f" value }}% 超过80%"
MySQL慢查询告警#注释lwh-csdn
- alert: "MySQLSlowQueries"
expr: rate(mysql_global_status_slow_queries[5m]) > 5
for: 2m
labels:
severity: warning
category: database
annotations:
summary: "MySQL慢查询过多"
description: "MySQL实例 {{ labels.instance }} 慢查询速率 {{ printf "%.2f" value }} queries/sec"
Redis内存使用告警
- alert: "RedisHighMemoryUsage"
expr: redis_memory_used_bytes / redis_memory_max_bytes * 100 > 80
for: 2m
labels:
severity: warning
category: database
annotations:
summary: "Redis内存使用过高"
description: "Redis实例 {{ labels.instance }} 内存使用率 {{ printf "%.2f" value }}% 超过80%"
Redis连接数告警
- alert: "RedisHighConnections"
expr: redis_connected_clients > 100
for: 2m
labels:
severity: warning
category: database
annotations:
summary: "Redis连接数过高"
description: "Redis实例 {{ labels.instance }} 当前连接数 {{ value }} 超过100"
- alert: "MySQLHighConnections"
2.4 业务告警规则
业务指标告警#注释lwh-csdn
groups:
-
name: "business_alerts"
rules:
订单量异常
- alert: "LowOrderRate"
expr: rate(orders_total[10m]) < 5
for: 5m
labels:
severity: warning
category: business
annotations:
summary: "订单量异常偏低"
description: "当前订单速率 {{ printf "%.2f" $value }} orders/min 低于正常水平"
支付失败率过高#注释lwh-csdn
- alert: "HighPaymentFailureRate"
expr: (sum(rate(payments_total{status="failed"}[5m])) / sum(rate(payments_total[5m]))) * 100 > 2
for: 2m
labels:
severity: critical
category: business
annotations:
summary: "支付失败率过高"
description: "支付失败率 {{ printf "%.2f" $value }}% 超过2%阈值"
用户注册异常#注释lwh-csdn
- alert: "UserRegistrationAnomaly"
expr: rate(user_registrations_total[5m]) / avg_over_time(rate(user_registrations_total[5m])[1h:5m]) > 5 or rate(user_registrations_total[5m]) / avg_over_time(rate(user_registrations_total[5m])[1h:5m]) < 0.2
for: 3m
labels:
severity: warning
category: business
annotations:
summary: "用户注册异常"
description: "用户注册率异常,当前值与历史平均值比例为 {{ printf "%.2f" $value }}"
- alert: "LowOrderRate"