Prometheus 05-01: 告警规则与Alertmanager配置

Prometheus 05-01: 告警规则与Alertmanager配置

相关文档链接

告警规则设计、Alertmanager配置和通知集成

官方文档资源

GitHub项目资源

中文资源和教程

在线工具和资源

一、告警系统架构概述

1.1 Prometheus告警系统组件

Prometheus告警系统由以下核心组件构成:

复制代码
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│                 │    │                 │    │                 │
│   Prometheus    │───▶│  Alertmanager   │───▶│   Receivers     │
│     Server      │    │                 │    │ (Email/Slack/   │
│                 │    │                 │    │  WebHook etc.)  │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         │                       │                       │
    ┌────▼────┐              ┌───▼───┐              ┌────▼────┐
    │ Alert   │              │ Route │              │ Silence │
    │ Rules   │              │ Tree  │              │ Manager │
    └─────────┘              └───────┘              └─────────┘
组件职责分工
  1. Prometheus Server

    • 执行告警规则评估
    • 生成告警实例
    • 向Alertmanager发送告警
  2. Alertmanager

    • 接收Prometheus的告警
    • 对告警进行分组、抑制、静默
    • 路由告警到不同的接收器
    • 管理告警的生命周期
  3. Alert Rules

    • 定义告警条件
    • 配置告警标签和注释
    • 设置告警持续时间

1.2 告警流程详解#注释lwh-csdn

Yes No Yes No Metrics Collection Rule Evaluation Alert Condition Met? Generate Alert Continue Monitoring Send to Alertmanager Grouping Routing Inhibition Check Silence Check Send Notification? Send to Receiver Suppress Alert Notification Sent Wait for Next Cycle

告警状态转换
复制代码
Inactive ──condition_met──▶ Pending ──for_duration──▶ Firing
    ▲                          │                        │
    │                          │                        │
    └──condition_not_met───────┴────condition_not_met──┘
  • Inactive: 告警条件未满足
  • Pending: 告警条件满足但未达到持续时间
  • Firing: 告警条件满足且超过持续时间,开始发送通知

二、告警规则配置

2.1 告警规则基础语法

规则文件结构
yaml 复制代码
# alert_rules.yml#注释lwh-csdn
groups:
  - name: "基础系统告警"
    rules:
      - alert: "高CPU使用率"
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 2m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "实例 {{ $labels.instance }} CPU使用率过高"
          description: "{{ $labels.instance }} 的CPU使用率为 {{ $value }}%,已超过80%阈值"

  - name: "应用服务告警#注释lwh-csdn"
    rules:
      - alert: "服务不可用"
        expr: up == 0
        for: 1m
        labels:
          severity: critical
          team: sre
        annotations:
          summary: "服务 {{ $labels.job }} 不可用"
          description: "服务 {{ $labels.job }} 在实例 {{ $labels.instance }} 上已停止运行"
规则配置要素
yaml 复制代码
# 完整的告警规则示例
- alert: "AlertName"                    # 告警名称(必需)
  expr: prometheus_query_expression     # PromQL表达式(必需)
  for: 5m                              # 持续时间(可选,默认0)
  labels:                              # 告警标签(可选)
    severity: warning
    team: platform
    env: production
  annotations:                         # 告警注释(可选)
    summary: "简短描述"
    description: "详细描述,支持模板变量"
    runbook_url: "https://runbook.example.com/alerts/alert-name"
    dashboard_url: "https://grafana.example.com/dashboard"

2.2 常用系统告警规则

CPU相关告警
yaml 复制代码
groups:
  - name: "cpu_alerts"
    rules:
      # CPU使用率告警
      - alert: "HighCPUUsage"
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)) > 80
        for: 2m
        labels:
          severity: warning
          category: system
        annotations:
          summary: "高CPU使用率警告"
          description: "实例 {{ $labels.instance }} CPU使用率 {{ printf \"%.2f\" $value }}% 超过80%"

      # CPU负载告警
      - alert: "HighCPULoad"
        expr: node_load5 / count by(instance) (node_cpu_seconds_total{mode="idle"}) > 1.5
        for: 5m
        labels:
          severity: warning
          category: system
        annotations:
          summary: "高CPU负载警告"
          description: "实例 {{ $labels.instance }} 5分钟负载 {{ printf \"%.2f\" $value }} 超过CPU核心数的1.5倍"

      # 极高CPU使用率告警
      - alert: "CriticalCPUUsage"
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)) > 95
        for: 1m
        labels:
          severity: critical
          category: system
        annotations:
          summary: "严重CPU使用率告警"
          description: "实例 {{ $labels.instance }} CPU使用率 {{ printf \"%.2f\" $value }}% 超过95%,需要立即处理"
内存相关告警#注释lwh-csdn

groups:

  • name: "memory_alerts"

    rules:

    内存使用率告警

    • alert: "HighMemoryUsage"
      expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 80
      for: 2m
      labels:
      severity: warning
      category: system
      annotations:
      summary: "高内存使用率警告#注释lwh-csdn"
      description: "实例 {{ labels.instance }} 内存使用率 {{ printf "%.2f" value }}% 超过80%"

    内存不足告警

    • alert: "LowMemoryAvailable"
      expr: node_memory_MemAvailable_bytes / 1024 / 1024 / 1024 < 1
      for: 1m
      labels:
      severity: critical
      category: system
      annotations:
      summary: "内存严重不足"
      description: "实例 {{ labels.instance }} 可用内存仅剩 {{ printf "%.2f" value }}GB"

    Swap使用率告警

    • alert: "HighSwapUsage"
      expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 50
      for: 5m
      labels:
      severity: warning
      category: system
      annotations:
      summary: "高Swap使用率警告#注释lwh-csdn"
      description: "实例 {{ labels.instance }} Swap使用率 {{ printf "%.2f" value }}% 超过50%"
磁盘相关告警

groups:

  • name: "disk_alerts"

    rules:

    磁盘空间使用率告警

    • alert: "HighDiskUsage"
      expr: (1 - (node_filesystem_free_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})) * 100 > 80
      for: 2m
      labels:
      severity: warning
      category: system
      annotations:
      summary: "高磁盘使用率警告#注释lwh-csdn"
      description: "实例 {{ labels.instance }} 挂载点 {{ labels.mountpoint }} 磁盘使用率 {{ printf "%.2f" $value }}% 超过80%"

    磁盘空间即将耗尽

    • alert: "DiskSpaceLow"
      expr: node_filesystem_free_bytes{fstype!="tmpfs"} / 1024 / 1024 / 1024 < 5
      for: 1m
      labels:
      severity: critical
      category: system
      annotations:
      summary: "磁盘空间严重不足#注释lwh-csdn"
      description: "实例 {{ labels.instance }} 挂载点 {{ labels.mountpoint }} 剩余空间仅 {{ printf "%.2f" $value }}GB"

    磁盘IO延迟告警

    • alert: "HighDiskIOLatency"
      expr: rate(node_disk_io_time_seconds_total[5m]) * 100 > 80
      for: 3m
      labels:
      severity: warning
      category: system
      annotations:
      summary: "高磁盘IO延迟"
      description: "实例 {{ labels.instance }} 磁盘 {{ labels.device }} IO延迟 {{ printf "%.2f" $value }}% 超过80%"

    磁盘预测告警

    • alert: "DiskWillFillSoon"
      expr: predict_linear(node_filesystem_free_bytes{fstype!="tmpfs"}[1h], 4 * 3600) <= 0
      for: 5m
      labels:
      severity: warning
      category: system
      annotations:
      summary: "磁盘空间预警#注释lwh-csdn"
      description: "根据当前趋势,实例 {{ labels.instance }} 挂载点 {{ labels.mountpoint }} 将在4小时内用完"
网络相关告警

groups:

  • name: "network_alerts"

    rules:

    网络接口状态告警

    • alert: "NetworkInterfaceDown"
      expr: node_network_up{device!="lo"} == 0
      for: 1m
      labels:
      severity: warning
      category: network
      annotations:
      summary: "网络接口故障#注释lwh-csdn"
      description: "实例 {{ labels.instance }} 网络接口 {{ labels.device }} 已断开"

    高网络错误率

    • alert: "HighNetworkErrors"
      expr: rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m]) > 10
      for: 2m
      labels:
      severity: warning
      category: network
      annotations:
      summary: "高网络错误率"
      description: "实例 {{ labels.instance }} 接口 {{ labels.device }} 网络错误率 {{ printf "%.2f" $value }} errors/sec"

    高带宽使用率

    • alert: "HighBandwidthUsage"
      expr: rate(node_network_transmit_bytes_total{device!="lo"}[5m]) * 8 / 1000 / 1000 / 1000 > 0.8
      for: 3m
      labels:
      severity: warning
      category: network
      annotations:
      summary: "高带宽使用率"
      description: "实例 {{ labels.instance }} 接口 {{ labels.device }} 出口带宽使用率 {{ printf "%.2f" $value }}Gbps"

2.3 应用服务告警规则

HTTP服务告警#注释lwh-csdn

groups:

  • name: "http_service_alerts"

    rules:

    服务不可用

    • alert: "ServiceDown"
      expr: up == 0
      for: 1m
      labels:
      severity: critical
      category: service
      annotations:
      summary: "服务不可用"
      description: "服务 {{ labels.job }} 实例 {{ labels.instance }} 已停止响应"

    高错误率

    • alert: "HighErrorRate"
      expr: (sum(rate(http_requests_total{status=~"5..."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)) * 100 > 5
      for: 2m
      labels:
      severity: warning
      category: application
      annotations:
      summary: "高HTTP错误率"
      description: "服务 {{ labels.job }} HTTP 5xx错误率 {{ printf "%.2f" value }}% 超过5%"

    高响应时间#注释lwh-csdn

    • alert: "HighResponseTime"
      expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)) > 1
      for: 3m
      labels:
      severity: warning
      category: application
      annotations:
      summary: "高响应时间"
      description: "服务 {{ labels.job }} P95响应时间 {{ printf "%.2f" value }}s 超过1秒"

    低QPS告警

    • alert: "LowRequestRate"
      expr: sum(rate(http_requests_total[5m])) by (job) < 10
      for: 5m
      labels:
      severity: info
      category: application
      annotations:
      summary: "低请求率"
      description: "服务 {{ labels.job }} QPS {{ printf "%.2f" value }} 低于正常水平"

    异常流量激增

    • alert: "TrafficSpike"
      expr: sum(rate(http_requests_total[5m])) by (job) / avg_over_time(sum(rate(http_requests_total[5m])) by (job)[1h:5m]) > 3
      for: 2m
      labels:
      severity: warning
      category: application
      annotations:
      summary: "流量异常激增"
      description: "服务 {{ labels.job }} 当前QPS是过去1小时平均值的 {{ printf "%.2f" value }} 倍"
数据库告警

groups:

  • name: "database_alerts"

    rules:

    MySQL连接数告警

    • alert: "MySQLHighConnections"
      expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 80
      for: 2m
      labels:
      severity: warning
      category: database
      annotations:
      summary: "MySQL连接数过高"
      description: "MySQL实例 {{ labels.instance }} 连接使用率 {{ printf "%.2f" value }}% 超过80%"

    MySQL慢查询告警#注释lwh-csdn

    • alert: "MySQLSlowQueries"
      expr: rate(mysql_global_status_slow_queries[5m]) > 5
      for: 2m
      labels:
      severity: warning
      category: database
      annotations:
      summary: "MySQL慢查询过多"
      description: "MySQL实例 {{ labels.instance }} 慢查询速率 {{ printf "%.2f" value }} queries/sec"

    Redis内存使用告警

    • alert: "RedisHighMemoryUsage"
      expr: redis_memory_used_bytes / redis_memory_max_bytes * 100 > 80
      for: 2m
      labels:
      severity: warning
      category: database
      annotations:
      summary: "Redis内存使用过高"
      description: "Redis实例 {{ labels.instance }} 内存使用率 {{ printf "%.2f" value }}% 超过80%"

    Redis连接数告警

    • alert: "RedisHighConnections"
      expr: redis_connected_clients > 100
      for: 2m
      labels:
      severity: warning
      category: database
      annotations:
      summary: "Redis连接数过高"
      description: "Redis实例 {{ labels.instance }} 当前连接数 {{ value }} 超过100"

2.4 业务告警规则

业务指标告警#注释lwh-csdn

groups:

  • name: "business_alerts"

    rules:

    订单量异常

    • alert: "LowOrderRate"
      expr: rate(orders_total[10m]) < 5
      for: 5m
      labels:
      severity: warning
      category: business
      annotations:
      summary: "订单量异常偏低"
      description: "当前订单速率 {{ printf "%.2f" $value }} orders/min 低于正常水平"

    支付失败率过高#注释lwh-csdn

    • alert: "HighPaymentFailureRate"
      expr: (sum(rate(payments_total{status="failed"}[5m])) / sum(rate(payments_total[5m]))) * 100 > 2
      for: 2m
      labels:
      severity: critical
      category: business
      annotations:
      summary: "支付失败率过高"
      description: "支付失败率 {{ printf "%.2f" $value }}% 超过2%阈值"

    用户注册异常#注释lwh-csdn

    • alert: "UserRegistrationAnomaly"
      expr: rate(user_registrations_total[5m]) / avg_over_time(rate(user_registrations_total[5m])[1h:5m]) > 5 or rate(user_registrations_total[5m]) / avg_over_time(rate(user_registrations_total[5m])[1h:5m]) < 0.2
      for: 3m
      labels:
      severity: warning
      category: business
      annotations:
      summary: "用户注册异常"
      description: "用户注册率异常,当前值与历史平均值比例为 {{ printf "%.2f" $value }}"
相关推荐
是阿楷啊16 小时前
Java大厂面试场景:音视频场景中的Spring Boot与微服务实战
spring boot·redis·spring cloud·微服务·grafana·prometheus·java面试
xixingzhe22 天前
Prometheus+Grafana监控服务器
grafana·prometheus
牛奶咖啡132 天前
Prometheus+Grafana构建云原生分布式监控系统(十六) _基于Alertmanager的告警机制(一)
云原生·prometheus·prometheus告警整合·prometheus告警配置·prometheus告警规则·prometheus触发告警·告警规则配置实践
码农小卡拉4 天前
Prometheus 监控 SpringBoot 应用完整教程
spring boot·后端·grafana·prometheus
牛奶咖啡134 天前
Prometheus+Grafana构建云原生分布式监控系统(十五)_Prometheus中PromQL使用(二)
云原生·prometheus·集合运算·对查询结果排序·直方图原理·统计掉线的实例·检查节点或指标是否存在
牛奶咖啡135 天前
Prometheus+Grafana构建云原生分布式监控系统(十四)_Prometheus中PromQL使用(一)
云原生·prometheus·promql·计算一个时间范围内的平均值·将相同数据整合查看整体趋势·计算时间范围内的最大最小比率·向量标量的算术运算
牛奶咖啡136 天前
Prometheus+Grafana构建云原生分布式监控系统(十三)_Prometheus数据模型及其PromQL
云原生·prometheus·prometheus数据类型·promql使用场景·promql表达式解析·promql数据类型·监控系统的方法论与指标
AC赳赳老秦7 天前
外文文献精读:DeepSeek翻译并解析顶会论文核心技术要点
前端·flutter·zookeeper·自动化·rabbitmq·prometheus·deepseek
牛奶咖啡138 天前
Prometheus+Grafana构建云原生分布式监控系统(十二)_基于DNS的服务发现
云原生·prometheus·dns·搭建自己的dns服务器·使用bind搭建dns服务器·配置正向解析·基于dns的服务发现
A-刘晨阳9 天前
Prometheus + Grafana + Alertmanager 实现邮件监控告警及配置告警信息
运维·云计算·grafana·prometheus·监控·邮件