Prometheus 05-01: 告警规则与Alertmanager配置

Prometheus 05-01: 告警规则与Alertmanager配置

相关文档链接

告警规则设计、Alertmanager配置和通知集成

官方文档资源

GitHub项目资源

中文资源和教程

在线工具和资源

一、告警系统架构概述

1.1 Prometheus告警系统组件

Prometheus告警系统由以下核心组件构成:

复制代码
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│                 │    │                 │    │                 │
│   Prometheus    │───▶│  Alertmanager   │───▶│   Receivers     │
│     Server      │    │                 │    │ (Email/Slack/   │
│                 │    │                 │    │  WebHook etc.)  │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         │                       │                       │
    ┌────▼────┐              ┌───▼───┐              ┌────▼────┐
    │ Alert   │              │ Route │              │ Silence │
    │ Rules   │              │ Tree  │              │ Manager │
    └─────────┘              └───────┘              └─────────┘
组件职责分工
  1. Prometheus Server

    • 执行告警规则评估
    • 生成告警实例
    • 向Alertmanager发送告警
  2. Alertmanager

    • 接收Prometheus的告警
    • 对告警进行分组、抑制、静默
    • 路由告警到不同的接收器
    • 管理告警的生命周期
  3. Alert Rules

    • 定义告警条件
    • 配置告警标签和注释
    • 设置告警持续时间

1.2 告警流程详解#注释lwh-csdn

Yes No Yes No Metrics Collection Rule Evaluation Alert Condition Met? Generate Alert Continue Monitoring Send to Alertmanager Grouping Routing Inhibition Check Silence Check Send Notification? Send to Receiver Suppress Alert Notification Sent Wait for Next Cycle

告警状态转换
复制代码
Inactive ──condition_met──▶ Pending ──for_duration──▶ Firing
    ▲                          │                        │
    │                          │                        │
    └──condition_not_met───────┴────condition_not_met──┘
  • Inactive: 告警条件未满足
  • Pending: 告警条件满足但未达到持续时间
  • Firing: 告警条件满足且超过持续时间,开始发送通知

二、告警规则配置

2.1 告警规则基础语法

规则文件结构
yaml 复制代码
# alert_rules.yml#注释lwh-csdn
groups:
  - name: "基础系统告警"
    rules:
      - alert: "高CPU使用率"
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 2m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "实例 {{ $labels.instance }} CPU使用率过高"
          description: "{{ $labels.instance }} 的CPU使用率为 {{ $value }}%,已超过80%阈值"

  - name: "应用服务告警#注释lwh-csdn"
    rules:
      - alert: "服务不可用"
        expr: up == 0
        for: 1m
        labels:
          severity: critical
          team: sre
        annotations:
          summary: "服务 {{ $labels.job }} 不可用"
          description: "服务 {{ $labels.job }} 在实例 {{ $labels.instance }} 上已停止运行"
规则配置要素
yaml 复制代码
# 完整的告警规则示例
- alert: "AlertName"                    # 告警名称(必需)
  expr: prometheus_query_expression     # PromQL表达式(必需)
  for: 5m                              # 持续时间(可选,默认0)
  labels:                              # 告警标签(可选)
    severity: warning
    team: platform
    env: production
  annotations:                         # 告警注释(可选)
    summary: "简短描述"
    description: "详细描述,支持模板变量"
    runbook_url: "https://runbook.example.com/alerts/alert-name"
    dashboard_url: "https://grafana.example.com/dashboard"

2.2 常用系统告警规则

CPU相关告警
yaml 复制代码
groups:
  - name: "cpu_alerts"
    rules:
      # CPU使用率告警
      - alert: "HighCPUUsage"
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)) > 80
        for: 2m
        labels:
          severity: warning
          category: system
        annotations:
          summary: "高CPU使用率警告"
          description: "实例 {{ $labels.instance }} CPU使用率 {{ printf \"%.2f\" $value }}% 超过80%"

      # CPU负载告警
      - alert: "HighCPULoad"
        expr: node_load5 / count by(instance) (node_cpu_seconds_total{mode="idle"}) > 1.5
        for: 5m
        labels:
          severity: warning
          category: system
        annotations:
          summary: "高CPU负载警告"
          description: "实例 {{ $labels.instance }} 5分钟负载 {{ printf \"%.2f\" $value }} 超过CPU核心数的1.5倍"

      # 极高CPU使用率告警
      - alert: "CriticalCPUUsage"
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)) > 95
        for: 1m
        labels:
          severity: critical
          category: system
        annotations:
          summary: "严重CPU使用率告警"
          description: "实例 {{ $labels.instance }} CPU使用率 {{ printf \"%.2f\" $value }}% 超过95%,需要立即处理"
内存相关告警#注释lwh-csdn

groups:

  • name: "memory_alerts"

    rules:

    内存使用率告警

    • alert: "HighMemoryUsage"
      expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 80
      for: 2m
      labels:
      severity: warning
      category: system
      annotations:
      summary: "高内存使用率警告#注释lwh-csdn"
      description: "实例 {{ labels.instance }} 内存使用率 {{ printf "%.2f" value }}% 超过80%"

    内存不足告警

    • alert: "LowMemoryAvailable"
      expr: node_memory_MemAvailable_bytes / 1024 / 1024 / 1024 < 1
      for: 1m
      labels:
      severity: critical
      category: system
      annotations:
      summary: "内存严重不足"
      description: "实例 {{ labels.instance }} 可用内存仅剩 {{ printf "%.2f" value }}GB"

    Swap使用率告警

    • alert: "HighSwapUsage"
      expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 50
      for: 5m
      labels:
      severity: warning
      category: system
      annotations:
      summary: "高Swap使用率警告#注释lwh-csdn"
      description: "实例 {{ labels.instance }} Swap使用率 {{ printf "%.2f" value }}% 超过50%"
磁盘相关告警

groups:

  • name: "disk_alerts"

    rules:

    磁盘空间使用率告警

    • alert: "HighDiskUsage"
      expr: (1 - (node_filesystem_free_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})) * 100 > 80
      for: 2m
      labels:
      severity: warning
      category: system
      annotations:
      summary: "高磁盘使用率警告#注释lwh-csdn"
      description: "实例 {{ labels.instance }} 挂载点 {{ labels.mountpoint }} 磁盘使用率 {{ printf "%.2f" $value }}% 超过80%"

    磁盘空间即将耗尽

    • alert: "DiskSpaceLow"
      expr: node_filesystem_free_bytes{fstype!="tmpfs"} / 1024 / 1024 / 1024 < 5
      for: 1m
      labels:
      severity: critical
      category: system
      annotations:
      summary: "磁盘空间严重不足#注释lwh-csdn"
      description: "实例 {{ labels.instance }} 挂载点 {{ labels.mountpoint }} 剩余空间仅 {{ printf "%.2f" $value }}GB"

    磁盘IO延迟告警

    • alert: "HighDiskIOLatency"
      expr: rate(node_disk_io_time_seconds_total[5m]) * 100 > 80
      for: 3m
      labels:
      severity: warning
      category: system
      annotations:
      summary: "高磁盘IO延迟"
      description: "实例 {{ labels.instance }} 磁盘 {{ labels.device }} IO延迟 {{ printf "%.2f" $value }}% 超过80%"

    磁盘预测告警

    • alert: "DiskWillFillSoon"
      expr: predict_linear(node_filesystem_free_bytes{fstype!="tmpfs"}[1h], 4 * 3600) <= 0
      for: 5m
      labels:
      severity: warning
      category: system
      annotations:
      summary: "磁盘空间预警#注释lwh-csdn"
      description: "根据当前趋势,实例 {{ labels.instance }} 挂载点 {{ labels.mountpoint }} 将在4小时内用完"
网络相关告警

groups:

  • name: "network_alerts"

    rules:

    网络接口状态告警

    • alert: "NetworkInterfaceDown"
      expr: node_network_up{device!="lo"} == 0
      for: 1m
      labels:
      severity: warning
      category: network
      annotations:
      summary: "网络接口故障#注释lwh-csdn"
      description: "实例 {{ labels.instance }} 网络接口 {{ labels.device }} 已断开"

    高网络错误率

    • alert: "HighNetworkErrors"
      expr: rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m]) > 10
      for: 2m
      labels:
      severity: warning
      category: network
      annotations:
      summary: "高网络错误率"
      description: "实例 {{ labels.instance }} 接口 {{ labels.device }} 网络错误率 {{ printf "%.2f" $value }} errors/sec"

    高带宽使用率

    • alert: "HighBandwidthUsage"
      expr: rate(node_network_transmit_bytes_total{device!="lo"}[5m]) * 8 / 1000 / 1000 / 1000 > 0.8
      for: 3m
      labels:
      severity: warning
      category: network
      annotations:
      summary: "高带宽使用率"
      description: "实例 {{ labels.instance }} 接口 {{ labels.device }} 出口带宽使用率 {{ printf "%.2f" $value }}Gbps"

2.3 应用服务告警规则

HTTP服务告警#注释lwh-csdn

groups:

  • name: "http_service_alerts"

    rules:

    服务不可用

    • alert: "ServiceDown"
      expr: up == 0
      for: 1m
      labels:
      severity: critical
      category: service
      annotations:
      summary: "服务不可用"
      description: "服务 {{ labels.job }} 实例 {{ labels.instance }} 已停止响应"

    高错误率

    • alert: "HighErrorRate"
      expr: (sum(rate(http_requests_total{status=~"5..."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)) * 100 > 5
      for: 2m
      labels:
      severity: warning
      category: application
      annotations:
      summary: "高HTTP错误率"
      description: "服务 {{ labels.job }} HTTP 5xx错误率 {{ printf "%.2f" value }}% 超过5%"

    高响应时间#注释lwh-csdn

    • alert: "HighResponseTime"
      expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)) > 1
      for: 3m
      labels:
      severity: warning
      category: application
      annotations:
      summary: "高响应时间"
      description: "服务 {{ labels.job }} P95响应时间 {{ printf "%.2f" value }}s 超过1秒"

    低QPS告警

    • alert: "LowRequestRate"
      expr: sum(rate(http_requests_total[5m])) by (job) < 10
      for: 5m
      labels:
      severity: info
      category: application
      annotations:
      summary: "低请求率"
      description: "服务 {{ labels.job }} QPS {{ printf "%.2f" value }} 低于正常水平"

    异常流量激增

    • alert: "TrafficSpike"
      expr: sum(rate(http_requests_total[5m])) by (job) / avg_over_time(sum(rate(http_requests_total[5m])) by (job)[1h:5m]) > 3
      for: 2m
      labels:
      severity: warning
      category: application
      annotations:
      summary: "流量异常激增"
      description: "服务 {{ labels.job }} 当前QPS是过去1小时平均值的 {{ printf "%.2f" value }} 倍"
数据库告警

groups:

  • name: "database_alerts"

    rules:

    MySQL连接数告警

    • alert: "MySQLHighConnections"
      expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 80
      for: 2m
      labels:
      severity: warning
      category: database
      annotations:
      summary: "MySQL连接数过高"
      description: "MySQL实例 {{ labels.instance }} 连接使用率 {{ printf "%.2f" value }}% 超过80%"

    MySQL慢查询告警#注释lwh-csdn

    • alert: "MySQLSlowQueries"
      expr: rate(mysql_global_status_slow_queries[5m]) > 5
      for: 2m
      labels:
      severity: warning
      category: database
      annotations:
      summary: "MySQL慢查询过多"
      description: "MySQL实例 {{ labels.instance }} 慢查询速率 {{ printf "%.2f" value }} queries/sec"

    Redis内存使用告警

    • alert: "RedisHighMemoryUsage"
      expr: redis_memory_used_bytes / redis_memory_max_bytes * 100 > 80
      for: 2m
      labels:
      severity: warning
      category: database
      annotations:
      summary: "Redis内存使用过高"
      description: "Redis实例 {{ labels.instance }} 内存使用率 {{ printf "%.2f" value }}% 超过80%"

    Redis连接数告警

    • alert: "RedisHighConnections"
      expr: redis_connected_clients > 100
      for: 2m
      labels:
      severity: warning
      category: database
      annotations:
      summary: "Redis连接数过高"
      description: "Redis实例 {{ labels.instance }} 当前连接数 {{ value }} 超过100"

2.4 业务告警规则

业务指标告警#注释lwh-csdn

groups:

  • name: "business_alerts"

    rules:

    订单量异常

    • alert: "LowOrderRate"
      expr: rate(orders_total[10m]) < 5
      for: 5m
      labels:
      severity: warning
      category: business
      annotations:
      summary: "订单量异常偏低"
      description: "当前订单速率 {{ printf "%.2f" $value }} orders/min 低于正常水平"

    支付失败率过高#注释lwh-csdn

    • alert: "HighPaymentFailureRate"
      expr: (sum(rate(payments_total{status="failed"}[5m])) / sum(rate(payments_total[5m]))) * 100 > 2
      for: 2m
      labels:
      severity: critical
      category: business
      annotations:
      summary: "支付失败率过高"
      description: "支付失败率 {{ printf "%.2f" $value }}% 超过2%阈值"

    用户注册异常#注释lwh-csdn

    • alert: "UserRegistrationAnomaly"
      expr: rate(user_registrations_total[5m]) / avg_over_time(rate(user_registrations_total[5m])[1h:5m]) > 5 or rate(user_registrations_total[5m]) / avg_over_time(rate(user_registrations_total[5m])[1h:5m]) < 0.2
      for: 3m
      labels:
      severity: warning
      category: business
      annotations:
      summary: "用户注册异常"
      description: "用户注册率异常,当前值与历史平均值比例为 {{ printf "%.2f" $value }}"
相关推荐
SRETalk16 小时前
夜莺监控设计思考(三)时序库、agent 的一些设计考量
prometheus·可观测性·监控告警·nightingale·opentelemetry·夜莺监控·categraf
q9085447032 天前
Prometheus+Grafana 智能监控告警系统(服务器指标采集、mysql指标采集)
服务器·grafana·prometheus
风清再凯3 天前
03_Pushgateway使用&Prometheus的服务发现机制
服务发现·prometheus
喜欢你,还有大家3 天前
Prometheus监控部署——pushgateway&&自动推送
运维·prometheus
小小的木头人4 天前
基于Docker 搭建 Prometheus & Grafana 环境
运维·docker·容器·grafana·prometheus
奈斯ing4 天前
【prometheus+Grafana篇】避坑指南:实践中常见问题与解决方案总结整理(持续更新...)
运维·grafana·prometheus·1024程序员节
风清再凯4 天前
02_prometheus监控&Grafana展示
prometheus·1024程序员节
抹香鲸之海5 天前
Prometheus+Grafana实现Springboot服务监控
spring boot·grafana·prometheus
荣光波比5 天前
Prometheus(二)—— 在K8s集群中部署Prometheus+Grafana+AlertManager实现全方位监控
kubernetes·grafana·prometheus
啊啊啊啊8437 天前
Prometheus监控系统
prometheus