redis8.0三主三从集群基于prometheus+grafana监控

一.现状

现在管理的redis集群是有3套,其中2套是3主3从,1套是5主5从,redis版本是8.0.22,目前业务反馈偶发慢的问题,数据同步和备份问题以及大键值影响业务的问题。

为了时刻监控redis集群的运行状态,考虑基于现有的prometheus+grafana集群监控redis集群运行状态,并实现redis集群异常告警。

二.涉及软件

组件 推荐版本 作用
Redis Cluster 8.0.22 三主三从,已完成集群初始化(槽位分配、主从复制正常)
Prometheus v2.40+ 核心监控采集与告警判断
redis_exporter v1.50+ 适配 Redis 8.0 新特性(如 ACL、集群拓扑识别)
Grafana v10.0+ 可视化大盘展示
环境依赖 Linux(如 Ubuntu 22.04、rocylinux) 开放 9121-9123(exporter)、9090(Prometheus)端口

以上软件安装,本文档不再意义进行安装,只展示对应的配置文件

三.实现redis-cluster监控

1.redis-cluster安装

复制代码
wget https://github.com/oliver006/redis_exporter/releases/download/v1.50.0/redis_exporter-v1.50.0.linux-amd64.tar.gz
tar -zxvf redis_exporter-v1.50.0.linux-amd64.tar.gz
mv  redis_exporter-v1.50.0.linux-amd64  redis_exporter
cp -r redis_exporter  /usr/local/

2.配置redis_exporter

由于有三个redis集群,redis_exporter安装的目录是/usr/local/下,所以会启用三个redis_exporter 进程。在/etc/systemd/system目录下创建三个进程的启动文件,本次把三个redis_exporter进程的监听端口为9121-9123

进程1的systemd文件

cat redis-11-exporter.service

复制代码
[Unit]
Description=Redis Exporter for Prometheus
After=network.target

[Service]
ExecStart=/usr/local/redis_exporter/redis_exporter -redis.addr redis://10.10.10.1:6379 -is-cluster -web.listen-address :9121 -redis.password "1234565"
Restart=always

[Install]
WantedBy=multi-user.target

-redis.addr redis://10.10.10.1:6379 代表redis地址,集群模式写一个地址即可,和 -is-cluster 同时使用

-is-cluster 代表是集群模式

-web.listen-address :9121 监听端口

-redis.password "1234565" redis的口令

进程2的systemd文件

cat redis-12-exporter.service

复制代码
[Unit]
Description=Redis Exporter for Prometheus
After=network.target

[Service]
ExecStart=/usr/local/redis_exporter/redis_exporter -redis.addr redis://10.10.10.7:6379 -is-cluster -web.listen-address :9122 -redis.password "1234565"
Restart=always

[Install]
WantedBy=multi-user.target

进程3

2的systemd文件

cat redis-13-exporter.service

复制代码
[Unit]
Description=Redis Exporter for Prometheus
After=network.target

[Service]
ExecStart=/usr/local/redis_exporter/redis_exporter -redis.addr redis://10.10.10.15:6379 -is-cluster -web.listen-address :9123 -redis.password "1234565"
Restart=always

[Install]
WantedBy=multi-user.target

启动服务

复制代码
systemctl daemon-reload
systemctl start redis-11-exporter
systemctl start redis-12-exporter
systemctl start redis-13-exporter
systemctl enable redis-11-exporter
systemctl enable redis-12-exporter
systemctl enable redis-13-exporter

3.配置prometheus配置文件

复制代码
global:
  scrape_interval: 15s  # 全局采集间隔
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'redis_cluster-11'
    metrics_path: /scrape  # 集群模式使用/scrape接口
    static_configs:
      - targets:  # 填写集群所有节点地址
        - redis://master1:6379
        - redis://master2:6379
        - redis://master3:6379
        - redis://slave1:6379
        - redis://slave2:6379
        - redis://slave3:6379
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target  # 传递目标节点参数
      - source_labels: [__param_target]
        target_label: instance  # 标记实例名
      - target_label: __address__
        replacement: exporter-ip:9121  # 替换为exporter地址及端口
    scrape_timeout: 20s  # 延长超时,适配集群采集
  - job_name: 'redis_cluster-12'
    metrics_path: /scrape  # 集群模式使用/scrape接口
    static_configs:
      - targets:  # 填写集群所有节点地址
        - redis://master1:6379
        - redis://master2:6379
        - redis://master3:6379
        - redis://slave1:6379
        - redis://slave2:6379
        - redis://slave3:6379
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target  # 传递目标节点参数
      - source_labels: [__param_target]
        target_label: instance  # 标记实例名
      - target_label: __address__
        replacement: exporter-ip:9122  # 替换为exporter地址
    scrape_timeout: 20s  # 延长超时,适配集群采集
  - job_name: 'redis_cluster-13'
    metrics_path: /scrape  # 集群模式使用/scrape接口
    static_configs:
      - targets:  # 填写集群所有节点地址
        - redis://master1:6379
        - redis://master2:6379
        - redis://master3:6379
        - redis://slave1:6379
        - redis://slave2:6379
        - redis://slave3:6379
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target  # 传递目标节点参数
      - source_labels: [__param_target]
        target_label: instance  # 标记实例名
      - target_label: __address__
        replacement: exporter-ip:9123  # 替换为exporter地址
    scrape_timeout: 20s  # 延长超时,适配集群采集

分配替换配置文件中的redis的地址以及为exporter地址及端口,特别注意为exporter地址及端口要和redis_exporter进程的配置对应。

以下是正式环境的prometheus的配置文件内容

复制代码
global:
  scrape_interval: 15s
  scrape_timeout: 10s
  scrape_protocols:
  - OpenMetricsText1.0.0
  - OpenMetricsText0.0.1
  - PrometheusText0.0.4
  evaluation_interval: 15s
runtime:
  gogc: 75
alerting:
  alertmanagers:
  - follow_redirects: true
    enable_http2: true
    http_headers: null
    scheme: http
    timeout: 10s
    api_version: v2
    static_configs:
    - targets:
      - 10.20.12.75:9093
rule_files:
- /usr/local/prometheus/rules.d/*.yml
scrape_configs:
- job_name: prometheus
  honor_timestamps: true
  track_timestamps_staleness: false
  scrape_interval: 15s
  scrape_timeout: 10s
  scrape_protocols:
  - OpenMetricsText1.0.0
  - OpenMetricsText0.0.1
  - PrometheusText0.0.4
  metrics_path: /metrics
  scheme: http
  enable_compression: true
  follow_redirects: true
  enable_http2: true
  http_headers: null
  static_configs:
  - targets:
    - 10.20.12.75:9090
- job_name: nodes
  honor_timestamps: true
  track_timestamps_staleness: false
  scrape_interval: 15s
  scrape_timeout: 10s
  scrape_protocols:
  - OpenMetricsText1.0.0
  - OpenMetricsText0.0.1
  - PrometheusText0.0.4
  metrics_path: /metrics
  scheme: http
  enable_compression: true
  follow_redirects: true
  enable_http2: true
  http_headers: null
  file_sd_configs:
  - files:
    - /usr/local/prometheus/conf.d/node*.yml
    refresh_interval: 5m
- job_name: portstatus
  honor_timestamps: true
  track_timestamps_staleness: false
  params:
    module:
    - tcp_connect
  scrape_interval: 15s
  scrape_timeout: 10s
  scrape_protocols:
  - OpenMetricsText1.0.0
  - OpenMetricsText0.0.1
  - PrometheusText0.0.4
  metrics_path: /probe
  scheme: http
  enable_compression: true
  follow_redirects: true
  enable_http2: true
  http_headers: null
  relabel_configs:
  - source_labels: [__address__]
    separator: ;
    regex: (.*)
    target_label: __param_target
    replacement: $1
    action: replace
  - source_labels: [__param_target]
    separator: ;
    regex: (.*)
    target_label: instance
    replacement: $1
    action: replace
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: 10.20.12.75:9115
    action: replace
  file_sd_configs:
  - files:
    - /usr/local/prometheus/conf.d/portstatus.yml
    refresh_interval: 5m
- job_name: redis11
  honor_timestamps: true
  track_timestamps_staleness: false
  scrape_interval: 15s
  scrape_timeout: 10s
  scrape_protocols:
  - OpenMetricsText1.0.0
  - OpenMetricsText0.0.1
  - PrometheusText0.0.4
  metrics_path: /scrape
  scheme: http
  enable_compression: true
  follow_redirects: true
  enable_http2: true
  http_headers: null
  relabel_configs:
  - source_labels: [__address__]
    separator: ;
    regex: (.*)
    target_label: __param_target
    replacement: $1
    action: replace
  - source_labels: [__param_target]
    separator: ;
    regex: (.*)
    target_label: instance
    replacement: $1
    action: replace
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: 10.20.10.75:9121
    action: replace
  static_configs:
  - targets:
    - redis://10.10.10.11:5000
    - redis://10.10.10.12:5000
    - redis://10.10.10.13:5000
    - redis://10.10.10.14:5000
    - redis://10.10.10.15:5000
    - redis://10.10.10.16:5000
    - redis://10.10.10.17:5000
    - redis://10.10.10.18:5000
    - redis://10.10.10.19:5000
    - redis://10.10.10.20:5000
- job_name: redis21
  honor_timestamps: true
  track_timestamps_staleness: false
  scrape_interval: 15s
  scrape_timeout: 10s
  scrape_protocols:
  - OpenMetricsText1.0.0
  - OpenMetricsText0.0.1
  - PrometheusText0.0.4
  metrics_path: /scrape
  scheme: http
  enable_compression: true
  follow_redirects: true
  enable_http2: true
  http_headers: null
  relabel_configs:
  - source_labels: [__address__]
    separator: ;
    regex: (.*)
    target_label: __param_target
    replacement: $1
    action: replace
  - source_labels: [__param_target]
    separator: ;
    regex: (.*)
    target_label: instance
    replacement: $1
    action: replace
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: 10.20.10.75:9122
    action: replace
  static_configs:
  - targets:
    - redis://10.10.10.21:5000
    - redis://10.10.10.22:5000
    - redis://10.10.10.23:5000
    - redis://10.10.10.24:5000
    - redis://10.10.10.25:5000
    - redis://10.10.10.26:5000
- job_name: redis44
  honor_timestamps: true
  track_timestamps_staleness: false
  scrape_interval: 15s
  scrape_timeout: 10s
  scrape_protocols:
  - OpenMetricsText1.0.0
  - OpenMetricsText0.0.1
  - PrometheusText0.0.4
  metrics_path: /scrape
  scheme: http
  enable_compression: true
  follow_redirects: true
  enable_http2: true
  http_headers: null
  relabel_configs:
  - source_labels: [__address__]
    separator: ;
    regex: (.*)
    target_label: __param_target
    replacement: $1
    action: replace
  - source_labels: [__param_target]
    separator: ;
    regex: (.*)
    target_label: instance
    replacement: $1
    action: replace
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: 10.20.10.75:9123
    action: replace
  static_configs:
  - targets:
    - redis://10.20.10.44:6379
    - redis://10.20.10.45:6379
    - redis://10.20.10.46:6379
    - redis://10.20.10.47:6379
    - redis://10.20.10.48:6379
    - redis://10.20.10.49:6379

此配置文件包含了主机、端口和redis以及告警规则的相关配置。

4.prometheus针对redis的告警规则

cat redis_8.0_alerts.yml

复制代码
groups:
- name: redis_8.0_alerts
  rules:
  # ====================== 一、集群健康类告警 ======================
  - alert: RedisClusterSlotUncovered
    expr: redis_cluster_slots_uncovered > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Redis集群槽位未覆盖"
      description: "实例{{ $labels.instance }}存在{{ $value }}个未分配槽位,集群不可用"

  - alert: RedisClusterNodeRoleMismatch
    expr: redis_cluster_node_master{role="slave"} == 1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Redis节点角色异常"
      description: "实例{{ $labels.instance }}实际为从节点,但集群标记为主节点"

  - alert: RedisClusterReplicaLagHigh
    expr: redis_cluster_slave_offset_delay_seconds > 5
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: "Redis从节点同步延迟过高"
      description: "从节点{{ $labels.instance }}与主节点延迟达{{ $value }}秒"

  # ====================== 二、性能与资源类告警 ======================
  - alert: RedisMemoryUsageHigh
    expr: (redis_memory_used_bytes / redis_memory_max_bytes) > 0.85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Redis内存使用率过高"
      description: "实例{{ $labels.instance }}内存使用率达{{ $value | humanizePercentage }}"

  - alert: RedisMemoryFragmentationExcessive
    expr: redis_memory_fragmentation_ratio > 1.6
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Redis内存碎片率过高"
      description: "实例{{ $labels.instance }}内存碎片率为{{ $value }},建议执行MEMORY PURGE"

  - alert: RedisCommandLatencyHigh
    expr: redis_command_duration_seconds:99quantile{command=~"GET|SET|HGETALL"} > 0.1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Redis命令延迟过高"
      description: "{{ $labels.command }}命令P99延迟达{{ $value }}秒"

  # ====================== 三、8.0新特性适配告警 ======================
  - alert: RedisACLPermissionDenied
    expr: increase(redis_acl_denied_commands_total[5m]) > 10
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Redis ACL权限拒绝次数过多"
      description: "实例{{ $labels.instance }}近5分钟ACL拒绝命令{{ $value }}次"

  - alert: RedisMemoryPurgeFailed
    expr: increase(redis_memory_purge_failures_total[1h]) > 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Redis内存碎片整理失败"
      description: "实例{{ $labels.instance }}内存碎片整理操作失败"

  - alert: RedisConfigRewriteFailed
    expr: increase(redis_config_rewrite_failures_total[1h]) > 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Redis配置重写失败"
      description: "实例{{ $labels.instance }}配置持久化操作失败"

  # ====================== 四、可用性告警 ======================
  - alert: RedisInstanceDown
    expr: redis_up == 0
    for: 3m
    labels:
      severity: critical
    annotations:
      summary: "Redis实例宕机"
      description: "实例{{ $labels.instance }}已离线超过3分钟"

  - alert: RedisExporterUnreachable
    expr: up{job="redis_exporter"} == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Redis Exporter不可达"
      description: "Exporter实例{{ $labels.instance }}已离线,无法采集指标"

  # ====================== 五、持久化相关告警 ======================
  - alert: RedisAofFsyncFailed
    expr: increase(redis_aof_last_fsync_status[1m]) > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Redis AOF刷盘失败"
      description: "实例{{ $labels.instance }}AOF日志刷盘失败,数据可能丢失,需检查磁盘空间"

  - alert: RedisAofLoadError
    expr: redis_aof_loading_error == 1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Redis AOF文件加载异常"
      description: "实例{{ $labels.instance }}AOF文件损坏或格式错误,重启后无法恢复数据"

  - alert: RedisRdbSaveFailed
    expr: increase(redis_rdb_last_save_status[1h]) > 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Redis RDB持久化失败"
      description: "实例{{ $labels.instance }}近1小时RDB备份失败{{ $value }}次,检查磁盘权限/空间"

  - alert: RedisAofRewriteInProgressBlock
    expr: redis_aof_rewrite_in_progress == 1 and redis_used_cpu_sys > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Redis AOF重写阻塞业务"
      description: "实例{{ $labels.instance }}AOF重写持续5分钟,CPU占用过高,已阻塞核心请求"

  # ====================== 六、并发与连接数告警 ======================
  - alert: RedisMaxClientsReached
    expr: redis_connected_clients >= redis_maxclients
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Redis最大连接数耗尽"
      description: "实例{{ $labels.instance }}当前连接数{{ $value }},已达上限{{ $labels.redis_maxclients }},新连接被拒绝"

  - alert: RedisClientsAbnormalIncrease
    expr: increase(redis_connected_clients[5m]) / redis_connected_clients offset 5m > 2
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: "Redis连接数突增"
      description: "实例{{ $labels.instance }}5分钟内连接数增长超2倍,排查是否有客户端连接泄露或突发流量"

  - alert: RedisSlowlogAbnormalIncrease
    expr: increase(redis_slowlog_length[5m]) > 10
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Redis慢查询数突增"
      description: "实例{{ $labels.instance }}近5分钟慢查询新增{{ $value }}条,需排查大key/复杂命令"

  # ====================== 七、安全相关告警 ======================
  - alert: RedisACLUserLocked
    expr: redis_acl_users_locked > 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Redis ACL用户被锁定"
      description: "实例{{ $labels.instance }}当前有{{ $value }}个ACL用户被锁定,排查是否密码输错超限"

  - alert: RedisAnonymousUserAccess
    expr: redis_acl_anonymous_users > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Redis存在匿名用户访问"
      description: "实例{{ $labels.instance }}启用了匿名用户,无认证即可访问,立即禁用acl setuser default off"

  - alert: RedisSensitiveCommandExecuted
    expr: increase(redis_command_stats_calls{command=~"CONFIG|FLUSHDB|FLUSHALL|DEL"}[10m]) > 50
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Redis敏感命令高频执行"
      description: "实例{{ $labels.instance }}近10分钟{{ $labels.command }}命令执行{{ $value }}次,警惕误操作/入侵"

  # ====================== 八、内存特殊场景告警 ======================
  - alert: RedisMemoryEvictionTriggered
    expr: increase(redis_evicted_keys[5m]) > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Redis内存淘汰触发"
      description: "实例{{ $labels.instance }}近5分钟淘汰{{ $value }}个key,需扩容内存或优化过期策略"

  - alert: RedisTransparentHugePageEnabled
    expr: redis_transparent_hugepage == 1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Redis启用透明大页内存"
      description: "实例{{ $labels.instance }}开启了THP,会导致延迟飙升,立即执行echo never > /sys/kernel/mm/transparent_hugepage/enabled"

  - alert: RedisMemoryAllocFailed
    expr: increase(redis_allocator_failures_total[1m]) > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Redis内存分配失败"
      description: "实例{{ $labels.instance }}内存分配失败{{ $value }}次,已无法分配新内存,立即释放内存或扩容"

  # ====================== 九、节点与集群特殊告警 ======================
  - alert: RedisClusterNodeDisconnected
    expr: redis_cluster_connected_nodes < 6  # 适配三主三从6节点,按需调整
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Redis集群节点失联"
      description: "当前集群仅连接{{ $value }}个节点(预期6个),排查节点网络或集群状态"

  - alert: RedisClusterSlotMigrationFailed
    expr: redis_cluster_slots_migrating > 0 and redis_cluster_slots_migrating == redis_cluster_slots_migrating offset 5m
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Redis集群槽位迁移卡住"
      description: "实例{{ $labels.instance }}槽位迁移持续5分钟未完成,排查节点间网络"

  - alert: RedisClusterReplicaInsufficient
    expr: count by (instance) (redis_cluster_node_slave{role="slave"}) < 1  # 每个主节点至少1个从节点
    for: 3m
    labels:
      severity: critical
    annotations:
      summary: "Redis集群从节点数量不足"
      description: "主节点{{ $labels.instance }}无可用从节点,宕机后无法自动故障转移"

  - alert: RedisClusterBusError
    expr: increase(redis_cluster_bus_errors_total[1m]) > 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Redis集群总线错误"
      description: "实例{{ $labels.instance }}集群总线(默认16379)通信错误,影响主从同步和集群拓扑"

  # ====================== 十、系统与自身健康告警 ======================
  - alert: RedisRestartDetected
    expr: time() - redis_server_start_time_seconds < 300  # 5分钟内重启
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Redis实例异常重启"
      description: "实例{{ $labels.instance }}5分钟内重启,排查是否OOM/崩溃"

  - alert: RedisClusterVersionMismatch
    expr: count by (redis_version) (redis_version) > 1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Redis集群版本不一致"
      description: "集群内存在多个Redis版本,会导致槽位/同步异常,统一升级为8.0.x"

  - alert: RedisServerDiskFull
    expr: redis_disk_used_percent > 90
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Redis服务器磁盘满"
      description: "实例{{ $labels.instance }}所在服务器磁盘使用率{{ $value }}%,立即清理空间"

5.grafana展示

导入官网的涉及redis的模板

相关推荐
我爱学习好爱好爱3 小时前
Docker Compose 一键部署 Prometheus + Alertmanager + Grafana 完整监控方案
docker·grafana·prometheus
无心水4 小时前
【神经风格迁移:性能】24、神经风格迁移全链路监控实战:基于Prometheus+Grafana的性能调优指南
数据库·人工智能·深度学习·机器学习·grafana·prometheus·神经风格迁移:性能
是Judy咋!4 小时前
Loki + Promtail + Tempo + Grafana 实现日志与链路追踪一体化
grafana
BullSmall4 小时前
Grafana 如何提供7*24小时的监控
产品运营·grafana
cui_win14 小时前
Prometheus实战教程 - Redis 监控
数据库·redis·prometheus
BullSmall15 小时前
普罗米修斯 的学习路径及建议
学习·prometheus
我爱学习好爱好爱15 小时前
Prometheus监控栈 监控数据库mysql
docker·grafana·prometheus
menggb0717 小时前
在Linux系统上安装和使用Prometheus+Grafana
linux·运维·prometheus
我爱学习好爱好爱1 天前
Prometheus监控栈 监控Linux操作系统
linux·grafana·prometheus