一.现状
现在管理的redis集群是有3套,其中2套是3主3从,1套是5主5从,redis版本是8.0.22,目前业务反馈偶发慢的问题,数据同步和备份问题以及大键值影响业务的问题。
为了时刻监控redis集群的运行状态,考虑基于现有的prometheus+grafana集群监控redis集群运行状态,并实现redis集群异常告警。
二.涉及软件
| 组件 | 推荐版本 | 作用 |
|---|---|---|
| Redis Cluster | 8.0.22 | 三主三从,已完成集群初始化(槽位分配、主从复制正常) |
| Prometheus | v2.40+ | 核心监控采集与告警判断 |
| redis_exporter | v1.50+ | 适配 Redis 8.0 新特性(如 ACL、集群拓扑识别) |
| Grafana | v10.0+ | 可视化大盘展示 |
| 环境依赖 | Linux(如 Ubuntu 22.04、rocylinux) | 开放 9121-9123(exporter)、9090(Prometheus)端口 |
以上软件安装,本文档不再意义进行安装,只展示对应的配置文件
三.实现redis-cluster监控
1.redis-cluster安装
wget https://github.com/oliver006/redis_exporter/releases/download/v1.50.0/redis_exporter-v1.50.0.linux-amd64.tar.gz
tar -zxvf redis_exporter-v1.50.0.linux-amd64.tar.gz
mv redis_exporter-v1.50.0.linux-amd64 redis_exporter
cp -r redis_exporter /usr/local/
2.配置redis_exporter
由于有三个redis集群,redis_exporter安装的目录是/usr/local/下,所以会启用三个redis_exporter 进程。在/etc/systemd/system目录下创建三个进程的启动文件,本次把三个redis_exporter进程的监听端口为9121-9123
进程1的systemd文件
cat redis-11-exporter.service
[Unit]
Description=Redis Exporter for Prometheus
After=network.target
[Service]
ExecStart=/usr/local/redis_exporter/redis_exporter -redis.addr redis://10.10.10.1:6379 -is-cluster -web.listen-address :9121 -redis.password "1234565"
Restart=always
[Install]
WantedBy=multi-user.target
-redis.addr redis://10.10.10.1:6379 代表redis地址,集群模式写一个地址即可,和 -is-cluster 同时使用
-is-cluster 代表是集群模式
-web.listen-address :9121 监听端口
-redis.password "1234565" redis的口令
进程2的systemd文件
cat redis-12-exporter.service
[Unit]
Description=Redis Exporter for Prometheus
After=network.target
[Service]
ExecStart=/usr/local/redis_exporter/redis_exporter -redis.addr redis://10.10.10.7:6379 -is-cluster -web.listen-address :9122 -redis.password "1234565"
Restart=always
[Install]
WantedBy=multi-user.target
进程3
2的systemd文件
cat redis-13-exporter.service
[Unit]
Description=Redis Exporter for Prometheus
After=network.target
[Service]
ExecStart=/usr/local/redis_exporter/redis_exporter -redis.addr redis://10.10.10.15:6379 -is-cluster -web.listen-address :9123 -redis.password "1234565"
Restart=always
[Install]
WantedBy=multi-user.target
启动服务
systemctl daemon-reload
systemctl start redis-11-exporter
systemctl start redis-12-exporter
systemctl start redis-13-exporter
systemctl enable redis-11-exporter
systemctl enable redis-12-exporter
systemctl enable redis-13-exporter
3.配置prometheus配置文件
global:
scrape_interval: 15s # 全局采集间隔
evaluation_interval: 15s
scrape_configs:
- job_name: 'redis_cluster-11'
metrics_path: /scrape # 集群模式使用/scrape接口
static_configs:
- targets: # 填写集群所有节点地址
- redis://master1:6379
- redis://master2:6379
- redis://master3:6379
- redis://slave1:6379
- redis://slave2:6379
- redis://slave3:6379
relabel_configs:
- source_labels: [__address__]
target_label: __param_target # 传递目标节点参数
- source_labels: [__param_target]
target_label: instance # 标记实例名
- target_label: __address__
replacement: exporter-ip:9121 # 替换为exporter地址及端口
scrape_timeout: 20s # 延长超时,适配集群采集
- job_name: 'redis_cluster-12'
metrics_path: /scrape # 集群模式使用/scrape接口
static_configs:
- targets: # 填写集群所有节点地址
- redis://master1:6379
- redis://master2:6379
- redis://master3:6379
- redis://slave1:6379
- redis://slave2:6379
- redis://slave3:6379
relabel_configs:
- source_labels: [__address__]
target_label: __param_target # 传递目标节点参数
- source_labels: [__param_target]
target_label: instance # 标记实例名
- target_label: __address__
replacement: exporter-ip:9122 # 替换为exporter地址
scrape_timeout: 20s # 延长超时,适配集群采集
- job_name: 'redis_cluster-13'
metrics_path: /scrape # 集群模式使用/scrape接口
static_configs:
- targets: # 填写集群所有节点地址
- redis://master1:6379
- redis://master2:6379
- redis://master3:6379
- redis://slave1:6379
- redis://slave2:6379
- redis://slave3:6379
relabel_configs:
- source_labels: [__address__]
target_label: __param_target # 传递目标节点参数
- source_labels: [__param_target]
target_label: instance # 标记实例名
- target_label: __address__
replacement: exporter-ip:9123 # 替换为exporter地址
scrape_timeout: 20s # 延长超时,适配集群采集
分配替换配置文件中的redis的地址以及为exporter地址及端口,特别注意为exporter地址及端口要和redis_exporter进程的配置对应。
以下是正式环境的prometheus的配置文件内容
global:
scrape_interval: 15s
scrape_timeout: 10s
scrape_protocols:
- OpenMetricsText1.0.0
- OpenMetricsText0.0.1
- PrometheusText0.0.4
evaluation_interval: 15s
runtime:
gogc: 75
alerting:
alertmanagers:
- follow_redirects: true
enable_http2: true
http_headers: null
scheme: http
timeout: 10s
api_version: v2
static_configs:
- targets:
- 10.20.12.75:9093
rule_files:
- /usr/local/prometheus/rules.d/*.yml
scrape_configs:
- job_name: prometheus
honor_timestamps: true
track_timestamps_staleness: false
scrape_interval: 15s
scrape_timeout: 10s
scrape_protocols:
- OpenMetricsText1.0.0
- OpenMetricsText0.0.1
- PrometheusText0.0.4
metrics_path: /metrics
scheme: http
enable_compression: true
follow_redirects: true
enable_http2: true
http_headers: null
static_configs:
- targets:
- 10.20.12.75:9090
- job_name: nodes
honor_timestamps: true
track_timestamps_staleness: false
scrape_interval: 15s
scrape_timeout: 10s
scrape_protocols:
- OpenMetricsText1.0.0
- OpenMetricsText0.0.1
- PrometheusText0.0.4
metrics_path: /metrics
scheme: http
enable_compression: true
follow_redirects: true
enable_http2: true
http_headers: null
file_sd_configs:
- files:
- /usr/local/prometheus/conf.d/node*.yml
refresh_interval: 5m
- job_name: portstatus
honor_timestamps: true
track_timestamps_staleness: false
params:
module:
- tcp_connect
scrape_interval: 15s
scrape_timeout: 10s
scrape_protocols:
- OpenMetricsText1.0.0
- OpenMetricsText0.0.1
- PrometheusText0.0.4
metrics_path: /probe
scheme: http
enable_compression: true
follow_redirects: true
enable_http2: true
http_headers: null
relabel_configs:
- source_labels: [__address__]
separator: ;
regex: (.*)
target_label: __param_target
replacement: $1
action: replace
- source_labels: [__param_target]
separator: ;
regex: (.*)
target_label: instance
replacement: $1
action: replace
- separator: ;
regex: (.*)
target_label: __address__
replacement: 10.20.12.75:9115
action: replace
file_sd_configs:
- files:
- /usr/local/prometheus/conf.d/portstatus.yml
refresh_interval: 5m
- job_name: redis11
honor_timestamps: true
track_timestamps_staleness: false
scrape_interval: 15s
scrape_timeout: 10s
scrape_protocols:
- OpenMetricsText1.0.0
- OpenMetricsText0.0.1
- PrometheusText0.0.4
metrics_path: /scrape
scheme: http
enable_compression: true
follow_redirects: true
enable_http2: true
http_headers: null
relabel_configs:
- source_labels: [__address__]
separator: ;
regex: (.*)
target_label: __param_target
replacement: $1
action: replace
- source_labels: [__param_target]
separator: ;
regex: (.*)
target_label: instance
replacement: $1
action: replace
- separator: ;
regex: (.*)
target_label: __address__
replacement: 10.20.10.75:9121
action: replace
static_configs:
- targets:
- redis://10.10.10.11:5000
- redis://10.10.10.12:5000
- redis://10.10.10.13:5000
- redis://10.10.10.14:5000
- redis://10.10.10.15:5000
- redis://10.10.10.16:5000
- redis://10.10.10.17:5000
- redis://10.10.10.18:5000
- redis://10.10.10.19:5000
- redis://10.10.10.20:5000
- job_name: redis21
honor_timestamps: true
track_timestamps_staleness: false
scrape_interval: 15s
scrape_timeout: 10s
scrape_protocols:
- OpenMetricsText1.0.0
- OpenMetricsText0.0.1
- PrometheusText0.0.4
metrics_path: /scrape
scheme: http
enable_compression: true
follow_redirects: true
enable_http2: true
http_headers: null
relabel_configs:
- source_labels: [__address__]
separator: ;
regex: (.*)
target_label: __param_target
replacement: $1
action: replace
- source_labels: [__param_target]
separator: ;
regex: (.*)
target_label: instance
replacement: $1
action: replace
- separator: ;
regex: (.*)
target_label: __address__
replacement: 10.20.10.75:9122
action: replace
static_configs:
- targets:
- redis://10.10.10.21:5000
- redis://10.10.10.22:5000
- redis://10.10.10.23:5000
- redis://10.10.10.24:5000
- redis://10.10.10.25:5000
- redis://10.10.10.26:5000
- job_name: redis44
honor_timestamps: true
track_timestamps_staleness: false
scrape_interval: 15s
scrape_timeout: 10s
scrape_protocols:
- OpenMetricsText1.0.0
- OpenMetricsText0.0.1
- PrometheusText0.0.4
metrics_path: /scrape
scheme: http
enable_compression: true
follow_redirects: true
enable_http2: true
http_headers: null
relabel_configs:
- source_labels: [__address__]
separator: ;
regex: (.*)
target_label: __param_target
replacement: $1
action: replace
- source_labels: [__param_target]
separator: ;
regex: (.*)
target_label: instance
replacement: $1
action: replace
- separator: ;
regex: (.*)
target_label: __address__
replacement: 10.20.10.75:9123
action: replace
static_configs:
- targets:
- redis://10.20.10.44:6379
- redis://10.20.10.45:6379
- redis://10.20.10.46:6379
- redis://10.20.10.47:6379
- redis://10.20.10.48:6379
- redis://10.20.10.49:6379
此配置文件包含了主机、端口和redis以及告警规则的相关配置。
4.prometheus针对redis的告警规则
cat redis_8.0_alerts.yml
groups:
- name: redis_8.0_alerts
rules:
# ====================== 一、集群健康类告警 ======================
- alert: RedisClusterSlotUncovered
expr: redis_cluster_slots_uncovered > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Redis集群槽位未覆盖"
description: "实例{{ $labels.instance }}存在{{ $value }}个未分配槽位,集群不可用"
- alert: RedisClusterNodeRoleMismatch
expr: redis_cluster_node_master{role="slave"} == 1
for: 2m
labels:
severity: warning
annotations:
summary: "Redis节点角色异常"
description: "实例{{ $labels.instance }}实际为从节点,但集群标记为主节点"
- alert: RedisClusterReplicaLagHigh
expr: redis_cluster_slave_offset_delay_seconds > 5
for: 3m
labels:
severity: warning
annotations:
summary: "Redis从节点同步延迟过高"
description: "从节点{{ $labels.instance }}与主节点延迟达{{ $value }}秒"
# ====================== 二、性能与资源类告警 ======================
- alert: RedisMemoryUsageHigh
expr: (redis_memory_used_bytes / redis_memory_max_bytes) > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Redis内存使用率过高"
description: "实例{{ $labels.instance }}内存使用率达{{ $value | humanizePercentage }}"
- alert: RedisMemoryFragmentationExcessive
expr: redis_memory_fragmentation_ratio > 1.6
for: 10m
labels:
severity: warning
annotations:
summary: "Redis内存碎片率过高"
description: "实例{{ $labels.instance }}内存碎片率为{{ $value }},建议执行MEMORY PURGE"
- alert: RedisCommandLatencyHigh
expr: redis_command_duration_seconds:99quantile{command=~"GET|SET|HGETALL"} > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "Redis命令延迟过高"
description: "{{ $labels.command }}命令P99延迟达{{ $value }}秒"
# ====================== 三、8.0新特性适配告警 ======================
- alert: RedisACLPermissionDenied
expr: increase(redis_acl_denied_commands_total[5m]) > 10
for: 1m
labels:
severity: warning
annotations:
summary: "Redis ACL权限拒绝次数过多"
description: "实例{{ $labels.instance }}近5分钟ACL拒绝命令{{ $value }}次"
- alert: RedisMemoryPurgeFailed
expr: increase(redis_memory_purge_failures_total[1h]) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Redis内存碎片整理失败"
description: "实例{{ $labels.instance }}内存碎片整理操作失败"
- alert: RedisConfigRewriteFailed
expr: increase(redis_config_rewrite_failures_total[1h]) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Redis配置重写失败"
description: "实例{{ $labels.instance }}配置持久化操作失败"
# ====================== 四、可用性告警 ======================
- alert: RedisInstanceDown
expr: redis_up == 0
for: 3m
labels:
severity: critical
annotations:
summary: "Redis实例宕机"
description: "实例{{ $labels.instance }}已离线超过3分钟"
- alert: RedisExporterUnreachable
expr: up{job="redis_exporter"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Redis Exporter不可达"
description: "Exporter实例{{ $labels.instance }}已离线,无法采集指标"
# ====================== 五、持久化相关告警 ======================
- alert: RedisAofFsyncFailed
expr: increase(redis_aof_last_fsync_status[1m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Redis AOF刷盘失败"
description: "实例{{ $labels.instance }}AOF日志刷盘失败,数据可能丢失,需检查磁盘空间"
- alert: RedisAofLoadError
expr: redis_aof_loading_error == 1
for: 1m
labels:
severity: critical
annotations:
summary: "Redis AOF文件加载异常"
description: "实例{{ $labels.instance }}AOF文件损坏或格式错误,重启后无法恢复数据"
- alert: RedisRdbSaveFailed
expr: increase(redis_rdb_last_save_status[1h]) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Redis RDB持久化失败"
description: "实例{{ $labels.instance }}近1小时RDB备份失败{{ $value }}次,检查磁盘权限/空间"
- alert: RedisAofRewriteInProgressBlock
expr: redis_aof_rewrite_in_progress == 1 and redis_used_cpu_sys > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Redis AOF重写阻塞业务"
description: "实例{{ $labels.instance }}AOF重写持续5分钟,CPU占用过高,已阻塞核心请求"
# ====================== 六、并发与连接数告警 ======================
- alert: RedisMaxClientsReached
expr: redis_connected_clients >= redis_maxclients
for: 2m
labels:
severity: critical
annotations:
summary: "Redis最大连接数耗尽"
description: "实例{{ $labels.instance }}当前连接数{{ $value }},已达上限{{ $labels.redis_maxclients }},新连接被拒绝"
- alert: RedisClientsAbnormalIncrease
expr: increase(redis_connected_clients[5m]) / redis_connected_clients offset 5m > 2
for: 3m
labels:
severity: warning
annotations:
summary: "Redis连接数突增"
description: "实例{{ $labels.instance }}5分钟内连接数增长超2倍,排查是否有客户端连接泄露或突发流量"
- alert: RedisSlowlogAbnormalIncrease
expr: increase(redis_slowlog_length[5m]) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "Redis慢查询数突增"
description: "实例{{ $labels.instance }}近5分钟慢查询新增{{ $value }}条,需排查大key/复杂命令"
# ====================== 七、安全相关告警 ======================
- alert: RedisACLUserLocked
expr: redis_acl_users_locked > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Redis ACL用户被锁定"
description: "实例{{ $labels.instance }}当前有{{ $value }}个ACL用户被锁定,排查是否密码输错超限"
- alert: RedisAnonymousUserAccess
expr: redis_acl_anonymous_users > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Redis存在匿名用户访问"
description: "实例{{ $labels.instance }}启用了匿名用户,无认证即可访问,立即禁用acl setuser default off"
- alert: RedisSensitiveCommandExecuted
expr: increase(redis_command_stats_calls{command=~"CONFIG|FLUSHDB|FLUSHALL|DEL"}[10m]) > 50
for: 1m
labels:
severity: critical
annotations:
summary: "Redis敏感命令高频执行"
description: "实例{{ $labels.instance }}近10分钟{{ $labels.command }}命令执行{{ $value }}次,警惕误操作/入侵"
# ====================== 八、内存特殊场景告警 ======================
- alert: RedisMemoryEvictionTriggered
expr: increase(redis_evicted_keys[5m]) > 0
for: 2m
labels:
severity: warning
annotations:
summary: "Redis内存淘汰触发"
description: "实例{{ $labels.instance }}近5分钟淘汰{{ $value }}个key,需扩容内存或优化过期策略"
- alert: RedisTransparentHugePageEnabled
expr: redis_transparent_hugepage == 1
for: 1m
labels:
severity: critical
annotations:
summary: "Redis启用透明大页内存"
description: "实例{{ $labels.instance }}开启了THP,会导致延迟飙升,立即执行echo never > /sys/kernel/mm/transparent_hugepage/enabled"
- alert: RedisMemoryAllocFailed
expr: increase(redis_allocator_failures_total[1m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Redis内存分配失败"
description: "实例{{ $labels.instance }}内存分配失败{{ $value }}次,已无法分配新内存,立即释放内存或扩容"
# ====================== 九、节点与集群特殊告警 ======================
- alert: RedisClusterNodeDisconnected
expr: redis_cluster_connected_nodes < 6 # 适配三主三从6节点,按需调整
for: 2m
labels:
severity: critical
annotations:
summary: "Redis集群节点失联"
description: "当前集群仅连接{{ $value }}个节点(预期6个),排查节点网络或集群状态"
- alert: RedisClusterSlotMigrationFailed
expr: redis_cluster_slots_migrating > 0 and redis_cluster_slots_migrating == redis_cluster_slots_migrating offset 5m
for: 5m
labels:
severity: warning
annotations:
summary: "Redis集群槽位迁移卡住"
description: "实例{{ $labels.instance }}槽位迁移持续5分钟未完成,排查节点间网络"
- alert: RedisClusterReplicaInsufficient
expr: count by (instance) (redis_cluster_node_slave{role="slave"}) < 1 # 每个主节点至少1个从节点
for: 3m
labels:
severity: critical
annotations:
summary: "Redis集群从节点数量不足"
description: "主节点{{ $labels.instance }}无可用从节点,宕机后无法自动故障转移"
- alert: RedisClusterBusError
expr: increase(redis_cluster_bus_errors_total[1m]) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Redis集群总线错误"
description: "实例{{ $labels.instance }}集群总线(默认16379)通信错误,影响主从同步和集群拓扑"
# ====================== 十、系统与自身健康告警 ======================
- alert: RedisRestartDetected
expr: time() - redis_server_start_time_seconds < 300 # 5分钟内重启
for: 1m
labels:
severity: warning
annotations:
summary: "Redis实例异常重启"
description: "实例{{ $labels.instance }}5分钟内重启,排查是否OOM/崩溃"
- alert: RedisClusterVersionMismatch
expr: count by (redis_version) (redis_version) > 1
for: 1m
labels:
severity: critical
annotations:
summary: "Redis集群版本不一致"
description: "集群内存在多个Redis版本,会导致槽位/同步异常,统一升级为8.0.x"
- alert: RedisServerDiskFull
expr: redis_disk_used_percent > 90
for: 2m
labels:
severity: critical
annotations:
summary: "Redis服务器磁盘满"
description: "实例{{ $labels.instance }}所在服务器磁盘使用率{{ $value }}%,立即清理空间"
5.grafana展示
导入官网的涉及redis的模板
