Redis监控告警体系搭建:从零到企业级实战
- 第一章:体系架构设计
-
- [1.1 整体架构图](#1.1 整体架构图)
- [1.2 组件职责说明](#1.2 组件职责说明)
- 第二章:环境准备与部署
-
- [2.1 Redis Exporter 部署](#2.1 Redis Exporter 部署)
-
- [2.1.1 Docker 部署方式](#2.1.1 Docker 部署方式)
- [2.1.2 二进制文件部署](#2.1.2 二进制文件部署)
- [2.1.3 Kubernetes部署配置](#2.1.3 Kubernetes部署配置)
- [2.2 Prometheus 配置](#2.2 Prometheus 配置)
-
- [2.2.1 主配置文件](#2.2.1 主配置文件)
- [2.2.2 记录规则配置](#2.2.2 记录规则配置)
- [2.3 Alertmanager 配置](#2.3 Alertmanager 配置)
-
- [2.3.1 告警路由配置](#2.3.1 告警路由配置)
- [2.3.2 钉钉告警配置](#2.3.2 钉钉告警配置)
- 第三章:Redis监控指标体系
-
- [3.1 关键性能指标分类](#3.1 关键性能指标分类)
-
- [3.1.1 内存相关指标](#3.1.1 内存相关指标)
- [3.1.2 连接相关指标](#3.1.2 连接相关指标)
- [3.1.3 性能相关指标](#3.1.3 性能相关指标)
- [3.1.4 持久化相关指标](#3.1.4 持久化相关指标)
- [3.2 业务自定义指标](#3.2 业务自定义指标)
-
- [3.2.1 大Key监控](#3.2.1 大Key监控)
- [3.2.2 热点Key监控](#3.2.2 热点Key监控)
- 第四章:告警规则配置
-
- [4.1 内存告警规则](#4.1 内存告警规则)
- [4.2 连接数告警规则](#4.2 连接数告警规则)
- [4.3 性能告警规则](#4.3 性能告警规则)
- [4.4 集群状态告警](#4.4 集群状态告警)
- 第五章:Grafana仪表盘开发
-
- [5.1 总体监控仪表盘](#5.1 总体监控仪表盘)
-
- [5.1.1 全局概览面板](#5.1.1 全局概览面板)
- [5.1.2 性能监控面板](#5.1.2 性能监控面板)
- [5.2 高级监控功能](#5.2 高级监控功能)
-
- [5.2.1 预测性监控](#5.2.1 预测性监控)
- [5.2.2 多集群监控](#5.2.2 多集群监控)
- 第六章:高级特性与优化
-
- [6.1 监控数据降采样](#6.1 监控数据降采样)
- [6.2 动态标签管理](#6.2 动态标签管理)
- [6.3 性能优化配置](#6.3 性能优化配置)
- 第七章:实战案例与故障排查
-
- [7.1 常见故障场景](#7.1 常见故障场景)
-
- [7.1.1 内存泄漏排查](#7.1.1 内存泄漏排查)
- [7.1.2 性能瓶颈分析](#7.1.2 性能瓶颈分析)
- [7.2 监控体系验证](#7.2 监控体系验证)
-
- [7.2.1 端到端测试](#7.2.1 端到端测试)
- 第八章:总结与最佳实践
-
- [8.1 监控体系检查清单](#8.1 监控体系检查清单)
- [8.2 持续优化建议](#8.2 持续优化建议)
第一章:体系架构设计
1.1 整体架构图
我们先通过一张架构图来全局了解整个监控体系的组件和数据流向。
可视化与告警 存储计算层 数据采集层 监控仪表盘 Grafana 性能分析 邮件告警 Alertmanager 钉钉告警 短信告警 时序数据库 Prometheus 告警规则 记录规则 Exporter Redis主库 Exporter Redis从库 Exporter Redis集群 Redis实例 Redis Exporter 通知渠道
1.2 组件职责说明
| 组件 |
职责 |
关键技术点 |
| Redis Exporter |
采集Redis指标,暴露HTTP端点供Prometheus抓取 |
Go语言开发,支持集群模式 |
| Prometheus |
定时抓取指标,存储时序数据,执行告警规则 |
时序数据库,PromQL查询语言 |
| Grafana |
数据可视化,创建监控仪表盘 |
数据源插件,面板编辑器 |
| Alertmanager |
告警去重、分组、路由、静默 |
分组策略,抑制规则 |
第二章:环境准备与部署
2.1 Redis Exporter 部署
2.1.1 Docker 部署方式
# 创建docker-compose.yml
version: '3.8'
services:
redis-exporter:
image: oliver006/redis_exporter:latest
container_name: redis-exporter
ports:
- "9121:9121"
environment:
- REDIS_ADDR=redis://localhost:6379
- REDIS_PASSWORD=your_redis_password
- REDIS_ALIAS=production-redis
restart: unless-stopped
healthcheck:
test: ["CMD", "wget", "--quiet", "--spider", "http://localhost:9121/metrics"]
interval: 30s
timeout: 10s
retries: 3
# 启动服务
docker-compose up -d
2.1.2 二进制文件部署
# 下载最新版本
wget https://github.com/oliver006/redis_exporter/releases/download/v1.45.0/redis_exporter-v1.45.0.linux-amd64.tar.gz
tar -xzf redis_exporter-v1.45.0.linux-amd64.tar.gz
cd redis_exporter-v1.45.0.linux-amd64
# 启动exporter(支持监控多个Redis实例)
./redis_exporter \
-redis.addr redis1:6379,redis2:6379 \
-redis.password file:/etc/redis/password.txt \
-web.listen-address :9121 \
-web.telemetry-path /metrics \
-log-format json
2.1.3 Kubernetes部署配置
# redis-exporter-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-exporter
namespace: monitoring
spec:
replicas: 2
selector:
matchLabels:
app: redis-exporter
template:
metadata:
labels:
app: redis-exporter
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9121"
prometheus.io/path: "/metrics"
spec:
containers:
- name: redis-exporter
image: oliver006/redis_exporter:v1.45.0
ports:
- containerPort: 9121
env:
- name: REDIS_ADDR
value: "redis-service:6379"
- name: REDIS_PASSWORD
valueFrom:
secretKeyRef:
name: redis-secret
key: password
resources:
requests:
memory: "64Mi"
cpu: "50m"
limits:
memory: "128Mi"
cpu: "100m"
livenessProbe:
httpGet:
path: /metrics
port: 9121
initialDelaySeconds: 30
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: redis-exporter-service
namespace: monitoring
labels:
app: redis-exporter
spec:
selector:
app: redis-exporter
ports:
- name: metrics
port: 9121
targetPort: 9121
2.2 Prometheus 配置
2.2.1 主配置文件
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
environment: production
cluster: redis-cluster-01
# 告警规则文件
rule_files:
- "alerts/redis_alerts.yml"
- "alerts/system_alerts.yml"
# 抓取配置
scrape_configs:
# Redis Exporter 监控配置
- job_name: 'redis'
static_configs:
- targets:
- 'redis-exporter-service:9121' # Kubernetes服务发现
- '192.168.1.100:9121' # 静态IP配置
metrics_path: /metrics
params:
format: ['prometheus']
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: redis-exporter-service:9121
scrape_interval: 30s
scrape_timeout: 10s
# Prometheus自身监控
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter 系统监控
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
scrape_interval: 30s
2.2.2 记录规则配置
# recording_rules/redis_rules.yml
groups:
- name: redis_recording_rules
interval: 30s
rules:
- record: redis:memory_usage_percent
expr: redis_memory_used_bytes / redis_memory_max_bytes * 100
- record: redis:connected_clients_percent
expr: redis_connected_clients / redis_maxclients * 100
- record: redis:instantaneous_ops_per_second
expr: rate(redis_commands_processed_total[2m])
- record: redis:keyspace_hits_rate
expr: rate(redis_keyspace_hits_total[5m])
- record: redis:keyspace_misses_rate
expr: rate(redis_keyspace_misses_total[5m])
- record: redis:hit_ratio
expr: redis:keyspace_hits_rate / (redis:keyspace_hits_rate + redis:keyspace_misses_rate) * 100
- record: redis:network_input_rate
expr: rate(redis_net_input_bytes_total[2m])
- record: redis:network_output_rate
expr: rate(redis_net_output_bytes_total[2m])
2.3 Alertmanager 配置
2.3.1 告警路由配置
# alertmanager.yml
global:
smtp_smarthost: 'smtp.qq.com:587'
smtp_from: 'monitoring@company.com'
smtp_auth_username: 'monitoring@company.com'
smtp_auth_password: 'your-smtp-password'
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 5m
repeat_interval: 1h
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
group_interval: 1m
repeat_interval: 5m
- match:
service: redis
receiver: 'redis-team'
routes:
- match:
severity: warning
receiver: 'redis-warning'
- match:
severity: critical
receiver: 'redis-critical'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'instance']
receivers:
- name: 'default-receiver'
email_configs:
- to: 'devops-team@company.com'
headers:
subject: '[Monitoring] Alert: {{ .GroupLabels.alertname }}'
- name: 'critical-alerts'
email_configs:
- to: 'sre-team@company.com'
webhook_configs:
- url: 'http://alert-hook:8080/alerts'
send_resolved: true
- name: 'redis-team'
email_configs:
- to: 'redis-dba@company.com'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts-redis'
title: 'Redis Alert: {{ .CommonLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}'
2.3.2 钉钉告警配置
# 钉钉机器人接收器
- name: 'dingtalk-redis'
webhook_configs:
- url: 'https://oapi.dingtalk.com/robot/send?access_token=your-token'
send_resolved: true
http_config:
bearer_token: 'your-bearer-token'
第三章:Redis监控指标体系
3.1 关键性能指标分类
3.1.1 内存相关指标
# 内存使用率
redis_memory_used_bytes / redis_memory_max_bytes * 100
# 内存碎片率
redis_memory_fragmentation_ratio
# 内存使用趋势预测
predict_linear(redis_memory_used_bytes[6h], 86400)
# 内存淘汰策略
redis_evicted_keys_total
3.1.2 连接相关指标
# 连接数使用率
redis_connected_clients / redis_maxclients * 100
# 连接拒绝率
rate(redis_rejected_connections_total[5m])
# 连接数趋势
rate(redis_connected_clients[10m])
3.1.3 性能相关指标
# 每秒操作数
rate(redis_commands_processed_total[2m])
# 命令延迟分布
histogram_quantile(0.95,
rate(redis_commands_duration_seconds_bucket[5m])
)
# 缓存命中率
rate(redis_keyspace_hits_total[5m]) /
(rate(redis_keyspace_hits_total[5m]) +
rate(redis_keyspace_misses_total[5m])) * 100
3.1.4 持久化相关指标
# RDB持久化状态
redis_rdb_last_save_timestamp_seconds
redis_rdb_changes_since_last_save
# AOF持久化状态
redis_aof_enabled
redis_aof_last_rewrite_time_seconds
3.2 业务自定义指标
3.2.1 大Key监控
# 通过Redis命令扫描大Key
redis-cli --bigkeys -i 0.1
# 通过Lua脚本监控特定模式的大Key
local keys = redis.call('keys', ARGV[1])
local result = {}
for i, key in ipairs(keys) do
local size = redis.call('memory', 'usage', key)
if size > tonumber(ARGV[2]) then
table.insert(result, {key, size})
end
end
return result
3.2.2 热点Key监控
# 通过命令统计识别热点Key
rate(redis_cmdstat_get_calls[2m])
rate(redis_cmdstat_set_calls[2m])
# 自定义指标暴露
# 在应用中埋点热点Key访问
第四章:告警规则配置
4.1 内存告警规则
# alerts/redis_alerts.yml
groups:
- name: redis_memory_alerts
rules:
- alert: RedisMemoryUsageCritical
expr: redis:memory_usage_percent > 90
for: 5m
labels:
severity: critical
service: redis
annotations:
summary: "Redis内存使用率超过90%"
description: "实例 {{ $labels.instance }} 内存使用率当前为 {{ $value }}%,请立即处理"
runbook: "https://wiki/redis-memory-optimization"
- alert: RedisMemoryFragmentationHigh
expr: redis_memory_fragmentation_ratio > 1.5
for: 10m
labels:
severity: warning
annotations:
summary: "Redis内存碎片率过高"
description: "实例 {{ $labels.instance }} 内存碎片率为 {{ $value }},建议进行碎片整理"
- alert: RedisMemoryOOMWarning
expr: predict_linear(redis_memory_used_bytes[1h], 3600) / redis_memory_max_bytes > 1
for: 2m
labels:
severity: critical
annotations:
summary: "Redis预计将发生OOM"
description: "实例 {{ $labels.instance }} 预计1小时内将内存耗尽,当前使用率 {{ $value }}%"
4.2 连接数告警规则
- name: redis_connection_alerts
rules:
- alert: RedisConnectionsHigh
expr: redis:connected_clients_percent > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Redis连接数使用率过高"
description: "实例 {{ $labels.instance }} 连接数使用率 {{ $value }}%,最大连接数 {{ $labels.maxclients }}"
- alert: RedisConnectionsRejected
expr: rate(redis_rejected_connections_total[5m]) > 10
for: 1m
labels:
severity: critical
annotations:
summary: "Redis拒绝连接"
description: "实例 {{ $labels.instance }} 在5分钟内拒绝 {{ $value }} 个连接"
4.3 性能告警规则
- name: redis_performance_alerts
rules:
- alert: RedisHighLatency
expr: histogram_quantile(0.95, rate(redis_commands_duration_seconds_bucket[2m])) > 0.1
for: 3m
labels:
severity: warning
annotations:
summary: "Redis命令延迟过高"
description: "实例 {{ $labels.instance }} P95延迟为 {{ $value }}s"
- alert: RedisLowHitRatio
expr: redis:hit_ratio < 80
for: 10m
labels:
severity: warning
annotations:
summary: "Redis缓存命中率过低"
description: "实例 {{ $labels.instance }} 命中率仅为 {{ $value }}%"
- alert: RedisHighCPUUsage
expr: rate(redis_cpu_sys_seconds_total[2m]) + rate(redis_cpu_user_seconds_total[2m]) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Redis CPU使用率过高"
description: "实例 {{ $labels.instance }} CPU使用率 {{ $value }}%"
4.4 集群状态告警
- name: redis_cluster_alerts
rules:
- alert: RedisClusterDown
expr: redis_cluster_state != 1
for: 1m
labels:
severity: critical
annotations:
summary: "Redis集群状态异常"
description: "集群 {{ $labels.cluster }} 状态异常,当前状态码: {{ $value }}"
- alert: RedisMasterLinkDown
expr: redis_master_link_status == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Redis主从复制中断"
description: "从库 {{ $labels.instance }} 与主库复制连接中断"
第五章:Grafana仪表盘开发
5.1 总体监控仪表盘
5.1.1 全局概览面板
{
"dashboard": {
"title": "Redis集群监控概览",
"tags": ["redis", "monitoring"],
"timezone": "browser",
"panels": [
{
"title": "内存使用率",
"type": "stat",
"targets": [{
"expr": "redis:memory_usage_percent",
"legendFormat": "{{instance}}"
}],
"thresholds": [
{"value": 80, "color": "yellow"},
{"value": 90, "color": "red"}
]
},
{
"title": "连接数趋势",
"type": "timeseries",
"targets": [{
"expr": "redis_connected_clients",
"legendFormat": "{{instance}}"
}]
}
]
}
}
5.1.2 性能监控面板
{
"title": "Redis性能监控",
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
"targets": [{
"expr": "rate(redis_commands_processed_total[2m])",
"legendFormat": "{{instance}} OPS"
}],
"fieldConfig": {
"defaults": {
"color": {"mode": "palette-classic"},
"thresholds": {
"steps": [
{"value": null, "color": "green"},
{"value": 1000, "color": "yellow"},
{"value": 5000, "color": "red"}
]
}
}
}
}
5.2 高级监控功能
5.2.1 预测性监控
# 内存增长预测
predict_linear(redis_memory_used_bytes[6h], 86400)
# 容量规划预警
- alert: RedisCapacityPlanning
expr: predict_linear(redis_memory_used_bytes[24h], 604800) / redis_memory_max_bytes > 0.8
for: 1h
labels:
severity: warning
annotations:
summary: "Redis容量规划预警"
description: "实例 {{ $labels.instance }} 预计7天后内存使用率达到 {{ $value }}%"
5.2.2 多集群监控
# 多集群标签管理
global:
external_labels:
region: us-east-1
environment: production
cluster: redis-cluster-01
# 集群级聚合查询
- record: cluster:memory_usage_avg
expr: avg by (cluster) (redis:memory_usage_percent)
- record: cluster:qps_sum
expr: sum by (cluster) (redis:instantaneous_ops_per_second)
第六章:高级特性与优化
6.1 监控数据降采样
# 长期存储配置
remote_write:
- url: http://victoriametrics:8428/api/v1/write
write_relabel_configs:
- action: keep
regex: redis_(memory|connected|commands).*
source_labels: [__name__]
# 记录规则用于降采样
groups:
- name: redis_downsample
interval: 1h
rules:
- record: redis:memory_usage_percent:1h
expr: avg_over_time(redis:memory_usage_percent[1h])
6.2 动态标签管理
# 使用relabel_configs添加业务标签
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: application
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- regex: "(.*)"
target_label: environment
replacement: "production"
6.3 性能优化配置
# Prometheus性能调优
global:
scrape_interval: 30s
scrape_timeout: 10s
# 限制样本数量
scrape_configs:
- job_name: 'redis'
sample_limit: 5000
label_limit: 50
label_name_length_limit: 100
label_value_length_limit: 100
第七章:实战案例与故障排查
7.1 常见故障场景
7.1.1 内存泄漏排查
# 内存增长趋势分析
rate(redis_memory_used_bytes[1h])
# Key数量监控
redis_db_keys{db="db0"}
# 大Key识别
redis_memory_usage_key
7.1.2 性能瓶颈分析
# 慢查询分析
rate(redis_slowlog_length[5m])
# 网络带宽监控
rate(redis_net_input_bytes_total[2m])
rate(redis_net_output_bytes_total[2m])
# 命令耗时分布
histogram_quantile(0.99, rate(redis_commands_duration_seconds_bucket[5m]))
7.2 监控体系验证
7.2.1 端到端测试
# 1. 检查Exporter是否正常
curl http://redis-exporter:9121/metrics | grep redis_up
# 2. 检查Prometheus抓取
curl http://prometheus:9090/api/v1/query?query=redis_up
# 3. 模拟告警触发
curl -X POST http://alertmanager:9093/api/v1/alerts -d '[
{
"labels": {
"alertname": "TestAlert",
"instance": "test-redis",
"severity": "warning"
},
"annotations": {
"summary": "测试告警",
"description": "这是一个测试告警"
}
}
]'
第八章:总结与最佳实践
8.1 监控体系检查清单
| 检查项 |
标准 |
验证方法 |
| 数据采集完整性 |
所有Redis实例都被监控 |
redis_up == 1 |
| 告警规则有效性 |
关键指标都有对应告警 |
模拟触发测试 |
| 通知渠道畅通 |
告警能正确送达 |
端到端测试 |
| 仪表盘可用性 |
主要指标可视化 |
Grafana面板检查 |
| 性能影响评估 |
监控系统资源占用合理 |
资源监控 |
8.2 持续优化建议
- 定期评审告警规则:根据误报和漏报情况调整阈值
- 优化数据保留策略:根据存储成本调整数据保留时间
- 容量规划:根据业务增长预测监控系统容量需求
- 文档维护:保持Runbook和故障处理流程的更新
通过本指南的完整实施,你将建立起一个覆盖Redis全方面监控的企业级监控体系,能够及时发现和处理各种Redis相关问题,保障业务的稳定运行。