Redis监控告警体系搭建:从零到企业级实战
第一章:体系架构设计
[1.1 整体架构图](#1.1 整体架构图)
[1.2 组件职责说明](#1.2 组件职责说明)
第二章:环境准备与部署
[2.1 Redis Exporter 部署](#2.1 Redis Exporter 部署)
[2.1.1 Docker 部署方式](#2.1.1 Docker 部署方式)
[2.1.2 二进制文件部署](#2.1.2 二进制文件部署)
[2.1.3 Kubernetes部署配置](#2.1.3 Kubernetes部署配置)
[2.2 Prometheus 配置](#2.2 Prometheus 配置)
[2.2.1 主配置文件](#2.2.1 主配置文件)
[2.2.2 记录规则配置](#2.2.2 记录规则配置)
[2.3 Alertmanager 配置](#2.3 Alertmanager 配置)
[2.3.1 告警路由配置](#2.3.1 告警路由配置)
[2.3.2 钉钉告警配置](#2.3.2 钉钉告警配置)
第三章:Redis监控指标体系
[3.1 关键性能指标分类](#3.1 关键性能指标分类)
[3.1.1 内存相关指标](#3.1.1 内存相关指标)
[3.1.2 连接相关指标](#3.1.2 连接相关指标)
[3.1.3 性能相关指标](#3.1.3 性能相关指标)
[3.1.4 持久化相关指标](#3.1.4 持久化相关指标)
[3.2 业务自定义指标](#3.2 业务自定义指标)
[3.2.1 大Key监控](#3.2.1 大Key监控)
[3.2.2 热点Key监控](#3.2.2 热点Key监控)
第四章:告警规则配置
[4.1 内存告警规则](#4.1 内存告警规则)
[4.2 连接数告警规则](#4.2 连接数告警规则)
[4.3 性能告警规则](#4.3 性能告警规则)
[4.4 集群状态告警](#4.4 集群状态告警)
第五章:Grafana仪表盘开发
[5.1 总体监控仪表盘](#5.1 总体监控仪表盘)
[5.1.1 全局概览面板](#5.1.1 全局概览面板)
[5.1.2 性能监控面板](#5.1.2 性能监控面板)
[5.2 高级监控功能](#5.2 高级监控功能)
[5.2.1 预测性监控](#5.2.1 预测性监控)
[5.2.2 多集群监控](#5.2.2 多集群监控)
第六章:高级特性与优化
[6.1 监控数据降采样](#6.1 监控数据降采样)
[6.2 动态标签管理](#6.2 动态标签管理)
[6.3 性能优化配置](#6.3 性能优化配置)
第七章:实战案例与故障排查
[7.1 常见故障场景](#7.1 常见故障场景)
[7.1.1 内存泄漏排查](#7.1.1 内存泄漏排查)
[7.1.2 性能瓶颈分析](#7.1.2 性能瓶颈分析)
[7.2 监控体系验证](#7.2 监控体系验证)
[7.2.1 端到端测试](#7.2.1 端到端测试)
第八章:总结与最佳实践
[8.1 监控体系检查清单](#8.1 监控体系检查清单)
[8.2 持续优化建议](#8.2 持续优化建议)
第一章:体系架构设计
1.1 整体架构图
我们先通过一张架构图来全局了解整个监控体系的组件和数据流向。
可视化与告警 存储计算层 数据采集层 监控仪表盘 Grafana 性能分析 邮件告警 Alertmanager 钉钉告警 短信告警 时序数据库 Prometheus 告警规则 记录规则 Exporter Redis主库 Exporter Redis从库 Exporter Redis集群 Redis实例 Redis Exporter 通知渠道
1.2 组件职责说明
组件
职责
关键技术点
Redis Exporter
采集Redis指标,暴露HTTP端点供Prometheus抓取
Go语言开发,支持集群模式
Prometheus
定时抓取指标,存储时序数据,执行告警规则
时序数据库,PromQL查询语言
Grafana
数据可视化,创建监控仪表盘
数据源插件,面板编辑器
Alertmanager
告警去重、分组、路由、静默
分组策略,抑制规则
第二章:环境准备与部署
2.1 Redis Exporter 部署
2.1.1 Docker 部署方式
bash
复制代码
# 创建docker-compose.yml
version: '3.8'
services:
redis-exporter:
image: oliver006/redis_exporter:latest
container_name: redis-exporter
ports:
- "9121:9121"
environment:
- REDIS_ADDR=redis://localhost:6379
- REDIS_PASSWORD=your_redis_password
- REDIS_ALIAS=production-redis
restart: unless-stopped
healthcheck:
test: ["CMD", "wget", "--quiet", "--spider", "http://localhost:9121/metrics"]
interval: 30s
timeout: 10s
retries: 3
# 启动服务
docker-compose up -d
2.1.2 二进制文件部署
bash
复制代码
# 下载最新版本
wget https://github.com/oliver006/redis_exporter/releases/download/v1.45.0/redis_exporter-v1.45.0.linux-amd64.tar.gz
tar -xzf redis_exporter-v1.45.0.linux-amd64.tar.gz
cd redis_exporter-v1.45.0.linux-amd64
# 启动exporter(支持监控多个Redis实例)
./redis_exporter \
-redis.addr redis1:6379,redis2:6379 \
-redis.password file:/etc/redis/password.txt \
-web.listen-address :9121 \
-web.telemetry-path /metrics \
-log-format json
2.1.3 Kubernetes部署配置
yaml
复制代码
# redis-exporter-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-exporter
namespace: monitoring
spec:
replicas: 2
selector:
matchLabels:
app: redis-exporter
template:
metadata:
labels:
app: redis-exporter
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9121"
prometheus.io/path: "/metrics"
spec:
containers:
- name: redis-exporter
image: oliver006/redis_exporter:v1.45.0
ports:
- containerPort: 9121
env:
- name: REDIS_ADDR
value: "redis-service:6379"
- name: REDIS_PASSWORD
valueFrom:
secretKeyRef:
name: redis-secret
key: password
resources:
requests:
memory: "64Mi"
cpu: "50m"
limits:
memory: "128Mi"
cpu: "100m"
livenessProbe:
httpGet:
path: /metrics
port: 9121
initialDelaySeconds: 30
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: redis-exporter-service
namespace: monitoring
labels:
app: redis-exporter
spec:
selector:
app: redis-exporter
ports:
- name: metrics
port: 9121
targetPort: 9121
2.2 Prometheus 配置
2.2.1 主配置文件
yaml
复制代码
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
environment: production
cluster: redis-cluster-01
# 告警规则文件
rule_files:
- "alerts/redis_alerts.yml"
- "alerts/system_alerts.yml"
# 抓取配置
scrape_configs:
# Redis Exporter 监控配置
- job_name: 'redis'
static_configs:
- targets:
- 'redis-exporter-service:9121' # Kubernetes服务发现
- '192.168.1.100:9121' # 静态IP配置
metrics_path: /metrics
params:
format: ['prometheus']
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: redis-exporter-service:9121
scrape_interval: 30s
scrape_timeout: 10s
# Prometheus自身监控
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter 系统监控
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
scrape_interval: 30s
2.2.2 记录规则配置
yaml
复制代码
# recording_rules/redis_rules.yml
groups:
- name: redis_recording_rules
interval: 30s
rules:
- record: redis:memory_usage_percent
expr: redis_memory_used_bytes / redis_memory_max_bytes * 100
- record: redis:connected_clients_percent
expr: redis_connected_clients / redis_maxclients * 100
- record: redis:instantaneous_ops_per_second
expr: rate(redis_commands_processed_total[2m])
- record: redis:keyspace_hits_rate
expr: rate(redis_keyspace_hits_total[5m])
- record: redis:keyspace_misses_rate
expr: rate(redis_keyspace_misses_total[5m])
- record: redis:hit_ratio
expr: redis:keyspace_hits_rate / (redis:keyspace_hits_rate + redis:keyspace_misses_rate) * 100
- record: redis:network_input_rate
expr: rate(redis_net_input_bytes_total[2m])
- record: redis:network_output_rate
expr: rate(redis_net_output_bytes_total[2m])
2.3 Alertmanager 配置
2.3.1 告警路由配置
yaml
复制代码
# alertmanager.yml
global:
smtp_smarthost: 'smtp.qq.com:587'
smtp_from: 'monitoring@company.com'
smtp_auth_username: 'monitoring@company.com'
smtp_auth_password: 'your-smtp-password'
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 5m
repeat_interval: 1h
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
group_interval: 1m
repeat_interval: 5m
- match:
service: redis
receiver: 'redis-team'
routes:
- match:
severity: warning
receiver: 'redis-warning'
- match:
severity: critical
receiver: 'redis-critical'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'instance']
receivers:
- name: 'default-receiver'
email_configs:
- to: 'devops-team@company.com'
headers:
subject: '[Monitoring] Alert: {{ .GroupLabels.alertname }}'
- name: 'critical-alerts'
email_configs:
- to: 'sre-team@company.com'
webhook_configs:
- url: 'http://alert-hook:8080/alerts'
send_resolved: true
- name: 'redis-team'
email_configs:
- to: 'redis-dba@company.com'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts-redis'
title: 'Redis Alert: {{ .CommonLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}'
2.3.2 钉钉告警配置
yaml
复制代码
# 钉钉机器人接收器
- name: 'dingtalk-redis'
webhook_configs:
- url: 'https://oapi.dingtalk.com/robot/send?access_token=your-token'
send_resolved: true
http_config:
bearer_token: 'your-bearer-token'
第三章:Redis监控指标体系
3.1 关键性能指标分类
3.1.1 内存相关指标
promql
复制代码
# 内存使用率
redis_memory_used_bytes / redis_memory_max_bytes * 100
# 内存碎片率
redis_memory_fragmentation_ratio
# 内存使用趋势预测
predict_linear(redis_memory_used_bytes[6h], 86400)
# 内存淘汰策略
redis_evicted_keys_total
3.1.2 连接相关指标
promql
复制代码
# 连接数使用率
redis_connected_clients / redis_maxclients * 100
# 连接拒绝率
rate(redis_rejected_connections_total[5m])
# 连接数趋势
rate(redis_connected_clients[10m])
3.1.3 性能相关指标
promql
复制代码
# 每秒操作数
rate(redis_commands_processed_total[2m])
# 命令延迟分布
histogram_quantile(0.95,
rate(redis_commands_duration_seconds_bucket[5m])
)
# 缓存命中率
rate(redis_keyspace_hits_total[5m]) /
(rate(redis_keyspace_hits_total[5m]) +
rate(redis_keyspace_misses_total[5m])) * 100
3.1.4 持久化相关指标
promql
复制代码
# RDB持久化状态
redis_rdb_last_save_timestamp_seconds
redis_rdb_changes_since_last_save
# AOF持久化状态
redis_aof_enabled
redis_aof_last_rewrite_time_seconds
3.2 业务自定义指标
3.2.1 大Key监控
bash
复制代码
# 通过Redis命令扫描大Key
redis-cli --bigkeys -i 0.1
# 通过Lua脚本监控特定模式的大Key
local keys = redis.call('keys', ARGV[1])
local result = {}
for i, key in ipairs(keys) do
local size = redis.call('memory', 'usage', key)
if size > tonumber(ARGV[2]) then
table.insert(result, {key, size})
end
end
return result
3.2.2 热点Key监控
promql
复制代码
# 通过命令统计识别热点Key
rate(redis_cmdstat_get_calls[2m])
rate(redis_cmdstat_set_calls[2m])
# 自定义指标暴露
# 在应用中埋点热点Key访问
第四章:告警规则配置
4.1 内存告警规则
yaml
复制代码
# alerts/redis_alerts.yml
groups:
- name: redis_memory_alerts
rules:
- alert: RedisMemoryUsageCritical
expr: redis:memory_usage_percent > 90
for: 5m
labels:
severity: critical
service: redis
annotations:
summary: "Redis内存使用率超过90%"
description: "实例 {{ $labels.instance }} 内存使用率当前为 {{ $value }}%,请立即处理"
runbook: "https://wiki/redis-memory-optimization"
- alert: RedisMemoryFragmentationHigh
expr: redis_memory_fragmentation_ratio > 1.5
for: 10m
labels:
severity: warning
annotations:
summary: "Redis内存碎片率过高"
description: "实例 {{ $labels.instance }} 内存碎片率为 {{ $value }},建议进行碎片整理"
- alert: RedisMemoryOOMWarning
expr: predict_linear(redis_memory_used_bytes[1h], 3600) / redis_memory_max_bytes > 1
for: 2m
labels:
severity: critical
annotations:
summary: "Redis预计将发生OOM"
description: "实例 {{ $labels.instance }} 预计1小时内将内存耗尽,当前使用率 {{ $value }}%"
4.2 连接数告警规则
yaml
复制代码
- name: redis_connection_alerts
rules:
- alert: RedisConnectionsHigh
expr: redis:connected_clients_percent > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Redis连接数使用率过高"
description: "实例 {{ $labels.instance }} 连接数使用率 {{ $value }}%,最大连接数 {{ $labels.maxclients }}"
- alert: RedisConnectionsRejected
expr: rate(redis_rejected_connections_total[5m]) > 10
for: 1m
labels:
severity: critical
annotations:
summary: "Redis拒绝连接"
description: "实例 {{ $labels.instance }} 在5分钟内拒绝 {{ $value }} 个连接"
4.3 性能告警规则
yaml
复制代码
- name: redis_performance_alerts
rules:
- alert: RedisHighLatency
expr: histogram_quantile(0.95, rate(redis_commands_duration_seconds_bucket[2m])) > 0.1
for: 3m
labels:
severity: warning
annotations:
summary: "Redis命令延迟过高"
description: "实例 {{ $labels.instance }} P95延迟为 {{ $value }}s"
- alert: RedisLowHitRatio
expr: redis:hit_ratio < 80
for: 10m
labels:
severity: warning
annotations:
summary: "Redis缓存命中率过低"
description: "实例 {{ $labels.instance }} 命中率仅为 {{ $value }}%"
- alert: RedisHighCPUUsage
expr: rate(redis_cpu_sys_seconds_total[2m]) + rate(redis_cpu_user_seconds_total[2m]) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Redis CPU使用率过高"
description: "实例 {{ $labels.instance }} CPU使用率 {{ $value }}%"
4.4 集群状态告警
yaml
复制代码
- name: redis_cluster_alerts
rules:
- alert: RedisClusterDown
expr: redis_cluster_state != 1
for: 1m
labels:
severity: critical
annotations:
summary: "Redis集群状态异常"
description: "集群 {{ $labels.cluster }} 状态异常,当前状态码: {{ $value }}"
- alert: RedisMasterLinkDown
expr: redis_master_link_status == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Redis主从复制中断"
description: "从库 {{ $labels.instance }} 与主库复制连接中断"
第五章:Grafana仪表盘开发
5.1 总体监控仪表盘
5.1.1 全局概览面板
json
复制代码
{
"dashboard": {
"title": "Redis集群监控概览",
"tags": ["redis", "monitoring"],
"timezone": "browser",
"panels": [
{
"title": "内存使用率",
"type": "stat",
"targets": [{
"expr": "redis:memory_usage_percent",
"legendFormat": "{{instance}}"
}],
"thresholds": [
{"value": 80, "color": "yellow"},
{"value": 90, "color": "red"}
]
},
{
"title": "连接数趋势",
"type": "timeseries",
"targets": [{
"expr": "redis_connected_clients",
"legendFormat": "{{instance}}"
}]
}
]
}
}
5.1.2 性能监控面板
json
复制代码
{
"title": "Redis性能监控",
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
"targets": [{
"expr": "rate(redis_commands_processed_total[2m])",
"legendFormat": "{{instance}} OPS"
}],
"fieldConfig": {
"defaults": {
"color": {"mode": "palette-classic"},
"thresholds": {
"steps": [
{"value": null, "color": "green"},
{"value": 1000, "color": "yellow"},
{"value": 5000, "color": "red"}
]
}
}
}
}
5.2 高级监控功能
5.2.1 预测性监控
promql
复制代码
# 内存增长预测
predict_linear(redis_memory_used_bytes[6h], 86400)
# 容量规划预警
- alert: RedisCapacityPlanning
expr: predict_linear(redis_memory_used_bytes[24h], 604800) / redis_memory_max_bytes > 0.8
for: 1h
labels:
severity: warning
annotations:
summary: "Redis容量规划预警"
description: "实例 {{ $labels.instance }} 预计7天后内存使用率达到 {{ $value }}%"
5.2.2 多集群监控
yaml
复制代码
# 多集群标签管理
global:
external_labels:
region: us-east-1
environment: production
cluster: redis-cluster-01
# 集群级聚合查询
- record: cluster:memory_usage_avg
expr: avg by (cluster) (redis:memory_usage_percent)
- record: cluster:qps_sum
expr: sum by (cluster) (redis:instantaneous_ops_per_second)
第六章:高级特性与优化
6.1 监控数据降采样
yaml
复制代码
# 长期存储配置
remote_write:
- url: http://victoriametrics:8428/api/v1/write
write_relabel_configs:
- action: keep
regex: redis_(memory|connected|commands).*
source_labels: [__name__]
# 记录规则用于降采样
groups:
- name: redis_downsample
interval: 1h
rules:
- record: redis:memory_usage_percent:1h
expr: avg_over_time(redis:memory_usage_percent[1h])
6.2 动态标签管理
yaml
复制代码
# 使用relabel_configs添加业务标签
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: application
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- regex: "(.*)"
target_label: environment
replacement: "production"
6.3 性能优化配置
yaml
复制代码
# Prometheus性能调优
global:
scrape_interval: 30s
scrape_timeout: 10s
# 限制样本数量
scrape_configs:
- job_name: 'redis'
sample_limit: 5000
label_limit: 50
label_name_length_limit: 100
label_value_length_limit: 100
第七章:实战案例与故障排查
7.1 常见故障场景
7.1.1 内存泄漏排查
promql
复制代码
# 内存增长趋势分析
rate(redis_memory_used_bytes[1h])
# Key数量监控
redis_db_keys{db="db0"}
# 大Key识别
redis_memory_usage_key
7.1.2 性能瓶颈分析
promql
复制代码
# 慢查询分析
rate(redis_slowlog_length[5m])
# 网络带宽监控
rate(redis_net_input_bytes_total[2m])
rate(redis_net_output_bytes_total[2m])
# 命令耗时分布
histogram_quantile(0.99, rate(redis_commands_duration_seconds_bucket[5m]))
7.2 监控体系验证
7.2.1 端到端测试
bash
复制代码
# 1. 检查Exporter是否正常
curl http://redis-exporter:9121/metrics | grep redis_up
# 2. 检查Prometheus抓取
curl http://prometheus:9090/api/v1/query?query=redis_up
# 3. 模拟告警触发
curl -X POST http://alertmanager:9093/api/v1/alerts -d '[
{
"labels": {
"alertname": "TestAlert",
"instance": "test-redis",
"severity": "warning"
},
"annotations": {
"summary": "测试告警",
"description": "这是一个测试告警"
}
}
]'
第八章:总结与最佳实践
8.1 监控体系检查清单
检查项
标准
验证方法
数据采集完整性
所有Redis实例都被监控
redis_up == 1
告警规则有效性
关键指标都有对应告警
模拟触发测试
通知渠道畅通
告警能正确送达
端到端测试
仪表盘可用性
主要指标可视化
Grafana面板检查
性能影响评估
监控系统资源占用合理
资源监控
8.2 持续优化建议
定期评审告警规则:根据误报和漏报情况调整阈值
优化数据保留策略:根据存储成本调整数据保留时间
容量规划:根据业务增长预测监控系统容量需求
文档维护:保持Runbook和故障处理流程的更新
通过本指南的完整实施,你将建立起一个覆盖Redis全方面监控的企业级监控体系,能够及时发现和处理各种Redis相关问题,保障业务的稳定运行。