Redis Cluster 模式的自动故障发现和切换
1. Cluster 模式故障检测机制
bash
# Cluster 模式具有内置的故障检测和自动故障转移机制
# 关键配置参数:
# cluster-node-timeout 15000 # 节点超时时间(毫秒)
# cluster-require-full-coverage yes # 是否要求全部槽位可用
故障检测原理:
- 心跳检测:每个节点定期向其他节点发送 PING 消息
- PFAIL 标记 :当节点在
cluster-node-timeout
时间内未收到响应,标记为 PFAIL(可能故障) - FAIL 确认:当超过半数主节点认为某节点 PFAIL 时,该节点被标记为 FAIL(确认故障)
自动故障转移过程:
bash
# 1. 故障检测命令验证
redis-cli -c -p 7000 -a redis123 CLUSTER NODES
# 输出示例:
# master1 127.0.0.1:7000@17000 master - 0 1642534567890 1 connected 0-5461
# slave1 127.0.0.1:7003@17003 slave master1 0 1642534567890 4 connected
# 2. 模拟主节点故障
redis-cli -p 7000 -a redis123 DEBUG SEGFAULT # 模拟宕机
# 3. 观察故障转移过程
redis-cli -c -p 7001 -a redis123 CLUSTER NODES
# 故障转移后,slave1 会被提升为新的主节点
2. Cluster 故障转移详细步骤
让我为您的指南添加 Cluster 故障转移的详细说明:
Redis Cluster 自动故障转移机制
bash
# 创建 Cluster 故障转移演示脚本
cat > ~/redis-practice/cluster/cluster-failover-demo.sh << 'EOF'
#!/bin/bash
PASSWORD="redis123"
echo "=== Redis Cluster 故障转移演示 ==="
# 1. 检查集群初始状态
echo "1. 集群初始状态:"
redis-cli -c -p 7000 -a $PASSWORD CLUSTER NODES | grep master
# 2. 写入测试数据
echo -e "\n2. 写入测试数据..."
for i in {1..10}; do
redis-cli -c -p 7000 -a $PASSWORD SET "test:$i" "value-$i"
done
# 3. 验证数据分布
echo -e "\n3. 数据分布验证:"
for port in 7000 7001 7002; do
count=$(redis-cli -p $port -a $PASSWORD DBSIZE)
echo "节点 $port: $count 个键"
done
# 4. 获取第一个主节点的从节点端口
MASTER_PORT=7000
SLAVE_PORT=$(redis-cli -c -p 7000 -a $PASSWORD CLUSTER NODES | grep slave | grep $(redis-cli -c -p 7000 -a $PASSWORD CLUSTER NODES | grep $MASTER_PORT | grep master | awk '{print $1}') | awk '{print $2}' | cut -d: -f2 | cut -d@ -f1)
echo -e "\n4. 准备模拟故障..."
echo "主节点: $MASTER_PORT"
echo "对应从节点: $SLAVE_PORT"
read -p "按 Enter 模拟主节点 $MASTER_PORT 故障..."
# 5. 模拟主节点故障
echo -e "\n5. 模拟主节点故障..."
redis-cli -p $MASTER_PORT -a $PASSWORD DEBUG SEGFAULT &>/dev/null ||
redis-cli -p $MASTER_PORT -a $PASSWORD SHUTDOWN &>/dev/null
# 6. 等待故障转移
echo "等待故障转移完成(通常需要几秒钟)..."
sleep 8
# 7. 检查故障转移结果
echo -e "\n7. 故障转移后的集群状态:"
redis-cli -c -p 7001 -a $PASSWORD CLUSTER NODES | grep -E "(master|fail)"
# 8. 验证数据完整性
echo -e "\n8. 验证数据完整性:"
for i in {1..10}; do
value=$(redis-cli -c -p 7001 -a $PASSWORD GET "test:$i" 2>/dev/null)
if [ "$value" = "value-$i" ]; then
echo "✅ test:$i = $value"
else
echo "❌ test:$i 数据丢失或错误"
fi
done
# 9. 检查新的主节点状态
echo -e "\n9. 新主节点状态:"
NEW_MASTER_PORT=$(redis-cli -c -p 7001 -a $PASSWORD CLUSTER NODES | grep $SLAVE_PORT | awk '{print $2}' | cut -d: -f2 | cut -d@ -f1)
redis-cli -p $NEW_MASTER_PORT -a $PASSWORD INFO replication | grep role
echo -e "\n=== 故障转移演示完成 ==="
EOF
chmod +x ~/redis-practice/cluster/cluster-failover-demo.sh
主从模式 + Sentinel 的自动故障发现和切换
1. Sentinel 故障检测机制
主从模式本身没有 自动故障转移能力,需要配合 Redis Sentinel 实现:
bash
# Sentinel 关键配置参数:
# sentinel down-after-milliseconds mymaster 5000 # 故障检测时间
# sentinel parallel-syncs mymaster 1 # 同时同步的从节点数
# sentinel failover-timeout mymaster 10000 # 故障转移超时时间
# sentinel monitor mymaster 127.0.0.1 6379 2 # 最少确认故障的 Sentinel 数量
2. Sentinel 故障转移详细步骤
让我为您添加 Sentinel 故障转移的详细演示:
Redis Sentinel 自动故障转移机制
bash
# 创建 Sentinel 故障转移演示脚本
cat > ~/redis-practice/sentinel/sentinel-failover-demo.sh << 'EOF'
#!/bin/bash
REDIS_PASSWORD="redis123"
SENTINEL_PASSWORD="sentinel123"
echo "=== Redis Sentinel 故障转移演示 ==="
# 1. 检查 Sentinel 监控状态
echo "1. Sentinel 监控状态:"
redis-cli -p 26379 -a $SENTINEL_PASSWORD SENTINEL MASTERS | grep -E "name|ip|port|flags"
# 2. 检查主从状态
echo -e "\n2. 当前主从状态:"
CURRENT_MASTER=$(redis-cli -p 26379 -a $SENTINEL_PASSWORD SENTINEL GET-MASTER-ADDR-BY-NAME mymaster)
MASTER_IP=$(echo $CURRENT_MASTER | awk '{print $1}')
MASTER_PORT=$(echo $CURRENT_MASTER | awk '{print $2}')
echo "当前主节点: $MASTER_IP:$MASTER_PORT"
redis-cli -p $MASTER_PORT -a $REDIS_PASSWORD INFO replication | grep -E "role|connected_slaves"
echo -e "\n从节点状态:"
redis-cli -p 26379 -a $SENTINEL_PASSWORD SENTINEL SLAVES mymaster | grep -E "name|ip|port|flags"
# 3. 写入测试数据
echo -e "\n3. 在主节点写入测试数据..."
for i in {1..5}; do
redis-cli -p $MASTER_PORT -a $REDIS_PASSWORD SET "sentinel:test:$i" "failover-test-$i"
done
# 4. 验证从节点数据同步
echo -e "\n4. 验证从节点数据同步:"
for port in 6380 6381; do
if redis-cli -p $port -a $REDIS_PASSWORD ping >/dev/null 2>&1; then
count=$(redis-cli -p $port -a $REDIS_PASSWORD KEYS "sentinel:test:*" | wc -l)
echo "从节点 $port: $count 个测试键"
fi
done
read -p -e "\n按 Enter 开始模拟主节点故障..."
# 5. 模拟主节点故障
echo -e "\n5. 模拟主节点 $MASTER_PORT 故障..."
redis-cli -p $MASTER_PORT -a $REDIS_PASSWORD DEBUG SEGFAULT &>/dev/null ||
redis-cli -p $MASTER_PORT -a $REDIS_PASSWORD SHUTDOWN &>/dev/null
echo "主节点已关闭,Sentinel 开始故障检测..."
# 6. 监控故障转移过程
echo -e "\n6. 监控故障转移过程:"
for i in {1..20}; do
sleep 1
NEW_MASTER=$(redis-cli -p 26379 -a $SENTINEL_PASSWORD SENTINEL GET-MASTER-ADDR-BY-NAME mymaster 2>/dev/null)
NEW_MASTER_PORT=$(echo $NEW_MASTER | awk '{print $2}')
if [ "$NEW_MASTER_PORT" != "$MASTER_PORT" ] && [ ! -z "$NEW_MASTER_PORT" ]; then
echo "✅ 故障转移完成!新主节点端口: $NEW_MASTER_PORT"
break
fi
echo "等待故障转移... ($i/20)"
done
# 7. 验证新主节点状态
echo -e "\n7. 验证新主节点状态:"
if [ ! -z "$NEW_MASTER_PORT" ]; then
redis-cli -p $NEW_MASTER_PORT -a $REDIS_PASSWORD INFO replication | grep -E "role|connected_slaves"
# 8. 验证数据完整性
echo -e "\n8. 验证数据完整性:"
for i in {1..5}; do
value=$(redis-cli -p $NEW_MASTER_PORT -a $REDIS_PASSWORD GET "sentinel:test:$i")
if [ "$value" = "failover-test-$i" ]; then
echo "✅ sentinel:test:$i = $value"
else
echo "❌ sentinel:test:$i 数据异常"
fi
done
# 9. 测试新主节点写入能力
echo -e "\n9. 测试新主节点写入能力:"
redis-cli -p $NEW_MASTER_PORT -a $REDIS_PASSWORD SET "post-failover" "$(date)"
echo "✅ 新主节点可以正常写入数据"
else
echo "❌ 故障转移可能失败,请检查 Sentinel 配置"
fi
# 10. 显示 Sentinel 日志摘要
echo -e "\n10. Sentinel 故障转移日志摘要:"
if [ -f ~/redis-practice/sentinel/logs/sentinel-1.log ]; then
tail -10 ~/redis-practice/sentinel/logs/sentinel-1.log | grep -E "failover|switch-master"
fi
echo -e "\n=== Sentinel 故障转移演示完成 ==="
EOF
chmod +x ~/redis-practice/sentinel/sentinel-failover-demo.sh
故障转移机制对比
两种模式的关键区别:
特性 | Redis Cluster | 主从 + Sentinel |
---|---|---|
自动故障转移 | ✅ 内置支持 | ✅ 需要 Sentinel |
检测时间 | cluster-node-timeout | down-after-milliseconds |
决策机制 | 过半数主节点确认 | 过半数 Sentinel 确认 |
数据分片 | ✅ 自动分片 | ❌ 需要应用层处理 |
客户端复杂度 | 需要集群感知 | 需要 Sentinel 感知 |
最小节点数 | 6个(3主3从) | 3个(1主2从+3个Sentinel) |
故障转移时间对比:
bash
# 创建故障转移时间对比脚本
cat > ~/redis-practice/failover-time-comparison.sh << 'EOF'
#!/bin/bash
echo "=== 故障转移时间对比测试 ==="
# Cluster 模式故障转移时间测试
test_cluster_failover_time() {
echo "1. Cluster 模式故障转移时间测试"
start_time=$(date +%s)
# 模拟故障
redis-cli -p 7000 -a redis123 DEBUG SEGFAULT &>/dev/null
# 等待故障转移
while true; do
if redis-cli -c -p 7001 -a redis123 CLUSTER NODES 2>/dev/null | grep "7003.*master" >/dev/null; then
end_time=$(date +%s)
failover_time=$((end_time - start_time))
echo " Cluster 故障转移完成时间: ${failover_time} 秒"
break
fi
sleep 1
done
}
# Sentinel 模式故障转移时间测试
test_sentinel_failover_time() {
echo "2. Sentinel 模式故障转移时间测试"
start_time=$(date +%s)
# 获取当前主节点
current_master_port=$(redis-cli -p 26379 -a sentinel123 SENTINEL GET-MASTER-ADDR-BY-NAME mymaster | tail -1)
# 模拟故障
redis-cli -p $current_master_port -a redis123 DEBUG SEGFAULT &>/dev/null
# 等待故障转移
while true; do
new_master_port=$(redis-cli -p 26379 -a sentinel123 SENTINEL GET-MASTER-ADDR-BY-NAME mymaster 2>/dev/null | tail -1)
if [ "$new_master_port" != "$current_master_port" ] && [ ! -z "$new_master_port" ]; then
end_time=$(date +%s)
failover_time=$((end_time - start_time))
echo " Sentinel 故障转移完成时间: ${failover_time} 秒"
break
fi
sleep 1
done
}
echo "注意:请确保对应的架构正在运行"
echo "选择测试模式:"
echo "1) Cluster 模式"
echo "2) Sentinel 模式"
echo "3) 两种模式对比"
read -p "请选择 [1-3]: " choice
case $choice in
1) test_cluster_failover_time ;;
2) test_sentinel_failover_time ;;
3)
test_cluster_failover_time
echo ""
test_sentinel_failover_time
;;
*) echo "无效选择" ;;
esac
EOF
chmod +x ~/redis-practice/failover-time-comparison.sh
生产环境最佳实践
1. 故障转移配置优化
bash
# Cluster 模式优化配置
cluster-node-timeout 5000 # 减少故障检测时间
cluster-require-full-coverage no # 允许部分槽位不可用时继续服务
# Sentinel 模式优化配置
sentinel down-after-milliseconds mymaster 3000 # 快速故障检测
sentinel failover-timeout mymaster 10000 # 充足的故障转移时间
sentinel parallel-syncs mymaster 1 # 避免同步风暴
2. 监控告警设置
bash
# 创建故障转移监控脚本
cat > ~/redis-practice/failover-monitor.sh << 'EOF'
#!/bin/bash
# 监控 Cluster 模式故障转移
monitor_cluster() {
redis-cli -c -p 7000 -a redis123 CLUSTER NODES | while read line; do
if echo "$line" | grep -q "fail"; then
echo "ALERT: Cluster node failed - $line"
# 这里可以添加邮件或短信告警
fi
done
}
# 监控 Sentinel 模式故障转移
monitor_sentinel() {
# 检查 master 切换事件
redis-cli -p 26379 -a sentinel123 SENTINEL MASTERS | grep -A 10 "name" | while read line; do
if echo "$line" | grep -q "flags"; then
if echo "$line" | grep -q "s_down\|o_down"; then
echo "ALERT: Sentinel detected master down - $line"
fi
fi
done
}
echo "Redis 故障转移监控启动..."
echo "按 Ctrl+C 停止监控"
while true; do
monitor_cluster
monitor_sentinel
sleep 10
done
EOF
chmod +x ~/redis-practice/failover-monitor.sh
两种模式各有优势,可以根据具体的业务需求和技术栈来选择合适的方案。