Redis 哨兵模式详解

Redis 哨兵模式详解：自动故障转移与选举机制

主从复制虽然实现了数据冗余，但主节点故障时需要人工干预，切换时间长且容易出错。Redis Sentinel（哨兵）提供了自动故障检测和故障转移能力，实现真正的高可用。本文将深入剖析哨兵模式的架构、选举机制、故障转移流程。

📖 目录

[为什么需要 Sentinel？](#为什么需要 Sentinel？ "#%E4%B8%BA%E4%BB%80%E4%B9%88%E9%9C%80%E8%A6%81-sentinel")
[6.1 Sentinel 架构](#6.1 Sentinel 架构 "#61-sentinel-%E6%9E%B6%E6%9E%84")
- 整体架构
- 核心功能
- [Sentinel 集群](#Sentinel 集群 "#sentinel-%E9%9B%86%E7%BE%A4")
[6.2 故障检测机制](#6.2 故障检测机制 "#62-%E6%95%85%E9%9A%9C%E6%A3%80%E6%B5%8B%E6%9C%BA%E5%88%B6")
[6.3 Leader 选举](#6.3 Leader 选举 "#63-leader-%E9%80%89%E4%B8%BE")
- [Raft 协议简化版](#Raft 协议简化版 "#raft-%E5%8D%8F%E8%AE%AE%E7%AE%80%E5%8C%96%E7%89%88")
- 选举流程
- 选举算法
[6.4 故障转移](#6.4 故障转移 "#64-%E6%95%85%E9%9A%9C%E8%BD%AC%E7%A7%BB")
[6.5 Sentinel 通信](#6.5 Sentinel 通信 "#65-sentinel-%E9%80%9A%E4%BF%A1")
[6.6 客户端支持](#6.6 客户端支持 "#66-%E5%AE%A2%E6%88%B7%E7%AB%AF%E6%94%AF%E6%8C%81")
- 客户端连接
- 自动切换
- [Java 实现](#Java 实现 "#java-%E5%AE%9E%E7%8E%B0")
[6.7 Sentinel 配置](#6.7 Sentinel 配置 "#67-sentinel-%E9%85%8D%E7%BD%AE")
生产环境实战
常见问题解答

为什么需要 Sentinel？

主从复制的痛点

markdown 复制代码

场景：凌晨 3 点，主节点宕机

主从复制的处理方式：
1. 📞 运维被告警电话叫醒
2. 🏃 运维爬起来登录服务器
3. 🔍 排查主节点是否真的宕机
4. 🤔 选择一个从节点提升为主节点
5. ⌨️  执行命令：REPLICAOF NO ONE
6. 🔄 修改其他从节点的配置
7. 🔄 修改应用配置（主节点地址）
8. 🔄 重启应用
9. ✅ 故障恢复（已过去 10-30 分钟）

问题：
• ⏰ 切换时间长（10-30 分钟）
• 😴 需要人工干预（半夜被叫醒）
• ❌ 容易出错（人为失误）
• 💸 服务中断时间长（损失大）

Sentinel 的解决方案

markdown 复制代码

场景：凌晨 3 点，主节点宕机

Sentinel 的处理方式：
1. 🤖 Sentinel 自动检测到主节点下线（秒级）
2. 🤖 Sentinel 集群投票确认
3. 🤖 自动选择最优从节点
4. 🤖 自动执行故障转移
5. 🤖 自动通知客户端
6. ✅ 故障恢复（10-30 秒完成）

优势：
• ⚡ 切换时间短（秒级）
• 🤖 全自动（运维可以睡觉）
• ✅ 不易出错（程序化执行）
• 💰 服务中断时间短（损失小）

对比：

维度	主从复制	哨兵模式
故障检测	❌ 人工检测	✅ 自动检测
故障转移	❌ 手动切换	✅ 自动切换
切换时间	10-30 分钟	10-30 秒
运维成本	高（7×24 值班）	低（自动化）
可靠性	低（人为失误）	高（程序化）

6.1 Sentinel 架构

整体架构

scss 复制代码

┌────────────────────────────────────────────────┐
│          Redis Sentinel 架构                    │
└────────────────────────────────────────────────┘

                客户端
                  ↓
         ┌────────────────┐
         │  询问主节点地址 │
         └────────┬───────┘
                  ↓
    ┌──────────────────────────────┐
    │     Sentinel 集群（监控者）   │
    │                              │
    │  ┌──────────┐  ┌──────────┐│
    │  │Sentinel1 │  │Sentinel2 ││
    │  │ (26379)  │  │ (26380)  ││
    │  └────┬─────┘  └────┬─────┘│
    │       │ 互相通信     │       │
    │  ┌────┴──────┬──────┴─────┐│
    │  │Sentinel3  │            ││
    │  │ (26381)   │            ││
    │  └────┬──────┘            ││
    └───────┼────────────────────┘
            │ 监控、心跳、投票
            ↓
    ┌──────────────────────────────┐
    │     Redis 主从集群            │
    │                              │
    │      ┌──────────┐            │
    │      │  Master  │            │
    │      │  (6379)  │            │
    │      └────┬─────┘            │
    │           │ 数据同步          │
    │      ┌────┴─────┐            │
    │      ↓          ↓            │
    │  ┌────────┐ ┌────────┐     │
    │  │ Slave1 │ │ Slave2 │     │
    │  │ (6380) │ │ (6381) │     │
    │  └────────┘ └────────┘     │
    └──────────────────────────────┘

核心功能

Sentinel 提供四大核心功能：

1️⃣ 监控（Monitoring）

bash 复制代码

# Sentinel 持续监控：
• Master 是否正常运行
• Slave 是否正常运行
• 其他 Sentinel 是否正常运行

# 监控方式
每秒向所有节点发送 PING 命令
如果超时未响应 → 标记为下线

2️⃣ 通知（Notification）

bash 复制代码

# 当发现故障时，Sentinel 会通知：
• 运维人员（邮件、短信、钉钉）
• 其他程序（API 回调）

# 通知事件
• +sdown: 主观下线
• +odown: 客观下线
• +failover-start: 开始故障转移
• +failover-end: 故障转移完成

3️⃣ 自动故障转移（Automatic Failover）

复制代码

检测到 Master 下线
↓
Sentinel 集群投票
↓
选举 Leader Sentinel
↓
Leader 选择最优 Slave
↓
提升 Slave 为新 Master
↓
其他 Slave 指向新 Master
↓
通知客户端
↓
故障转移完成（10-30 秒）

4️⃣ 配置提供（Configuration Provider）

bash 复制代码

# 客户端不直接连接 Redis
# 而是先问 Sentinel："Master 在哪？"

Client → Sentinel: "主节点地址是？"
Sentinel → Client: "Master 在 192.168.1.100:6379"

# Master 切换后
Client → Sentinel: "主节点地址是？"
Sentinel → Client: "Master 在 192.168.1.101:6380（新地址）"

# 客户端自动连接新 Master

Sentinel 集群

为什么 Sentinel 也要集群？

markdown 复制代码

单个 Sentinel 的问题：
┌──────────┐
│Sentinel1 │ → 监控 Master
└──────────┘
     ↓ Sentinel 宕机
  💥 无人监控
  
Sentinel 集群：
┌──────────┐  ┌──────────┐  ┌──────────┐
│Sentinel1 │  │Sentinel2 │  │Sentinel3 │
└────┬─────┘  └────┬─────┘  └────┬─────┘
     └──────────┬──────────────┘
                ↓
         互相监控 + 投票决策
         
某个 Sentinel 宕机？
其他 Sentinel 继续工作 ✅

Sentinel 数量建议：

yaml 复制代码

推荐：奇数个（3、5、7）

为什么是奇数？
• 需要超过半数投票才能故障转移
• quorum（法定人数）机制

示例：
3 个 Sentinel: 至少 2 个同意（2/3 > 50%）✅
4 个 Sentinel: 至少 3 个同意（3/4 > 50%）
5 个 Sentinel: 至少 3 个同意（3/5 > 50%）✅

结论：4 个和 5 个都需要 3 个同意
      但 5 个更能容忍故障
      所以用奇数更划算

6.2 故障检测机制

主观下线（SDOWN）

主观下线（Subjectively Down）：单个 Sentinel 认为节点下线。

erlang 复制代码

┌────────────────────────────────────────┐
│        主观下线检测流程                 │
└────────────────────────────────────────┘

Sentinel1                    Master
  │                            │
  │──── PING ──────────────────→│
  │                            │
  │ 等待 down-after-milliseconds │
  │ (默认 30 秒)                │
  │                            │
  │ 超时！没响应                │
  │                            │
  │ "我觉得 Master 挂了"        │
  │ 标记为 +sdown               │
  │                            │
  └─ 但这只是我的想法          │
     （主观判断，可能误判）

配置：

bash 复制代码

# sentinel.conf
sentinel monitor mymaster 127.0.0.1 6379 2

# down-after-milliseconds: 多久没响应算下线
sentinel down-after-milliseconds mymaster 30000  # 30 秒

触发条件：

c 复制代码

// 判断主观下线
if (now - sentinel->last_ping_time > down_after_milliseconds) {
    // 超过 30 秒没收到 PONG
    sentinel->flags |= SRI_S_DOWN;  // 标记主观下线
    
    // 发送事件通知
    sentinelEvent(LL_WARNING, "+sdown", master);
}

客观下线（ODOWN）

客观下线（Objectively Down）：多数 Sentinel 都认为节点下线。

vbnet 复制代码

┌────────────────────────────────────────┐
│        客观下线确认流程                 │
└────────────────────────────────────────┘

Sentinel1: "我觉得 Master 挂了（主观下线）"
           ↓
      问其他 Sentinel

Sentinel1 → Sentinel2: "你觉得 Master 挂了吗？"
Sentinel2 → Sentinel1: "是的，我也觉得挂了"

Sentinel1 → Sentinel3: "你觉得 Master 挂了吗？"
Sentinel3 → Sentinel1: "是的，我也觉得挂了"

Sentinel1: "好，大家都觉得挂了"
           统计：2 个 Sentinel 同意（包括自己是 3 个）
           quorum = 2（配置的法定人数）
           3 >= 2 ✅
           ↓
      标记为客观下线（ODOWN）
           ↓
      开始故障转移

配置：

bash 复制代码

# sentinel.conf
sentinel monitor mymaster 127.0.0.1 6379 2
                                          ↑
                                      quorum（法定人数）

# quorum = 2 表示：
# 至少 2 个 Sentinel 认为 Master 下线，才确认客观下线

实现：

c 复制代码

// 询问其他 Sentinel
void sentinelAskMasterStateToOtherSentinels(sentinelRedisInstance *master) {
    dictIterator *di;
    dictEntry *de;
    
    // 遍历所有 Sentinel
    di = dictGetIterator(master->sentinels);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *ri = dictGetVal(de);
        
        // 发送命令：你觉得 Master 怎么样？
        sentinelSendCommand(ri, "SENTINEL", "is-master-down-by-addr",
            master->addr->ip, 
            master->addr->port,
            "0",  // 0 表示只询问状态，不投票
            "*");
    }
    dictReleaseIterator(di);
}

// 判断客观下线
void sentinelCheckObjectivelyDown(sentinelRedisInstance *master) {
    int quorum = 0, odown = 0;
    
    // 统计认为下线的 Sentinel 数量
    dictIterator *di = dictGetIterator(master->sentinels);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *ri = dictGetVal(de);
        if (ri->flags & SRI_MASTER_DOWN) {
            quorum++;
        }
    }
    
    // 加上自己
    quorum++;
    
    // 是否达到法定人数
    if (quorum >= master->quorum) {
        odown = 1;
    }
    
    // 标记客观下线
    if (odown) {
        if ((master->flags & SRI_O_DOWN) == 0) {
            sentinelEvent(LL_WARNING, "+odown", master, "%@ #quorum %d/%d",
                quorum, master->quorum);
            master->flags |= SRI_O_DOWN;
        }
    }
}

检测流程

完整的故障检测流程：

markdown 复制代码

┌────────────────────────────────────────┐
│      故障检测完整流程                   │
└────────────────────────────────────────┘

1. 持续监控
   每个 Sentinel 每秒向 Master 发送 PING
   ↓
2. 主观判断
   Sentinel1: 30 秒没收到 PONG
   → 标记为 +sdown（主观下线）
   ↓
3. 询问确认
   Sentinel1 问其他 Sentinel：
   "你们觉得 Master 挂了吗？"
   ↓
4. 客观确认
   Sentinel2: "是的"
   Sentinel3: "是的"
   → 达到 quorum（法定人数）
   → 标记为 +odown（客观下线）
   ↓
5. 开始故障转移

举例说明：

yaml 复制代码

场景：3 个 Sentinel 监控 1 个 Master，quorum=2

正常情况：
Sentinel1 → Master: PING → PONG ✅
Sentinel2 → Master: PING → PONG ✅
Sentinel3 → Master: PING → PONG ✅

Master 宕机：
Sentinel1 → Master: PING → 无响应 ❌
           （30 秒后）
Sentinel1: "Master 主观下线了"

Sentinel1 → Sentinel2: "Master 挂了吗？"
Sentinel2: "是的，我也 PING 不通"

Sentinel1 → Sentinel3: "Master 挂了吗？"
Sentinel3: "是的，我也 PING 不通"

统计：3 个 Sentinel 都认为下线
3 >= quorum(2) ✅
→ Master 客观下线

开始故障转移...

6.3 Leader 选举

Raft 协议简化版

Sentinel 使用 Raft 协议的简化版进行 Leader 选举。

为什么需要选举 Leader？

vbnet 复制代码

问题：多个 Sentinel 同时执行故障转移

Sentinel1: "我来执行故障转移"
           提升 Slave1 为 Master
           
Sentinel2: "我也来执行故障转移"
           提升 Slave2 为 Master
           
结果：出现多个 Master ❌（脑裂）

解决方案：先选出一个 Leader
只有 Leader 才能执行故障转移

选举流程

sql 复制代码

┌────────────────────────────────────────┐
│        Leader 选举流程                  │
└────────────────────────────────────────┘

1. 发起选举
   Sentinel1 发现 Master 客观下线
   ↓
   Sentinel1: "我要当 Leader！"
   向所有 Sentinel 发送投票请求
   
2. 投票规则
   每个 Sentinel 在一个纪元（epoch）内只能投 1 票
   先到先得（First-Come-First-Served）
   
3. 投票过程
   Sentinel1 → Sentinel2: "投我一票"
   Sentinel2: "好，投你" ✅（Sentinel2 的票）
   
   Sentinel1 → Sentinel3: "投我一票"
   Sentinel3: "好，投你" ✅（Sentinel3 的票）
   
   Sentinel1: "我得到 3 票（包括自己）"
   
4. 确认当选
   总共 3 个 Sentinel
   需要 3/2 + 1 = 2 票
   Sentinel1 得到 3 票 >= 2 票 ✅
   
   Sentinel1: "我是 Leader！开始故障转移"

选举算法

c 复制代码

// Sentinel 发起选举
void sentinelStartFailover(sentinelRedisInstance *master) {
    // 1. 增加配置纪元（选举轮次）
    master->failover_epoch = ++server.current_epoch;
    
    // 2. 进入选举状态
    master->failover_state = SENTINEL_FAILOVER_STATE_WAIT_START;
    
    // 3. 向所有 Sentinel 请求投票
    sentinelAskMasterStateToOtherSentinels(master, SENTINEL_ASK_FORCED);
}

// 投票逻辑
char *sentinelVoteLeader(sentinelRedisInstance *master, uint64_t req_epoch,
                         char *req_runid, uint64_t *leader_epoch) {
    // 在同一个 epoch 内，只能投一次票
    if (req_epoch > sentinel.current_epoch) {
        sentinel.current_epoch = req_epoch;
        sentinel.leader = NULL;  // 新一轮，清空投票
    }
    
    // 还没投票？投给你
    if (sentinel.leader == NULL) {
        sentinel.leader = sdsnew(req_runid);
        sentinel.leader_epoch = req_epoch;
        return sentinel.leader;
    }
    
    // 已经投过了，返回之前投的
    return sentinel.leader;
}

// 统计投票
int votes = 0;
for (Sentinel in sentinels) {
    if (Sentinel.vote == my_runid) {
        votes++;
    }
}

// 是否当选
if (votes >= (total_sentinels / 2 + 1)) {
    // 获得多数票，当选 Leader
    return true;
}

举例说明：

vbnet 复制代码

场景：5 个 Sentinel，Master 客观下线

时刻 T0: 所有 Sentinel 发现 Master 下线
epoch = 100（新一轮选举）

时刻 T1: Sentinel1 率先发起选举
Sentinel1 → S2: "投我一票！epoch=100, runid=s1"
Sentinel1 → S3: "投我一票！epoch=100, runid=s1"
Sentinel1 → S4: "投我一票！epoch=100, runid=s1"
Sentinel1 → S5: "投我一票！epoch=100, runid=s1"

S2: "epoch=100，我还没投票，投给你" → s1 ✅
S3: "epoch=100，我还没投票，投给你" → s1 ✅
S4: "epoch=100，我还没投票，投给你" → s1 ✅
S5: "epoch=100，我还没投票，投给你" → s1 ✅

Sentinel1 统计：5 票（包括自己）
需要：5/2 + 1 = 3 票
5 >= 3 ✅
Sentinel1: "我是 Leader！"

时刻 T2: Sentinel2 也想当 Leader（晚了一步）
Sentinel2 → S1: "投我一票！epoch=100, runid=s2"
S1: "对不起，我已经投给 s1 了" → s1 ❌

Sentinel2 → S3: "投我一票！epoch=100, runid=s2"
S3: "对不起，我已经投给 s1 了" → s1 ❌

Sentinel2: "我没当选"（票数不够）

结果：Sentinel1 当选 Leader

6.4 故障转移

选择新主节点

Leader Sentinel 需要从多个 Slave 中选择一个作为新 Master。

选择标准（优先级从高到低）：

ini 复制代码

1️⃣ 优先级（replica-priority）
   ├─ priority = 0：不参与选举
   └─ priority 越小，优先级越高

2️⃣ 复制偏移量（replication offset）
   └─ offset 越大，数据越完整，优先级越高

3️⃣ Run ID
   └─ 字典序最小的（兜底规则）

举例说明：

yaml 复制代码

场景：Master 下线，3 个 Slave 候选

Slave1:
• priority: 100
• offset: 1234567
• run_id: aaa...

Slave2:
• priority: 90   ← 最小（优先级最高）
• offset: 1234500
• run_id: bbb...

Slave3:
• priority: 100
• offset: 1234600  ← 最大（数据最完整）
• run_id: ccc...

选择流程：
1. 先看 priority
   Slave2 的 90 最小 ✅
   
2. 不用看 offset 和 run_id 了
   
结果：选择 Slave2 作为新 Master

另一个例子：

yaml 复制代码

场景：priority 相同

Slave1:
• priority: 100
• offset: 1234567  ← 最大
• run_id: ccc...

Slave2:
• priority: 100
• offset: 1234500
• run_id: bbb...

Slave3:
• priority: 100
• offset: 1234000
• run_id: aaa...

选择流程：
1. priority 都是 100（相同）
   ↓ 比较下一个条件
   
2. 看 offset
   Slave1 的 1234567 最大（数据最完整）✅
   
结果：选择 Slave1

选择算法：

c 复制代码

sentinelRedisInstance *sentinelSelectSlave(sentinelRedisInstance *master) {
    sentinelRedisInstance *slave, *best = NULL;
    dictIterator *di;
    dictEntry *de;
    
    di = dictGetIterator(master->slaves);
    while((de = dictNext(di)) != NULL) {
        slave = dictGetVal(de);
        
        // 过滤条件
        if (slave->flags & (SRI_S_DOWN|SRI_O_DOWN)) continue;  // 下线的不选
        if (slave->priority == 0) continue;  // priority=0 不参与
        if (mstime() - slave->last_avail_time > SENTINEL_INFO_PERIOD*5) continue;  // 失联的不选
        
        // 比较优先级
        if (best == NULL || slave->priority < best->priority) {
            best = slave;
        } else if (slave->priority == best->priority) {
            // 优先级相同，比较偏移量
            if (slave->repl_offset > best->repl_offset) {
                best = slave;
            } else if (slave->repl_offset == best->repl_offset) {
                // 偏移量也相同，比较 run_id
                if (memcmp(slave->runid, best->runid, CONFIG_RUN_ID_SIZE) < 0) {
                    best = slave;
                }
            }
        }
    }
    dictReleaseIterator(di);
    
    return best;
}

转移流程

Leader Sentinel 执行故障转移的完整流程：

objectivec 复制代码

┌────────────────────────────────────────┐
│        故障转移 6 步走                  │
└────────────────────────────────────────┘

Leader Sentinel 的工作：

1️⃣ 选出新 Master
   "Slave2 最合适，就它了"
   
2️⃣ 发送 REPLICAOF NO ONE
   Leader → Slave2: "REPLICAOF NO ONE"
   Slave2: "好的，我现在是 Master 了"
   
3️⃣ 等待确认
   Leader 每秒 INFO：检查 Slave2 是否变成 Master
   Slave2: role:master ✅
   
4️⃣ 修改其他 Slave
   Leader → Slave1: "REPLICAOF <Slave2-IP> <Slave2-Port>"
   Leader → Slave3: "REPLICAOF <Slave2-IP> <Slave2-Port>"
   
5️⃣ 更新配置
   所有 Sentinel 更新配置：
   • 新的 Master 地址
   • 旧的 Master 降级为 Slave
   
6️⃣ 旧 Master 上线处理
   如果旧 Master 恢复：
   Leader → 旧Master: "REPLICAOF <新Master-IP> <新Master-Port>"
   旧 Master: "好的，我现在是 Slave 了"

详细实现：

c 复制代码

void sentinelFailoverStateMachine(sentinelRedisInstance *master) {
    // 状态机驱动故障转移
    switch(master->failover_state) {
        case SENTINEL_FAILOVER_STATE_WAIT_START:
            // 等待开始
            sentinelFailoverWaitStart(master);
            break;
            
        case SENTINEL_FAILOVER_STATE_SELECT_SLAVE:
            // 选择新 Master
            sentinelFailoverSelectSlave(master);
            break;
            
        case SENTINEL_FAILOVER_STATE_SEND_SLAVEOF_NOONE:
            // 发送 REPLICAOF NO ONE
            sentinelFailoverSendSlaveOfNoOne(master);
            break;
            
        case SENTINEL_FAILOVER_STATE_WAIT_PROMOTION:
            // 等待升级完成
            sentinelFailoverWaitPromotion(master);
            break;
            
        case SENTINEL_FAILOVER_STATE_RECONF_SLAVES:
            // 重新配置其他 Slave
            sentinelFailoverReconfNextSlave(master);
            break;
            
        case SENTINEL_FAILOVER_STATE_UPDATE_CONFIG:
            // 更新配置
            sentinelFailoverUpdateConfig(master);
            break;
    }
}

配置传播

故障转移完成后，配置如何传播？

yaml 复制代码

┌────────────────────────────────────────┐
│        配置传播流程                     │
└────────────────────────────────────────┘

Leader Sentinel (执行故障转移的)
  │ 更新配置：
  │ • Master: 新地址
  │ • Slaves: 新列表
  ↓
发布到 __sentinel__:hello 频道
  ↓
  ├─→ Sentinel2: 订阅到消息，更新配置 ✅
  ├─→ Sentinel3: 订阅到消息，更新配置 ✅
  └─→ Sentinel4: 订阅到消息，更新配置 ✅

所有 Sentinel 配置同步完成
  ↓
写入配置文件 sentinel.conf
  ↓
持久化完成，重启后仍然有效

6.5 Sentinel 通信

发布订阅频道

Sentinel 之间通过 Redis 的 Pub/Sub 机制通信：

arduino 复制代码

┌────────────────────────────────────────┐
│      Sentinel 通信：发布订阅            │
└────────────────────────────────────────┘

每个 Sentinel 都：
1. 订阅 Master 的 __sentinel__:hello 频道
2. 每 2 秒向该频道发布自己的信息

Sentinel1 → Master: 
   PUBLISH __sentinel__:hello "sentinel1,127.0.0.1,26379,..."

Sentinel2 订阅到消息：
   "哦，有个 Sentinel1，地址是 127.0.0.1:26379"
   "我可以直接和它通信了"

Sentinel3 订阅到消息：
   "哦，有个 Sentinel1..."

发布的信息：

xml 复制代码

格式：
<sentinel_ip>,<sentinel_port>,<sentinel_runid>,
<current_epoch>,<master_name>,<master_ip>,<master_port>,
<master_config_epoch>

示例：
127.0.0.1,26379,9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d3e2f1a0b,
100,mymaster,127.0.0.1,6379,99

信息交换

Sentinel 之间通过命令交换信息：

bash 复制代码

# 1. 获取 Master 状态
SENTINEL is-master-down-by-addr <ip> <port> <current-epoch> <runid>

# 参数说明：
# • current-epoch: 当前纪元
# • runid: 
#   - "*": 只询问状态，不投票
#   - <runid>: 请求投票给这个 runid

# 响应：
# 1) down_state（0 或 1）
# 2) leader_runid（投票给谁）
# 3) leader_epoch（纪元）

# 2. 获取 Sentinel 信息
SENTINEL sentinels <master-name>

# 3. 获取 Slave 信息
SENTINEL slaves <master-name>

# 4. 获取 Master 信息
SENTINEL master <master-name>

拓扑发现

Sentinel 如何自动发现整个集群拓扑？

yaml 复制代码

┌────────────────────────────────────────┐
│        自动发现机制                     │
└────────────────────────────────────────┘

1. 发现 Master
   Sentinel1 配置：
   sentinel monitor mymaster 127.0.0.1 6379 2
   ↓
   连接到 Master

2. 发现 Slave
   Sentinel1 → Master: INFO replication
   Master → Sentinel1: 
   "我有 2 个 Slave：
    Slave1: 127.0.0.1:6380
    Slave2: 127.0.0.1:6381"
   ↓
   Sentinel1 自动连接 Slave1 和 Slave2

3. 发现其他 Sentinel
   Sentinel1 订阅 Master 的 __sentinel__:hello
   Sentinel2 发布消息："我是 Sentinel2，地址 127.0.0.1:26380"
   Sentinel3 发布消息："我是 Sentinel3，地址 127.0.0.1:26381"
   ↓
   Sentinel1 自动连接 Sentinel2 和 Sentinel3

结果：
Sentinel1 知道了完整的拓扑：
• Master: 127.0.0.1:6379
• Slave: 127.0.0.1:6380, 127.0.0.1:6381
• Sentinel: 127.0.0.1:26380, 127.0.0.1:26381

6.6 客户端支持

客户端连接

客户端不直接连接 Redis，而是通过 Sentinel 获取 Master 地址：

arduino 复制代码

┌────────────────────────────────────────┐
│      客户端连接流程                     │
└────────────────────────────────────────┘

1. 客户端启动
   配置 Sentinel 地址列表：
   • 127.0.0.1:26379
   • 127.0.0.1:26380
   • 127.0.0.1:26381
   
2. 询问 Master 地址
   Client → Sentinel1: "mymaster 的 Master 在哪？"
   Sentinel1 → Client: "127.0.0.1:6379"
   
3. 连接 Master
   Client → Master(6379): 建立连接
   
4. 订阅切换事件
   Client → Sentinel1: 订阅 +switch-master 事件
   
5. Master 切换时
   Sentinel1 → Client: "+switch-master mymaster 127.0.0.1 6379 127.0.0.1 6380"
                       "Master 换了，新地址是 6380"
   
6. 自动切换连接
   Client: 断开旧连接(6379)
   Client: 连接新 Master(6380)

自动切换

makefile 复制代码

场景：Master 故障切换

时间线：
00:00  Master(6379) 正常，Client 连接中
       Client → Master(6379): SET key value

00:30  Master(6379) 宕机 💥

00:35  Sentinel 检测到故障，开始转移

00:45  故障转移完成，Slave(6380) 升级为 Master

00:46  Sentinel 通知 Client：
       "+switch-master mymaster 127.0.0.1 6379 127.0.0.1 6380"

00:47  Client 自动切换：
       断开 6379
       连接 6380 ✅

00:48  Client → Master(6380): SET key value
       业务恢复正常 ✅

服务中断时间：约 18 秒（00:30 - 00:48）

Java 实现

java 复制代码

import redis.clients.jedis.JedisSentinelPool;
import redis.clients.jedis.Jedis;

/**
 * Jedis Sentinel 客户端
 */
public class RedisSentinelClient {
    
    private JedisSentinelPool sentinelPool;
    
    public RedisSentinelClient() {
        // Sentinel 地址列表
        Set<String> sentinels = new HashSet<>();
        sentinels.add("127.0.0.1:26379");
        sentinels.add("127.0.0.1:26380");
        sentinels.add("127.0.0.1:26381");
        
        // 创建 Sentinel 连接池
        sentinelPool = new JedisSentinelPool(
            "mymaster",        // Master 名称
            sentinels,         // Sentinel 地址
            poolConfig,        // 连接池配置
            "password"         // Redis 密码
        );
    }
    
    public void set(String key, String value) {
        try (Jedis jedis = sentinelPool.getResource()) {
            // Sentinel 自动返回当前 Master 的连接
            jedis.set(key, value);
            
            // Master 切换时，连接池自动切换
            // 应用无感知 ✅
        }
    }
    
    public String get(String key) {
        try (Jedis jedis = sentinelPool.getResource()) {
            return jedis.get(key);
        }
    }
    
    public void close() {
        sentinelPool.close();
    }
}

Spring Boot 配置：

java 复制代码

import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.data.redis.connection.RedisSentinelConfiguration;
import org.springframework.data.redis.connection.jedis.JedisConnectionFactory;
import org.springframework.data.redis.core.RedisTemplate;

@Configuration
public class RedisSentinelConfig {
    
    @Bean
    public RedisSentinelConfiguration sentinelConfiguration() {
        RedisSentinelConfiguration config = new RedisSentinelConfiguration();
        
        // Master 名称
        config.setMaster("mymaster");
        
        // Sentinel 节点
        config.sentinel("127.0.0.1", 26379);
        config.sentinel("127.0.0.1", 26380);
        config.sentinel("127.0.0.1", 26381);
        
        // 密码
        config.setPassword("password");
        
        return config;
    }
    
    @Bean
    public JedisConnectionFactory jedisConnectionFactory(
            RedisSentinelConfiguration sentinelConfig) {
        return new JedisConnectionFactory(sentinelConfig);
    }
    
    @Bean
    public RedisTemplate<String, Object> redisTemplate(
            JedisConnectionFactory connectionFactory) {
        RedisTemplate<String, Object> template = new RedisTemplate<>();
        template.setConnectionFactory(connectionFactory);
        return template;
    }
}

6.7 Sentinel 配置

核心配置

bash 复制代码

# sentinel.conf

# ============ 基础配置 ============

# 绑定地址
bind 0.0.0.0

# Sentinel 端口
port 26379

# 后台运行
daemonize yes

# 日志文件
logfile "/var/log/redis/sentinel.log"

# ============ 监控配置 ============

# 监控 Master
# sentinel monitor <master-name> <ip> <port> <quorum>
sentinel monitor mymaster 127.0.0.1 6379 2

# Master 密码
sentinel auth-pass mymaster <password>

# ============ 故障检测配置 ============

# 多久没响应算主观下线（毫秒）
sentinel down-after-milliseconds mymaster 30000  # 30 秒

# 故障转移超时（毫秒）
sentinel failover-timeout mymaster 180000  # 3 分钟

# ============ 故障转移配置 ============

# 同时向新 Master 同步的 Slave 数量
sentinel parallel-syncs mymaster 1  # 一次 1 个

# 为什么是 1？
# 如果设置为 3，3 个 Slave 同时全量同步
# 会对新 Master 造成压力
# 设置为 1，逐个同步，更稳妥

# ============ 通知配置 ============

# 故障转移时执行脚本
sentinel notification-script mymaster /opt/scripts/notify.sh

# 客户端重新配置脚本
sentinel client-reconfig-script mymaster /opt/scripts/reconfig.sh

调优参数

1️⃣ down-after-milliseconds

bash 复制代码

# 判断下线的超时时间
sentinel down-after-milliseconds mymaster 30000

# 设置原则：
• 太短（5000）：网络抖动可能误判
• 太长（60000）：故障发现慢
• 推荐：30000（30 秒，默认值）

# 不同场景：
• 本地机房：30000（网络稳定）
• 跨机房：60000（网络延迟大）
• 测试环境：5000（快速测试）

2️⃣ quorum

bash 复制代码

# 法定人数
sentinel monitor mymaster 127.0.0.1 6379 2

# 设置原则：
• quorum = Sentinel 总数 / 2 + 1（推荐）
• 至少为 2

# 示例：
• 3 个 Sentinel：quorum = 2
• 5 个 Sentinel：quorum = 3
• 7 个 Sentinel：quorum = 4

# 注意：
• quorum 只用于判断客观下线
• Leader 选举仍需要超过半数投票

quorum vs 多数派的区别（重要！）：

ini 复制代码

场景：5 个 Sentinel，quorum=2

判断客观下线：
• 需要：quorum = 2 个 Sentinel 同意
• Sentinel1 和 Sentinel2 认为下线
• 2 >= 2 ✅ 客观下线

Leader 选举：
• 需要：5/2 + 1 = 3 个 Sentinel 投票
• Sentinel1 得到 2 票
• 2 < 3 ❌ 选举失败

结论：
• quorum 可以设置得较小（快速发现故障）
• 但 Leader 选举永远需要多数派（避免脑裂）
• 如果 Sentinel 挂太多（< 半数），无法选出 Leader

实际案例：
5 个 Sentinel，2 个宕机，剩 3 个
• 客观下线：quorum=2，3 个 Sentinel 足够判断
• Leader 选举：需要 3 票，3 个 Sentinel 刚好可以
• 如果再挂 1 个，剩 2 个，无法选出 Leader ❌

3️⃣ failover-timeout

bash 复制代码

# 故障转移超时
sentinel failover-timeout mymaster 180000  # 3 分钟

# 用途：
• 如果 3 分钟内故障转移未完成，视为失败
• 重新发起故障转移

# 设置原则：
• 至少是 down-after-milliseconds 的 6 倍
• 推荐：180000（3 分钟）

4️⃣ parallel-syncs

bash 复制代码

# 同时同步的 Slave 数量
sentinel parallel-syncs mymaster 1

# 设置原则：
• parallel-syncs = 1：稳妥（推荐）
• parallel-syncs = N：激进（新 Master 压力大）

# 场景分析：
假设 1 主 3 从，Master 切换到 Slave1

parallel-syncs = 1:
Slave2 → 同步 → 完成
                  ↓
              Slave3 → 同步 → 完成
耗时：2 × 同步时间
优点：新 Master 压力小
缺点：完全同步时间长

parallel-syncs = 3:
Slave2 ──→ 同步 ─→ 完成
Slave3 ──→ 同步 ─→ 完成
耗时：1 × 同步时间
优点：快速完成同步
缺点：新 Master 压力大（同时处理 2 个全量同步）

推荐：parallel-syncs = 1（稳定性优先）

部署建议

bash 复制代码

# ============ 部署原则 ============

# 1. Sentinel 数量：奇数个
推荐配置：
• 小集群：3 个 Sentinel
• 中集群：5 个 Sentinel
• 大集群：7 个 Sentinel（最多）

# 2. 部署位置：分散部署
❌ 不好的部署：
Sentinel1, Sentinel2, Sentinel3 在同一台服务器
→ 服务器宕机，所有 Sentinel 挂掉

✅ 推荐部署：
Sentinel1: 服务器 A
Sentinel2: 服务器 B
Sentinel3: 服务器 C
→ 单台服务器宕机，其他 Sentinel 继续工作

# 3. 网络：确保互通
• 所有 Sentinel 之间能互相通信
• 所有 Sentinel 能访问 Master 和 Slave
• 客户端能访问所有 Sentinel

# 4. 资源：不要和 Redis 共用服务器
• Sentinel 占用资源少（内存约 50MB）
• 但不要和 Redis 混部
• Redis 宕机可能影响 Sentinel

生产环境实战

完整配置示例

Sentinel 配置：

bash 复制代码

# /etc/redis/sentinel.conf

# ============ 基础配置 ============
port 26379
daemonize yes
pidfile "/var/run/redis-sentinel-26379.pid"
logfile "/var/log/redis/sentinel-26379.log"
dir "/var/lib/redis"

# ============ 监控配置 ============
# 主从集群 1
sentinel monitor mymaster 192.168.1.100 6379 2
sentinel auth-pass mymaster Redis@2024
sentinel down-after-milliseconds mymaster 30000
sentinel failover-timeout mymaster 180000
sentinel parallel-syncs mymaster 1

# 主从集群 2（一个 Sentinel 可以监控多个集群）
sentinel monitor mymaster2 192.168.1.200 6379 2
sentinel auth-pass mymaster2 Redis@2024
sentinel down-after-milliseconds mymaster2 30000
sentinel failover-timeout mymaster2 180000
sentinel parallel-syncs mymaster2 1

# ============ 通知配置 ============
# 故障转移通知脚本
sentinel notification-script mymaster /opt/scripts/sentinel-notify.sh

# 客户端重配置脚本
sentinel client-reconfig-script mymaster /opt/scripts/sentinel-reconfig.sh

# ============ 日志配置 ============
loglevel notice

通知脚本示例

bash 复制代码

#!/bin/bash
# /opt/scripts/sentinel-notify.sh
# Sentinel 通知脚本

# 参数
# $1: 事件类型
# $2: 实例类型（master/slave）
# $3: 实例名称
# $4-$N: 其他参数

EVENT_TYPE=$1
INSTANCE_TYPE=$2
INSTANCE_NAME=$3

# 记录日志
echo "[$(date)] Sentinel Event: $EVENT_TYPE $INSTANCE_TYPE $INSTANCE_NAME" >> /var/log/sentinel-events.log

# 发送告警（示例：钉钉通知）
case "$EVENT_TYPE" in
    "+sdown")
        # 主观下线
        curl -X POST "https://oapi.dingtalk.com/robot/send?access_token=xxx" \
             -H 'Content-Type: application/json' \
             -d "{\"msgtype\":\"text\",\"text\":{\"content\":\"⚠️ Redis 主观下线：$INSTANCE_NAME\"}}"
        ;;
    "+odown")
        # 客观下线
        curl -X POST "https://oapi.dingtalk.com/robot/send?access_token=xxx" \
             -H 'Content-Type: application/json' \
             -d "{\"msgtype\":\"text\",\"text\":{\"content\":\"🚨 Redis 客观下线：$INSTANCE_NAME，开始故障转移\"}}"
        ;;
    "+failover-end")
        # 故障转移完成
        NEW_MASTER_IP=$4
        NEW_MASTER_PORT=$5
        curl -X POST "https://oapi.dingtalk.com/robot/send?access_token=xxx" \
             -H 'Content-Type: application/json' \
             -d "{\"msgtype\":\"text\",\"text\":{\"content\":\"✅ Redis 故障转移完成：$INSTANCE_NAME，新 Master: $NEW_MASTER_IP:$NEW_MASTER_PORT\"}}"
        ;;
    "+switch-master")
        # Master 切换
        curl -X POST "https://oapi.dingtalk.com/robot/send?access_token=xxx" \
             -H 'Content-Type: application/json' \
             -d "{\"msgtype\":\"text\",\"text\":{\"content\":\"🔄 Redis Master 切换：$INSTANCE_NAME\"}}"
        ;;
esac

exit 0

监控脚本

bash 复制代码

#!/bin/bash
# sentinel-monitor.sh
# Sentinel 集群健康检查

SENTINEL_HOSTS=("127.0.0.1:26379" "127.0.0.1:26380" "127.0.0.1:26381")
MASTER_NAME="mymaster"

echo "========================================="
echo "Sentinel 集群健康检查"
echo "时间：$(date)"
echo "========================================="

for sentinel in "${SENTINEL_HOSTS[@]}"; do
    IFS=':' read -r host port <<< "$sentinel"
    
    echo ""
    echo "检查 Sentinel: $host:$port"
    echo "-----------------------------------------"
    
    # 检查 Sentinel 是否在线
    if redis-cli -h $host -p $port PING > /dev/null 2>&1; then
        echo "✅ Sentinel 在线"
        
        # 获取 Master 信息
        master_info=$(redis-cli -h $host -p $port SENTINEL master $MASTER_NAME)
        master_ip=$(echo "$master_info" | grep -A1 "ip" | tail -1)
        master_port=$(echo "$master_info" | grep -A1 "port" | tail -1)
        master_flags=$(echo "$master_info" | grep -A1 "flags" | tail -1)
        
        echo "  Master: $master_ip:$master_port"
        echo "  状态: $master_flags"
        
        # 获取 Slave 数量
        slaves=$(redis-cli -h $host -p $port SENTINEL slaves $MASTER_NAME)
        slave_count=$(echo "$slaves" | grep -c "name")
        echo "  Slave 数量: $slave_count"
        
        # 获取其他 Sentinel
        sentinels=$(redis-cli -h $host -p $port SENTINEL sentinels $MASTER_NAME)
        sentinel_count=$(echo "$sentinels" | grep -c "name")
        echo "  其他 Sentinel: $sentinel_count"
        
    else
        echo "❌ Sentinel 离线"
    fi
done

echo ""
echo "========================================="
echo "检查完成"
echo "========================================="

故障演练

bash 复制代码

#!/bin/bash
# failover-drill.sh
# 故障转移演练脚本

echo "🎯 开始故障转移演练..."
echo ""

# 1. 记录当前 Master
current_master=$(redis-cli -h 127.0.0.1 -p 26379 SENTINEL get-master-addr-by-name mymaster | xargs)
echo "📍 当前 Master: $current_master"

# 2. 模拟 Master 故障
echo "💥 模拟 Master 故障（暂停 Master 进程）"
master_ip=$(echo $current_master | cut -d' ' -f1)
master_port=$(echo $current_master | cut -d' ' -f2)
redis-cli -h $master_ip -p $master_port DEBUG sleep 60 &

# 3. 观察 Sentinel 行为
echo ""
echo "👀 观察 Sentinel 检测和切换过程..."
sleep 5

for i in {1..12}; do
    new_master=$(redis-cli -h 127.0.0.1 -p 26379 SENTINEL get-master-addr-by-name mymaster | xargs)
    timestamp=$(date '+%H:%M:%S')
    
    echo "[$timestamp] [$i] Master: $new_master"
    
    if [ "$new_master" != "$current_master" ]; then
        echo ""
        echo "✅ 故障转移完成！"
        echo "   旧 Master: $current_master"
        echo "   新 Master: $new_master"
        echo ""
        break
    fi
    
    sleep 5
done

# 4. 验证新 Master
new_master_ip=$(echo $new_master | cut -d' ' -f1)
new_master_port=$(echo $new_master | cut -d' ' -f2)

echo "🔍 验证新 Master 可写..."
result=$(redis-cli -h $new_master_ip -p $new_master_port SET test_failover_key "failover_success_$(date +%s)" 2>&1)

if [[ $result == "OK" ]]; then
    echo "✅ 新 Master 写入成功"
else
    echo "❌ 新 Master 写入失败: $result"
fi

echo ""
echo "🎉 演练完成"

常见问题解答

Q1: Sentinel 本身高可用吗？

A: 是的，Sentinel 集群互相监控。

复制代码

• 单个 Sentinel 挂了，其他 Sentinel 继续工作
• 建议部署 3-5 个 Sentinel
• 分散部署在不同服务器

Q2: quorum 和选举的多数派有什么区别？

A: 这是最容易混淆的概念。

ini 复制代码

quorum：判断客观下线
• 配置：sentinel monitor mymaster 127.0.0.1 6379 2
• 用途：2 个 Sentinel 认为下线即可确认

多数派：Leader 选举
• 计算：Sentinel 总数 / 2 + 1
• 用途：必须获得多数票才能当选 Leader

示例（5 个 Sentinel，quorum=2）：
• 客观下线：2 个 Sentinel 同意即可
• Leader 选举：需要 3 个 Sentinel 投票

注意：
• quorum 可以小于多数派
• 但 Leader 选举永远需要多数派
• 所以如果 Sentinel 挂太多，无法选出 Leader

Q3: Sentinel 会监控从节点吗？

A: 会监控，但不会对从节点进行故障转移。

bash 复制代码

# Sentinel 监控从节点：
• 发送 PING 检测存活
• 标记主观下线
• 但不会进行故障转移

# 原因：
• 从节点挂了，还有其他从节点
• 不影响写入（Master 正常）
• 不需要故障转移

Q4: 故障转移期间能写入数据吗？

A: 不能，会有短暂的服务中断。

markdown 复制代码

Master 下线 → 故障转移完成
这段时间（10-30 秒）无法写入

优化方案：
1. 客户端实现重试机制
2. 使用消息队列缓冲写请求
3. 应用层做好降级处理

Q5: Sentinel 可以和 Redis 部署在同一台机器吗？

A: 可以，但不推荐。

复制代码

❌ 不推荐：
Sentinel1 和 Master 在同一台服务器
→ 服务器宕机，Sentinel 也挂了

✅ 推荐：
Sentinel 和 Redis 分开部署
→ 服务器宕机，Sentinel 仍能检测并切换

Q6: 如何手动触发故障转移？

bash 复制代码

# 强制故障转移
127.0.0.1:26379> SENTINEL failover mymaster
OK

# 用途：
• 测试故障转移流程
• 主动切换 Master（如升级维护）
• 演练灾难恢复

总结

本文深入剖析了 Redis 哨兵模式：

核心架构

🔍 Sentinel 集群：多个哨兵互相监控
🎯 监控 Master：持续心跳检测
🤖 自动故障转移：无需人工干预
📢 配置提供：客户端自动发现 Master

故障检测

👁️ 主观下线（SDOWN）：单个 Sentinel 的判断
👀 客观下线（ODOWN）：多数 Sentinel 的共识
⏱️ down-after-milliseconds：超时判定标准
🔢 quorum：法定人数机制

Leader 选举

🗳️ Raft 协议：简化版实现
🎫 投票机制：先到先得，一票制
👑 多数派：必须超过半数才能当选
🔄 epoch：选举轮次（防止重复投票）

故障转移

🎯 选择新 Master：优先级 → 偏移量 → run_id
🔄 自动切换：REPLICAOF NO ONE
📡 配置传播：Pub/Sub 同步配置
📢 客户端通知：+switch-master 事件

最佳实践

✅ 部署奇数个 Sentinel（3/5/7）
✅ 分散部署（不同服务器/机房）
✅ quorum = Sentinel 总数/2 + 1
✅ 监控 Sentinel 自身状态
✅ 定期演练故障转移
✅ 客户端使用 Sentinel 客户端库

理解哨兵模式，能帮助你：

✅ 实现 Redis 真正的高可用
✅ 秒级自动故障转移
✅ 减少运维工作量
✅ 提升系统稳定性

💡 下一篇预告：《Redis 集群模式详解：分布式架构与 Slot 机制》

从高可用到高扩展，Redis Cluster 带你突破单机限制！