Redis 哨兵机制：Sentinel 原理与高可用实现

前言

Redis Sentinel（哨兵）是 Redis 官方提供的高可用（HA）解决方案，用于监控 Redis 集群的运行状态，在 Master 故障时自动进行故障转移（Failover），确保系统持续可用。本文将深入剖析 Sentinel 的架构原理、核心算法、源码实现以及生产环境实践。

标签： Redis,Sentinel,哨兵,高可用,源码解析

一、Sentinel 架构概览

1.1 为什么需要 Sentinel？

在 Redis 主从复制架构中，存在以下问题：

单点故障：Master 宕机后，整个集群无法写入
手动恢复：需要人工介入将 Slave 提升为 Master
配置复杂：客户端需要感知 Master 变化

Sentinel 的解决方案：

自动故障检测：实时监控 Master 和 Slave 状态
自动故障转移：Master 故障时，自动选举新的 Master
配置中心：为客户端提供最新的 Master 地址
通知机制：通过 Pub/Sub 通知管理员

1.2 Sentinel 架构

客户端
Redis 数据节点
Sentinel 集群（3个节点）
复制
复制
订阅
查询
Sentinel-1

:26379
Sentinel-2

:26379
Sentinel-3

:26379
Master

:6379
Replica-1

:6380
Replica-2

:6381
应用服务器

核心特点：

分布式架构：多 Sentinel 节点组成集群，避免单点
Quorum 机制：需要半数以上 Sentinel 同意才能判定故障
自动发现 ：通过 Master 的 __sentinel__:hello 频道发现其他 Sentinel
客户端友好：提供 Pub/Sub 机制通知 Master 变化

1.3 Sentinel 与其他 HA 方案对比

特性	Redis Sentinel	Redis Cluster	Keepalived
架构	主从复制 + 哨兵	去中心化集群	VIP 漂移
故障检测	Quorum 机制	Gossip 协议	VRRP 协议
故障转移	自动选举	自动选举	VIP 切换
数据分片	不支持	支持（16384 槽）	不支持
客户端	订阅 Sentinel 消息	Smart Client	连接 VIP
部署复杂度	中等	较高	简单
适用场景	高可用主从	水平扩展	简单主从

二、Sentinel 核心原理

2.1 三个定时任务

Sentinel 通过三个定时任务实现监控和故障检测：

源码（Redis 7.2.0）：

c 复制代码

// sentinel.c (Redis 7.2.0) - Sentinel 主循环
void sentinelTimer(void) {
    /* 1. 每 10 秒：向 Master 和 Slave 发送 INFO 命令 */
    if (sentinel.tiltsentinelTimer < mstime()) {
        sentinelRefreshInstanceInfo(sentinel.master);
        sentinel.tiltsentinelTimer = mstime() + 10000;
    }
    
    /* 2. 每 2 秒：向 Master、Slave 和其他 Sentinel 发送 PING */
    if (sentinel.tiltpingTimer < mstime()) {
        sentinelSendPing(sentinel.master);
        sentinel.tiltpingTimer = mstime() + 2000;
    }
    
    /* 3. 每 1 秒：向其他 Sentinel 发送消息，交换 Master 状态 */
    if (sentinel.tiltaskTimer < mstime()) {
        sentinelSendHello(sentinel.master);
        sentinel.tiltpingTimer = mstime() + 1000;
    }
    
    /* 4. 故障转移检查 */
    sentinelCheckSubjectivelyDown(sentinel.master);
    sentinelCheckObjectivelyDown(sentinel.master);
    sentinelFailoverStateMachine(sentinel.master);
}

三个任务的详细说明：

任务	频率	目的	协议
INFO	每 10 秒	发现新 Slave，获取复制状态	INFO replication
PING	每 2 秒	检测节点是否存活	PING
HELLO	每 1 秒	交换 Master 状态，Leader 选举	PUBLISH sentinel:hello

2.2 主观下线与客观下线

Sentinel 通过两阶段判定故障：

主观下线（Subjectively Down, SDOWN）：

c 复制代码

// sentinel.c (Redis 7.2.0) - 检查主观下线
void sentinelCheckSubjectivelyDown(sentinelRedisInstance *ri) {
    /* 1. 获取最后响应时间 */
    mstime_t elapsed = mstime() - ri->last_avail_time;
    
    /* 2. 超过 down-after-milliseconds 判定为主观下线 */
    if (elapsed > ri->down_after_period) {
        if (ri->flags & SRI_MASTER) {
            sentinelEvent(LL_WARNING, "+sdown", ri, "%@");
            ri->flags |= SRI_S_DOWN;
        }
    }
}

客观下线（Objectively Down, ODOWN）：

c 复制代码

// sentinel.c (Redis 7.2.0) - 检查客观下线
void sentinelCheckObjectivelyDown(sentinelRedisInstance *master) {
    dictIterator *di;
    dictEntry *de;
    int quorum = 0;
    
    /* 1. 统计认为 Master 下线的 Sentinel 数量 */
    di = dictGetIterator(master->sentinels);
    while ((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *sentinel = dictGetVal(de);
        
        if (sentinel->flags & SRI_MASTER_DOWN) {
            quorum++;
        }
    }
    
    /* 2. 超过 quorum 配置，判定为客观下线 */
    if (quorum >= master->quorum) {
        if (!(master->flags & SRI_O_DOWN)) {
            sentinelEvent(LL_WARNING, "+odown", master, "%@ %d",
                         quorum, master->quorum);
            master->flags |= SRI_O_DOWN;
        }
    }
}

故障判定流程：
PING 超时 > down-after-milliseconds
PING 恢复
超过 quorum 个 Sentinel 同意
恢复且未开始故障转移
开始选举 Leader
新 Master 上线
在线
主观下线(SDOWN)
客观下线(ODOWN)
故障转移
单个 Sentinel 判定

down-after-milliseconds

默认 30000ms (30秒)
多数 Sentinel 判定

quorum 配置

默认 2 (3个节点中)

2.3 Sentinel Leader 选举

当 Master 被判定为客观下线后，Sentinel 集群需要选举一个 Leader 来执行故障转移：

c 复制代码

// sentinel.c (Redis 7.2.0) - Leader 选举
void sentinelAskMasterStateToOtherSentinels(sentinelRedisInstance *master) {
    dictIterator *di;
    dictEntry *de;
    
    /* 1. 向其他 Sentinel 发送 is-master-down-by-addr 命令 */
    di = dictGetIterator(master->sentinels);
    while ((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *sentinel = dictGetVal(de);
        
        char *cmd = sentinelIsMasterDownByAddrCmd(
            sentinel->addr,
            master->runid,
            master->config_epoch
        );
        
        sentinelSendCommand(sentinel, cmd);
    }
    
    /* 2. 统计票数 */
    int votes = 0;
    di = dictGetIterator(master->sentinels);
    while ((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *sentinel = dictGetVal(de);
        
        if (sentinel->leader == sentinel.myself) {
            votes++;
        }
    }
    
    /* 3. 超过半数则成为 Leader */
    if (votes > (sentinelTotalSentinels(master) / 2)) {
        master->flags |= SRI_FAILOVER_IN_PROGRESS;
        master->failover_state = SENTINEL_FAILOVER_STATE_WAIT_START;
    }
}

选举流程：
Master (故障) Sentinel-3 Sentinel-2 Sentinel-1 Master (故障) Sentinel-3 Sentinel-2 Sentinel-1 检测到 Master 客观下线统计票数 epoch 相同，投票结果平局更新 epoch 并重新投票获得 3 票，成为 Leader is-master-down-by-addr runid1 epoch1 is-master-down-by-addr runid1 epoch1 is-master-down-by-addr runid2 epoch1 is-master-down-by-addr runid3 epoch1 runid1 票数: 1 runid2 票数: 1 runid3 票数: 1 is-master-down-by-addr runid1 epoch2 is-master-down-by-addr runid1 epoch2 down-yes-leader runid1 down-yes-leader runid1 开始执行故障转移

三、故障转移流程

3.1 完整故障转移流程

选出 Leader Sentinel
选择优先级最高的 Slave
发送 SLAVEOF NO ONE
Slave 成为 Master
其他 Slave 复制新 Master
故障转移完成
等待开始
选择Slave
发送SlaveofNoOne
等待晋升
重同步其他Slave
广播新配置
选择标准：

slave-priority 最小
复制偏移量最大
runid 最小

3.2 选择 Slave 的算法

c 复制代码

// sentinel.c (Redis 7.2.0) - 选择要晋升的 Slave
sentinelRedisInstance *sentinelSelectSlave(sentinelRedisInstance *master) {
    sentinelRedisInstance *instance = NULL;
    dictIterator *di;
    dictEntry *de;
    int max_pr_idx = -1;  /* 最高优先级索引 */
    
    /* 1. 遍历所有 Slave */
    di = dictGetIterator(master->slaves);
    while ((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *slave = dictGetVal(de);
        
        /* 2. 检查 Slave 是否满足条件 */
        if (slave->flags & (SRI_S_DOWN | SRI_O_DOWN)) continue;
        if (slave->link->disconnected) continue;
        
        /* 3. 计算优先级 */
        int pr_idx = 0;
        
        /* 优先级 1: slave-priority 配置 */
        pr_idx += slave->slave_priority;
        
        /* 优先级 2: 复制偏移量（数据越新越好） */
        pr_idx += slave->master_link_down_time;
        
        /* 优先级 3: runid 字典序（确定性） */
        pr_idx += strcmp(slave->runid, instance->runid);
        
        /* 4. 选择最优 Slave */
        if (pr_idx > max_pr_idx) {
            max_pr_idx = pr_idx;
            instance = slave;
        }
    }
    
    return instance;
}

Slave 选择优先级：

优先级	判断条件	说明
1	`slave-priority` 最小	配置文件中设置，默认 100，值越小优先级越高
2	复制偏移量最大	数据最新，与原 Master 同步最接近
3	`runid` 字典序最小	确保选举结果确定，避免频繁切换

3.3 执行故障转移

c 复制代码

// sentinel.c (Redis 7.2.0) - 执行故障转移
void sentinelFailoverStateMachine(sentinelRedisInstance *master) {
    switch (master->failover_state) {
        case SENTINEL_FAILOVER_STATE_WAIT_START:
            /* 等待合适的时机 */
            if (mstime() - master->failover_start_time < 5000) break;
            master->failover_state = SENTINEL_FAILOVER_STATE_SELECT_SLAVE;
            break;
        
        case SENTINEL_FAILOVER_STATE_SELECT_SLAVE:
            /* 选择要晋升的 Slave */
            master->promoted_slave = sentinelSelectSlave(master);
            if (!master->promoted_slave) {
                sentinelEvent(LL_WARNING, "-failover-abort-no-good-slave",
                             master, "%@");
                return;
            }
            master->failover_state = SENTINEL_FAILOVER_STATE_SEND_SLAVEOF_NOONE;
            break;
        
        case SENTINEL_FAILOVER_STATE_SEND_SLAVEOF_NOONE:
            /* 发送 SLAVEOF NO ONE 命令 */
            sentinelSendSlaveOf(master->promoted_slave, NULL, 0);
            master->failover_state = SENTINEL_FAILOVER_STATE_WAIT_PROMOTION;
            master->failover_state_change_time = mstime();
            break;
        
        case SENTINEL_FAILOVER_STATE_WAIT_PROMOTION:
            /* 等待 Slave 晋升为 Master */
            if (mstime() - master->failover_state_change_time < 5000) break;
            master->failover_state = SENTINEL_FAILOVER_STATE_RECONF_SLAVES;
            break;
        
        case SENTINEL_FAILOVER_STATE_RECONF_SLAVES:
            /* 重新配置其他 Slave 复制新 Master */
            sentinelFailoverReconfOtherSlaves(master);
            master->failover_state = SENTINEL_FAILOVER_STATE_UPDATE_CONFIG;
            break;
        
        case SENTINEL_FAILOVER_STATE_UPDATE_CONFIG:
            /* 广播新配置 */
            sentinelFlushConfig();
            master->failover_state = SENTINEL_FAILOVER_STATE_DONE;
            break;
        
        case SENTINEL_FAILOVER_STATE_DONE:
            /* 故障转移完成 */
            sentinelEvent(LL_WARNING, "+failover-end", master, "%@");
            master->flags &= ~SRI_FAILOVER_IN_PROGRESS;
            break;
    }
}

四、Sentinel 配置详解

4.1 核心配置参数

bash 复制代码

# sentinel.conf

# === 监控配置 ===
# 格式: sentinel monitor <master-name> <ip> <port> <quorum>
sentinel monitor mymaster 127.0.0.1 6379 2

# === 故障检测配置 ===
# 判定主观下线的超时时间（毫秒）
sentinel down-after-milliseconds mymaster 30000

# 故障转移超时时间（毫秒）
sentinel failover-timeout mymaster 180000

# === 并发同步配置 ===
# 同时同步的 Slave 数量
sentinel parallel-syncs mymaster 1

# === 通知脚本 ===
# 故障发生时执行的脚本
# sentinel notification-script mymaster /var/redis/notify.sh

# 故障转移完成后执行的脚本
# sentinel client-reconfig-script mymaster /var/redis/reconfig.sh

参数详解：

参数	默认值	说明	推荐值
`quorum`	2	判定客观下线需要的 Sentinel 数量	集群总数 / 2 + 1
`down-after-milliseconds`	30000ms	主观下线判定时间	根据网络状况调整
`failover-timeout`	180000ms	整个故障转移的超时时间	建议 3-5 分钟
`parallel-syncs`	1	同时重新配置的 Slave 数量	Slave 多时可以增大

4.2 生产环境配置

bash 复制代码

# sentinel.conf - 生产环境推荐配置

# === 基础配置 ===
port 26379
daemonize yes
logfile "/var/log/redis/sentinel.log"
dir "/var/lib/redis"

# === 监控配置 ===
sentinel monitor mymaster 192.168.1.10 6379 2
sentinel down-after-milliseconds mymaster 10000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 300000

# === 权限配置 ===
# requirepass "your_sentinel_password"

# === 脚本配置 ===
sentinel notification-script mymaster /usr/local/bin/redis-notify.sh
sentinel client-reconfig-script mymaster /usr/local/bin/redis-reconfig.sh

# === 其他 Sentinel ===
# 会自动发现，无需手动配置

4.3 多 Master 监控

bash 复制代码

# 可以同时监控多个 Master
sentinel monitor prod-master 192.168.1.10 6379 2
sentinel monitor cache-master 192.168.1.20 6379 2
sentinel monitor session-master 192.168.1.30 6379 2

五、客户端集成

5.1 Java 客户端（Jedis）

java 复制代码

import redis.clients.jedis.Jedis;
import redis.clients.jedis.JedisSentinelPool;
import java.util.HashSet;
import java.util.Set;

public class RedisSentinelExample {
    public static void main(String[] args) {
        // 1. 配置 Sentinel 地址
        Set<String> sentinels = new HashSet<>();
        sentinels.add("127.0.0.1:26379");
        sentinels.add("127.0.0.1:26380");
        sentinels.add("127.0.0.1:26381");
        
        // 2. 创建连接池
        JedisSentinelPool pool = new JedisSentinelPool(
            "mymaster",  // Master 名称
            sentinels,   // Sentinel 集群
            2000,        // 超时时间
            "password"   // 密码
        );
        
        // 3. 获取连接
        try (Jedis jedis = pool.getResource()) {
            // 4. 执行命令
            jedis.set("key", "value");
            String value = jedis.get("key");
            System.out.println("Value: " + value);
        }
        
        // 5. 关闭连接池
        pool.close();
    }
}

5.2 Python 客户端（redis-py）

python 复制代码

import redis
from redis.sentinel import Sentinel

# 1. 配置 Sentinel 地址
sentinel = Sentinel([
    ('127.0.0.1', 26379),
    ('127.0.0.1', 26380),
    ('127.0.0.1', 26381),
], socket_timeout=0.5)

# 2. 获取 Master 连接
master = sentinel.master_for(
    'mymaster',
    socket_timeout=0.5,
    password='password'
)

# 3. 获取 Slave 连接（只读）
slave = sentinel.slave_for(
    'mymaster',
    socket_timeout=0.5,
    password='password'
)

# 4. 执行命令
master.set('key', 'value')
value = slave.get('key')
print(f"Value: {value}")

5.3 订阅 Sentinel 事件

python 复制代码

import redis

# 订阅 Master 切换事件
def subscribe_sentinel_events():
    sentinel = redis.StrictRedis(
        host='127.0.0.1',
        port=26379,
        db=0
    )
    
    pubsub = sentinel.pubsub()
    
    # 订阅所有 Sentinel 事件
    pubsub.psubscribe('__sentinel__:__*')
    
    for message in pubsub.listen():
        if message['type'] == 'pmessage':
            channel = message['channel'].decode()
            data = message['data'].decode()
            
            if channel == '__sentinel__:__+switch-master':
                print(f"Master 切换事件: {data}")
                # 解析: <master-name> <old-ip> <old-port> <new-ip> <new-port>
                parts = data.split()
                new_master_ip = parts[3]
                new_master_port = parts[4]
                
                # 更新应用配置
                update_master_config(new_master_ip, new_master_port)

if __name__ == "__main__":
    subscribe_sentinel_events()

六、故障排查与监控

6.1 常用监控命令

bash 复制代码

# 1. 查看 Sentinel 状态
redis-cli -p 26379 INFO Sentinel

# 输出示例:
# sentinel_masters:1
# sentinel_tilt:0
# sentinel_running_scripts:0
# sentinel_scripts_queue_length:0
# sentinel_simulate_failure_flags:0

# 2. 查看 Master 状态
redis-cli -p 26379 SENTINEL masters

# 输出示例:
# 1) "name"
# 2) "mymaster"
# ...
# 7) "flags"
# 8) "master"

# 3. 查看 Slave 状态
redis-cli -p 26379 SENTINEL slaves mymaster

# 4. 查看 Sentinel 集群
redis-cli -p 26379 SENTINEL sentinels mymaster

# 5. 获取当前 Master 地址
redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster

# 输出示例:
# 1) "192.168.1.10"
# 2) "6379"

# 6. 查看故障转移状态
redis-cli -p 26379 SENTINEL failover mymaster

6.2 监控指标

python 复制代码

import redis
import time

def monitor_sentinel():
    sentinel = redis.StrictRedis(host='127.0.0.1', port=26379, db=0)
    
    while True:
        # 获取 Master 信息
        master_info = sentinel.sentinel_master('mymaster')
        
        print(f"=== Sentinel 监控 ===")
        print(f"Master IP: {master_info['ip']}")
        print(f"Master Port: {master_info['port']}")
        print(f"Master 状态: {master_info['flags']}")
        print(f"Slave 数量: {master_info['num-slaves']}")
        print(f"Sentinel 数量: {master_info['num-other-sentinels']}")
        print(f"Quorum: {master_info['quorum']}")
        print(f"故障转移超时: {master_info['failover-timeout']}ms")
        
        # 检查是否客观下线
        if master_info['flags'] == 'master':
            print("✅ Master 正常运行")
        else:
            print(f"⚠️ Master 状态异常: {master_info['flags']}")
        
        print(f"\n更新时间: {time.strftime('%Y-%m-%d %H:%M:%S')}")
        print("-" * 50)
        
        time.sleep(10)  # 每 10 秒检查一次

if __name__ == "__main__":
    monitor_sentinel()

6.3 常见问题排查

问题 1：Sentinel 无法检测到 Master 下线

原因：

down-after-milliseconds 设置过大
网络延迟过高

解决：

bash 复制代码

# 减小超时时间
redis-cli -p 26379 SENTINEL set mymaster down-after-milliseconds 5000

# 检查网络延迟
ping 192.168.1.10

问题 2：故障转移失败

原因：

Slave 配置错误（只读、持久化问题）
网络分区导致脑裂

解决：

bash 复制代码

# 检查 Slave 配置
redis-cli -p 6380 CONFIG GET slave-read-only

# 检查 Slave 是否可写
redis-cli -p 6380 SET test_key test_value

# 手动触发故障转移（测试）
redis-cli -p 26379 SENTINEL failover mymaster

问题 3：脑裂（多个 Master）

原因：

网络分区
min-replicas-to-write 未配置

解决：

bash 复制代码

# 在 Master 上配置最少写入副本
redis-cli -p 6379 CONFIG SET min-replicas-to-write 1
redis-cli -p 6379 CONFIG SET min-replicas-max-lag 10

七、最佳实践

7.1 部署架构

推荐部署方案：
机房 C
机房 B
机房 A
复制
复制
复制
Master
Replica
Sentinel-1
Sentinel-2
Replica
Sentinel-3
Replica
Sentinel-4

最佳实践：

奇数个 Sentinel：3、5、7 个，避免脑裂
跨机房部署：避免单点故障
独立服务器：Sentinel 不与 Redis 共用服务器
监控告警：接入 Prometheus + Grafana

7.2 配置检查清单

bash 复制代码

#!/bin/bash
# sentinel-check.sh

echo "=== Sentinel 配置检查 ==="

# 1. 检查 Sentinel 数量
SENTINEL_COUNT=$(redis-cli -p 26379 SENTINEL sentinels mymaster | grep -c "ip")
echo "Sentinel 数量: $SENTINEL_COUNT"

if [ $SENTINEL_COUNT -lt 3 ]; then
    echo "⚠️ Sentinel 数量不足，建议至少 3 个"
fi

# 2. 检查 Quorum 配置
QUORUM=$(redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster | grep -o "quorum:[0-9]*" | cut -d: -f2)
echo "Quorum: $QUORUM"

if [ $QUORUM -gt $((SENTINEL_COUNT / 2 + 1)) ]; then
    echo "⚠️ Quorum 设置过大"
fi

# 3. 检查 Slave 数量
SLAVE_COUNT=$(redis-cli -p 26379 SENTINEL slaves mymaster | grep -c "ip")
echo "Slave 数量: $SLAVE_COUNT"

if [ $SLAVE_COUNT -lt 2 ]; then
    echo "⚠️ Slave 数量不足，建议至少 2 个"
fi

# 4. 检查故障转移超时
TIMEOUT=$(redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster | grep -o "failover-timeout:[0-9]*" | cut -d: -f2)
echo "故障转移超时: ${TIMEOUT}ms"

if [ $TIMEOUT -lt 60000 ]; then
    echo "⚠️ 故障转移超时过短，建议至少 60000ms"
fi

# 5. 测试 Master 可达性
MASTER_IP=$(redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster | head -1)
MASTER_PORT=$(redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster | tail -1)

if redis-cli -h $MASTER_IP -p $MASTER_PORT PING > /dev/null 2>&1; then
    echo "✅ Master 可达"
else
    echo "❌ Master 不可达"
fi

echo "=== 检查完成 ==="

八、总结

Redis Sentinel 是实现 Redis 高可用的重要组件：

核心要点：

三个定时任务：INFO、PING、HELLO，实现监控和状态同步
两阶段故障判定：主观下线（SDOWN）→ 客观下线（ODOWN）
Leader 选举：Raft 算法，需要半数以上 Sentinel 同意
故障转移：自动选择最优 Slave 晋升为 Master
客户端集成：通过 Sentinel 查询 Master 地址，订阅切换事件

适用场景：

✅ 主从架构的高可用
✅ 自动故障转移
✅ 中小规模部署（< 100GB）
❌ 大规模水平扩展（使用 Redis Cluster）

生产建议：

部署至少 3 个 Sentinel 节点
合理设置 down-after-milliseconds 和 failover-timeout
配置 min-replicas-to-write 防止脑裂
监控 Sentinel 日志和状态指标

参考资料

作者： [你的昵称]
发布时间： 2026-04-01
Redis 版本： 7.2.0