Redis哨兵之故障分析

1. 背景

该案例取测试环境，一次断电后，Redis主从没有切换过来。Redis版本：3.2.11

有3个redis、3个Sentinel节点。

10.200.5.50
10.100.5.32
10.70.0.20

10.70.0.20是主节点，2020-06-17时间该节点断电，10.70.0.20节点上的Sentinel和Redis宕机。断电后，应用长时间无法使用，提示无法连接到Redis主节点（10.70.0.20节点）。说明故障转移没有成功选举出新的Redis主节点。

2. 日志分析

2.1 Sentinel日志

50节点Sentinel日志

32节点Sentinel日志

2.2 分析

从日志信息可以得出信息：

1）15:58:05，50节点发现20节点的master主观下线（宕机）。

2）15:58:05，50节点发现20节点上的Sentinel主观下线（宕机）。

3）15:58:05，32节点发现20节点上的Sentinel主观下线（宕机）。

4）16:13:03，32节点将Master标记为客观下线，并开始故障转移。

5）16:13:03，32节点成为Leader。

6）16:13:03，32节点故障转移失败，失败的原因是没有选出合适的从节点（6分钟后会重试）

7）16:18:17，50（Sentinel）手动触发选主，并成为Leader。

8）16:18:17，50（Redis）被选出为新的Master。

问题点

1）32节点为什么晚15分钟才发现Master宕机？

2）第一次故障转移为什么会失败？也就是为什么没有选出合适的Slave。

3. 问题一：32节点为什么晚15分钟才发现Master宕机？

根本原因：Sentinel 内部主观下线判断逻辑有概率性问题。详细见：Long delay before detecting master is subjectively down github.com/antirez/red...

下面开始分析这个原因：查看Redis3.2源码。

1）Redis主观判断逻辑，当act_ping_time不为0时，elapsed为当前时间，减去上一次ping时间。然后判断elapsed是否大于ri->down_after_period（默认30s），大于则判断为主观离线。

c 复制代码

//sentinel.c#sentinelCheckSubjectivelyDown
void sentinelCheckSubjectivelyDown(sentinelRedisInstance *ri) {
  mstime_t elapsed = 0;
  if (ri->link->act_ping_time)
      elapsed = mstime() - ri->link->act_ping_time;
  else if (ri->link->disconnected)
      elapsed = mstime() - ri->link->last_avail_time;
 ...
 if (elapsed > ri->down_after_period ||
        (ri->flags & SRI_MASTER &&
         ri->role_reported == SRI_SLAVE &&
         mstime() - ri->role_reported_time >
          (ri->down_after_period+SENTINEL_INFO_PERIOD*2))){
        /* Is subjectively down */
        if ((ri->flags & SRI_S_DOWN) == 0) {
            sentinelEvent(REDIS_WARNING,"+sdown",ri,"%@");
            ri->s_down_since_time = mstime();
            ri->flags |= SRI_S_DOWN;
        }
    } 
 
}

2）在sentinelSendPeriodicCommands中，会发送INFO命令，PING命令以及publish Hello消息。判断逻辑是if，else逻辑，也就是只会满足一个。当INFO命令判断逻辑满足了，就不会在执行PING命令。问题就出在这里的判断。

c 复制代码

//sentinel.c#sentinelSendPeriodicCommands
void sentinelSendPeriodicCommands(sentinelRedisInstance *ri) {
    ...
    ping_period = ri->down_after_period;
    if (ping_period > SENTINEL_PING_PERIOD) ping_period = SENTINEL_PING_PERIOD;

    if ((ri->flags & SRI_SENTINEL) == 0 &&
        (ri->info_refresh == 0 ||
        (now - ri->info_refresh) > info_period))
    {
        /* Send INFO to masters and slaves, not sentinels. */
        retval = redisAsyncCommand(ri->link->cc,
            sentinelInfoReplyCallback, ri, "INFO");
        if (retval == C_OK) ri->link->pending_commands++;
    } else if ((now - ri->link->last_pong_time) > ping_period &&
               (now - ri->link->last_ping_time) > ping_period/2) {
        /* Send PING to all the three kinds of instances. */
        sentinelSendPing(ri);
    } else if ((now - ri->last_pub_time) > SENTINEL_PUBLISH_PERIOD) {
        /* PUBLISH hello messages to all the three kinds of instances. */
        sentinelSendHello(ri);
    }
}

3）在INFO命令回复的回调里，会更新info_refresh的时间。

c 复制代码

//sentinel.c#sentinelInfoReplyCallback
void sentinelInfoReplyCallback(redisAsyncContext *c, void *reply, void *privdata) {
	...
	 ri->info_refresh = mstime();
	...
}

4）在执行sentinelSendPing方法里，会调用PING命令，如果发送成功，且act_ping_time为0的话，会更新act_ping_time时间为当前时间。

c 复制代码

//sentinel.c#sentinelSendPing
int sentinelSendPing(sentinelRedisInstance *ri) {
    int retval = redisAsyncCommand(ri->link->cc,
        sentinelPingReplyCallback, ri, "PING");
    if (retval == C_OK) {
        ri->link->pending_commands++;
        ri->link->last_ping_time = mstime();
        /* We update the active ping time only if we received the pong for
         * the previous ping, otherwise we are technically waiting since the
         * first ping that did not received a reply. */
        if (ri->link->act_ping_time == 0)
            ri->link->act_ping_time = ri->link->last_ping_time;
        return 1;
    } else {
        return 0;
    }
}

5）在PING命令回复的方法里，如果有回复会更新act_ping_time为0，更新last_avail_time和last_pong_time的时间。

c 复制代码

//sentinel.c#sentisentinelPingReplyCallbacknelSendPing
void sentinelPingReplyCallback(redisAsyncContext *c, void *reply, void *privdata) {
   ...
    if (r->type == REDIS_REPLY_STATUS ||
        r->type == REDIS_REPLY_ERROR) {
        /* Update the "instance available" field only if this is an
         * acceptable reply. */
        if (strncmp(r->str,"PONG",4) == 0 ||
            strncmp(r->str,"LOADING",7) == 0 ||
            strncmp(r->str,"MASTERDOWN",10) == 0)
        {
             link->last_avail_time = mstime();
             link->act_ping_time = 0;/* Flag the pong as received. */
        } else {
          ...
        }
    }
    ri->last_pong_time = mstime();
}

根据以上的逻辑判断，假设INFO命令发送已经过去了9.5秒，而PING命令刚刚发送并且接收到回复。在这个时候出现了Master宕机。那么此时的状态为info_refresh为9.5秒前的时间，act_ping_time的时间为0，且link->disconnected也为0。

那么根据以上的分析，再过0.5秒后，将会发送INFO命令，但是Master已经不会回复了，所以命令发送将一直是INFO，不会再PING 。而主观下线判断的代码中可以发现，当act_ping_time=0且link->disconnected也为0，那么根据判断elapsed也为0，不可能将其判断为主观下线。

3.1 为什么Sentinel可以很早发现？

因为Sentinel没有发送INFO命令，所以代码判断不会走INFO判断逻辑，只有判断发送PING逻辑。仅发送PING逻辑，是没有问题的。

3.2 为什么50节点能够较快检测到Master宕机？

因为这个问题是概率性问题，INFO命令在上一次发送9秒内宕机（只要再发送一次PING就可以，不要让act_ping_time为0就行），理论上都可以触发主观下线判断。（该bug触发概率，大概在10%左右）

3.3 为什么过了15分钟左右，32也发现了Master挂掉？

因为32发送的数据没有收到ACK，所以操作系统会触发超时重传，重试多次仍然失败，会报错（超时重传总耗时大约在13-30分钟，《TCP/IP详解：卷一》中有说明）。

《TCP/IP详解：卷一》14.2 简单的超时与重传举例

当重传到一定次数的时候，操作系统层面会报错，然后应用层可以知道该连接已经失效了，然后应用层可以做关闭连接之类的操作。此时redis会处理相关逻辑。

c 复制代码

//sentinel.c#sentinelReconnectInstance
void sentinelReconnectInstance(sentinelRedisInstance *ri) {
    ...
    if (link->cc == NULL) {
      ...
      redisAsyncSetDisconnectCallback(link->cc,sentinelDisconnectCallback);
    }
}

在sentinelDisconnectCallback方法里，会调用instanceLinkConnectionError方法

c 复制代码

//sentinel.c#sentinelDisconnectCallback
void sentinelDisconnectCallback(const redisAsyncContext *c, int status) {
    UNUSED(status);
    instanceLinkConnectionError(c);
}

在instanceLinkConnectionError方法里，会清空连接，并且将disconnected置为1

c 复制代码

//sentinel.c#instanceLinkConnectionError
void instanceLinkConnectionError(const redisAsyncContext *c) {
    instanceLink *link = c->data;
    int pubsub;

    if (!link) return;

    pubsub = (link->pc == c);
    if (pubsub)
        link->pc = NULL;
    else
        link->cc = NULL;
    link->disconnected = 1;
}

主观下线判断里，当ri->link->disconnected为1的时候，会进行判断。此时的elapsed 为当前时间减去 ri->link->last_avail_time。因为last_avail_time是上一次收到pong回复的时间，已经远远超过15分钟了。所以此时会判断该节点为主观下线。

c 复制代码

//sentinel.c#sentinelCheckSubjectivelyDown
void sentinelCheckSubjectivelyDown(sentinelRedisInstance *ri) {
  mstime_t elapsed = 0;
  if (ri->link->act_ping_time)
      elapsed = mstime() - ri->link->act_ping_time;
  else if (ri->link->disconnected)
      elapsed = mstime() - ri->link->last_avail_time;
 ...
 if (elapsed > ri->down_after_period ||
        (ri->flags & SRI_MASTER &&
         ri->role_reported == SRI_SLAVE &&
         mstime() - ri->role_reported_time >
          (ri->down_after_period+SENTINEL_INFO_PERIOD*2))){
        /* Is subjectively down */
        if ((ri->flags & SRI_S_DOWN) == 0) {
            sentinelEvent(REDIS_WARNING,"+sdown",ri,"%@");
            ri->s_down_since_time = mstime();
            ri->flags |= SRI_S_DOWN;
        }
    } 
}

3.4 Redis官方如何解决该问题？

Redis5.0 代码如下，在sentinelSendPeriodicCommands中，每个命令都分别判断。确保PING命令的发送，让act_ping_time不为0。

c 复制代码

//sentinel.c#sentinelSendPeriodicCommands
 /* Send INFO to masters and slaves, not sentinels. */
    if ((ri->flags & SRI_SENTINEL) == 0 &&
        (ri->info_refresh == 0 ||
        (now - ri->info_refresh) > info_period))
    {
        retval = redisAsyncCommand(ri->link->cc,
            sentinelInfoReplyCallback, ri, "%s",
            sentinelInstanceMapCommand(ri,"INFO"));
        if (retval == C_OK) ri->link->pending_commands++;
    }

    /* Send PING to all the three kinds of instances. */
    if ((now - ri->link->last_pong_time) > ping_period &&
               (now - ri->link->last_ping_time) > ping_period/2) {
        sentinelSendPing(ri);
    }

    /* PUBLISH hello messages to all the three kinds of instances. */
    if ((now - ri->last_pub_time) > SENTINEL_PUBLISH_PERIOD) {
        sentinelSendHello(ri);
    }

4. 问题二：故障转移为什么会失败？

根本原因：Sentinel的主观下线时间有问题，导致选择从服务器时，没有可用的节点。

具体Reidis3.2 选从服务器逻辑如下：

max_master_down_time 为 master->down_after_period（默认为30s）的10倍，也就是300s。

查看源码中的判断，也就slave->master_link_down_time > max_master_down_time 该判断存在问题。那么分析一下slave->master_link_down_time参数的来源。

c 复制代码

//sentinel.c#sentinelSelectSlave
sentinelRedisInstance *sentinelSelectSlave(sentinelRedisInstance *master) {
    sentinelRedisInstance **instance =
        zmalloc(sizeof(instance[0])*dictSize(master->slaves));
    sentinelRedisInstance *selected = NULL;
    int instances = 0;
    dictIterator *di;
    dictEntry *de;
    mstime_t max_master_down_time = 0;

    if (master->flags & SRI_S_DOWN)
        max_master_down_time += mstime() - master->s_down_since_time;
    max_master_down_time += master->down_after_period * 10;

    di = dictGetIterator(master->slaves);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *slave = dictGetVal(de);
        mstime_t info_validity_time;

        if (slave->flags & (SRI_S_DOWN|SRI_O_DOWN|SRI_DISCONNECTED)) continue;
        if (mstime() - slave->last_avail_time > SENTINEL_PING_PERIOD*5) continue;
        if (slave->slave_priority == 0) continue;

        /* If the master is in SDOWN state we get INFO for slaves every second.
         * Otherwise we get it with the usual period so we need to account for
         * a larger delay. */
        if (master->flags & SRI_S_DOWN)
            info_validity_time = SENTINEL_PING_PERIOD*5;
        else
            info_validity_time = SENTINEL_INFO_PERIOD*3;
        if (mstime() - slave->info_refresh > info_validity_time) continue;
        if (slave->master_link_down_time > max_master_down_time) continue;
        instance[instances++] = slave;
    }
    ...
    return selected;
}

Slave的master_link_down_time参数，来自INFO命令回复，每次会更新master_link_down_time参数。那么查看一下Slave中该参数如何产生？

c 复制代码

//sentinel.c#sentinelRefreshInstanceInfo
/* master_link_down_since_seconds:<seconds> */
if (sdslen(l) >= 32 &&
            !memcmp(l,"master_link_down_since_seconds",30)) {
	ri->master_link_down_time = strtoll(l+31,NULL,10)*1000;
}

通过InfoCommand，查看该参数的生成。最后发现是通过server.unixtime-server.repl_down_since，当前时间减去主从复制停止时间，那么查看repl_down_since的生成。

c 复制代码

//redis.c#genRedisInfoString
if (server.repl_state != REDIS_REPL_CONNECTED) {
	info = sdscatprintf(info,
	                    "master_link_down_since_seconds:%jdrn",
                    (intmax_t)server.unixtime-server.repl_down_since);
 }

通过参数查看，其实在replicationCron方法中，会对主服务器连接进行判断，如果超过repl_timeout（默认60s）没有回复，则会关闭master的连接。

c 复制代码

//replication.c#replicationCron
if (server.masterhost && server.repl_state == REDIS_REPL_CONNECTED &&
        (time(NULL)-server.master->lastinteraction) > server.repl_timeout) {
	redisLog(REDIS_WARNING,"MASTER timeout: no data nor PING received...");
	freeClient(server.master);
}

在freeClient方法中，会对失效的Master调用一个方法进行判断。

c 复制代码

//networking.c#freeClient
void freeClient(redisClient *c) {
   ...
   if (c->flags & REDIS_MASTER) replicationHandleMasterDisconnection();
   ...
}

在具体处理的方法中，可以看到设置了repl_down_since的时间为当前时间。

c 复制代码

//networking.c#replicationHandleMasterDisconnection
void replicationHandleMasterDisconnection(void) {
    server.master = NULL;
    server.repl_state = REDIS_REPL_CONNECT;
    server.repl_down_since = server.unixtime;
    /* We lost connection with our master, don't disconnect slaves yet,
     * maybe we'll be able to PSYNC with our master later. We'll disconnect
     * the slaves only if we'll have to do a full resync with our master. */
}

综合以上信息，可以得出：之所以没有选举出合适的从服务器，是因为32那台Sentinel，发现Master宕机太迟了（比实际要迟很久）。导致在选择从服务器的时候，发现从服务器和主主服务器的连接时间太久了，从服务器全部被过滤了。

从服务器发现主服务器宕机最长需要：1分钟。

而32节点发现主服务器宕机大约花了：15分钟。

slave->master_link_down_time（大约14分钟） > max_master_down_time（300s=5分钟）的判断是成立的，所以没有选择出从服务器来。

4.1 为什么50手动选主就能成功呢?

这是因为50主观发现Master宕机时间比较早，大概30s就能发现Master宕机。所以max_master_down_time参数比较大。

16:18:17触发的手动选主，但是50在15:58:05就发现了Master宕机（主观下线），所以max_master_down_time时间大概为：20 + 5分钟。

c 复制代码

//sentinel.c#sentinelSelectSlave
 if (master->flags & SRI_S_DOWN)
        max_master_down_time += mstime() - master->s_down_since_time;
    max_master_down_time += master->down_after_period * 10;

上面有分析：从服务器发现主服务器宕机最长需要1分钟。所以从服务的master_link_down_time时间会比max_master_down_time小。所以可以选择出一个合适的从节点进行故障转移。

slave->master_link_down_time（大约20.5分钟） > max_master_down_time（大约25分钟）

该条件判断不成立，所以不会过滤掉从服务器，能够选出合适的从服务器，从而能够完成整个故障转移。

5. 启示

尽量使用新版本的Redis。本文中的问题升级到5.0可以解决，或者修改原分支源码，合入修改。
所有Sentinel发现宕机时间的配置：down-after-milliseconds要一致。

6. 参考资料

Redis issue链接：github.com/redis/redis...
Redis源码3.2.11
Redis源码5.0分支

记一次Redis哨兵故障转移失败问题

Redis哨兵之故障分析

1. 背景

2. 日志分析

2.1 Sentinel日志

2.2 分析

3. 问题一：32节点为什么晚15分钟才发现Master宕机？

3.1 为什么Sentinel可以很早发现？

3.2 为什么50节点能够较快检测到Master宕机？

3.3 为什么过了15分钟左右，32也发现了Master挂掉？

3.4 Redis官方如何解决该问题？

4. 问题二：故障转移为什么会失败？

4.1 为什么50手动选主就能成功呢?

5. 启示

6. 参考资料