Redis 集群模式详解（上篇）

Redis 集群模式详解（上篇）：架构、数据结构与 Gossip 协议

哨兵模式解决了高可用问题，但仍然受限于单机容量。Redis Cluster 通过数据分片实现了水平扩展，既提供高可用，又突破了单机内存限制。本文是 Redis Cluster 系列的上篇，深入剖析集群架构、节点内部数据结构、Hash Slot 机制以及 Gossip 协议的传播原理。

📖 目录

[为什么需要 Redis Cluster？](#为什么需要 Redis Cluster？ "#%E4%B8%BA%E4%BB%80%E4%B9%88%E9%9C%80%E8%A6%81-redis-cluster")
[Cluster 整体架构](#Cluster 整体架构 "#cluster-%E6%95%B4%E4%BD%93%E6%9E%B6%E6%9E%84")
节点内部数据结构深度解析
- [clusterState 全局状态](#clusterState 全局状态 "#clusterstate-%E5%85%A8%E5%B1%80%E7%8A%B6%E6%80%81")
- [clusterNode 节点信息](#clusterNode 节点信息 "#clusternode-%E8%8A%82%E7%82%B9%E4%BF%A1%E6%81%AF")
- [nodes.conf 配置文件](#nodes.conf 配置文件 "#nodesconf-%E9%85%8D%E7%BD%AE%E6%96%87%E4%BB%B6")
- 槽位存储结构
[Hash Slot 机制](#Hash Slot 机制 "#hash-slot-%E6%9C%BA%E5%88%B6")
- [为什么是 16384 个槽？](#为什么是 16384 个槽？ "#%E4%B8%BA%E4%BB%80%E4%B9%88%E6%98%AF-16384-%E4%B8%AA%E6%A7%BD")
- 槽位分配算法
- [Key 路由详解](#Key 路由详解 "#key-%E8%B7%AF%E7%94%B1%E8%AF%A6%E8%A7%A3")
- [Hash Tag 深入](#Hash Tag 深入 "#hash-tag-%E6%B7%B1%E5%85%A5")
[Gossip 协议深度解析](#Gossip 协议深度解析 "#gossip-%E5%8D%8F%E8%AE%AE%E6%B7%B1%E5%BA%A6%E8%A7%A3%E6%9E%90")
- [Gossip 消息格式](#Gossip 消息格式 "#gossip-%E6%B6%88%E6%81%AF%E6%A0%BC%E5%BC%8F")
- 消息类型详解
- [PING/PONG 内容](#PING/PONG 内容 "#pingpong-%E5%86%85%E5%AE%B9")
- 传播算法
- [Config Epoch 机制](#Config Epoch 机制 "#config-epoch-%E6%9C%BA%E5%88%B6")
- 信息收敛过程
总结

为什么需要 Redis Cluster？

哨兵模式的瓶颈

markdown 复制代码

哨兵模式的三大限制：

1. 容量限制：
┌─────────────────────┐
│  Master (64GB)      │  单机内存上限
│  • 数据增长到 64GB   │  
│  • 无法继续扩展 ❌   │
└─────────────────────┘

2. 写入瓶颈：
Master: 5 万 QPS
• 所有写入集中在一个节点
• 无法水平扩展写能力 ❌

3. 成本问题：
• 256GB 内存服务器：约 8 万元
• 8 台 32GB 服务器：约 3 万元
• 大服务器成本高且不灵活 ❌

Redis Cluster 的解决方案

makefile 复制代码

分布式集群：数据分片 + 高可用

容量扩展：
┌──────────┐  ┌──────────┐  ┌──────────┐
│ Master1  │  │ Master2  │  │ Master3  │
│  20GB    │  │  20GB    │  │  20GB    │
│ Slot     │  │ Slot     │  │ Slot     │
│ 0-5460   │  │5461-10922│  │10923-16383│
└──────────┘  └──────────┘  └──────────┘
   总容量：60GB ✅（可扩展到 TB 级）

写入扩展：
Master1: 2 万 QPS
Master2: 2 万 QPS
Master3: 2 万 QPS
总计：6 万 QPS ✅（N 倍提升）

成本优化：
• 使用多台普通服务器 ✅
• 按需扩展 ✅
• 成本降低 50%+ ✅

Cluster 整体架构

yaml 复制代码

┌─────────────────────────────────────────────────┐
│    Redis Cluster 完整架构（3 主 3 从）           │
└─────────────────────────────────────────────────┘

              客户端（Smart Client）
                      ↓
          1. 计算 Key 的 Slot
          slot = CRC16(key) % 16384
                      ↓
          2. 查本地槽位缓存
          slot 14909 → Node3
                      ↓
          3. 直接连接目标节点
        ┌─────────────┼─────────────┐
        ↓             ↓             ↓
  ┌──────────┐  ┌──────────┐  ┌──────────┐
  │ Master1  │  │ Master2  │  │ Master3  │
  │ 6379     │  │ 6380     │  │ 6381     │
  │ Slots:   │  │ Slots:   │  │ Slots:   │
  │ 0-5460   │  │5461-10922│  │10923-16383│
  └────┬─────┘  └────┬─────┘  └────┬─────┘
       │ 主从复制    │ 主从复制    │ 主从复制
       ↓             ↓             ↓
  ┌──────────┐  ┌──────────┐  ┌──────────┐
  │  Slave1  │  │  Slave2  │  │  Slave3  │
  │  6382    │  │  6383    │  │  6384    │
  └──────────┘  └──────────┘  └──────────┘

  ┌──────────────────────────────────────┐
  │   Cluster Bus（集群总线，端口+10000） │
  │   16379, 16380, 16381, 16382, ...    │
  └───────────────┬──────────────────────┘
                  ↓
    Gossip 协议（节点间通信）
    • PING/PONG 心跳（每秒）
    • 配置传播（槽位、纪元）
    • 故障检测（PFAIL/FAIL）

节点内部数据结构深度解析

clusterState 全局状态

每个节点维护完整的集群视图：

c 复制代码

// cluster.h
typedef struct clusterState {
    clusterNode *myself;                    // 自己
    uint64_t currentEpoch;                  // 当前纪元（全局）
    int state;                              // 集群状态（OK/FAIL）
    int size;                               // 负责槽位的 Master 数
    
    // ===== 节点管理 =====
    dict *nodes;                            // 所有节点 Dict
    dict *nodes_black_list;                 // 黑名单
    
    // ===== 槽位管理 =====
    clusterNode *slots[CLUSTER_SLOTS];      // 槽位→节点映射（16384个）
    uint64_t slots_keys_count[CLUSTER_SLOTS]; // 每个槽位的Key数量
    rax *slots_to_keys;                     // 槽位→Key的Radix树
    
    // ===== 迁移状态 =====
    clusterNode *migrating_slots_to[CLUSTER_SLOTS];      // 迁出目标
    clusterNode *importing_slots_from[CLUSTER_SLOTS];    // 迁入来源
    
    // ===== 故障转移 =====
    mstime_t failover_auth_time;            // 请求投票时间
    int failover_auth_count;                // 收到的投票数
    int failover_auth_sent;                 // 是否已请求投票
    uint64_t failover_auth_epoch;           // 投票纪元
    
    // ===== 统计信息 =====
    long long stats_bus_messages_sent;      // 发送的消息数
    long long stats_bus_messages_received;  // 接收的消息数
    
} clusterState;

内存布局示意：

scss 复制代码

Node1 的 clusterState 内存布局：
┌─────────────────────────────────────────────────┐
│                  clusterState                    │
├─────────────────────────────────────────────────┤
│ myself → ┌────────────────────┐                 │
│          │ clusterNode (自己)  │                 │
│          │ name: 07c37dfe...  │                 │
│          │ flags: MASTER|MYSELF│                 │
│          │ slots: [位图2KB]    │                 │
│          └────────────────────┘                 │
├─────────────────────────────────────────────────┤
│ nodes (Dict) → {                                │
│   "07c37dfe..." → clusterNode (Node1, myself)   │
│   "67ed2db8..." → clusterNode (Node2)           │
│   "2dcb8d1f..." → clusterNode (Node3)           │
│   "9f8e7d6c..." → clusterNode (Node4)           │
│   "5a4b3c2d..." → clusterNode (Node5)           │
│   "3e2f1d0c..." → clusterNode (Node6)           │
│ }                                               │
├─────────────────────────────────────────────────┤
│ slots[16384] → 槽位映射表                        │
│   slots[0]     → Node1 ──┐                     │
│   slots[1]     → Node1   │                     │
│   ...                    │ Node1 负责           │
│   slots[5460]  → Node1 ──┘                     │
│   slots[5461]  → Node2 ──┐                     │
│   ...                    │ Node2 负责           │
│   slots[10922] → Node2 ──┘                     │
│   slots[10923] → Node3 ──┐                     │
│   ...                    │ Node3 负责           │
│   slots[16383] → Node3 ──┘                     │
├─────────────────────────────────────────────────┤
│ slots_keys_count[16384] → 每个槽的Key数量        │
│   slots_keys_count[0] = 1500                   │
│   slots_keys_count[1] = 2300                   │
│   ...                                           │
├─────────────────────────────────────────────────┤
│ currentEpoch: 100  (全局纪元)                   │
│ state: CLUSTER_OK                               │
│ size: 3  (3 个 Master)                         │
└─────────────────────────────────────────────────┘

总大小：约 200KB（包含所有节点信息和槽位映射）

clusterNode 节点信息

c 复制代码

// cluster.h
typedef struct clusterNode {
    mstime_t ctime;                         // 创建时间
    char name[CLUSTER_NAMELEN];             // 节点ID（40字节，十六进制）
    int flags;                              // 状态标志
    uint64_t configEpoch;                   // 配置纪元
    
    // ===== 槽位信息（位图，2KB）=====
    unsigned char slots[CLUSTER_SLOTS/8];   // 16384/8 = 2048 字节
    int numslots;                           // 槽位数量
    
    // ===== 主从关系 =====
    int numslaves;                          // 从节点数
    struct clusterNode **slaves;            // 从节点数组
    struct clusterNode *slaveof;            // 主节点（如果是slave）
    
    // ===== 网络信息 =====
    mstime_t ping_sent;                     // 最后PING时间
    mstime_t pong_received;                 // 最后PONG时间
    mstime_t data_received;                 // 最后数据接收时间
    mstime_t fail_time;                     // 下线时间
    char ip[NET_IP_STR_LEN];                // IP地址
    int port;                               // 客户端端口
    int cport;                              // 集群总线端口
    clusterLink *link;                      // TCP连接
    
    // ===== 故障报告 =====
    list *fail_reports;                     // 故障报告列表
    
} clusterNode;

详细字段说明：

c 复制代码

// flags 标志位（位掩码）
#define CLUSTER_NODE_MASTER 1        // 0x0001: Master
#define CLUSTER_NODE_SLAVE 2         // 0x0002: Slave
#define CLUSTER_NODE_PFAIL 4         // 0x0004: 主观下线
#define CLUSTER_NODE_FAIL 8          // 0x0008: 客观下线
#define CLUSTER_NODE_MYSELF 16       // 0x0010: 自己
#define CLUSTER_NODE_HANDSHAKE 32    // 0x0020: 握手中
#define CLUSTER_NODE_NOADDR 64       // 0x0040: 无地址
#define CLUSTER_NODE_MEET 128        // 0x0080: 需要MEET
#define CLUSTER_NODE_MIGRATE_TO 256  // 0x0100: 迁移目标

// flags 组合示例
flags = CLUSTER_NODE_MASTER | CLUSTER_NODE_MYSELF  // 17 (0x0011)
flags = CLUSTER_NODE_SLAVE | CLUSTER_NODE_PFAIL    // 6  (0x0006)

clusterNode 实例示例：

yaml 复制代码

Node1 的完整信息：
┌──────────────────────────────────────────────┐
│            clusterNode (Node1)                │
├──────────────────────────────────────────────┤
│ name: 07c37dfeb235213a872192d90877d0cd55635b91│
│       (40字节十六进制，SHA1哈希生成)          │
├──────────────────────────────────────────────┤
│ flags: 17  (二进制: 0001 0001)               │
│        = MASTER (1) | MYSELF (16)            │
├──────────────────────────────────────────────┤
│ configEpoch: 1  (配置版本号)                 │
├──────────────────────────────────────────────┤
│ slots (位图2KB):                              │
│   byte 0:    11111111  (Slot 0-7)            │
│   byte 1:    11111111  (Slot 8-15)           │
│   ...                                         │
│   byte 682:  11110000  (Slot 5456-5463)      │
│   byte 683:  00000000  (Slot 5464-5471)      │
│   ...                                         │
│   byte 2047: 00000000  (Slot 16376-16383)    │
│                                               │
│ numslots: 5461                               │
├──────────────────────────────────────────────┤
│ slaveof: NULL  (我是Master)                  │
│ numslaves: 1                                 │
│ slaves[0] → Node2 (Slave)                    │
├──────────────────────────────────────────────┤
│ ip: 127.0.0.1                                │
│ port: 6379                                   │
│ cport: 16379  (集群总线端口)                  │
├──────────────────────────────────────────────┤
│ ping_sent: 1698307200000  (毫秒时间戳)        │
│ pong_received: 1698307201000                 │
│ fail_time: 0  (未下线)                        │
├──────────────────────────────────────────────┤
│ link → ┌─────────────────┐                   │
│        │ clusterLink     │                   │
│        │ fd: 25          │                   │
│        │ sndbuf: ...     │                   │
│        │ rcvbuf: ...     │                   │
│        └─────────────────┘                   │
├──────────────────────────────────────────────┤
│ fail_reports: []  (故障报告列表，空)          │
└──────────────────────────────────────────────┘

总大小：约 3KB/节点

nodes.conf 配置文件

bash 复制代码

# nodes-6379.conf
# 这个文件是 Redis 自动维护的，不要手动编辑

# ===== 格式说明 =====
# <node_id> <ip:port@cport> <flags> <master_id> <ping_sent> <pong_recv> <config_epoch> <link_state> <slots>

# ===== 实际内容 =====
07c37dfeb235213a872192d90877d0cd55635b91 127.0.0.1:6379@16379 myself,master - 0 0 1 connected 0-5460
67ed2db8d677e59ec4a4cefb06858cf2a1a89fa1 127.0.0.1:6380@16380 slave 07c37dfeb235213a872192d90877d0cd55635b91 0 1698307201568 1 connected
2dcb8d1f8c9a4e6b3f7d5a9c2e8b4d6f1a3c5e7b 127.0.0.1:6381@16381 master - 0 1698307202789 2 connected 5461-10922
9f8e7d6c5b4a3d2c1e0f9d8c7b6a5d4c3e2f1d0c 127.0.0.1:6382@16382 slave 2dcb8d1f8c9a4e6b3f7d5a9c2e8b4d6f1a3c5e7b 0 1698307203456 2 connected
5a4b3c2d1e0f9e8d7c6b5a4d3c2e1f0d9c8b7a6 127.0.0.1:6383@16383 master - 0 1698307204123 3 connected 10923-16383
3e2f1d0c9b8a7d6c5e4f3d2c1e0f9d8c7b6a5d4 127.0.0.1:6384@16384 slave 5a4b3c2d1e0f9e8d7c6b5a4d3c2e1f0d9c8b7a6 0 1698307205678 3 connected

# ===== 全局变量 =====
vars currentEpoch 6 lastVoteEpoch 0

字段详解：

yaml 复制代码

字段 1: node_id (40字节)
  07c37dfeb235213a872192d90877d0cd55635b91
  • 节点启动时随机生成
  • 永不改变（即使IP变化）
  • 类似身份证号

字段 2: ip:port@cport
  127.0.0.1:6379@16379
  • 6379: 客户端连接端口
  • 16379: 集群总线端口（通信专用）
  • 公式：cport = port + 10000

字段 3: flags
  myself,master
  • myself: 当前节点
  • master/slave: 角色
  • fail: 已下线
  • handshake: 握手中
  • noaddr: 地址未知

字段 4: master_id
  07c37dfeb... 或 -
  • Slave 记录其 Master 的ID
  • Master 显示 "-"

字段 5: ping_sent
  0 或时间戳
  • 最后发送PING的时间
  • 0 表示未发送或myself

字段 6: pong_received
  1698307201568
  • 最后接收PONG的时间
  • 用于判断节点是否存活

字段 7: config_epoch
  1
  • 配置纪元（版本号）
  • 用于解决槽位归属冲突
  • 越大越新

字段 8: link_state
  connected 或 disconnected
  • 连接状态

字段 9: slots
  0-5460 或 5461-10922 或多段
  • 负责的槽位范围
  • 可以不连续：0-100 200-300 400-500

槽位存储结构

三种槽位数据结构：

c 复制代码

// 1. slots[16384] - 数组索引
clusterNode *slots[CLUSTER_SLOTS];

// O(1) 查找槽位对应的节点
clusterNode *node = server.cluster->slots[14909];

// 2. clusterNode->slots[2048] - 位图
unsigned char slots[CLUSTER_SLOTS/8];

// 检查节点是否负责某个槽位
int hasSlot(clusterNode *n, int slot) {
    return (n->slots[slot/8] & (1 << (slot%8))) != 0;
}

// 3. slots_to_keys - Radix树
rax *slots_to_keys;

// Slot → Key 的映射，用于迁移
// 结构：
// Slot 100 → [key1, key2, key3, ...]
// Slot 101 → [key4, key5, ...]

位图操作详解：

c 复制代码

// 设置槽位
void clusterAddSlot(clusterNode *n, int slot) {
    // 位运算设置
    n->slots[slot/8] |= (1 << (slot%8));
    n->numslots++;
    
    // 更新全局映射
    server.cluster->slots[slot] = n;
}

// 示例：设置 Slot 100
// 100 / 8 = 12 (第13个字节)
// 100 % 8 = 4  (第5位)
// slots[12] |= (1 << 4)
// slots[12] |= 0001 0000

// 清除槽位
void clusterDelSlot(int slot) {
    clusterNode *n = server.cluster->slots[slot];
    
    // 位运算清除
    n->slots[slot/8] &= ~(1 << (slot%8));
    n->numslots--;
    
    // 清除全局映射
    server.cluster->slots[slot] = NULL;
}

位图优化的原因：

scss 复制代码

为什么用位图而不是数组？

方案1：数组（int slots[16384]）
• 大小：16384 × 4 = 64KB
• 心跳包每次携带64KB
• 网络开销大 ❌

方案2：位图（unsigned char slots[2048]）
• 大小：16384 / 8 = 2KB
• 心跳包只需2KB ✅
• 节省32倍空间

位图操作：
• 设置：O(1)
• 检查：O(1)
• 遍历：O(16384/8) = O(2048)

Hash Slot 机制

为什么是 16384 个槽？

ini 复制代码

Redis 作者的解释（antirez）：

原因 1：心跳包大小
┌────────────────────────────────────┐
│     PING/PONG 消息结构              │
├────────────────────────────────────┤
│ 消息头部：约 100 字节               │
│ 槽位位图：2KB (16384个槽)          │
│ Gossip 数据：约 1-2KB (10个节点)   │
│ 总计：约 3-4KB                      │
└────────────────────────────────────┘

如果用 65536 个槽：
• 位图：65536/8 = 8KB
• 总大小：约 9-10KB
• 每秒发送数十个心跳
• 网络带宽消耗增加 2-3 倍

原因 2：集群规模限制
• Redis 官方建议：集群不超过 1000 节点
• 16384 / 1000 ≈ 16 个槽/节点
• 粒度已经足够细

原因 3：槽位迁移成本
• 槽位越多，迁移时的元数据管理越复杂
• 16384 个槽已经提供足够的灵活性

原因 4：CRC16 性能
• CRC16 输出 0-65535
• slot = crc16 & 16383  (& 0x3FFF)
• 位运算比取模快

经验值：
• 小集群（3-10节点）：16384 / 3 ≈ 5461 槽/节点
• 中集群（10-50节点）：16384 / 20 ≈ 819 槽/节点
• 大集群（50-100节点）：16384 / 100 ≈ 163 槽/节点

Key 路由详解

完整的路由实现：

c 复制代码

// cluster.c
unsigned int keyHashSlot(char *key, int keylen) {
    int s, e;
    
    // ======== 阶段 1：查找 Hash Tag ========
    // Hash Tag 格式：prefix{tag}suffix
    // 规则：只对 {tag} 部分计算哈希
    
    // 从左往右查找 '{'
    for (s = 0; s < keylen; s++) {
        if (key[s] == '{') break;
    }
    
    // 情况 1：没有找到 '{'
    if (s == keylen) {
        // 对完整 Key 计算哈希
        return crc16(key, keylen) & 0x3FFF;  // 0x3FFF = 16383
    }
    
    // ======== 阶段 2：查找配对的 '}' ========
    // 从 '{' 的下一个位置开始查找 '}'
    for (e = s+1; e < keylen; e++) {
        if (key[e] == '}') break;
    }
    
    // 情况 2：没有找到 '}' 或 {} 之间为空
    if (e == keylen || e == s+1) {
        // {} 不完整或为空，对完整 Key 计算
        return crc16(key, keylen) & 0x3FFF;
    }
    
    // ======== 阶段 3：对 Tag 计算哈希 ========
    // 只对 {} 之间的内容计算
    // key+s+1: 跳过 '{'
    // e-s-1: 不包括 '{' 和 '}'
    return crc16(key+s+1, e-s-1) & 0x3FFF;
}

// CRC16 算法实现
uint16_t crc16(const char *buf, int len) {
    int counter;
    uint16_t crc = 0;
    
    for (counter = 0; counter < len; counter++) {
        crc = (crc<<8) ^ crc16tab[((crc>>8) ^ *buf++)&0x00FF];
    }
    
    return crc;
}

详细示例：

bash 复制代码

# 示例 1：标准 Hash Tag
Key: "user:{1001}:name"
步骤：
  1. 查找 '{': 位置 5
  2. 查找 '}': 位置 10
  3. 提取tag: "1001" (位置6-9)
  4. CRC16("1001") = 58503
  5. 58503 & 16383 = 9351
  结果：Slot 9351

# 示例 2：多个 {}
Key: "user:{1001}:{2002}:name"
步骤：
  1. 查找第一个 '{': 位置 5
  2. 查找对应 '}': 位置 10
  3. 提取tag: "1001" (只用第一个)
  4. CRC16("1001") = 58503
  5. Slot 9351

# 示例 3：空 {}
Key: "user:{}:name"
步骤：
  1. 查找 '{': 位置 5
  2. 查找 '}': 位置 6
  3. e == s+1（空tag）
  4. 使用完整Key: CRC16("user:{}:name")
  结果：Slot ???

# 示例 4：未闭合
Key: "user:{1001:name"
步骤：
  1. 查找 '{': 位置 5
  2. 查找 '}': 未找到
  3. e == keylen（未闭合）
  4. 使用完整Key: CRC16("user:{1001:name")
  结果：Slot ???

# 示例 5：嵌套 {}
Key: "user:{10{01}}:name"
步骤：
  1. 查找第一个 '{': 位置 5
  2. 查找 '}': 位置 11（外层的}）
  3. 提取tag: "10{01}" (包含内层{})
  4. CRC16("10{01}") = ???
  结果：Slot ???

Hash Tag 深入应用

实战场景：

java 复制代码

// 场景1：用户相关数据聚合
public class UserService {
    
    public void saveUser(User user) {
        String userId = user.getId();
        
        // 所有用户数据用相同tag
        jedis.hset("user:{" + userId + "}:profile", "name", user.getName());
        jedis.hset("user:{" + userId + "}:profile", "age", String.valueOf(user.getAge()));
        jedis.lpush("user:{" + userId + "}:orders", "order123", "order456");
        jedis.sadd("user:{" + userId + "}:tags", "vip", "active");
        
        // 都在同一个Slot，可以用事务
        Transaction tx = jedis.multi();
        tx.hset("user:{" + userId + "}:profile", "updated", "true");
        tx.incr("user:{" + userId + "}:login_count");
        tx.exec();  // ✅ 事务成功
    }
    
    // 批量获取
    public Map<String, String> getUserInfo(String userId) {
        // 一次性获取所有相关数据（在同一节点）
        Pipeline p = jedis.pipelined();
        Response<Map<String, String>> profile = p.hgetAll("user:{" + userId + "}:profile");
        Response<List<String>> orders = p.lrange("user:{" + userId + "}:orders", 0, -1);
        Response<Set<String>> tags = p.smembers("user:{" + userId + "}:tags");
        p.sync();
        
        // 返回合并结果
        return mergeResults(profile.get(), orders.get(), tags.get());
    }
}

Hash Tag的陷阱：

bash 复制代码

# 陷阱：数据倾斜

场景：电商系统，使用商家ID作为tag
shop:{seller:1001}:product:1
shop:{seller:1001}:product:2
...
shop:{seller:1001}:product:10000  # 1万个商品

# 后果：
# • 所有商品在同一个Slot
# • 某个节点数据特别多
# • 造成负载不均衡

解决方案1：二级分片
shop:{seller:1001:shard:0}:product:1-100
shop:{seller:1001:shard:1}:product:101-200
shop:{seller:1001:shard:2}:product:201-300

# 分散到不同Slot

解决方案2：监控+手动调整
# 定期检查数据分布
# 发现倾斜及时调整策略

Gossip 协议深度解析

Gossip 消息格式

完整的二进制协议：

c 复制代码

// cluster.h
#define CLUSTERMSG_TYPE_PING 0
#define CLUSTERMSG_TYPE_PONG 1
#define CLUSTERMSG_TYPE_MEET 2
#define CLUSTERMSG_TYPE_FAIL 3
#define CLUSTERMSG_TYPE_PUBLISH 4
#define CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST 5
#define CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK 6
#define CLUSTERMSG_TYPE_UPDATE 7
#define CLUSTERMSG_TYPE_MFSTART 8

typedef struct {
    char sig[4];                            // "RCmb" (Redis Cluster message bus)
    uint32_t totlen;                        // 消息总长度
    uint16_t ver;                           // 协议版本（当前是1）
    uint16_t port;                          // 发送者端口
    uint16_t type;                          // 消息类型
    uint16_t count;                         // Gossip节点数量
    uint64_t currentEpoch;                  // 当前纪元
    uint64_t configEpoch;                   // 配置纪元
    uint64_t offset;                        // 复制偏移量
    char sender[CLUSTER_NAMELEN];           // 发送者ID（40字节）
    unsigned char myslots[CLUSTER_SLOTS/8]; // 发送者的槽位（2KB）
    char slaveof[CLUSTER_NAMELEN];          // Master ID（如果是slave）
    char myip[NET_IP_STR_LEN];              // IP地址
    char notused1[34];                      // 对齐填充
    uint16_t cport;                         // 集群端口
    uint16_t flags;                         // 发送者标志
    unsigned char state;                    // 集群状态
    unsigned char mflags[3];                // 消息标志
    union clusterMsgData data;              // 消息体（变长）
} clusterMsg;

// Gossip 节点信息
typedef struct {
    char nodename[CLUSTER_NAMELEN];         // 节点ID
    uint32_t ping_sent;                     // PING发送时间
    uint32_t pong_received;                 // PONG接收时间
    char ip[NET_IP_STR_LEN];                // IP
    uint16_t port;                          // 端口
    uint16_t cport;                         // 集群端口
    uint16_t flags;                         // 标志
    uint32_t notused1;
} clusterMsgDataGossip;

消息大小计算：

yaml 复制代码

PING 消息大小：
固定头部：
  sig: 4 字节
  totlen: 4 字节
  ver: 2 字节
  port: 2 字节
  type: 2 字节
  count: 2 字节
  currentEpoch: 8 字节
  configEpoch: 8 字节
  offset: 8 字节
  sender: 40 字节
  myslots: 2048 字节  ← 最大的字段
  slaveof: 40 字节
  myip: 46 字节
  cport: 2 字节
  flags: 2 字节
  state: 1 字节
  mflags: 3 字节
  小计：约 2220 字节

Gossip 数据（变长）：
每个节点信息：
  nodename: 40 字节
  ping_sent: 4 字节
  pong_received: 4 字节
  ip: 46 字节
  port: 2 字节
  cport: 2 字节
  flags: 2 字节
  小计：100 字节

携带 10 个节点：10 × 100 = 1000 字节

总大小：2220 + 1000 ≈ 3.2KB

频率：每秒每个节点发送约 5-10 个PING
6 节点集群：每秒约 30-60 个PING
总流量：30 × 3.2KB ≈ 96KB/秒（可接受）

PING/PONG 内容

PING 消息的完整内容：

yaml 复制代码

发送者：Node1
时间：T0

┌──────────────────────────────────────────────┐
│           PING 消息详细内容                   │
├──────────────────────────────────────────────┤
│ sig: "RCmb"                                  │
│ type: PING (0)                               │
│ sender: 07c37dfeb235213a872192d90877d0cd... │
│ currentEpoch: 100                            │
│ configEpoch: 1                               │
│ offset: 1234567  (复制偏移量)                │
│ myip: 127.0.0.1                              │
│ port: 6379                                   │
│ cport: 16379                                 │
│ flags: MASTER (1)                            │
│ state: CLUSTER_OK                            │
├──────────────────────────────────────────────┤
│ myslots (位图2KB):                            │
│   [11111111][11111111]...[11110000][00000000]│
│   表示负责 Slot 0-5460                        │
├──────────────────────────────────────────────┤
│ count: 3  (携带3个节点的Gossip)               │
├──────────────────────────────────────────────┤
│ Gossip[0]: Node2 的信息                      │
│   nodename: 67ed2db8...                      │
│   ping_sent: 1698307200000                   │
│   pong_received: 1698307201000               │
│   ip: 127.0.0.1                              │
│   port: 6380                                 │
│   cport: 16380                               │
│   flags: SLAVE                               │
├──────────────────────────────────────────────┤
│ Gossip[1]: Node5 的信息                      │
│   nodename: 5a4b3c2d...                      │
│   ping_sent: 1698307195000                   │
│   pong_received: 1698307180000  ← 延迟21秒！ │
│   flags: MASTER | PFAIL  ← 主观下线          │
├──────────────────────────────────────────────┤
│ Gossip[2]: Node3 的信息                      │
│   ...                                        │
└──────────────────────────────────────────────┘

接收者处理逻辑：
Node2 收到后：
1. 更新 Node1 的信息（IP、端口、槽位、纪元）
2. 更新 Node1 的 pong_received（刚收到）
3. 处理 Gossip 数据：
   • Node5 的 PFAIL信息 → 添加故障报告
   • 检查是否达到客观下线
4. 发送 PONG 响应

传播算法

Gossip 节点选择策略：

c 复制代码

// cluster.c
void clusterSendPing(clusterLink *link, int type) {
    int gossipcount = 0;
    int wanted;
    clusterMsg buf[1];
    clusterMsg *hdr = (clusterMsg*) buf;
    
    // ======== 1. 计算要携带的Gossip节点数 ========
    // 集群大小的 1/10，最少 3 个
    wanted = floor(dictSize(server.cluster->nodes)/10);
    if (wanted < 3) wanted = 3;
    if (wanted > freshnodes) wanted = freshnodes;
    
    // ======== 2. 选择Gossip节点 ========
    int maxiterations = wanted*3;
    
    while(freshnodes > 0 && gossipcount < wanted && maxiterations--) {
        dictEntry *de = dictGetRandomKey(server.cluster->nodes);
        clusterNode *this = dictGetVal(de);
        
        // 过滤条件
        if (this == myself) continue;  // 跳过自己
        if (this->flags & CLUSTER_NODE_HANDSHAKE) continue;  // 跳过握手中的
        if (this->link == NULL) continue;  // 跳过无连接的
        
        // 特殊优先级：PFAIL 或 FAIL 节点
        // 优先传播故障信息
        if (this->flags & (CLUSTER_NODE_PFAIL|CLUSTER_NODE_FAIL)) {
            // 总是包含故障节点信息
            clusterSetGossipEntry(hdr, gossipcount, this);
            freshnodes--;
            gossipcount++;
            continue;
        }
        
        // 普通节点：随机选择
        if (rand() < (RAND_MAX/wanted)) {
            clusterSetGossipEntry(hdr, gossipcount, this);
            freshnodes--;
            gossipcount++;
        }
    }
    
    // ======== 3. 设置消息头 ========
    hdr->count = htons(gossipcount);
    hdr->currentEpoch = htonu64(server.cluster->currentEpoch);
    hdr->configEpoch = htonu64(master->configEpoch);
    memcpy(hdr->myslots, server.cluster->myself->slots, sizeof(hdr->myslots));
    // ...
    
    // ======== 4. 发送消息 ========
    clusterSendMessage(link, (unsigned char*)hdr, totlen);
}

Config Epoch 机制

配置纪元的完整机制：

c 复制代码

// Config Epoch 的作用：解决分布式系统中的配置冲突

// 场景：网络分区导致的槽位冲突
void clusterHandleConfigEpochCollision(clusterNode *sender) {
    // ======== 检测冲突 ========
    // 两个Master的configEpoch相同，但负责不同槽位
    
    if (sender->configEpoch == myself->configEpoch &&
        clusterNodeIsAMaster(sender) &&
        clusterNodeIsAMaster(myself)) {
        
        // ======== 解决冲突 ========
        // 增加自己的 configEpoch
        server.cluster->currentEpoch++;
        myself->configEpoch = server.cluster->currentEpoch;
        
        serverLog(LL_WARNING,
            "Config epoch collision with node %.40s (%llu)."
            " Updating my config epoch to %llu",
            sender->name,
            (unsigned long long) sender->configEpoch,
            (unsigned long long) myself->configEpoch);
        
        // 广播新配置
        clusterSaveConfigOrDie(1);
        clusterBroadcastPong(CLUSTER_BROADCAST_ALL);
    }
}

详细示例：

yaml 复制代码

场景：网络分区导致两个节点都声称负责Slot 100

初始状态：
Partition A:
  Node1: Slot 100, configEpoch=5

Partition B:
  Node2: Slot 100, configEpoch=3

网络恢复后：
两个节点开始通信

Node1 → Node2: PING {
  sender: node1,
  configEpoch: 5,
  myslots: [Slot 100 = 1]  (位图显示负责Slot 100)
}

Node2 收到消息：
1. 解析：Node1 也声称负责 Slot 100
2. 比较configEpoch：5 > 3
3. 决定：Node1 获胜 ✅
4. 更新本地：slots[100] = Node1
5. 清除自己的Slot 100：myslots[100/8] &= ~(1<<(100%8))

Node2 → Node1: PING {
  configEpoch: 3,
  myslots: [Slot 100 = 0]  (已清除)
}

最终：
Node1: Slot 100 ✅
Node2: 无Slot 100

冲突解决！

信息收敛过程

完整的收敛示例（6个节点）：

css 复制代码

场景：Node6 新加入集群

初始状态：
Node1 知道：[Node1, Node2, Node3, Node4, Node5]
Node2 知道：[Node1, Node2, Node3, Node4, Node5]
Node3 知道：[Node1, Node2, Node3, Node4, Node5]
Node4 知道：[Node1, Node2, Node3, Node4, Node5]
Node5 知道：[Node1, Node2, Node3, Node4, Node5]
Node6 知道：[Node6]  ← 新节点，只知道自己

T0: 管理员执行
127.0.0.1:6379> CLUSTER MEET 127.0.0.1:6385

Node1 → Node6: MEET {sender: node1, ...}
Node6 收到：
  • 添加 Node1 到 nodes
  • Node6 知道：[Node1, Node6]

Node6 → Node1: PONG {sender: node6, ...}
Node1 收到：
  • 确认 Node6 加入
  • Node1 知道：[Node1-Node6]

T1 (100ms 后): Node1 发送常规心跳
Node1 随机选择 3 个节点：Node2, Node4, Node6

Node1 → Node2: PING {
  gossip: [
    {nodename: node6, ip: 127.0.0.1, port: 6385, ...}  ← 携带Node6信息
    {nodename: node3, ...},
    {nodename: node5, ...}
  ]
}

Node2 收到：
  • 发现新节点 Node6
  • Node2 → Node6: MEET
  • Node2 知道：[Node1-Node6]

T2 (200ms 后): Node2 发送心跳
Node2 → Node3: PING {
  gossip: [{nodename: node6, ...}, ...]
}

Node3 收到：
  • 发现 Node6
  • Node3 → Node6: MEET
  • Node3 知道：[Node1-Node6]

T3-T10: 继续传播...

T10 (约1秒后): 收敛完成
所有节点都知道：[Node1-Node6]

收敛时间：O(log N)
6个节点：约1秒
100个节点：约2-3秒

Gossip 选择算法的代码实现：

c 复制代码

// cluster.c
void clusterSetGossipEntry(clusterMsg *hdr, int i, clusterNode *n) {
    clusterMsgDataGossip *gossip = &(hdr->data.ping.gossip[i]);
    
    // 填充Gossip信息
    memcpy(gossip->nodename, n->name, CLUSTER_NAMELEN);
    gossip->ping_sent = htonl(n->ping_sent/1000);
    gossip->pong_received = htonl(n->pong_received/1000);
    memcpy(gossip->ip, n->ip, sizeof(n->ip));
    gossip->port = htons(n->port);
    gossip->cport = htons(n->cport);
    gossip->flags = htons(n->flags);
    gossip->notused1 = 0;
}

总结

本文深入剖析了 Redis Cluster 的架构和核心机制：

节点数据结构

📊 clusterState：每个节点维护完整集群视图（约200KB）
🔢 clusterNode：存储节点详情（约3KB/节点）
🗺️ 位图优化：2KB表示16384个槽位（节省32倍空间）
📁 nodes.conf：自动持久化集群配置

Hash Slot 机制

🎯 16384个槽：心跳包大小、集群规模、迁移成本的最优平衡
🔢 CRC16算法：快速计算（& 0x3FFF）
🏷️ Hash Tag：{tag}控制Key分布，实现多键操作
📍 三层映射：Key→Slot→Node（O(1)查找）

Gossip 协议

💬 二进制协议："RCmb"签名 + 结构化消息
📦 消息内容：槽位位图(2KB) + Gossip数据(1KB)
🔄 传播策略：随机选择 + 故障优先
⏱️ 收敛时间：O(log N)，6节点约1秒，100节点约2-3秒
🎯 Config Epoch：版本号机制解决槽位冲突

设计智慧

✅ 位图压缩：2KB vs 64KB（节省网络）
✅ Gossip协议：去中心化，容错性好
✅ 随机传播：避免消息风暴
✅ 故障优先：重要信息快速传播

理解这些底层机制，能帮助你：

✅ 理解Cluster的性能特征
✅ 正确设计Hash Tag策略
✅ 避免数据倾斜问题
✅ 理解配置冲突的解决

💡 下篇预告：《Redis集群模式详解（下篇）：故障转移、扩容缩容与生产实战》

深入槽位迁移、在线扩容、故障转移的完整流程！