Redis 源码分析-内部数据结构 ziplist

如果提到双向链表，我们应该很熟悉，每一个对象都有三个字段

指向前驱节点的指针
指向后向节点的指针
该节点的数据

双向链表可以在两端执行 O(1) 级别的操作，缺点是可能产生内存碎片，对空间利用率低(每个节点除了存储数据还要存指向相邻节点的指针)

而 ziplist 就是一个特殊编码的双向链表，为什么说特殊呢？

ziplist 每个节点没有使用前后指针
ziplist 整体分配一大块内存
ziplist 中的每个元素会根据实际情况分配不同大小的空间，充分体现了Redis对于存储效率的追求

ziplist

以下是官方的注释信息：

c 复制代码

 * ZIPLIST OVERALL LAYOUT
 * ======================
 *
 * The general layout of the ziplist is as follows:
 *
 * <zlbytes> <zltail> <zllen> <entry> <entry> ... <entry> <zlend>
 *
 * NOTE: all fields are stored in little endian, if not specified otherwise.
 *
 * <uint32_t zlbytes> is an unsigned integer to hold the number of bytes that
 * the ziplist occupies, including the four bytes of the zlbytes field itself.
 * This value needs to be stored to be able to resize the entire structure
 * without the need to traverse it first.
 *
 * <uint32_t zltail> is the offset to the last entry in the list. This allows
 * a pop operation on the far side of the list without the need for full
 * traversal.
 *
 * <uint16_t zllen> is the number of entries. When there are more than
 * 2^16-2 entries, this value is set to 2^16-1 and we need to traverse the
 * entire list to know how many items it holds.
 *
 * <uint8_t zlend> is a special entry representing the end of the ziplist.
 * Is encoded as a single byte equal to 255. No other normal entry starts
 * with a byte set to the value of 255.

大概意思是说，ziplist 通常长下面这个样子：

zlbytes：4 字节，ziplistBytes，标识该 ziplist 的长度
zltail：4 字节，ziplistTail，标识尾节点偏移量
zllen：2字节，ziplistLen，zipList 中的 entry 个数
zlend：1字节，ziplistEnd，结束标识，固定值 255

根据这个定义，我们可以很容易的找到这个 ziplist 的首项和尾项，起始指针偏移 10 字节就是头节点，偏移 zltail 值对应的字节就是尾节点，不过你可能有其他的问题：zllen 字段只有 2 字节，那 ziplist 是不是只能存 2 ¹⁶ -1 个元素呢，这个问题的答案我们放在最后说，另外，请注意 zlend 的默认值 255，接下来会用到

ziplist entry

我们思考一个这样的问题，链表能够进行双向便利是存储了指向相邻节点的指针，那 ziplist 没有这样的指针，该怎么进行遍历呢？

ziplist中每个元素使用一个字段标识了前一个元素的长度，一个字段标识了本元素的长度，这样只要找到其中的任意一个元素，都可以进行双向遍历了。

ziplist 中的元素我们叫 entry，其结构如下：

c 复制代码

 *
 * <prevlen from 0 to 253> <encoding> <entry>
 *
 * |00pppppp| - 1 byte
 *      String value with length less than or equal to 63 bytes (6 bits).
 *      "pppppp" represents the unsigned 6 bit length.
 * |01pppppp|qqqqqqqq| - 2 bytes
 *      String value with length less than or equal to 16383 bytes (14 bits).
 *      IMPORTANT: The 14 bit number is stored in big endian.
 * |10000000|qqqqqqqq|rrrrrrrr|ssssssss|tttttttt| - 5 bytes
 *      String value with length greater than or equal to 16384 bytes.
 *      Only the 4 bytes following the first byte represents the length
 *      up to 32^2-1. The 6 lower bits of the first byte are not used and
 *      are set to zero.
 *      IMPORTANT: The 32 bit number is stored in big endian.
 * |11000000| - 3 bytes
 *      Integer encoded as int16_t (2 bytes).
 * |11010000| - 5 bytes
 *      Integer encoded as int32_t (4 bytes).
 * |11100000| - 9 bytes
 *      Integer encoded as int64_t (8 bytes).
 * |11110000| - 4 bytes
 *      Integer encoded as 24 bit signed (3 bytes).
 * |11111110| - 2 bytes
 *      Integer encoded as 8 bit signed (1 byte).
 * |1111xxxx| - (with xxxx between 0000 and 1101) immediate 4 bit integer.
 *      Unsigned integer from 0 to 12. The encoded value is actually from
 *      1 to 13 because 0000 and 1111 can not be used, so 1 should be
 *      subtracted from the encoded 4 bit value to obtain the right value.
 * |11111111| - End of ziplist special entry.

prevlen 前一项的长度，是一个 0 ～ 253 的值
encoding 当前项的编码
data 存放真正的数据

我在最开始提到：ziplist 中的每个元素会根据实际情况分配不同大小的空间，充分体现了Redis对于存储效率的追求。这是什么意思呢？以下就是 ziplist 最繁琐的部分了：

针对 prevlen 字段，如果说前一个 entry 非常小，不足 254 字节，那么 1 个字节就可以存下，prevlen 字段就用一个字节来存储；否则，prevlen 占用 5 字节，第 1 字节定义为特殊值 254，标识该字段占用 5 个字节，真正的长度在后面 4 字节。为什么不用 255 呢，原因是 255 就是我们上面提到的 zlend 标识，同时 254 也被赋予了特殊意义。

针对 encoding 字段，更复杂，共有以下 9 种情况：

|00pppppp| - 1 byte。第1个字节最高两个 bit 是00，那么 <encoding> 字段只有1个字节，剩余的 6 个 bit 用来表示长度值，最高可以表示 63 (2⁶-1)。
|01pppppp|qqqqqqqq| - 2 bytes。第1个字节最高两个bit是01，那么 <encoding> 字段占2个字节，总共有14个 bit 用来表示长度值，最高可以表示 16383 (2¹⁴-1)。
|10**__**|qqqqqqqq|rrrrrrrr|ssssssss|tttttttt| - 5 bytes。第1个字节最高两个 bit 是 10，那么 encoding 字段占 5 个字节，总共使用 32 个 bit 来表示长度值（6个 bit 舍弃不用），最高可以表示 2³²-1。需要注意的是：在前三种情况下，<data>都是按字符串来存储的；从下面第4种情况开始，<data>开始变为按整数来存储了。
|11000000| - 1 byte。<encoding>字段占用1个字节，值为0xC0，后面的数据<data>存储为 2 个字节的int16_t类型。
|11010000| - 1 byte。<encoding>字段占用1个字节，值为0xD0，后面的数据<data>存储为 4 个字节的int32_t类型。
|11100000| - 1 byte。<encoding>字段占用1个字节，值为0xE0，后面的数据<data>存储为 8 个字节的int64_t类型。
|11110000| - 1 byte。<encoding>字段占用1个字节，值为0xF0，后面的数据<data>存储为 3 个字节长的整数。
|11111110| - 1 byte。<encoding>字段占用1个字节，值为0xFE，后面的数据<data>存储为 1 个字节的整数。
|1111xxxx| - - (xxxx的值在0001和1101之间)。这是一种特殊情况，xxxx从 1 到 13 一共 13 个值，这时就用这 13 个值来表示真正的数据。注意，这里是表示真正的数据，而不是数据长度了。也就是说，在这种情况下，后面不再需要一个单独的<data>字段来表示真正的数据了，而是<encoding>和<data>合二为一了。另外，由于xxxx只能取 0001 和 1101 这 13 个值了（其它可能的值和其它情况冲突了，比如0000和1110分别同前面第7种第8种情况冲突，1111跟结束标记冲突），而小数值应该从0开始，因此这13个值分别表示0到12，即xxxx的值减去 1 才是它所要表示的那个整数数据的值

关键代码分析

创建函数

c 复制代码

unsigned char *ziplistNew(void) {
    // 分配头空间和尾空间 11 字节
    unsigned int bytes = ZIPLIST_HEADER_SIZE + ZIPLIST_END_SIZE;
    unsigned char *zl = zmalloc(bytes);

    ZIPLIST_BYTES(zl) = intrev32ifbe(bytes);
    ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(ZIPLIST_HEADER_SIZE);
    ZIPLIST_LENGTH(zl) = 0;

    // 将列表尾设置为特定标识 255
    zl[bytes - 1] = ZIP_END;
    return zl;
}

插入函数

c 复制代码

unsigned char *ziplistPush(unsigned char *zl, unsigned char *s, unsigned int slen, int where) {
    unsigned char *p;
    // 注意,这里 p 对应的不是头节点就是尾节点
    p = (where == ZIPLIST_HEAD) ? ZIPLIST_ENTRY_HEAD(zl) : ZIPLIST_ENTRY_END(zl);
    return __ziplistInsert(zl, p, s, slen);
}

c 复制代码

/**
 * 对 zipilst 执行插入操作，在指定的位置 p 插入一段新的数据
 * @param zl
 * @param p     ziplist 中某一个数据项的起始位置，或者在向尾端插入的时候，它指向 ziplist 的结束标记 <zlend>
 * @param s     待插入数据的地址指针
 * @param slen  待插入数据长度
 */
unsigned char *__ziplistInsert(unsigned char *zl, unsigned char *p, unsigned char *s, unsigned int slen) {
    size_t curlen = intrev32ifbe(ZIPLIST_BYTES(zl)), reqlen;
    unsigned int prevlensize, prevlen = 0;
    size_t offset;
    int nextdiff = 0;
    unsigned char encoding = 0;
    long long value = 123456789; /* initialized to avoid warning. Using a value
                                    that is easy to see if for some reason
                                    we use it uninitialized. */
    zlentry tail;

    // 查找前一个条目的长度
    if (p[0] != ZIP_END) {
        // 如果 p 不是 ziplist 的结束标志，从 p 中解码出前一个条目的长度
        ZIP_DECODE_PREVLEN(p, prevlensize, prevlen);
    } else {
        // 如果 p 是结束标志，获取 ziplist 的尾部条目并提取该尾部条目的长度
        unsigned char *ptail = ZIPLIST_ENTRY_TAIL(zl);
        if (ptail[0] != ZIP_END) {
            prevlen = zipRawEntryLength(ptail);
        }
    }

    // 尝试将待插入的条目 s 编码为整数。如果成功，则返回所需的编码大小；否则，使用字符串长度
    if (zipTryEncoding(s, slen, &value, &encoding)) {
        // 根据 encoding 编码类型,返回该编码类型对应的数据部分需要的字节数
        reqlen = zipIntSize(encoding);
    } else {
        reqlen = slen;
    }

    // 还需要记录 （记录上一个条目的长度需要的字节数 1 or 5）
    reqlen += zipStorePrevEntryLength(NULL, prevlen);
    // 当前条目的实际长度
    reqlen += zipStoreEntryEncoding(NULL, encoding, slen);

    // 判断下一个条目是否能够容纳当前条目的长度。如果不能，需要调整下一个条目的长度字节数所需的空间。
    int forcelarge = 0;
    nextdiff = (p[0] != ZIP_END) ? zipPrevLenByteDiff(p, reqlen) : 0;
    if (nextdiff == -4 && reqlen < 4) {
        nextdiff = 0;
        forcelarge = 1;
    }

    // 计算插入位置的偏移量，并调整 ziplist 的大小，以容纳新的条目
    offset = p - zl;
    zl = ziplistResize(zl, curlen + reqlen + nextdiff);
    p = zl + offset;

    // 如果 p 不是 ziplist 结束，则移动内存以为新的条目腾出空间
    if (p[0] != ZIP_END) {
        /* Subtract one because of the ZIP_END bytes */
        memmove(p + reqlen, p - nextdiff, curlen - offset - 1 + nextdiff);

        /* Encode this entry's raw length in the next entry. */
        if (forcelarge)
            zipStorePrevEntryLengthLarge(p + reqlen, reqlen);
        else
            zipStorePrevEntryLength(p + reqlen, reqlen);

        /* Update offset for tail */
        ZIPLIST_TAIL_OFFSET(zl) =
                intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl)) + reqlen);

        /* When the tail contains more than one entry, we need to take
         * "nextdiff" in account as well. Otherwise, a change in the
         * size of prevlen doesn't have an effect on the *tail* offset. */
        zipEntry(p + reqlen, &tail);
        if (p[reqlen + tail.headersize + tail.len] != ZIP_END) {
            ZIPLIST_TAIL_OFFSET(zl) =
                    intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl)) + nextdiff);
        }
    } else {
        // 更新尾指针以反映新条目的插入
        ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(p - zl);
    }

    // 如果 nextdiff 不为零，表示后续条目的长度发生了变化，因此需要级联更新
    if (nextdiff != 0) {
        offset = p - zl;
        zl = __ziplistCascadeUpdate(zl, p + reqlen);
        p = zl + offset;
    }

    // 写入新条目
    p += zipStorePrevEntryLength(p, prevlen);
    p += zipStoreEntryEncoding(p, encoding, slen);
    if (ZIP_IS_STR(encoding)) {
        memcpy(p, s, slen);
    } else {
        zipSaveInteger(p, value, encoding);
    }

    // 增加 ziplist 的长度计数并返回更新后的 ziplist
    ZIPLIST_INCR_LENGTH(zl, 1);
    return zl;
}

主要逻辑分析如下：

查找前一个元素的长度，计算 prevlen
计算当前数据项占用的总字节数reqlen，它包含三部分：<prevlen>, <encoding>和真正的数据。其中的数据部分会通过调用zipTryEncoding先来尝试转成整数，否则直接返回字符串长度
判断下一个条目的 prevlen 是否能够容纳当前条目的长度。如果不能，需要调整下一个条目的长度字节数所需的空间，同时由于下一个条目的 prevlen 长度变化，可能导致后续元素的 prevlen 也需要进行级联更新
计算出了最终需要的空间，调用 ziplistResize 去扩容，并把 p 后的元素继续移动，插入新元素，增加 ziplist 中的元素数量并返回更新后的 ziplist

hash 随着数据的增大，其底层数据结构的实现是会发生变化的，当然存储效率也就不同。在 field 比较少，各个 value 值也比较小的时候，hash 采用 ziplist 来实现；而随着 field 增多和 value 值增大，hash 可能会变成dict 来实现，原因如下：

ziplist 查询元素时只能进行遍历，效率较低，在整体数量不多时不明显，一旦数量过大，就会影响效率
当元素越来越多或越来越大时，插入元素可能会引起较大的内存拷贝，如由于 prevlen 而引起的级联更新

而这个大或者多的定义是什么呢？

最大键值对数量(注意是键值对，如果是 ziplist 中的 zllen，其实是 1024)

c 复制代码

hash_max_ziplist_entries 	// 512

最大元素长度，插入的任意一个 value 的长度超过 64

c 复制代码

hash_max_ziplist_value		// 64

回答最开始的问题：zllen 只有 2 字节，所以最多存 2¹⁶ - 1 个元素吗，答案是否定的，如果 zllen = 2¹⁶ - 1，此时该字段就无效了，如果想知道具体数量，就需要遍历来查了。