蒟蒻学习 Redis 源码（一） —— 动态字符串，SDS

参考黄健宏老师《如何阅读 Redis 源码？》一文中介绍的 Redis 阅读方法，本蒟蒻用这篇博客记录学习 Redis 数据结构「动态字符串， sds」过程。

一、数据结构

Reids 源码中和 sds相关两个文件分别是，sds.c 和 sds.h，而 sds定义在 sds.h 中，

c 复制代码

typedef char *sds;

/* Note: sdshdr5 is never used, we just access the flags byte directly.
 1. However is here to document the layout of type 5 SDS strings. */
struct __attribute__ ((__packed__)) sdshdr5 {
    unsigned char flags; /* 3 lsb of type, and 5 msb of string length */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr8 {
    uint8_t len; /* used */
    uint8_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr16 {
    uint16_t len; /* used */
    uint16_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr32 {
    uint32_t len; /* used */
    uint32_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr64 {
    uint64_t len; /* used */
    uint64_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};

从以上源码可以看出，sds 被定义为字符数组（typedef char *sds），但 sds 不是传统c字符串（字符数组中 '\0' 表示字符串结束），而是头sdshdr 中 char buf[]，sdshdr 中记录了 sds 一些元数据。

所有 sdshdr 中除未使用的 sdshdr5，其余均包含 len，alloc，flags， buf 四个成员分别记录字符串已使用长度，分配长度，类型，实际字符串，且先后顺序不变，在知道指向 buf 指针（sds）情况下，可以通过减去偏移量分别求出其余三个成员值。所以 sdshdr 使用 attribute ((packed)) 告诉编译器取消结构在编译过程中的优化对齐，按照实际占用字节数进行对齐，关于 attribute 可参考《attribute((packed))详解》。

为什么 Redis 要搞出这么五类 sdshdr 呢？其实在以前的版本中，sdshdr 只有一类，

c 复制代码

struct sdshdr{
	int len;
	int free;
	char buf[];
}

这样，头 sdshdr 长度固定为8字节，但如果 sds 中存储字符串长度比较小，sdshdr 所使用内存就会占很大比重。 Redis 2.0 版本后就对此作了优化，根据 sds 长度不同选择合适的 sdshdr。

c 复制代码

static inline char sdsReqType(size_t string_size) {
    if (string_size < 1<<5) 
        return SDS_TYPE_5;
    if (string_size < 1<<8) 
        return SDS_TYPE_8;
    if (string_size < 1<<16)
        return SDS_TYPE_16;
#if (LONG_MAX == LLONG_MAX)
    if (string_size < 1ll<<32)
        return SDS_TYPE_32;
    return SDS_TYPE_64;
#else
    return SDS_TYPE_32;
#endif
}

二、特性

使用 sds 有如下几个好处，

1. 二进制安全（binary safe）

传统c字符串特殊的字符以 '\0' 来判断字符串是否结束，如果字符串本身就有包含 '\0' 字符，字符串就会被截断。例如，对于字符串str = "1234\0123"来说，strlen(str)=4，因此传统c字符串是非二进制安全的。 sds 不需要特殊字符标识结束，因为实际长度已存储在 len 中，所以 sds 是二进制安全的，这意味着 sds 可以存储任何类型的数据，例如 JPEG 图像或序列化的 Ruby 对象。

2. <math xmlns="http://www.w3.org/1998/Math/MathML"> O ( 1 ) O(1) </math>O(1) 时间复杂度求长度

sds 求长度只需获取 sdshdr 中成员 len 即可，时间复杂度为 <math xmlns="http://www.w3.org/1998/Math/MathML"> O ( 1 ) O(1) </math>O(1) ，

c 复制代码

#define SDS_TYPE_5  0
#define SDS_TYPE_8  1
#define SDS_TYPE_16 2
#define SDS_TYPE_32 3
#define SDS_TYPE_64 4
#define SDS_TYPE_MASK 7
// 求sds所属sdshdr
#define SDS_HDR(T,s) ((struct sdshdr##T *)((s)-(sizeof(struct sdshdr##T))))
#define SDS_TYPE_5_LEN(f) ((f)>>SDS_TYPE_BITS)
static inline size_t sdslen(const sds s) {
    unsigned char flags = s[-1]; // sdshdr flags成员
    switch(flags&SDS_TYPE_MASK) { // 根据flags计算sdshdr 类型
        case SDS_TYPE_5:
            return SDS_TYPE_5_LEN(flags);
        case SDS_TYPE_8:
            return SDS_HDR(8,s)->len;
        case SDS_TYPE_16:
            return SDS_HDR(16,s)->len;
        case SDS_TYPE_32:
            return SDS_HDR(32,s)->len;
        case SDS_TYPE_64:
            return SDS_HDR(64,s)->len;
    }
    return 0;
}

而传统c字符换需要遍历数字直到遇到 '\0' 字符为止，时间复杂度为 <math xmlns="http://www.w3.org/1998/Math/MathML"> O ( n ) O(n) </math>O(n)。

3. 内存预分配

sds 会为追加操作进行优化，加快追加操作的速度，并降低内存分配的次数，代价是多占用了一些内存，而且这些内存不会被主动释放。

三、API

看下 sds 几个操作函数，

1. sdshdr 类型

成员 flags 第三位存类型，总共可以表示 <math xmlns="http://www.w3.org/1998/Math/MathML"> 2 3 = 8 2^3=8 </math>23=8种类型，sdshdr 只用了 <math xmlns="http://www.w3.org/1998/Math/MathML"> 0 − 4 0-4 </math>0−4五种分别表示 sdshdr5，sdshdr8，sdshdr16，sdshdr32，sdshdr64，确定 sdshdr 类型时。只需通过 sds 向后偏移一个字节求出 flags ，再用 flags 成员与二进制 <math xmlns="http://www.w3.org/1998/Math/MathML"> 00000111 00000111 </math>00000111 即十进制 7 做按位与；

2. 所属 sdshdr

求出 sdshdr 类型后，只需要在 buf 向后偏移对应该结构体大小，即可确定 sdshdr（sdshdr5特殊，这里不做分析），该操作定义在宏 SDS_HDR 中，

c 复制代码

// ## 表示宏拼接
// s：sds, T：类型,取值只能为[8, 16, 32, 24]
#define SDS_HDR(T,s) ((struct sdshdr##T *)((s)-(sizeof(struct sdshdr##T))))

// 求sds所属sdshdr指针, 并赋值给sh
#define SDS_HDR_VAR(T,s) struct sdshdr##T *sh = (void*)((s)-(sizeof(struct sdshdr##T)));

3. 实际长度，已分配长度，可使用长度

计算 sds 已分配长度，可使用长度和计算实际长度类似，返回 sdshdr 头 alloc 成员、alloc - len 即可。分别参考 sds.h 中 sdslen， sdsalloc， sdsavail 函数。

4. 创建

如下几个函数用于创建 sds，

c 复制代码

sds sdsnewlen(const void *init, size_t initlen);
sds sdsnew(const char *init);
sds sdsempty(void);
sds sdsdup(const sds s);

α. sdsnewlen

基于 init 指针指向的内容与指定长度 initlen 创建 sds ，这里 init 指针可以指向任意内容，最终会调用 _sdsnewlen 函数创建。

c 复制代码

sds sdsnewlen(const void *init, size_t initlen) {
    return _sdsnewlen(init, initlen, 0);
}

β. sdsnew

基于c字符串（以 '\0'）创建 sds，先使用 strlen 求出c字符串长度，在调用 sdsnewlen 创建，最终会调用 _sdsnewlen 函数创建。

c 复制代码

// sds -> sdsnewlen -> _sdsnewlen
sds sdsnew(const char *init) {
    size_t initlen = (init == NULL) ? 0 : strlen(init);
    return sdsnewlen(init, initlen);
}

γ. sdsempty

创建一个空（长度为0）sds 。

c 复制代码

sds sdsempty(void) {
    return sdsnewlen("",0);
}

δ. sdsdup

复制。

c 复制代码

sds sdsdup(const sds s) {
    return sdsnewlen(s, sdslen(s));
}

ε. _sdsnewlen

可以看出，以上新建函数最终都会调用 _sdsnewlen 创建一个 sds，源码注释对该函数说明如下，

Create a new sds string with the content specified by the 'init' pointer and 'initlen'. If NULL is used for 'init' the string is initialized with zero bytes. If SDS_NOINIT is used, the buffer is left uninitialized; The string is always null-terminated (all the sds strings are, always) so even if you create an sds string with: mystring = sdsnewlen("abc",3); You can print the string with printf() as there is an implicit \0 at the end of the string. However the string is binary safe and can contain \0 characters in the middle, as the length is stored in the sds header.

这里值得注意的是，sds 也会默认在字符串末尾加 \0 ，所以可以使用 printf() 函数打印 sds，但是由于 sds 是二进制安全的，如果 sds 中间包含 \0 的话打印将会不全。

c 复制代码

/*
 * trymalloc, 与分配内存相关
 */
sds _sdsnewlen(const void *init, size_t initlen, int trymalloc) {
    void *sh; // 指向sds结构体的指针
    sds s; // sds类型变量，即char*字符数组
    char type = sdsReqType(initlen); // 根据数据大小获取sdshdr类型
    /* Empty strings are usually created in order to append. Use type 8
     * since type 5 is not good at this. */
    if (type == SDS_TYPE_5 && initlen == 0) type = SDS_TYPE_8;
    int hdrlen = sdsHdrSize(type); // 根据类型获取sdshdr大小
    unsigned char *fp; /* flags pointer. */
    size_t usable;

    assert(initlen + hdrlen + 1 > initlen); /* Catch size_t overflow */
    // 申请内存空间，总空间 = 头 +  sds + 最后'\0'一个字节
    sh = trymalloc?
        s_trymalloc_usable(hdrlen+initlen+1, &usable) :
        s_malloc_usable(hdrlen+initlen+1, &usable);
    if (sh == NULL) return NULL;
    if (init==SDS_NOINIT)
        init = NULL;
    else if (!init)
        memset(sh, 0, hdrlen+initlen+1); // 将内存的值都设置为0
    s = (char*)sh+hdrlen; // 将s指针指向sds起始位置
    fp = ((unsigned char*)s)-1;  // 将fp指针指向sdshdr的flags成员
    usable = usable-hdrlen-1;
    if (usable > sdsTypeMaxSize(type))
        usable = sdsTypeMaxSize(type);
     
    // 构造 sdshdr
    switch(type) {
        case SDS_TYPE_5: {
            *fp = type | (initlen << SDS_TYPE_BITS);
            break;
        }
        case SDS_TYPE_8: {
            SDS_HDR_VAR(8,s);
            sh->len = initlen; // 初始化 sdshdr len字段
            sh->alloc = usable; // 初始化 sdshdr alloc字段
            *fp = type; // 初始化 sdshdr flag字段
            break;
        }
        case SDS_TYPE_16: {
            SDS_HDR_VAR(16,s);
            sh->len = initlen;
            sh->alloc = usable;
            *fp = type;
            break;
        }
        case SDS_TYPE_32: {
            SDS_HDR_VAR(32,s);
            sh->len = initlen;
            sh->alloc = usable;
            *fp = type;
            break;
        }
        case SDS_TYPE_64: {
            SDS_HDR_VAR(64,s);
            sh->len = initlen;
            sh->alloc = usable;
            *fp = type;
            break;
        }
    }
    
    if (initlen && init)
        memcpy(s, init, initlen);  // sds 内容拷贝
    s[initlen] = '\0'; //  添加传统c字符串式'\0'
    return s;
}

5. 拼接

拼接主要有如下几个函数，

c 复制代码

sds sdscatlen(sds s, const void *t, size_t len); // 向sds s后拼接长度为len, 内容有指针t指向的任意二进制安全字符串
sds sdscatsds(sds s, const sds t); // 向sds s后拼接sds t
sds sdscat(sds s, const char *t); // 向sds s后拼接c字符串t

和创建一样， sdscat ，sdscatsds 最终都会调用到sdscatlen，

c 复制代码

sds sdscatlen(sds s, const void *t, size_t len) {
    size_t curlen = sdslen(s);
    // 根据要追加的长度len和目标字符串s的现有长度，判断是否要增加新的空间
    // 返回的还是字符串起始内存地址
    s = sdsMakeRoomFor(s,len);  // 1. 容量检查及扩容
    if (s == NULL) return NULL;
    memcpy(s+curlen, t, len); // 2. 内存复制
    sdssetlen(s, curlen+len); // 3. 重新设置sds长度
    s[curlen+len] = '\0'; // 4. 末尾添加默认'\0'
    return s;
}

从源码可以看出, sds 追加操作由入上几步完成，最重要在 sdsMakeRoomFor 容量检查及扩容，

c 复制代码

sds sdsMakeRoomFor(sds s, size_t addlen) {
    return _sdsMakeRoomFor(s, addlen, 1);
}

看下 _sdsMakeRoomFor 函数官方注释，

Enlarge the free space at the end of the sds string so that the caller is sure that after calling this function can overwrite up to addlen bytes after the end of the string, plus one more byte for nul term. If there's already sufficient free space, this function returns without any action, if there isn't sufficient free space, it'll allocate what's missing, and possibly more: When greedy is 1, enlarge more than needed, to avoid need for future reallocs on incremental growth. When greedy is 0, enlarge just enough so that there's free space for 'addlen'. Note: this does not change the length of the sds string as returned by sdslen(), but only the free buffer space we have.

通过Google翻译加上本人蹩脚英语大概知道， _sdsMakeRoomFor 主要保证 s 结尾有addlen +1字节可用空间。如果s已经有足够可用空间，则该函数直接返回s，什么也不做。如果没有足够的可用空间，它会分配缺少的，甚至更多（预分配），这将取决于参数 greedy ，

greedy 为 1，将分配比缺少的更多空间，为了避免将来再分配；
greedy 为 0，只分配缺少的；

c 复制代码

#define SDS_MAX_PREALLOC (1024*1024)
sds _sdsMakeRoomFor(sds s, size_t addlen, int greedy) {
    void *sh, *newsh;
    // sdsavail: s->alloc - s->len, 获取 SDS 的剩余可用长度
    size_t avail = sdsavail(s);
    size_t len, newlen, reqlen;
    // 根据 flags 获取 SDS 的类型 oldtype
    char type, oldtype = s[-1] & SDS_TYPE_MASK;
    int hdrlen;
    size_t usable;

    /* Return ASAP if there is enough space left. */
    // 剩余空间大于等于新增空间，无需扩容，直接返回源字符串
    if (avail >= addlen) return s;

    // 获取当前长度
    len = sdslen(s);
    //获取头指针
    sh = (char*)s-sdsHdrSize(oldtype);
    // 新长度
    reqlen = newlen = (len+addlen);
    assert(newlen > len);   /* Catch size_t overflow */
    if (greedy == 1) { // 贪婪模式
        if (newlen < SDS_MAX_PREALLOC)  // #define SDS_MAX_PREALLOC (1024*1024), 1M
            newlen *= 2;  // 如果小于1M， 预分配1M空间
        else 
            newlen += SDS_MAX_PREALLOC; // 超过了1M，预分配1M空间
    }

	// end, 前面代码主要计算s新长度

    type = sdsReqType(newlen); // 根据新的空间占用计算 sds 类型 

    /* Don't use type 5: the user is appending to the string and type 5 is
     * not able to remember empty space, so sdsMakeRoomFor() must be called
     * at every appending operation. */
    // SDS_TYPE_5 弃用, 使用 SDS_TYPE_8 代替
    if (type == SDS_TYPE_5) type = SDS_TYPE_8;

    hdrlen = sdsHdrSize(type);  // 头长度
    assert(hdrlen + newlen + 1 > reqlen);  /* Catch size_t overflow */
    if (oldtype==type) { // 和原来头类型一样，那么可以复用原来的空间
        newsh = s_realloc_usable(sh, hdrlen+newlen+1, &usable);  // 申请一块内存，并追加大小
        if (newsh == NULL) return NULL;
        s = (char*)newsh+hdrlen;
    } else {
        /* Since the header size changes, need to move the string forward,
         * and can't use realloc */
        //如果头类型变了，表示内存头变了，那么需要重新申请内存
        //因为如果使用s_realloc只会向后追加内存
        newsh = s_malloc_usable(hdrlen+newlen+1, &usable);
        if (newsh == NULL) return NULL;
        memcpy((char*)newsh+hdrlen, s, len+1);
        s_free(sh); // 释放掉原内存
        s = (char*)newsh+hdrlen;
        s[-1] = type;
        sdssetlen(s, len);
    }
    usable = usable-hdrlen-1;
    if (usable > sdsTypeMaxSize(type))
        usable = sdsTypeMaxSize(type);
    sdssetalloc(s, usable); //重新设置alloc字段
    return s;
}

自动扩容机制总结，

α. 扩容阶段：

若 sds中剩余空闲空间 avail 大于新增内容的长度 addlen，则无需扩容；
若 sds 中剩余空闲空间 avail 小于或等于新增内容的长度 addlen： greedy 如果为1，则需预分配，若新增后总长度 len+addlen < 1MB，则按新长度的两倍扩容，若新增后总长度 len+addlen > 1MB，则按新长度加上 1MB 扩容；

β. 内存分配阶段：

根据扩容后的长度选择对应的 sds类型：

若类型不变，则只需通过 s_realloc_usable扩大 buf 数组即可；
若类型变化，则需要为整个 sds重新分配内存，并将原来的 sds内容拷贝至新位置；

蒟蒻学习 Redis 源码（一） —— 动态字符串，SDS

一、 数据结构

二、 特性

1. 二进制安全（binary safe）

2. <math xmlns="http://www.w3.org/1998/Math/MathML"> O ( 1 ) O(1) </math>O(1) 时间复杂度求长度

3. 内存预分配

三、API

1. sdshdr 类型

2. 所属 sdshdr

3. 实际长度，已分配长度，可使用长度

4. 创建

α. sdsnewlen

β. sdsnew

γ. sdsempty

δ. sdsdup

ε. _sdsnewlen

5. 拼接

α. 扩容阶段：

β. 内存分配阶段：

参考

一、数据结构

二、特性