【Redis】原理-网络模型

基础知识

用户空间和内核空间

进程的寻址空间(内存地址)被划分为两部分:内核空间和用户空间。

寻址空间是CPU相关的,一个32位的CPU最大的寻址空间是232。每一个地址对应一个字节的内存,也就是最大的寻址空间为4GB,其中对应的虚拟内存地址的低地址 0 - 3GB为用户空间,高地址为 3GB - 4GB内核空间。

  • 用户空间只能执行受限的命令(R3),而且不能直接调用系统资源。
  • 内核空间可以执行特权命令(R0),调用一切系统资源。

I/O读写

Linux为了提高I/O效率,会在用户空间和内核空间都加入缓冲区。

  • 写数据时,需要把用户缓冲数据拷贝到内核缓冲区,然后写入设备
  • 读数据时,需要从设备读取数据到内核缓冲区,然后拷贝到用户缓冲区。
    也就是需要等待数据,以及数据需要再不同空间的buffer中需要拷贝。

Linux I/O模型

阻塞I/O

Blocking IO在两阶段都需要进行阻塞等待。

阶段一(用户向内核请求数据数据未到达):

  • 用户进程尝试读取数据(比如网卡数据)

  • 此时数据尚未到达,内核需要等待数据

  • 此时用户进程也处于阻塞状态

阶段二(数据从内核缓冲区拷贝到用户缓冲区):

  • 数据到达并拷贝到内核缓冲区,代表已就绪

  • 将内核数据拷贝到用户缓冲区

  • 拷贝过程中,用户进程依然阻塞等待

  • 拷贝完成,用户进程解除阻塞,处理数据

非阻塞I/O

在阶段一为忙等,阶段二为阻塞等待。

阶段一:

  • 用户进程尝试读取数据(比如网卡数据)

  • 此时数据尚未到达,内核需要等待数据

  • 返回异常给用户进程

  • 用户进程拿到error后,再次尝试读取

  • 循环往复,直到数据就绪

阶段二:

  • 将内核数据拷贝到用户缓冲区

  • 拷贝过程中,用户进程依然阻塞等待

  • 拷贝完成,用户进程解除阻塞,处理数据

非阻塞I/O并没有提升读取数据的性能。

I/O多路复用

单个线程监听多个FD,获取FD的状态。

在请求数据时,对于阻塞I/O,调用recvfrom时,若没有数据,阻塞I/O会使CPU阻塞,非阻塞I/O使CPU空转。

单线程只能依次处理I/O。

增加效率的办法:

多线程

事件通知机制,当数据准备好时,才通知线程去处理当前I/O。

用户进程如何知道内核中数据是否就绪?

文件描述符:Linux一切皆文件

IO多路复用:是利用单个线程来同时监听多个FD,并在某个FD可读、可写时得到通知,从而避免无效的等待,充分利用CPU资源。

阶段一:

用户进程调用select,指定要监听的FD集合

内核监听FD对应的多个socket

任意一个或多个socket数据就绪则返回readable

此过程中用户进程阻塞

阶段二:

用户进程找到就绪的socket

依次调用recvfrom读取数据

内核将数据拷贝到用户空间

用户进程处理数据

三种多路复用实现:

select
c 复制代码
// 定义类型别名 __fd_mask,本质是 long int
typedef long int __fd_mask;
/* fd_set 记录要监听的fd集合,及其对应状态 */
typedef struct {
    // fds_bits是long类型数组,长度为 1024/32 = 32
    // 共1024个bit位,每个bit位代表一个fd,0代表未就绪,1代表就绪
    __fd_mask fds_bits[__FD_SETSIZE / __NFDBITS];
    // ...
} fd_set;
// select函数,用于监听fd_set,也就是多个fd的集合
int select(
    int nfds, // 要监视的fd_set的最大fd + 1
    fd_set *readfds, // 要监听读事件的fd集合
    fd_set *writefds,// 要监听写事件的fd集合
    fd_set *exceptfds, // // 要监听异常事件的fd集合
    // 超时时间,null-用不超时;0-不阻塞等待;大于0-固定等待时间
    struct timeval *timeout
);

select限制为1024个FD,也就是使用位图做标记

会有两次数据拷贝,并且要得知具体是哪个FD就绪,需要遍历。

poll

IO流程:

① 创建polfd数组,向其中添加关注的fd信息,数组大小自定义。

② 调用poll函数,将pollfd数组拷贝到内核空间,转为链表存储。

③ 内核遍历fd,判断是否就绪。

④ 数据就绪或者超时后,拷贝polfds数组到用户空间,返回就绪fd数量n

⑤ 用户进程判断n是否大于0

⑥ 大于0则遍历polfd数组,找到就绪的fd。

c 复制代码
// pollfd 中的事件类型
#define POLLIN     //可读事件
#define POLLOUT    //可写事件
#define POLLERR    //错误事件
#define POLLNVAL   //fd未打开

// pollfd结构
struct pollfd {
    int fd;     	  /* 要监听的fd  */
    short int events; /* 要监听的事件类型:读、写、异常 */
    short int revents;/* 实际发生的事件类型 */
};
// poll函数
int poll(
    struct pollfd *fds, // pollfd数组,可以自定义大小
    nfds_t nfds, // 数组元素个数
    int timeout // 超时时间
);

自身优缺点:

与select对比:

select模式中的fd_set大小固定为1024,而pollfd在内核中采用链表,理论上无上限

监听FD越多,每次遍历消耗时间也越久,性能反而会下降

epoll

epoll是对select和poll的改进,其提供了三个函数:

三个函数代表三步,分别为创建,添加/修改FD,检查就绪列表。

c 复制代码
struct eventpoll {
    //...
    struct rb_root  rbr; // 一颗红黑树,记录要监听的FD
    struct list_head rdlist;// 一个链表,记录就绪的FD
    //...
};

// 1.创建一个epoll实例,内部是event poll,返回对应的句柄epfd
int epoll_create(int size);

// 2.将一个FD添加到epoll的红黑树中,并设置ep_poll_callback
// callback触发时,就把对应的FD加入到rdlist这个就绪列表中
int epoll_ctl(
    int epfd,  // epoll实例的句柄
    int op,    // 要执行的操作,包括:ADD(增)、MOD(改)、DEL(删)
    int fd,    // 要监听的FD
    struct epoll_event *event // 要监听的事件类型:读、写、异常等
);

// 3.检查rdlist列表是否为空,不为空则返回就绪的FD的数量
int epoll_wait(
    int epfd,                   // epoll实例的句柄
    struct epoll_event *events, // 空event数组,用于接收就绪的FD
    int maxevents,              // events数组的最大长度
    int timeout   // 超时时间,-1永不超时;0不阻塞;大于0为阻塞时间
);

FD是一个无符号整数,使用红黑数进行记录;ep_poll_callback是一个回调函数,其作用就是监听fd,并把对应的FD添加到rdlist当中。

图中的红色代表就绪FD。

注意:在epoll中只是将就绪的FD传输到用户空间,而与select或poll模式拷贝整个监听数组或链表不同。

Epoll不仅仅是传输效率高的一种实现,并且其还支持多种触发模式:

水平触发和边缘触发:类似与数字电路中的电平触发与边沿触发。

使用两种触发的逻辑是不同的:

水平触发

水平触发可以不必一次性读完全部数据,因为下一次还会有事件再通知读取。

c 复制代码
// 相对简单,不用担心数据遗漏
while (true) {
    int nfds = epoll_wait(epfd, events, MAX_EVENTS, -1);
    for (int i = 0; i < nfds; i++) {
        if (events[i].events & EPOLLIN) {
            // 读取数据,可以不一次读完
            // 下次epoll_wait还会通知
            char buf[1024];
            int n = read(fd, buf, sizeof(buf));
            // 即使没读完,下次还会通知
        }
    }
}
边缘触发

但是边缘触发就必须在事件通知后将所有数据全部读取完毕才可以。

c 复制代码
// 必须一次处理完所有数据
while (true) {
    int nfds = epoll_wait(epfd, events, MAX_EVENTS, -1);
    for (int i = 0; i < nfds; i++) {
        if (events[i].events & EPOLLIN) {
            // 必须循环读取,直到读完所有数据
            while (true) {
                char buf[1024];
                int n = read(fd, buf, sizeof(buf));
                if (n <= 0) {
                    if (errno == EAGAIN || errno == EWOULDBLOCK) {
                        // 数据已读完
                        break;
                    }
                    // 错误处理
                    close(fd);
                    break;
                }
                // 处理数据
                process_data(buf, n);
            }
        }
    }
}

I/O多路复用-事件通知机制

当FD中有数据可读时,调用epoll_wait就可以得到通知,但是事件通知的模式有两种:

LT:当FD有数据可读时,会重复通知多次,直到数据处理完成吗,是Epoll的默认模式:

ET:当FD有数据可读时,只会通知一次,不管数据是否处理完成。

当FD就绪时,ET和LT 模式 List_head中的链表处理方式不同:ET会直接断开就绪的FD链表,然后从内核空间拷贝到用户空间,所以相当于只会通知一次。但是LT不同,其拷贝后,会再添加到List_head链表中。

LT和ET的区别:

ET模式下:如果要读取数据,不可以在while循环中采取阻塞模式去读取,因为如果读取不到数据,其会一直阻塞。

LT模式下:多线程读取数据容易造成重复消费,也就是惊群现象。

总结:

ET模式避免了LT模式可能出现的惊群现象;

ET模式最好结合非阻塞I/O读取FD数据,相比LT会复杂一些。

基于epoll模式的Web服务

红黑树:记录监听的FD
链表:记录就绪FD
注册FD
注册ep_poll_callback当FD就绪时,将就绪FD记录到List_head链表中
判断list_head是否为空


epollin
是FD请求连接

epollerr
服务端
epoll_create
rb_root
list_head
创建ServerSocket得到FD:ssfd
epoll_ctl:注册并监听ssfd
epoll_wait:等待FD就绪
是否有FD就绪
判断事件类型
是否ssfd可读
accept接收客户端Socket得到对应FD
读取请求数据
写出响应

信号驱动I/O

与内核机建立SIGIO的信号关联并设置回调,当内核有FD就绪时,会发出SIGIO信号通知用户,期间用户应用可以执行其他任务,无需阻塞等待。

异步I/O

同步和异步

I/O操作是同步还是异步,关键看数据在内核空间与用户空间拷贝过程(数据读写的I/O操作),也就是阶段二是同步的还是异步的:

Redis网络模型

单线程还是多线程?

Redis中的核心业务部分(命令处理)是单线程,整个Redis是多线程。

Redis中的重要节点:

Redis V4.0:引入多线程异步处理一些耗时较长的任务,比如删除命令unlink。

Redis V6.0:核心网络模型中引入多线程,进一步提高对于多核CPU的利用率。

为什么Redis选择单线程?

除去持久化不谈,Redis是纯内存操作,执行速度非常快,其性能瓶颈是网络延迟而不是执行速度,因此多线程并不会带来巨大的性能提升。

多线程会导致过多的上下文切换,带来了不必要的开销

引入多线程会面临线程安全问题,必然要引入线程锁这种安全手段,实现复杂程度增高,而且性能大打折扣。

6.0之前单线程网络模型

Redis通过I/O多路复用来提升网络性能,并且支持各种不同的多路复用实现,并且将这些实现进行封装,提供了高性能事件库API库AE:

这里可以看出Redis支持各种系统的I/O多路复用技术,evport是Solaris(sun公司的Unix系统),kqueue是BSD家族操作系统(比如MacOS)

其中的ae.c文件中会判断当前系统的支持,从而选择对应的I/O多路复用。

c 复制代码
#ifdef HAVE_EVPORT
#include "ae_evport.c"
#else
    #ifdef HAVE_EPOLL
    #include "ae_epoll.c"
    #else
        #ifdef HAVE_KQUEUE
        #include "ae_kqueue.c"
        #else
        #include "ae_select.c"
        #endif
    #endif
#endif

不管是那种类型的I/O多路复用,其实对于实现上的函数接口是类似的,与之前的epoll非常类似。也就是创建、注册、等待就绪等多个函数,可以看之前的epoll的流程框图来理解对应的流程。

Redis启动流程

从server.c开始执行:

核心方法①:initServer(); //初始化服务

其中aeCreateEventLoop,其中会调用aeApiCreate,类似epoll_create

源码如下:

c 复制代码
aeEventLoop *aeCreateEventLoop(int setsize) {
    aeEventLoop *eventLoop;
    int i;

    monotonicInit();    /* just in case the calling app didn't initialize */

    if ((eventLoop = zmalloc(sizeof(*eventLoop))) == NULL) goto err;
    eventLoop->events = zmalloc(sizeof(aeFileEvent)*setsize);
    eventLoop->fired = zmalloc(sizeof(aeFiredEvent)*setsize);
    if (eventLoop->events == NULL || eventLoop->fired == NULL) goto err;
    eventLoop->setsize = setsize;
    eventLoop->timeEventHead = NULL;
    eventLoop->timeEventNextId = 0;
    eventLoop->stop = 0;
    eventLoop->maxfd = -1;
    eventLoop->beforesleep = NULL;
    eventLoop->aftersleep = NULL;
    eventLoop->flags = 0;
    if (aeApiCreate(eventLoop) == -1) goto err;
    /* Events with mask == AE_NONE are not set. So let's initialize the
     * vector with it. */
    for (i = 0; i < setsize; i++)
        eventLoop->events[i].mask = AE_NONE;
    return eventLoop;

err:
    if (eventLoop) {
        zfree(eventLoop->events);
        zfree(eventLoop->fired);
        zfree(eventLoop);
    }
    return NULL;
}

其中listenToPort是监听TCP端口,并且去创建 ServerSocket ,并得到FD。

所以此函数就是获取FD,那么获取到FD之后就是需要量FD注册到I/O多路复用器中,去监听当前的FD是否就绪。

所以后续就是 createSocketAcceptHandler ,注册连接处理器,内部会调用

aeCreateFileEvent,监听Socket的FD读事件,并绑定读处理器ReadQueryFromClient。

c 复制代码
/* Create an event handler for accepting new connections in TCP or TLS domain sockets.
 * This works atomically for all socket fds */
int createSocketAcceptHandler(socketFds *sfd, aeFileProc *accept_handler) {
    int j;

    for (j = 0; j < sfd->count; j++) {
        if (aeCreateFileEvent(server.el, sfd->fd[j], AE_READABLE, accept_handler,NULL) == AE_ERR) {
            /* Rollback */
            for (j = j-1; j >= 0; j--) aeDeleteFileEvent(server.el, sfd->fd[j], AE_READABLE);
            return C_ERR;
        }
    }
    return C_OK;
}

这里的accept_handler的入参是一个 acceptTcpHandler:

c 复制代码
void acceptTcpHandler(aeEventLoop *el, int fd, void *privdata, int mask) {
    int cport, cfd, max = MAX_ACCEPTS_PER_CALL;
    char cip[NET_IP_STR_LEN];
    UNUSED(el);
    UNUSED(mask);
    UNUSED(privdata);

    while(max--) {
        cfd = anetTcpAccept(server.neterr, fd, cip, sizeof(cip), &cport);
        if (cfd == ANET_ERR) {
            if (errno != EWOULDBLOCK)
                serverLog(LL_WARNING,
                    "Accepting client connection: %s", server.neterr);
            return;
        }
        anetCloexec(cfd);
        serverLog(LL_VERBOSE,"Accepted %s:%d", cip, cport);
        acceptCommonHandler(connCreateAcceptedSocket(cfd),0,cip);
    }
}

在 anetTcpAccept 中:

c 复制代码
int anetTcpAccept(char *err, int s, char *ip, size_t ip_len, int *port) {
    int fd;
    struct sockaddr_storage sa;
    socklen_t salen = sizeof(sa);
    if ((fd = anetGenericAccept(err,s,(struct sockaddr*)&sa,&salen)) == -1)
        return ANET_ERR;

    if (sa.ss_family == AF_INET) {
        struct sockaddr_in *s = (struct sockaddr_in *)&sa;
        if (ip) inet_ntop(AF_INET,(void*)&(s->sin_addr),ip,ip_len);
        if (port) *port = ntohs(s->sin_port);
    } else {
        struct sockaddr_in6 *s = (struct sockaddr_in6 *)&sa;
        if (ip) inet_ntop(AF_INET6,(void*)&(s->sin6_addr),ip,ip_len);
        if (port) *port = ntohs(s->sin6_port);
    }
    return fd;
}
c 复制代码
static int anetGenericAccept(char *err, int s, struct sockaddr *sa, socklen_t *len) {
    int fd;
    while(1) {
        fd = accept(s,sa,len);
        if (fd == -1) {
            if (errno == EINTR)
                continue;
            else {
                anetSetError(err, "accept: %s", strerror(errno));
                return ANET_ERR;
            }
        }
        break;
    }
    return fd;
}

可以看到这部分的主要事情,是接收Socket连接,获取FD。

在获取到FD之后,建立connection,关联FD

也就是在 acceptCommonHandler(connCreateAcceptedSocket(cfd),0,cip);

c 复制代码
connection *connCreateAcceptedSocket(int fd) {
    connection *conn = connCreateSocket();
    conn->fd = fd;
    conn->state = CONN_STATE_ACCEPTING;
    return conn;
}

//暂不确定,暂时只看到下面源码,并未理清调用关系

之后内部调用 aeApiAddEvent(fd,READABLE) //暂时不能确认

监听Socket的FD读事件,并绑定读处理器readQueryFromClient

connSetReadHandler(conn, readQueryFromClient);

c 复制代码
networking.c
void protectClient(client *c) {
    c->flags |= CLIENT_PROTECTED;
    if (c->conn) {
        connSetReadHandler(c->conn,NULL);
        connSetWriteHandler(c->conn,NULL);
    }
}

/* This will undo the client protection done by protectClient() */
void unprotectClient(client *c) {
    if (c->flags & CLIENT_PROTECTED) {
        c->flags &= ~CLIENT_PROTECTED;
        if (c->conn) {
            connSetReadHandler(c->conn,readQueryFromClient);
            if (clientHasPendingReplies(c)) clientInstallWriteHandler(c);
        }
    }
}

后续就是 aeSetBeforeSleepProc(server.el,beforeSleep);

aeSetAfterSleepProc(server.el,afterSleep);

因为后续等待FD就绪的过程中,线程会处于休眠状态,那么线程在休眠状态之前和之后的一些处理工作,可以放入到aeSetBeforeSleepProc和aeSetAfterSleepProc。也就是前后处理器。

所以回顾一下:在 initServer() 中做的工作主要是:

aeCreateEventLoop相当于epoll_create

listenToPort 相当于创建FD

createSocketAcceptHandler 相当于 epoll_ctl 并将就绪后的处理函数传入

aeSetBeforeSleepProc/aeSetAfterSleepProc 线程休眠前后置处理器

核心方法②:aeMain(server.el); //开始监听事件循环

源码如下:

c 复制代码
void aeMain(aeEventLoop *eventLoop) {
    eventLoop->stop = 0;
    while (!eventLoop->stop) {
        aeProcessEvents(eventLoop, AE_ALL_EVENTS|
                                   AE_CALL_BEFORE_SLEEP|
                                   AE_CALL_AFTER_SLEEP);
    }
}

上面代码就是一个标准的循环监听的标准写法,可以来继续看一下aeProcessEvents函数中的内容。

其中 eventloop 是 Redis 管理 epoll 的具体运行时框架;其中aeProcessEventsw围绕eventloop 事件循环 做了一系列工作,aeApiPoll 等待注册的FD就绪,相当于 epoll_wait。在其运行之前会调用beforeSleep以及在其运行之后对调用afterSleep。之后就是使用for循环来遍历就绪的FD并调用对应的FD处理器。

Redis 整体网络模型

整体模型如下:

接下来看命令请求处理器的源码:

readQueryFromClient

c 复制代码
networking.c
void readQueryFromClient(connection *conn) {
    client *c = connGetPrivateData(conn);
    int nread, readlen;
    size_t qblen;

    /* Check if we want to read from the client later when exiting from
     * the event loop. This is the case if threaded I/O is enabled. */
    if (postponeClientRead(c)) return;

    /* Update total number of reads on server */
    atomicIncr(server.stat_total_reads_processed, 1);

    readlen = PROTO_IOBUF_LEN;
    /* If this is a multi bulk request, and we are processing a bulk reply
     * that is large enough, try to maximize the probability that the query
     * buffer contains exactly the SDS string representing the object, even
     * at the risk of requiring more read(2) calls. This way the function
     * processMultiBulkBuffer() can avoid copying buffers to create the
     * Redis Object representing the argument. */
    if (c->reqtype == PROTO_REQ_MULTIBULK && c->multibulklen && c->bulklen != -1
        && c->bulklen >= PROTO_MBULK_BIG_ARG)
    {
        ssize_t remaining = (size_t)(c->bulklen+2)-sdslen(c->querybuf);

        /* Note that the 'remaining' variable may be zero in some edge case,
         * for example once we resume a blocked client after CLIENT PAUSE. */
        if (remaining > 0 && remaining < readlen) readlen = remaining;
    }

    qblen = sdslen(c->querybuf);
    if (c->querybuf_peak < qblen) c->querybuf_peak = qblen;
    c->querybuf = sdsMakeRoomFor(c->querybuf, readlen);
    nread = connRead(c->conn, c->querybuf+qblen, readlen);
    if (nread == -1) {
        if (connGetState(conn) == CONN_STATE_CONNECTED) {
            return;
        } else {
            serverLog(LL_VERBOSE, "Reading from client: %s",connGetLastError(c->conn));
            freeClientAsync(c);
            return;
        }
    } else if (nread == 0) {
        serverLog(LL_VERBOSE, "Client closed connection");
        freeClientAsync(c);
        return;
    } else if (c->flags & CLIENT_MASTER) {
        /* Append the query buffer to the pending (not applied) buffer
         * of the master. We'll use this buffer later in order to have a
         * copy of the string applied by the last command executed. */
        c->pending_querybuf = sdscatlen(c->pending_querybuf,
                                        c->querybuf+qblen,nread);
    }

    sdsIncrLen(c->querybuf,nread);
    c->lastinteraction = server.unixtime;
    if (c->flags & CLIENT_MASTER) c->read_reploff += nread;
    atomicIncr(server.stat_net_input_bytes, nread);
    if (sdslen(c->querybuf) > server.client_max_querybuf_len) {
        sds ci = catClientInfoString(sdsempty(),c), bytes = sdsempty();

        bytes = sdscatrepr(bytes,c->querybuf,64);
        serverLog(LL_WARNING,"Closing client that reached max query buffer length: %s (qbuf initial bytes: %s)", ci, bytes);
        sdsfree(ci);
        sdsfree(bytes);
        freeClientAsync(c);
        return;
    }

    /* There is more data in the client input buffer, continue parsing it
     * in case to check if there is a full command to execute. */
     processInputBuffer(c);
}

从客户端读取数据,并将数据放入到缓冲区之后做解析,解析完成后的字符串转化为Redis命令存入到 Client *c的argv中,之后处理Client *c的argv中存储的命令。

下面是ProcessInputBuffer的源码:

c 复制代码
/* This function is called every time, in the client structure 'c', there is
 * more query buffer to process, because we read more data from the socket
 * or because a client was blocked and later reactivated, so there could be
 * pending query buffer, already representing a full command, to process. */
void processInputBuffer(client *c) {
    /* Keep processing while there is something in the input buffer */
    while(c->qb_pos < sdslen(c->querybuf)) {
        /* Immediately abort if the client is in the middle of something. */
        if (c->flags & CLIENT_BLOCKED) break;

        /* Don't process more buffers from clients that have already pending
         * commands to execute in c->argv. */
        if (c->flags & CLIENT_PENDING_COMMAND) break;

        /* Don't process input from the master while there is a busy script
         * condition on the slave. We want just to accumulate the replication
         * stream (instead of replying -BUSY like we do with other clients) and
         * later resume the processing. */
        if (server.lua_timedout && c->flags & CLIENT_MASTER) break;

        /* CLIENT_CLOSE_AFTER_REPLY closes the connection once the reply is
         * written to the client. Make sure to not let the reply grow after
         * this flag has been set (i.e. don't process more commands).
         *
         * The same applies for clients we want to terminate ASAP. */
        if (c->flags & (CLIENT_CLOSE_AFTER_REPLY|CLIENT_CLOSE_ASAP)) break;

        /* Determine request type when unknown. */
        if (!c->reqtype) {
            if (c->querybuf[c->qb_pos] == '*') {
                c->reqtype = PROTO_REQ_MULTIBULK;
            } else {
                c->reqtype = PROTO_REQ_INLINE;
            }
        }

        if (c->reqtype == PROTO_REQ_INLINE) {
            if (processInlineBuffer(c) != C_OK) break;
            /* If the Gopher mode and we got zero or one argument, process
             * the request in Gopher mode. To avoid data race, Redis won't
             * support Gopher if enable io threads to read queries. */
            if (server.gopher_enabled && !server.io_threads_do_reads &&
                ((c->argc == 1 && ((char*)(c->argv[0]->ptr))[0] == '/') ||
                  c->argc == 0))
            {
                processGopherRequest(c);
                resetClient(c);
                c->flags |= CLIENT_CLOSE_AFTER_REPLY;
                break;
            }
        } else if (c->reqtype == PROTO_REQ_MULTIBULK) {
            if (processMultibulkBuffer(c) != C_OK) break;
        } else {
            serverPanic("Unknown request type");
        }

        /* Multibulk processing could see a <= 0 length. */
        if (c->argc == 0) {
            resetClient(c);
        } else {
            /* If we are in the context of an I/O thread, we can't really
             * execute the command here. All we can do is to flag the client
             * as one that needs to process the command. */
            if (c->flags & CLIENT_PENDING_READ) {
                c->flags |= CLIENT_PENDING_COMMAND;
                break;
            }

            /* We are finally ready to execute the command. */
            if (processCommandAndResetClient(c) == C_ERR) {
                /* If the client is no longer valid, we avoid exiting this
                 * loop and trimming the client buffer later. So we return
                 * ASAP in that case. */
                return;
            }
        }
    }

    /* Trim to pos */
    if (c->qb_pos) {
        sdsrange(c->querybuf,c->qb_pos,-1);
        c->qb_pos = 0;
    }
}

当前函数的作用就是做一些判断,并且处理命令并重置客户端。

下面就是processCommandAndResetClient函数:其中最主要的是

processCommand函数,用来处理c -> argv 中的命令。

c 复制代码
networking.c
/* This function calls processCommand(), but also performs a few sub tasks
 * for the client that are useful in that context:
 *
 * 1. It sets the current client to the client 'c'.
 * 2. calls commandProcessed() if the command was handled.
 *
 * The function returns C_ERR in case the client was freed as a side effect
 * of processing the command, otherwise C_OK is returned. */
int processCommandAndResetClient(client *c) {
    int deadclient = 0;
    client *old_client = server.current_client;
    server.current_client = c;
    if (processCommand(c) == C_OK) {
        commandProcessed(c);
    }
    if (server.current_client == NULL) deadclient = 1;
    /*
     * Restore the old client, this is needed because when a script
     * times out, we will get into this code from processEventsWhileBlocked.
     * Which will cause to set the server.current_client. If not restored
     * we will return 1 to our caller which will falsely indicate the client
     * is dead and will stop reading from its buffer.
     */
    server.current_client = old_client;
    /* performEvictions may flush slave output buffers. This may
     * result in a slave, that may be the active client, to be
     * freed. */
    return deadclient ? C_ERR : C_OK;
}

接下来是processCommand函数的源码:

主要是处理c->argv中的命令。

c 复制代码
networking.c
/* If this function gets called we already read a whole
 * command, arguments are in the client argv/argc fields.
 * processCommand() execute the command or prepare the
 * server for a bulk read from the client.
 *
 * If C_OK is returned the client is still alive and valid and
 * other operations can be performed by the caller. Otherwise
 * if C_ERR is returned the client was destroyed (i.e. after QUIT). */
int processCommand(client *c) {
    if (!server.lua_timedout) {
        /* Both EXEC and EVAL call call() directly so there should be
         * no way in_exec or in_eval or propagate_in_transaction is 1.
         * That is unless lua_timedout, in which case client may run
         * some commands. */
        serverAssert(!server.propagate_in_transaction);
        serverAssert(!server.in_exec);
        serverAssert(!server.in_eval);
    }

    moduleCallCommandFilters(c);

    /* The QUIT command is handled separately. Normal command procs will
     * go through checking for replication and QUIT will cause trouble
     * when FORCE_REPLICATION is enabled and would be implemented in
     * a regular command proc. */
    if (!strcasecmp(c->argv[0]->ptr,"quit")) {
        addReply(c,shared.ok);
        c->flags |= CLIENT_CLOSE_AFTER_REPLY;
        return C_ERR;
    }

    /* Now lookup the command and check ASAP about trivial error conditions
     * such as wrong arity, bad command name and so forth. */
    c->cmd = c->lastcmd = lookupCommand(c->argv[0]->ptr);
    if (!c->cmd) {
        sds args = sdsempty();
        int i;
        for (i=1; i < c->argc && sdslen(args) < 128; i++)
            args = sdscatprintf(args, "`%.*s`, ", 128-(int)sdslen(args), (char*)c->argv[i]->ptr);
        rejectCommandFormat(c,"unknown command `%s`, with args beginning with: %s",
            (char*)c->argv[0]->ptr, args);
        sdsfree(args);
        return C_OK;
    } else if ((c->cmd->arity > 0 && c->cmd->arity != c->argc) ||
               (c->argc < -c->cmd->arity)) {
        rejectCommandFormat(c,"wrong number of arguments for '%s' command",
            c->cmd->name);
        return C_OK;
    }

    int is_read_command = (c->cmd->flags & CMD_READONLY) ||
                           (c->cmd->proc == execCommand && (c->mstate.cmd_flags & CMD_READONLY));
    int is_write_command = (c->cmd->flags & CMD_WRITE) ||
                           (c->cmd->proc == execCommand && (c->mstate.cmd_flags & CMD_WRITE));
    int is_denyoom_command = (c->cmd->flags & CMD_DENYOOM) ||
                             (c->cmd->proc == execCommand && (c->mstate.cmd_flags & CMD_DENYOOM));
    int is_denystale_command = !(c->cmd->flags & CMD_STALE) ||
                               (c->cmd->proc == execCommand && (c->mstate.cmd_inv_flags & CMD_STALE));
    int is_denyloading_command = !(c->cmd->flags & CMD_LOADING) ||
                                 (c->cmd->proc == execCommand && (c->mstate.cmd_inv_flags & CMD_LOADING));
    int is_may_replicate_command = (c->cmd->flags & (CMD_WRITE | CMD_MAY_REPLICATE)) ||
                                   (c->cmd->proc == execCommand && (c->mstate.cmd_flags & (CMD_WRITE | CMD_MAY_REPLICATE)));

    if (authRequired(c)) {
        /* AUTH and HELLO and no auth commands are valid even in
         * non-authenticated state. */
        if (!(c->cmd->flags & CMD_NO_AUTH)) {
            rejectCommand(c,shared.noautherr);
            return C_OK;
        }
    }

    /* Check if the user can run this command according to the current
     * ACLs. */
    int acl_errpos;
    int acl_retval = ACLCheckAllPerm(c,&acl_errpos);
    if (acl_retval != ACL_OK) {
        addACLLogEntry(c,acl_retval,acl_errpos,NULL);
        switch (acl_retval) {
        case ACL_DENIED_CMD:
            rejectCommandFormat(c,
                "-NOPERM this user has no permissions to run "
                "the '%s' command or its subcommand", c->cmd->name);
            break;
        case ACL_DENIED_KEY:
            rejectCommandFormat(c,
                "-NOPERM this user has no permissions to access "
                "one of the keys used as arguments");
            break;
        case ACL_DENIED_CHANNEL:
            rejectCommandFormat(c,
                "-NOPERM this user has no permissions to access "
                "one of the channels used as arguments");
            break;
        default:
            rejectCommandFormat(c, "no permission");
            break;
        }
        return C_OK;
    }

    /* If cluster is enabled perform the cluster redirection here.
     * However we don't perform the redirection if:
     * 1) The sender of this command is our master.
     * 2) The command has no key arguments. */
    if (server.cluster_enabled &&
        !(c->flags & CLIENT_MASTER) &&
        !(c->flags & CLIENT_LUA &&
          server.lua_caller->flags & CLIENT_MASTER) &&
        !(!cmdHasMovableKeys(c->cmd) && c->cmd->firstkey == 0 &&
          c->cmd->proc != execCommand))
    {
        int hashslot;
        int error_code;
        clusterNode *n = getNodeByQuery(c,c->cmd,c->argv,c->argc,
                                        &hashslot,&error_code);
        if (n == NULL || n != server.cluster->myself) {
            if (c->cmd->proc == execCommand) {
                discardTransaction(c);
            } else {
                flagTransaction(c);
            }
            clusterRedirectClient(c,n,hashslot,error_code);
            c->cmd->rejected_calls++;
            return C_OK;
        }
    }

    /* Handle the maxmemory directive.
     *
     * Note that we do not want to reclaim memory if we are here re-entering
     * the event loop since there is a busy Lua script running in timeout
     * condition, to avoid mixing the propagation of scripts with the
     * propagation of DELs due to eviction. */
    if (server.maxmemory && !server.lua_timedout) {
        int out_of_memory = (performEvictions() == EVICT_FAIL);
        /* performEvictions may flush slave output buffers. This may result
         * in a slave, that may be the active client, to be freed. */
        if (server.current_client == NULL) return C_ERR;

        int reject_cmd_on_oom = is_denyoom_command;
        /* If client is in MULTI/EXEC context, queuing may consume an unlimited
         * amount of memory, so we want to stop that.
         * However, we never want to reject DISCARD, or even EXEC (unless it
         * contains denied commands, in which case is_denyoom_command is already
         * set. */
        if (c->flags & CLIENT_MULTI &&
            c->cmd->proc != execCommand &&
            c->cmd->proc != discardCommand &&
            c->cmd->proc != resetCommand) {
            reject_cmd_on_oom = 1;
        }

        if (out_of_memory && reject_cmd_on_oom) {
            rejectCommand(c, shared.oomerr);
            return C_OK;
        }

        /* Save out_of_memory result at script start, otherwise if we check OOM
         * until first write within script, memory used by lua stack and
         * arguments might interfere. */
        if (c->cmd->proc == evalCommand || c->cmd->proc == evalShaCommand) {
            server.lua_oom = out_of_memory;
        }
    }

    /* Make sure to use a reasonable amount of memory for client side
     * caching metadata. */
    if (server.tracking_clients) trackingLimitUsedSlots();

    /* Don't accept write commands if there are problems persisting on disk
     * and if this is a master instance. */
    int deny_write_type = writeCommandsDeniedByDiskError();
    if (deny_write_type != DISK_ERROR_TYPE_NONE &&
        server.masterhost == NULL &&
        (is_write_command ||c->cmd->proc == pingCommand))
    {
        if (deny_write_type == DISK_ERROR_TYPE_RDB)
            rejectCommand(c, shared.bgsaveerr);
        else
            rejectCommandFormat(c,
                "-MISCONF Errors writing to the AOF file: %s",
                strerror(server.aof_last_write_errno));
        return C_OK;
    }

    /* Don't accept write commands if there are not enough good slaves and
     * user configured the min-slaves-to-write option. */
    if (server.masterhost == NULL &&
        server.repl_min_slaves_to_write &&
        server.repl_min_slaves_max_lag &&
        is_write_command &&
        server.repl_good_slaves_count < server.repl_min_slaves_to_write)
    {
        rejectCommand(c, shared.noreplicaserr);
        return C_OK;
    }

    /* Don't accept write commands if this is a read only slave. But
     * accept write commands if this is our master. */
    if (server.masterhost && server.repl_slave_ro &&
        !(c->flags & CLIENT_MASTER) &&
        is_write_command)
    {
        rejectCommand(c, shared.roslaveerr);
        return C_OK;
    }

    /* Only allow a subset of commands in the context of Pub/Sub if the
     * connection is in RESP2 mode. With RESP3 there are no limits. */
    if ((c->flags & CLIENT_PUBSUB && c->resp == 2) &&
        c->cmd->proc != pingCommand &&
        c->cmd->proc != subscribeCommand &&
        c->cmd->proc != unsubscribeCommand &&
        c->cmd->proc != psubscribeCommand &&
        c->cmd->proc != punsubscribeCommand &&
        c->cmd->proc != resetCommand) {
        rejectCommandFormat(c,
            "Can't execute '%s': only (P)SUBSCRIBE / "
            "(P)UNSUBSCRIBE / PING / QUIT / RESET are allowed in this context",
            c->cmd->name);
        return C_OK;
    }

    /* Only allow commands with flag "t", such as INFO, SLAVEOF and so on,
     * when slave-serve-stale-data is no and we are a slave with a broken
     * link with master. */
    if (server.masterhost && server.repl_state != REPL_STATE_CONNECTED &&
        server.repl_serve_stale_data == 0 &&
        is_denystale_command)
    {
        rejectCommand(c, shared.masterdownerr);
        return C_OK;
    }

    /* Loading DB? Return an error if the command has not the
     * CMD_LOADING flag. */
    if (server.loading && is_denyloading_command) {
        rejectCommand(c, shared.loadingerr);
        return C_OK;
    }

    /* Lua script too slow? Only allow a limited number of commands.
     * Note that we need to allow the transactions commands, otherwise clients
     * sending a transaction with pipelining without error checking, may have
     * the MULTI plus a few initial commands refused, then the timeout
     * condition resolves, and the bottom-half of the transaction gets
     * executed, see Github PR #7022. */
    if (server.lua_timedout &&
          c->cmd->proc != authCommand &&
          c->cmd->proc != helloCommand &&
          c->cmd->proc != replconfCommand &&
          c->cmd->proc != multiCommand &&
          c->cmd->proc != discardCommand &&
          c->cmd->proc != watchCommand &&
          c->cmd->proc != unwatchCommand &&
          c->cmd->proc != resetCommand &&
        !(c->cmd->proc == shutdownCommand &&
          c->argc == 2 &&
          tolower(((char*)c->argv[1]->ptr)[0]) == 'n') &&
        !(c->cmd->proc == scriptCommand &&
          c->argc == 2 &&
          tolower(((char*)c->argv[1]->ptr)[0]) == 'k'))
    {
        rejectCommand(c, shared.slowscripterr);
        return C_OK;
    }

    /* Prevent a replica from sending commands that access the keyspace.
     * The main objective here is to prevent abuse of client pause check
     * from which replicas are exempt. */
    if ((c->flags & CLIENT_SLAVE) && (is_may_replicate_command || is_write_command || is_read_command)) {
        rejectCommandFormat(c, "Replica can't interract with the keyspace");
        return C_OK;
    }

    /* If the server is paused, block the client until
     * the pause has ended. Replicas are never paused. */
    if (!(c->flags & CLIENT_SLAVE) && 
        ((server.client_pause_type == CLIENT_PAUSE_ALL) ||
        (server.client_pause_type == CLIENT_PAUSE_WRITE && is_may_replicate_command)))
    {
        c->bpop.timeout = 0;
        blockClient(c,BLOCKED_PAUSE);
        return C_OK;       
    }

    /* Exec the command */
    if (c->flags & CLIENT_MULTI &&
        c->cmd->proc != execCommand && c->cmd->proc != discardCommand &&
        c->cmd->proc != multiCommand && c->cmd->proc != watchCommand &&
        c->cmd->proc != resetCommand)
    {
        queueMultiCommand(c);
        addReply(c,shared.queued);
    } else {
        call(c,CMD_CALL_FULL);
        c->woff = server.master_repl_offset;
        if (listLength(server.ready_keys))
            handleClientsBlockedOnKeys();
    }

    return C_OK;
}

核心代码代码有:

根据命令名称寻找命令对应的command:

之后执行command,得到响应结果。

上文中的proc定义如下:

c 复制代码
typedef void redisCommandProc(client *c);
typedef int redisGetKeysProc(struct redisCommand *cmd, robj **argv, int argc, getKeysResult *result);
struct redisCommand {
    char *name;
    redisCommandProc *proc;
    int arity;
    char *sflags;   /* Flags as string representation, one char per flag. */
    uint64_t flags; /* The actual flags, obtained from the 'sflags' field. */
    /* Use a function to determine keys arguments in a command line.
     * Used for Redis Cluster redirect. */
    redisGetKeysProc *getkeys_proc;
    /* What keys should be loaded in background when calling this command? */
    int firstkey; /* The first argument that's a key (0 = no keys) */
    int lastkey;  /* The last argument that's a key */
    int keystep;  /* The step between first and last key */
    long long microseconds, calls, rejected_calls, failed_calls;
    int id;     /* Command ID. This is a progressive ID starting from 0 that
                   is assigned at runtime, and is used in order to check
                   ACLs. A connection is able to execute a given command if
                   the user associated to the connection has this command
                   bit set in the bitmap of allowed commands. */
};

最后把执行结果写出, 比如如果ping命令,就会返回pong给Client。

接下来是addReply的源码:

尝试把结果写回到 c-buf 客户端写缓冲区;

c 复制代码
/* -----------------------------------------------------------------------------
 * Higher level functions to queue data on the client output buffer.
 * The following functions are the ones that commands implementations will call.
 * -------------------------------------------------------------------------- */

/* Add the object 'obj' string representation to the client output buffer. */
void addReply(client *c, robj *obj) {
    if (prepareClientToWrite(c) != C_OK) return;

    if (sdsEncodedObject(obj)) {
        if (_addReplyToBuffer(c,obj->ptr,sdslen(obj->ptr)) != C_OK)
            _addReplyProtoToList(c,obj->ptr,sdslen(obj->ptr));
    } else if (obj->encoding == OBJ_ENCODING_INT) {
        /* For integer encoded strings we just convert it into a string
         * using our optimized function, and attach the resulting string
         * to the output buffer. */
        char buf[32];
        size_t len = ll2string(buf,sizeof(buf),(long)obj->ptr);
        if (_addReplyToBuffer(c,buf,len) != C_OK)
            _addReplyProtoToList(c,buf,len);
    } else {
        serverPanic("Wrong obj->encoding in addReply()");
    }
}

如果 c-buf 写不下,就写到 c->reply,这里的是一个链表; _addReplyProtoToList

之后将客户端添加到server.clients_pending_write这个队列中,等待被写出。

此时在aeEventLoop代码以及全部执行完成,但是我们会发现一个问题,也就是缓冲区未成功写入的数据仍旧在队列中,并未写入。那么这一部分的功能已经不是当前模块的功能了,Redis将其放置在了beforeSleep中。

下面是beforeSleep的源码:

在函数:handleClientsWithPendingWrites()中:

源码如下:也就是使用迭代器取出对应的队列中的数据,并将数据写出。其中内部调用aeApiAddEvent(fd,WRITABLE),监听Socket的FD的读事件

并且绑定写处理器,sendReplyToClient,把响应写到客户端Socket

c 复制代码
int handleClientsWithPendingWrites(void) {
    listIter li;
    listNode *ln;
    int processed = listLength(server.clients_pending_write);

    listRewind(server.clients_pending_write,&li);
    while((ln = listNext(&li))) {
        client *c = listNodeValue(ln);
        c->flags &= ~CLIENT_PENDING_WRITE;
        listDelNode(server.clients_pending_write,ln);

        /* If a client is protected, don't do anything,
         * that may trigger write error or recreate handler. */
        if (c->flags & CLIENT_PROTECTED) continue;

        /* Don't write to clients that are going to be closed anyway. */
        if (c->flags & CLIENT_CLOSE_ASAP) continue;

        /* Try to write buffers to the client socket. */
        if (writeToClient(c,0) == C_ERR) continue;

        /* If after the synchronous writes above we still have data to
         * output to the client, we need to install the writable handler. */
        if (clientHasPendingReplies(c)) {
            int ae_barrier = 0;
            /* For the fsync=always policy, we want that a given FD is never
             * served for reading and writing in the same event loop iteration,
             * so that in the middle of receiving the query, and serving it
             * to the client, we'll call beforeSleep() that will do the
             * actual fsync of AOF to disk. the write barrier ensures that. */
            if (server.aof_state == AOF_ON &&
                server.aof_fsync == AOF_FSYNC_ALWAYS)
            {
                ae_barrier = 1;
            }
            if (connSetWriteHandlerWithBarrier(c->conn, sendReplyToClient, ae_barrier) == C_ERR) {
                freeClientAsync(c);
            }
        }
    }
    return processed;
}

所以总体的流程如下:

6.0之后多线程改进单线程模型

之前单线程的瓶颈点:

命令处理:

需要空客户端连接中读取Socket,读出对应的字符串并将解析为Redis中的命令。这里是读Socket ,那么就是受到网络I/O的影响。

添加多线程的部分:

减少因为网络I/O而造成的性能下降,虽然并未改善单条命令的处理时间,但是整体的吞吐量却得到了明显的提升。

通信协议

RESP协议

Redis是一种CS架构的软件。RESP协议为 Redis Serialization Protocol。

1.2版本中引入RESP协议。2.0版本中RESP2为Redis通信标准。6.0版本后,RESP2升级为RESP3,增加了更多的数据类型并且支持了客户端缓存。

RESP中通过首字节的字符来区分不同的数据类型,常用的数据类型包括了5种:

  • 单行字符串:"+",后面跟上单行字符串,以CRLF("\r\n")结尾。
  • 错误:"-",与单行字符串类似,只是字符串是异常信息。
  • 数值:":",后面跟上数字格式的字符串,以CRLF结尾。
  • 多行字符串:"",表示二进制安全的字符串,最大支持512MB;记录长度以及具体的字符串,这样就可以了。例如:5\r\nhello\r\n。特殊情况:大小为0表示空字符串,大小-1表示不存在。
  • 数组:"*",后面跟上数组元素个数,跟上元素,元素数据类型不限。发请求一般就是一个数组。

模拟Redis客户端

java 复制代码
import java.io.*;
import java.net.Socket;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.List;

public class Main {
    static Socket s;
    static PrintWriter writer;
    static BufferedReader reader;

    public static void main(String[] args) {
        String host = "";
        int port = 6379;
        try {
            //建立连接
            s = new Socket(host, port);
            //获取输入流和输出流
            writer = new PrintWriter(new OutputStreamWriter(s.getOutputStream(), StandardCharsets.UTF_8));
            reader = new BufferedReader(new InputStreamReader(s.getInputStream(), StandardCharsets.UTF_8));
            //发送请求
            sendRequest();
            //解析响应
            handleResponse();
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    private static Object handleResponse() {
        //五种情况:读取首字节
        try {
            int prefix = reader.read();
            switch (prefix) {
                case '+' :
                    return reader.readLine();
                case '-' :
                    throw new RuntimeException(reader.readLine());
                case ':' :
                    return Long.parseLong(reader.readLine());
                case '$' :
                    int len = Integer.parseInt(reader.readLine());
                    if (len == -1) {
                        return null;
                    }
                    if (len == 0) {
                        return "";
                    }
                    return reader.readLine();
                case '*' :
                    return readBulkString();
                default:
                    throw new RuntimeException("错误的数据格式!");
            }
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    private static Object readBulkString() throws IOException {
        //获取数组大小
        int len = Integer.parseInt(reader.readLine());

        if (len <= 0) {
            return null;
        }
        //定义一个集合接受元素
        List<Object> list = new ArrayList<>(len);
        for (int i = 0; i < len; i++) {
            list.add(handleResponse());
        }

        //遍历,依次读取
    }


    private static void sendRequest(String ... args) {

        writer.println("*" + args.length);
        for (String arg : args) {
            writer.println("$"+ arg.getBytes(StandardCharsets.UTF_8).length);
            writer.println(arg);
        }

        writer.flush();
    }
}
相关推荐
TDengine (老段)11 小时前
TDengine C/C++ 连接器入门指南
大数据·c语言·数据库·c++·物联网·时序数据库·tdengine
地球资源数据云11 小时前
2019-2024年中国逐年10米分辨率最大值合成NDVI数据集
大数据·运维·服务器·数据库·均值算法
自燃人~11 小时前
怎么优化慢SQL
数据库·sql
liulilittle11 小时前
XDP VNP虚拟以太网关(章节:三)
网络·c++·网络协议·信息与通信·通信·xdp
爱学java的ptt12 小时前
mysql的存储引擎
数据库·mysql
小宇的天下12 小时前
innovus Flip chip 产品设计方法(3)
数据库·windows·microsoft
无限大.12 小时前
为什么游戏需要“加载时间“?——从硬盘读取到内存渲染
网络·人工智能·游戏
-To be number.wan12 小时前
两道经典IP子网题解析|掌握CIDR与广播地址的奥秘
网络·网络协议·tcp/ip·计算机网络
GalenZhang88812 小时前
使用 Python SDK 将数据写入飞书多维表格
数据库·python·飞书·多维表格
云和数据.ChenGuang12 小时前
GaussDB 期末考试题与面试题
数据库·opengauss·gaussdb·数据库期末试题