RDMA编程实例rdma_cm API

RDMA编程基础

存储大师班 | RDMA简介与编程基础 -https://zhuanlan.zhihu.com/p/387549948

1. RDMA的学习环境搭建

RDMA需要专门的RDMA网卡或者InfiniBand卡才能使用,学习RDMA而又没有这些硬件设备,可以使用一个软件RDMA模拟环境,softiwarp ,

更多的rdmacm实例:,

需要注意的是,这个例子里面缺省用的是IPv6连接,如果希望在IPv4环境下测试,需要先改代码用IPv4地址。

2. RDMA与socket的类比

和Socket连接类似,RDMA连接也分为可靠连接和不可靠连接。然而也不完全相同,Socket的可靠连接就是TCP连接,是流式的;不可靠连接也就是UDP,是消息式的。对于RDMA来说,无论是可靠连接和不可靠连接,都是消息式的。

编程角度看,RDMA代码也分为Server端,Client端,也有bind, listen, connect, accept,等动作,然而细节上仍有不少区别。

大家可以关注一下mellonx的vma,貌似可以直接用socket api通信,方便很多:

【RDMA】降低CPU除了RDMA (vbers)还是VMA ?|使用socket进行RDMA编程?_bandaoyu的note-CSDN博客

前言看介绍,像是mellonx针对其kernel bypass网卡(RDMA网卡)提供的一个lib库,该lib库对外提供socket api,使得用户的程序不需要修改就可以直接使用kernel bypass网卡(如RDMA网卡)。我们都知道RDMA 网卡目前使用的是rdma_cm和vbers api编程,和socket不一样,如果能用socket对RDMA编程,那确实是很大的利好。官网介绍什么是VMA?Mellanox Interconnect Community官方介绍:M

https://blog.csdn.net/bandaoyu/article/details/120726746

rdma_cm API说明:

https://linux.die.net/man/3/rdma_create_id (推荐)

https://www.ibm.com/docs/en/aix/7.2?topic=operations-rdma-listen (内容少)

rdma_cm API 管理连接(建立连接和销毁)+vbers api 管理收发

RDMA主机使用queue pairs(QP)进行通信;主机创建由发送队列SQ和接收队列RQ组成的QP,并使用verbs API将操作post 到这些队列。(所以rdma_cm是管理连接的,收发还是verbs API。)

3. RDMA服务器的代码流程

main()

{

channel=rdma_create_event_channel

这一步是创建一个event channel,event channel是RDMA设备在操作完成后,或者有连接请求等事件发生时,用来通知应用程序的通道。其内部就是一个file descriptor, 因此可以进行poll等操作。

rdma_create_id(channel, **id,......)

这一步创建一个rdma_cm_id, 概念上等价与socket编程时的listen socket。

rdma_bind_addr(id,addr)

和socket编程一样,也要先绑定一个本地的地址和端口,以进行listen操作。

rdma_listen(id,block)

开始侦听客户端的连接请求

rdma_get_cm_event(channel,&event)

这个调用就是作用在第一步创建的event channel上面,要从event channel中获取一个事件。这是个阻塞调用,只有有事件时才会返回。在一切正常的情况下,函数返回时会得到一个 RDMA_CM_EVENT_CONNECT_REQUEST事件,也就是说,有客户端发起连接了。

在事件的参数里面,会有一个新的rdma_cm_id传入。这点和socket是不同的,socket只有在accept后才有新的socket fd创建。

on_event()

{

on_connect_request()//RDMA_CM_EVENT_CONNECT_REQUEST

{

build_context()

{

6.ibv_alloc_pd

创建一个protection domain。protection domain可以看作是一个内存保护单位,在内存区域和队列直接建立一个关联关系,防止未授权的访问。

7.ibv_create_comp_channel

和之前创建的event channel类似,这也是一个event channel,但只用来报告【完成队列】里面的事件。当【完成队列】里有新的任务完成时,就通过这个channel向应用程序报告。

8.ibv_create_cq

创建【完成队列】,创建时就指定使用第6步的channel。

}//--end build_context()

9.rdma_create_qp

创建一个queue pair, 一个queue pair包括一个发送queue和一个接收queue. 指定使用前面创建的cq作为完成队列。该qp创建时就指定关联到第6步创建的pd上。

10.ibv_reg_mr

注册内存区域。RDMA使用的内存,必须事先进行注册。这个是可以理解的,DMA的内存在边界对齐,能否被swap等方面,都有要求。

11.rdma_accept

至此,做好了全部的准备工作,可以调用accept接受客户端的这个请求了。 --:)长出一口气 ~~ 且慢,

}

//--end on_connect_request()

12.rdma_ack_cm_event

对于每个从event channel得到的事件,都要调用ack函数,否则会产生内存泄漏。这一步的ack是对应第5步的get。每一次get调用,都要有对应的ack调用。

13.rdma_get_cm_event

继续调用rdma_get_cm_event, 一切正常的话我们此时应该得到 RDMA_CM_EVENT_ESTABLISHED 事件,表示连接已经建立起来。不需要做额外的处理,直接rdma_ack_cm_event就行了

}//--end on_event()

终于可以开始进行数据传输了 ==== (如何传输下篇再说)

参考:http://10.165.104.246:8080/#/c/43882/

4. 关闭连接

断开连接

当rdma_get_cm_event返回RDMA_CM_EVENT_DISCONNECTED事件时,表示客户端断开了连接,server端要进行对应的清理。此时可以调用rdma_ack_cm_event释放事件资源。然后依次调用下面的函数,释放连接资源,内存资源,队列资源。

rdma_disconnect

rdma_destroy_qp

ibv_dereg_mr

rdma_destroy_id

释放同客户端连接的rdma_cm_id

rdma_destroy_id

释放用于侦听的rdma_cm_id

rdma_destroy_event_channel

释放 event channel

}

// end main

实例

源码地址- https://github.com/tarickb/the-geek-in-the-corner

用法

root@localhost 01_basic-client-server\]# ./server listening on port 42956. client \ \ Makefile .PHONY: clean CFLAGS := -Wall -g LDLIBS := ${LDLIBS} -lrdmacm -libverbs -lpthread APPS := server client all: ${APPS} clean: rm -f ${APPS} 注意:makefile 没有-L 指定lib的路径,所以 -lrdmacm -libverbs -lpthread 对应的库 librdmacm.so libibverbs.so libpthread.so 应放在默认的路径下/usr/lib 或/usr/lib64 **服务端server.c:** #include #include #include #include #include #define TEST_NZ(x) do { if ( (x)) die("error: " #x " failed (returned non-zero)." ); } while (0) #define TEST_Z(x) do { if (!(x)) die("error: " #x " failed (returned zero/null)."); } while (0) const int BUFFER_SIZE = 1024; struct context { struct ibv_context *ctx; struct ibv_pd *pd; struct ibv_cq *cq; struct ibv_comp_channel *comp_channel; pthread_t cq_poller_thread; }; struct connection { struct ibv_qp *qp; struct ibv_mr *recv_mr; struct ibv_mr *send_mr; char *recv_region; char *send_region; }; static void die(const char *reason); static void build_context(struct ibv_context *verbs); static void build_qp_attr(struct ibv_qp_init_attr *qp_attr); static void * poll_cq(void *); static void post_receives(struct connection *conn); static void register_memory(struct connection *conn); static void on_completion(struct ibv_wc *wc); static int on_connect_request(struct rdma_cm_id *id); static int on_connection(void *context); static int on_disconnect(struct rdma_cm_id *id); static int on_event(struct rdma_cm_event *event); static struct context *s_ctx = NULL; int main(int argc, char **argv) { #if _USE_IPV6 struct sockaddr_in6 addr; #else struct sockaddr_in addr; #endif struct rdma_cm_event *event = NULL; struct rdma_cm_id *listener = NULL; struct rdma_event_channel *ec = NULL; uint16_t port = 0; memset(&addr, 0, sizeof(addr)); #if _USE_IPV6 addr.sin6_family = AF_INET6; #else addr.sin_family = AF_INET; #endif TEST_Z(ec = rdma_create_event_channel()); TEST_NZ(rdma_create_id(ec, &listener, NULL, RDMA_PS_TCP)); TEST_NZ(rdma_bind_addr(listener, (struct sockaddr *)&addr)); TEST_NZ(rdma_listen(listener, 10)); /* backlog=10 is arbitrary */ port = ntohs(rdma_get_src_port(listener)); //rdma_get_src_port 返回listener对应的tcp 端口 printf("listening on port %d.\n", port); while (rdma_get_cm_event(ec, &event) == 0) { struct rdma_cm_event event_copy; memcpy(&event_copy, event, sizeof(*event)); rdma_ack_cm_event(event); if (on_event(&event_copy)) break; } rdma_destroy_id(listener); rdma_destroy_event_channel(ec); return 0; } void die(const char *reason) { fprintf(stderr, "%s\n", reason); exit(EXIT_FAILURE); } void build_context(struct ibv_context *verbs) { if (s_ctx) { if (s_ctx->ctx != verbs) die("cannot handle events in more than one context."); return; } s_ctx = (struct context *)malloc(sizeof(struct context)); s_ctx->ctx = verbs; TEST_Z(s_ctx->pd = ibv_alloc_pd(s_ctx->ctx)); TEST_Z(s_ctx->comp_channel = ibv_create_comp_channel(s_ctx->ctx)); TEST_Z(s_ctx->cq = ibv_create_cq(s_ctx->ctx, 10, NULL, s_ctx->comp_channel, 0)); /* cqe=10 is arbitrary */ TEST_NZ(ibv_req_notify_cq(s_ctx->cq, 0)); #完成完成队列与完成通道的关联 TEST_NZ(pthread_create(&s_ctx->cq_poller_thread, NULL, poll_cq, NULL)); } void build_qp_attr(struct ibv_qp_init_attr *qp_attr) { memset(qp_attr, 0, sizeof(*qp_attr)); qp_attr->send_cq = s_ctx->cq; qp_attr->recv_cq = s_ctx->cq; qp_attr->qp_type = IBV_QPT_RC; qp_attr->cap.max_send_wr = 10; qp_attr->cap.max_recv_wr = 10; qp_attr->cap.max_send_sge = 1; qp_attr->cap.max_recv_sge = 1; } void * poll_cq(void *ctx) { struct ibv_cq *cq; struct ibv_wc wc; while (1) { TEST_NZ(ibv_get_cq_event(s_ctx->comp_channel, &cq, &ctx)); ibv_ack_cq_events(cq, 1); TEST_NZ(ibv_req_notify_cq(cq, 0)); while (ibv_poll_cq(cq, 1, &wc)) on_completion(&wc); } return NULL; } void post_receives(struct connection *conn) { struct ibv_recv_wr wr, *bad_wr = NULL; struct ibv_sge sge; wr.wr_id = (uintptr_t)conn; wr.next = NULL; wr.sg_list = &sge; wr.num_sge = 1; sge.addr = (uintptr_t)conn->recv_region; sge.length = BUFFER_SIZE; sge.lkey = conn->recv_mr->lkey; TEST_NZ(ibv_post_recv(conn->qp, &wr, &bad_wr)); } void register_memory(struct connection *conn) { conn->send_region = malloc(BUFFER_SIZE); conn->recv_region = malloc(BUFFER_SIZE); TEST_Z(conn->send_mr = ibv_reg_mr( s_ctx->pd, conn->send_region, BUFFER_SIZE, 0)); TEST_Z(conn->recv_mr = ibv_reg_mr( s_ctx->pd, conn->recv_region, BUFFER_SIZE, IBV_ACCESS_LOCAL_WRITE)); } void on_completion(struct ibv_wc *wc) { if (wc->status != IBV_WC_SUCCESS) die("on_completion: status is not IBV_WC_SUCCESS."); if (wc->opcode & IBV_WC_RECV) { struct connection *conn = (struct connection *)(uintptr_t)wc->wr_id; printf("received message: %s\n", conn->recv_region); } else if (wc->opcode == IBV_WC_SEND) { printf("send completed successfully.\n"); } } int on_connect_request(struct rdma_cm_id *id) { struct ibv_qp_init_attr qp_attr; struct rdma_conn_param cm_params; struct connection *conn; printf("received connection request.\n"); build_context(id->verbs); build_qp_attr(&qp_attr); TEST_NZ(rdma_create_qp(id, s_ctx->pd, &qp_attr)); id->context = conn = (struct connection *)malloc(sizeof(struct connection)); conn->qp = id->qp; register_memory(conn); post_receives(conn); memset(&cm_params, 0, sizeof(cm_params)); TEST_NZ(rdma_accept(id, &cm_params)); return 0; } int on_connection(void *context) { struct connection *conn = (struct connection *)context; struct ibv_send_wr wr, *bad_wr = NULL; struct ibv_sge sge; snprintf(conn->send_region, BUFFER_SIZE, "message from passive/server side with pid %d", getpid()); printf("connected. posting send...\n"); memset(&wr, 0, sizeof(wr)); wr.opcode = IBV_WR_SEND; wr.sg_list = &sge; wr.num_sge = 1; wr.send_flags = IBV_SEND_SIGNALED; sge.addr = (uintptr_t)conn->send_region; sge.length = BUFFER_SIZE; sge.lkey = conn->send_mr->lkey; TEST_NZ(ibv_post_send(conn->qp, &wr, &bad_wr)); return 0; } int on_disconnect(struct rdma_cm_id *id) { struct connection *conn = (struct connection *)id->context; printf("peer disconnected.\n"); rdma_destroy_qp(id); ibv_dereg_mr(conn->send_mr); ibv_dereg_mr(conn->recv_mr); free(conn->send_region); free(conn->recv_region); free(conn); rdma_destroy_id(id); return 0; } int on_event(struct rdma_cm_event *event) { int r = 0; if (event->event == RDMA_CM_EVENT_CONNECT_REQUEST) r = on_connect_request(event->id); else if (event->event == RDMA_CM_EVENT_ESTABLISHED) r = on_connection(event->id->context); else if (event->event == RDMA_CM_EVENT_DISCONNECTED) r = on_disconnect(event->id); else die("on_event: unknown event."); return r; } 客户端client.c: #include #include #include #include #include #include #define TEST_NZ(x) do { if ( (x)) die("error: " #x " failed (returned non-zero)." ); } while (0) #define TEST_Z(x) do { if (!(x)) die("error: " #x " failed (returned zero/null)."); } while (0) const int BUFFER_SIZE = 1024; const int TIMEOUT_IN_MS = 500; /* ms */ struct context { struct ibv_context *ctx; struct ibv_pd *pd; struct ibv_cq *cq; struct ibv_comp_channel *comp_channel; pthread_t cq_poller_thread; }; struct connection { struct rdma_cm_id *id; struct ibv_qp *qp; struct ibv_mr *recv_mr; struct ibv_mr *send_mr; char *recv_region; char *send_region; int num_completions; }; static void die(const char *reason); static void build_context(struct ibv_context *verbs); static void build_qp_attr(struct ibv_qp_init_attr *qp_attr); static void * poll_cq(void *); static void post_receives(struct connection *conn); static void register_memory(struct connection *conn); static int on_addr_resolved(struct rdma_cm_id *id); static void on_completion(struct ibv_wc *wc); static int on_connection(void *context); static int on_disconnect(struct rdma_cm_id *id); static int on_event(struct rdma_cm_event *event); static int on_route_resolved(struct rdma_cm_id *id); static struct context *s_ctx = NULL; int main(int argc, char **argv) { struct addrinfo *addr; struct rdma_cm_event *event = NULL; struct rdma_cm_id *conn= NULL; struct rdma_event_channel *ec = NULL; if (argc != 3) die("usage: client "); TEST_NZ(getaddrinfo(argv[1], argv[2], NULL, &addr)); TEST_Z(ec = rdma_create_event_channel()); TEST_NZ(rdma_create_id(ec, &conn, NULL, RDMA_PS_TCP)); TEST_NZ(rdma_resolve_addr(conn, NULL, addr->ai_addr, TIMEOUT_IN_MS)); freeaddrinfo(addr); while (rdma_get_cm_event(ec, &event) == 0) { struct rdma_cm_event event_copy; memcpy(&event_copy, event, sizeof(*event)); rdma_ack_cm_event(event); if (on_event(&event_copy)) break; } rdma_destroy_event_channel(ec); return 0; } void die(const char *reason) { fprintf(stderr, "%s\n", reason); exit(EXIT_FAILURE); } void build_context(struct ibv_context *verbs) { if (s_ctx) { if (s_ctx->ctx != verbs) die("cannot handle events in more than one context."); return; } s_ctx = (struct context *)malloc(sizeof(struct context)); s_ctx->ctx = verbs; TEST_Z(s_ctx->pd = ibv_alloc_pd(s_ctx->ctx)); TEST_Z(s_ctx->comp_channel = ibv_create_comp_channel(s_ctx->ctx)); TEST_Z(s_ctx->cq = ibv_create_cq(s_ctx->ctx, 10, NULL, s_ctx->comp_channel, 0)); /* cqe=10 is arbitrary */ TEST_NZ(ibv_req_notify_cq(s_ctx->cq, 0)); TEST_NZ(pthread_create(&s_ctx->cq_poller_thread, NULL, poll_cq, NULL)); } void build_qp_attr(struct ibv_qp_init_attr *qp_attr) { memset(qp_attr, 0, sizeof(*qp_attr)); qp_attr->send_cq = s_ctx->cq; qp_attr->recv_cq = s_ctx->cq; qp_attr->qp_type = IBV_QPT_RC; qp_attr->cap.max_send_wr = 10; qp_attr->cap.max_recv_wr = 10; qp_attr->cap.max_send_sge = 1; qp_attr->cap.max_recv_sge = 1; } void * poll_cq(void *ctx) { struct ibv_cq *cq; struct ibv_wc wc; while (1) { TEST_NZ(ibv_get_cq_event(s_ctx->comp_channel, &cq, &ctx)); ibv_ack_cq_events(cq, 1); TEST_NZ(ibv_req_notify_cq(cq, 0)); while (ibv_poll_cq(cq, 1, &wc)) on_completion(&wc); } return NULL; } void post_receives(struct connection *conn) { struct ibv_recv_wr wr, *bad_wr = NULL; struct ibv_sge sge; wr.wr_id = (uintptr_t)conn; wr.next = NULL; wr.sg_list = &sge; wr.num_sge = 1; sge.addr = (uintptr_t)conn->recv_region; sge.length = BUFFER_SIZE; sge.lkey = conn->recv_mr->lkey; TEST_NZ(ibv_post_recv(conn->qp, &wr, &bad_wr)); } void register_memory(struct connection *conn) { conn->send_region = malloc(BUFFER_SIZE); conn->recv_region = malloc(BUFFER_SIZE); TEST_Z(conn->send_mr = ibv_reg_mr( s_ctx->pd, conn->send_region, BUFFER_SIZE, 0)); TEST_Z(conn->recv_mr = ibv_reg_mr( s_ctx->pd, conn->recv_region, BUFFER_SIZE, IBV_ACCESS_LOCAL_WRITE)); } int on_addr_resolved(struct rdma_cm_id *id) { struct ibv_qp_init_attr qp_attr; struct connection *conn; printf("address resolved.\n"); build_context(id->verbs); build_qp_attr(&qp_attr); TEST_NZ(rdma_create_qp(id, s_ctx->pd, &qp_attr)); id->context = conn = (struct connection *)malloc(sizeof(struct connection)); conn->id = id; conn->qp = id->qp; conn->num_completions = 0; register_memory(conn); post_receives(conn); TEST_NZ(rdma_resolve_route(id, TIMEOUT_IN_MS)); return 0; } void on_completion(struct ibv_wc *wc) { struct connection *conn = (struct connection *)(uintptr_t)wc->wr_id; if (wc->status != IBV_WC_SUCCESS) die("on_completion: status is not IBV_WC_SUCCESS."); if (wc->opcode & IBV_WC_RECV) printf("received message: %s\n", conn->recv_region); else if (wc->opcode == IBV_WC_SEND) printf("send completed successfully.\n"); else die("on_completion: completion isn't a send or a receive."); if (++conn->num_completions == 2) rdma_disconnect(conn->id); } int on_connection(void *context) { struct connection *conn = (struct connection *)context; struct ibv_send_wr wr, *bad_wr = NULL; struct ibv_sge sge; snprintf(conn->send_region, BUFFER_SIZE, "message from active/client side with pid %d", getpid()); printf("connected. posting send...\n"); memset(&wr, 0, sizeof(wr)); wr.wr_id = (uintptr_t)conn; wr.opcode = IBV_WR_SEND; wr.sg_list = &sge; wr.num_sge = 1; wr.send_flags = IBV_SEND_SIGNALED; sge.addr = (uintptr_t)conn->send_region; sge.length = BUFFER_SIZE; sge.lkey = conn->send_mr->lkey; TEST_NZ(ibv_post_send(conn->qp, &wr, &bad_wr)); return 0; } int on_disconnect(struct rdma_cm_id *id) { struct connection *conn = (struct connection *)id->context; printf("disconnected.\n"); rdma_destroy_qp(id); ibv_dereg_mr(conn->send_mr); ibv_dereg_mr(conn->recv_mr); free(conn->send_region); free(conn->recv_region); free(conn); rdma_destroy_id(id); return 1; /* exit event loop */ } int on_event(struct rdma_cm_event *event) { int r = 0; if (event->event == RDMA_CM_EVENT_ADDR_RESOLVED) r = on_addr_resolved(event->id); else if (event->event == RDMA_CM_EVENT_ROUTE_RESOLVED) r = on_route_resolved(event->id); else if (event->event == RDMA_CM_EVENT_ESTABLISHED) r = on_connection(event->id->context); else if (event->event == RDMA_CM_EVENT_DISCONNECTED) r = on_disconnect(event->id); else die("on_event: unknown event."); return r; } int on_route_resolved(struct rdma_cm_id *id) { struct rdma_conn_param cm_params; printf("route resolved.\n"); memset(&cm_params, 0, sizeof(cm_params)); TEST_NZ(rdma_connect(id, &cm_params)); return 0; } 更多讲解教程 InfiniBand, Verbs, RDMA \| https://thegeekinthecorner.wordpress.com/category/infiniband-verbs-rdma/ RDMA read and write with IB verbs \| https://thegeekinthecorner.wordpress.com/2010/09/28/rdma-read-and-write-with-ib-verbs/ http://www.hpcadvisorycouncil.com/pdf/building-an-rdma-capable-application-with-ib-verbs.pdf

相关推荐
斐夷所非2 个月前
RDMA 工作原理 | 支持 RDMA 的网络协议
rdma
bandaoyu3 个月前
【RDMA】 ZTR(Zero Touch RoCE)技术(无需配置PFC和ECN)
rdma
大桔骑士v3 个月前
【RDMA学习笔记】1:RDMA(Remote Direct Memory Access)介绍
计算机网络·rdma
中古传奇3 个月前
【3.1 以太网RDMA优化--网卡缓存资源维度】
网络·缓存·rdma
Tassel_YUE3 个月前
SmartX分享:SMTX ZBS 中 RDMA 技术简介
分布式存储·rdma·技术分享·块存储·smartx
bandaoyu4 个月前
【RDMA】RDMA read和write编程实例(verbs API)
rdma
北冥有鱼被烹4 个月前
微知-DOCA SDK中如何编译一个sample?如何运行?(meson /tmp/xxx; meson compile -C /tmp/xxx)
rdma·dpu·doca
北冥有鱼被烹4 个月前
微知-ib_write_bw的各种参数汇总(-d -q -s -R --run_infinitely)
rdma·mellanox
KIDGINBROOK7 个月前
RDMA驱动学习(一)- 用户态到内核态的过程
rdma
羌俊恩8 个月前
Linux 常见的冷知识集锦
linux·rdma·posix