RDMA
(Remote Direct Memory Access) 全称远程直接数据存取
,就是为了解决网络传输中服务器端数据处理的延迟而产生的。RDMA 通过网络把资料直接传入计算机的存储区,将数据从一个系统快速移动到远程系统存储器中,而不对操作系统造成任何影响,这样就不需要用到多少计算机的处理功能。它消除了外部存储器复制和上下文切换的开销,因而能解放内存带宽和 CPU 周期用于改进应用系统性能。RDMA需要智能网卡支持,这里使用的是Mellanox cx5。
基于CentOS 7.8 x86_64
1. 识别CX5 网卡
#lspci |grep Mellanox
5e:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
5e:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
2. 安装MLNX驱动
选择下载与OS匹配的驱动包,地址:https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/。
官方文档:https://enterprise-support.nvidia.com/s/article/howto-install-mlnx-ofed-driver
#下载 MLNX_OFED_LINUX-23.10-3.2.2.0-rhel7.8-x86_64.tgz
#yum install createrepo
#yum install tcl fuse-libs tk
#tar zxvf MLNX_OFED_LINUX-23.10-3.2.2.0-rhel7.8-x86_64.tgz
#cd MLNX_OFED_LINUX-23.10-3.2.2.0-rhel7.8-x86_64
#./mlnxofedinstall --add-kernel-support --with-nvmf --force #我这里后面要测试nvmeof rdma
#dracut -f
# /etc/init.d/openibd restart
Unloading HCA driver: [ OK ]
Loading HCA driver and Access Layer: [ OK ]
#systemctl enable openibd
#reboot
# lsmod |grep -i nvme
nvme 47306 8
nvme_core 94686 5 nvme
mlx_compat 55285 13 nvme,rdma_cm,ib_cm,iw_cm,auxiliary,mlx5_ib,nvme_core,ib_core,ib_umad,ib_uverbs,mlx5_core,rdma_ucm,ib_ipoib
3. 检查设备
#ibdev2netdev
mlx5_0 port 1 ==> eth2 (Up)
mlx5_1 port 1 ==> eth3 (Up)
#ibstatus
Infiniband device 'mlx5_0' port 1 status:
default gid: fe80:0000:0000:0000:1270:fdff:fe27:f982
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 25 Gb/sec (1X EDR)
link_layer: Ethernet
Infiniband device 'mlx5_1' port 1 status:
default gid: fe80:0000:0000:0000:1270:fdff:fe27:f983
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 25 Gb/sec (1X EDR)
link_layer: Ethernet
常用命令
ibstat
: 查询 InfiniBand 设备的基本状态ibstatus
: 网卡信息ibv_devinfo
:网卡设备信息(ibv_devinfo -d mlx5_0 -v)ibv_devices
:查看本主机的 infiniband 设备ibnodes
:查看网络中的 infiniband 设备show_gids
:看看网卡支持的 roce 版本show_counters
: 网卡端口统计数据,比如发送接受数据大小mlxconfig
: 网卡配置(mlxconfig -d mlx5_1 q 查询网卡配置信息)
吞吐量测试
写吞吐量
在 RDMA 驱动安装时会安装一些 RDMA 工具,可以使用 ib_send_bw
测试写吞吐量
服务器 A(server):
ib_write_bw -a -d mlx5_0
服务器 B(client):
ib_write_bw -a -d mlx5_0 10.192.51.152 (server端ip)
读吞吐量
读吞吐量的测试与写吞吐量测试相同,只是使用命令换为 ib_read_bw
延时测试
测试同样分为读写,测试工具为 ib_read_lat
、ib_write_lat
延时测试
测试同样分为读写,测试工具为 ib_read_lat
、ib_write_lat
带宽统计
在使用 RDMA 时,发送和接收的数据带宽可以在 app 中自己进行收集,这样我们的程序发送和接收的数据量会很清楚。
如果想知道当前 RDMA 网卡所发送和接收的带宽可以通过 sysfs 下的相关节点获取。
- 发送数据量(byte):
/sys/class/infiniband/mlx5_0/ports/1/counters/port_xmit_data
- 接收数据量(byte):
/sys/class/infiniband/mlx5_0/ports/1/counters/port_rcv_data
注 :port_xmit_data
和 port_rcv_data
的数值是实际的 1/4, 因此实际的带宽是在其基础之上乘以 4
,应该是为了防止数据溢出
port_xmit_data: (RO) Total number of data octets, divided by 4 (lanes), transmitted on all VLs. This is 64 bit counter
port_rcv_data: (RO) Total number of data octets, divided by 4 (lanes), received on all VLs. This is 64 bit counter.
来自:
Documentation/ABI/stable/sysfs-class-infiniband
pma_cnt_ext->port_xmit_data =
cpu_to_be64(MLX5_SUM_CNT(out, transmitted_ib_unicast.octets,
transmitted_ib_multicast.octets) >> 2);
pma_cnt_ext->port_rcv_data =
cpu_to_be64(MLX5_SUM_CNT(out, received_ib_unicast.octets,
received_ib_multicast.octets) >> 2);
file: drivers/infiniband/hw/mlx5/mad.c
网络联通性测试
由于当前网卡只支持 Ethernet
模式,因此只能使用 ibv_rc_pingpong
进行 ping 测试。
服务器A
# ibv_rc_pingpong -d mlx5_0 -g 0
local address: LID 0x0000, QPN 0x000088, PSN 0xf4799b, GID fe80::1270:fdff:fe27:f982
remote address: LID 0x0000, QPN 0x000088, PSN 0x22fd0b, GID fe80::ac0:ebff:fef4:4bf4
8192000 bytes in 0.01 seconds = 6475.25 Mbit/sec
1000 iters in 0.01 seconds = 10.12 usec/iter
client B
# ibv_rc_pingpong -d mlx5_1 -g 0 10.192.51.152
local address: LID 0x0000, QPN 0x000088, PSN 0x22fd0b, GID fe80::ac0:ebff:fef4:4bf4
remote address: LID 0x0000, QPN 0x000088, PSN 0xf4799b, GID fe80::1270:fdff:fe27:f982
8192000 bytes in 0.01 seconds = 6746.55 Mbit/sec
1000 iters in 0.01 seconds = 9.71 usec/iter
counters
# ls -lsh /sys/class/infiniband/mlx5_0/ports/1/counters/
total 0
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 excessive_buffer_overrun_errors
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 link_downed
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 link_error_recovery
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 local_link_integrity_errors
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 multicast_rcv_packets
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 multicast_xmit_packets
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 port_rcv_constraint_errors
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 port_rcv_data
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 port_rcv_errors
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 port_rcv_packets
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 port_rcv_remote_physical_errors
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 port_rcv_switch_relay_errors
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 port_xmit_constraint_errors
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 port_xmit_data
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 port_xmit_discards
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 port_xmit_packets
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 port_xmit_wait
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 symbol_error
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 unicast_rcv_packets
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 unicast_xmit_packets
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 VL15_dropped
Counter Description:
Counter | Description | InfiniBand Spec Name | Group |
---|---|---|---|
port_rcv_data | The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port. | PortRcvData | Informative |
port_rcv_packets | Total number of packets (this may include packets containing Errors. This is 64 bit counter. | PortRcvPkts | Informative |
port_multicast_rcv_packets | Total number of multicast packets, including multicast packets containing errors. | PortMultiCastRcvPkts | Informative |
port_unicast_rcv_packets | Total number of unicast packets, including unicast packets containing errors. | PortUnicastRcvPkts | Informative |
port_xmit_data | The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port. | PortXmitData | Informative |
port_xmit_packetsport_xmit_packets_64 | Total number of packets transmitted on all VLs from this port. This may include packets with errors.This is 64 bit counter. | PortXmitPkts | Informative |
port_rcv_switch_relay_errors | Total number of packets received on the port that were discarded because they could not be forwarded by the switch relay. | PortRcvSwitchRelayErrors | Error |
port_rcv_errors | Total number of packets containing an error that were received on the port. | PortRcvErrors | Informative |
port_rcv_constraint_errors | Total number of packets received on the switch physical port that are discarded. | PortRcvConstraintErrors | Error |
local_link_integrity_errors | The number of times that the count of local physical errors exceeded the threshold specified by LocalPhyErrors. | LocalLinkIntegrityErrors | Error |
port_xmit_wait | The number of ticks during which the port had data to transmit but no data was sent during the entire tick (either because of insufficient credits or because of lack of arbitration). | PortXmitWait | Informative |
port_multicast_xmit_packets | Total number of multicast packets transmitted on all VLs from the port. This may include multicast packets with errors. | PortMultiCastXmitPkts | Informative |
port_unicast_xmit_packets | Total number of unicast packets transmitted on all VLs from the port. This may include unicast packets with errors. | PortUnicastXmitPkts | Informative |
port_xmit_discards | Total number of outbound packets discarded by the port because the port is down or congested. | PortXmitDiscards | Error |
port_xmit_constraint_errors | Total number of packets not transmitted from the switch physical port. | PortXmitConstraintErrors | Error |
port_rcv_remote_physical_errors | Total number of packets marked with the EBP delimiter received on the port. | PortRcvRemotePhysicalErrors | Error |
symbol_error | Total number of minor link errors detected on one or more physical lanes. | SymbolErrorCounter | Error |
VL15_dropped | Number of incoming VL15 packets dropped due to resource limitations (e.g., lack of buffers) of the port. | VL15Dropped | Error |
link_error_recovery | Total number of times the Port Training state machine has successfully completed the link error recovery process. | LinkErrorRecoveryCounter | Error |
link_downed | Total number of times the Port Training state machine has failed the link error recovery process and downed the link. | LinkDownedCounter | Error |
hw_counters
# ls -lsh /sys/class/infiniband/mlx5_0/ports/1/hw_counters/
total 0
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 duplicate_request
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 implied_nak_seq_err
0 -rw-r--r-- 1 root root 4.0K 5月 28 16:42 lifespan
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 local_ack_timeout_err
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 np_cnp_sent
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 np_ecn_marked_roce_packets
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 out_of_buffer
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 out_of_sequence
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 packet_seq_err
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 req_cqe_error
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 req_cqe_flush_error
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 req_remote_access_errors
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 req_remote_invalid_request
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 resp_cqe_error
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 resp_cqe_flush_error
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 resp_local_length_error
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 resp_remote_access_errors
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 rnr_nak_retry_err
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 rp_cnp_handled
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 rp_cnp_ignored
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 rx_atomic_requests
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 rx_icrc_encapsulated
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 rx_read_requests
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 rx_write_requests
HW Counters Description:
Counter | Description | Group |
---|---|---|
duplicate_request | Number of received packets. A duplicate request is a request that had been previously executed. | Error |
implied_nak_seq_err | Number of time the requested decided an ACK. with a PSN larger than the expected PSN for an RDMA read or response. | Error |
lifespan | The maximum period in ms which defines the aging of the counter reads. Two consecutive reads within this period might return the same values | Informative |
local_ack_timeout_err | The number of times QP's ack timer expired for RC, XRC, DCT QPs at the sender side.The QP retry limit was not exceed, therefore it is still recoverable error. | Error |
np_cnp_sent | The number of CNP packets sent by the Notification Point when it noticed congestion experienced in the RoCEv2 IP header (ECN bits).The counters was added in MLNX_OFED 4.1 | Informative |
np_ecn_marked_roce_packets | The number of RoCEv2 packets received by the notification point which were marked for experiencing the congestion (ECN bits where '11' on the ingress RoCE traffic) .The counters was added in MLNX_OFED 4.1 | Informative |
out_of_buffer | The number of drops occurred due to lack of WQE for the associated QPs. | Error |
out_of_sequence | The number of out of sequence packets received. | Error |
packet_seq_err | The number of received NAK sequence error packets. The QP retry limit was not exceeded. | Error |
req_cqe_error | The number of times requester detected CQEs completed with errors.The counters was added in MLNX_OFED 4.1 | Error |
req_cqe_flush_error | The number of times requester detected CQEs completed with flushed errors.The counters was added in MLNX_OFED 4.1 | Error |
req_remote_access_errors | The number of times requester detected remote access errors.The counters was added in MLNX_OFED 4.1 | Error |
req_remote_invalid_request | The number of times requester detected remote invalid request errors.The counters was added in MLNX_OFED 4.1 | Error |
resp_cqe_error | The number of times responder detected CQEs completed with errors.The counters was added in MLNX_OFED 4.1 | Error |
resp_cqe_flush_error | The number of times responder detected CQEs completed with flushed errors.The counters was added in MLNX_OFED 4.1 | Error |
resp_local_length_error | The number of times responder detected local length errors.The counters was added in MLNX_OFED 4.1 | Error |
resp_remote_access_errors | The number of times responder detected remote access errors.The counters was added in MLNX_OFED 4.1 | Error |
rnr_nak_retry_err | The number of received RNR NAK packets. The QP retry limit was not exceeded. | Error |
rp_cnp_handled | The number of CNP packets handled by the Reaction Point HCA to throttle the transmission rate.The counters was added in MLNX_OFED 4.1 | Informative |
rp_cnp_ignored | The number of CNP packets received and ignored by the Reaction Point HCA. This counter should not raise if RoCE Congestion Control was enabled in the network. If this counter raise, verify that ECN was enabled on the adapter. See HowTo Configure DCQCN (RoCE CC) values for ConnectX-4 (Linux).The counters was added in MLNX_OFED 4.1 | Error |
rx_atomic_requests | The number of received ATOMIC request for the associated QPs. | Informative |
rx_dct_connect | The number of received connection request for the associated DCTs. | Informative |
rx_read_requests | The number of received READ requests for the associated QPs. | Informative |
rx_write_requests | The number of received WRITE requests for the associated QPs. | Informative |
rx_icrc_encapsulated | The number of RoCE packets with ICRC errors.This counter was added in MLNX_OFED 4.4 and kernel 4.19 | Error |
roce_adp_retrans | Counts the number of adaptive retransmissions for RoCE trafficThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0 | Informative |
roce_adp_retrans_to | Counts the number of times RoCE traffic reached timeout due to adaptive retransmissionThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0 | Informative |
roce_slow_restart | Counts the number of times RoCE slow restart was usedThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0 | Informative |
roce_slow_restart_cnps | Counts the number of times RoCE slow restart generated CNP packetsThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0 | Informative |
roce_slow_restart_trans | Counts the number of times RoCE slow restart changed state to slow restartThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0 | Informative |
duplicate_request
:(Duplicated packets)接收报文数,重复请求是先前已执行的请求。out_of_sequence
:(Drop out of sequence)接收到的乱序包的数量,说明此时已经产生了丢包packet_seq_err
:(NAK sequence rcvd)接收到的 NAK 序列错误数据包的数量,未超过 QP 重试限制。