CentOS 7下CX5-RDMA网络测试

RDMA(Remote Direct Memory Access) 全称远程直接数据存取,就是为了解决网络传输中服务器端数据处理的延迟而产生的。RDMA 通过网络把资料直接传入计算机的存储区,将数据从一个系统快速移动到远程系统存储器中,而不对操作系统造成任何影响,这样就不需要用到多少计算机的处理功能。它消除了外部存储器复制和上下文切换的开销,因而能解放内存带宽和 CPU 周期用于改进应用系统性能。RDMA需要智能网卡支持,这里使用的是Mellanox cx5。

基于CentOS 7.8 x86_64

1. 识别CX5 网卡
#lspci |grep Mellanox
5e:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
5e:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
2. 安装MLNX驱动

选择下载与OS匹配的驱动包,地址:https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/。

官方文档:https://enterprise-support.nvidia.com/s/article/howto-install-mlnx-ofed-driver

#下载 MLNX_OFED_LINUX-23.10-3.2.2.0-rhel7.8-x86_64.tgz
#yum install createrepo
#yum install tcl fuse-libs tk
#tar zxvf MLNX_OFED_LINUX-23.10-3.2.2.0-rhel7.8-x86_64.tgz
#cd MLNX_OFED_LINUX-23.10-3.2.2.0-rhel7.8-x86_64
#./mlnxofedinstall  --add-kernel-support --with-nvmf --force    #我这里后面要测试nvmeof rdma
#dracut -f
# /etc/init.d/openibd restart
Unloading HCA driver:                                      [  OK  ]
Loading HCA driver and Access Layer:                       [  OK  ]
#systemctl enable openibd
#reboot
# lsmod |grep -i nvme
nvme                   47306  8 
nvme_core              94686  5 nvme
mlx_compat             55285  13 nvme,rdma_cm,ib_cm,iw_cm,auxiliary,mlx5_ib,nvme_core,ib_core,ib_umad,ib_uverbs,mlx5_core,rdma_ucm,ib_ipoib
3. 检查设备
#ibdev2netdev
mlx5_0 port 1 ==> eth2 (Up)
mlx5_1 port 1 ==> eth3 (Up)
#ibstatus
Infiniband device 'mlx5_0' port 1 status:
        default gid:     fe80:0000:0000:0000:1270:fdff:fe27:f982
        base lid:        0x0
        sm lid:          0x0
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            25 Gb/sec (1X EDR)
        link_layer:      Ethernet

Infiniband device 'mlx5_1' port 1 status:
        default gid:     fe80:0000:0000:0000:1270:fdff:fe27:f983
        base lid:        0x0
        sm lid:          0x0
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            25 Gb/sec (1X EDR)
        link_layer:      Ethernet

常用命令

  • ibstat: 查询 InfiniBand 设备的基本状态
  • ibstatus: 网卡信息
  • ibv_devinfo:网卡设备信息(ibv_devinfo -d mlx5_0 -v)
  • ibv_devices:查看本主机的 infiniband 设备
  • ibnodes:查看网络中的 infiniband 设备
  • show_gids:看看网卡支持的 roce 版本
  • show_counters: 网卡端口统计数据,比如发送接受数据大小
  • mlxconfig: 网卡配置(mlxconfig -d mlx5_1 q 查询网卡配置信息)

吞吐量测试

写吞吐量

在 RDMA 驱动安装时会安装一些 RDMA 工具,可以使用 ib_send_bw 测试写吞吐量

服务器 A(server):

ib_write_bw -a -d mlx5_0

服务器 B(client):

ib_write_bw -a -d mlx5_0 10.192.51.152 (server端ip)

读吞吐量

读吞吐量的测试与写吞吐量测试相同,只是使用命令换为 ib_read_bw

延时测试

测试同样分为读写,测试工具为 ib_read_latib_write_lat

延时测试

测试同样分为读写,测试工具为 ib_read_latib_write_lat

带宽统计

在使用 RDMA 时,发送和接收的数据带宽可以在 app 中自己进行收集,这样我们的程序发送和接收的数据量会很清楚。

如果想知道当前 RDMA 网卡所发送和接收的带宽可以通过 sysfs 下的相关节点获取。

  • 发送数据量(byte):/sys/class/infiniband/mlx5_0/ports/1/counters/port_xmit_data
  • 接收数据量(byte):/sys/class/infiniband/mlx5_0/ports/1/counters/port_rcv_data

port_xmit_dataport_rcv_data 的数值是实际的 1/4, 因此实际的带宽是在其基础之上乘以 4,应该是为了防止数据溢出

port_xmit_data: (RO) Total number of data octets, divided by 4 (lanes), transmitted on all VLs. This is 64 bit counter

port_rcv_data: (RO) Total number of data octets, divided by 4 (lanes), received on all VLs. This is 64 bit counter.

来自: Documentation/ABI/stable/sysfs-class-infiniband

pma_cnt_ext->port_xmit_data =
    cpu_to_be64(MLX5_SUM_CNT(out, transmitted_ib_unicast.octets,
                 transmitted_ib_multicast.octets) >> 2);
pma_cnt_ext->port_rcv_data =
    cpu_to_be64(MLX5_SUM_CNT(out, received_ib_unicast.octets,
                 received_ib_multicast.octets) >> 2);

file: drivers/infiniband/hw/mlx5/mad.c

网络联通性测试

由于当前网卡只支持 Ethernet 模式,因此只能使用 ibv_rc_pingpong 进行 ping 测试。

服务器A

# ibv_rc_pingpong -d mlx5_0 -g 0
  local address:  LID 0x0000, QPN 0x000088, PSN 0xf4799b, GID fe80::1270:fdff:fe27:f982
  remote address: LID 0x0000, QPN 0x000088, PSN 0x22fd0b, GID fe80::ac0:ebff:fef4:4bf4
8192000 bytes in 0.01 seconds = 6475.25 Mbit/sec
1000 iters in 0.01 seconds = 10.12 usec/iter

client B

# ibv_rc_pingpong -d mlx5_1 -g 0 10.192.51.152
  local address:  LID 0x0000, QPN 0x000088, PSN 0x22fd0b, GID fe80::ac0:ebff:fef4:4bf4
  remote address: LID 0x0000, QPN 0x000088, PSN 0xf4799b, GID fe80::1270:fdff:fe27:f982
8192000 bytes in 0.01 seconds = 6746.55 Mbit/sec
1000 iters in 0.01 seconds = 9.71 usec/iter

counters

# ls -lsh /sys/class/infiniband/mlx5_0/ports/1/counters/
total 0
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 excessive_buffer_overrun_errors
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 link_downed
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 link_error_recovery
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 local_link_integrity_errors
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 multicast_rcv_packets
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 multicast_xmit_packets
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 port_rcv_constraint_errors
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 port_rcv_data
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 port_rcv_errors
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 port_rcv_packets
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 port_rcv_remote_physical_errors
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 port_rcv_switch_relay_errors
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 port_xmit_constraint_errors
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 port_xmit_data
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 port_xmit_discards
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 port_xmit_packets
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 port_xmit_wait
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 symbol_error
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 unicast_rcv_packets
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 unicast_xmit_packets
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 VL15_dropped

Counter Description:

Counter Description InfiniBand Spec Name Group
port_rcv_data The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port. PortRcvData Informative
port_rcv_packets Total number of packets (this may include packets containing Errors. This is 64 bit counter. PortRcvPkts Informative
port_multicast_rcv_packets Total number of multicast packets, including multicast packets containing errors. PortMultiCastRcvPkts Informative
port_unicast_rcv_packets Total number of unicast packets, including unicast packets containing errors. PortUnicastRcvPkts Informative
port_xmit_data The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port. PortXmitData Informative
port_xmit_packetsport_xmit_packets_64 Total number of packets transmitted on all VLs from this port. This may include packets with errors.This is 64 bit counter. PortXmitPkts Informative
port_rcv_switch_relay_errors Total number of packets received on the port that were discarded because they could not be forwarded by the switch relay. PortRcvSwitchRelayErrors Error
port_rcv_errors Total number of packets containing an error that were received on the port. PortRcvErrors Informative
port_rcv_constraint_errors Total number of packets received on the switch physical port that are discarded. PortRcvConstraintErrors Error
local_link_integrity_errors The number of times that the count of local physical errors exceeded the threshold specified by LocalPhyErrors. LocalLinkIntegrityErrors Error
port_xmit_wait The number of ticks during which the port had data to transmit but no data was sent during the entire tick (either because of insufficient credits or because of lack of arbitration). PortXmitWait Informative
port_multicast_xmit_packets Total number of multicast packets transmitted on all VLs from the port. This may include multicast packets with errors. PortMultiCastXmitPkts Informative
port_unicast_xmit_packets Total number of unicast packets transmitted on all VLs from the port. This may include unicast packets with errors. PortUnicastXmitPkts Informative
port_xmit_discards Total number of outbound packets discarded by the port because the port is down or congested. PortXmitDiscards Error
port_xmit_constraint_errors Total number of packets not transmitted from the switch physical port. PortXmitConstraintErrors Error
port_rcv_remote_physical_errors Total number of packets marked with the EBP delimiter received on the port. PortRcvRemotePhysicalErrors Error
symbol_error Total number of minor link errors detected on one or more physical lanes. SymbolErrorCounter Error
VL15_dropped Number of incoming VL15 packets dropped due to resource limitations (e.g., lack of buffers) of the port. VL15Dropped Error
link_error_recovery Total number of times the Port Training state machine has successfully completed the link error recovery process. LinkErrorRecoveryCounter Error
link_downed Total number of times the Port Training state machine has failed the link error recovery process and downed the link. LinkDownedCounter Error

hw_counters

# ls -lsh /sys/class/infiniband/mlx5_0/ports/1/hw_counters/
total 0
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 duplicate_request
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 implied_nak_seq_err
0 -rw-r--r-- 1 root root 4.0K 5月  28 16:42 lifespan
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 local_ack_timeout_err
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 np_cnp_sent
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 np_ecn_marked_roce_packets
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 out_of_buffer
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 out_of_sequence
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 packet_seq_err
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 req_cqe_error
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 req_cqe_flush_error
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 req_remote_access_errors
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 req_remote_invalid_request
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 resp_cqe_error
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 resp_cqe_flush_error
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 resp_local_length_error
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 resp_remote_access_errors
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 rnr_nak_retry_err
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 rp_cnp_handled
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 rp_cnp_ignored
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 rx_atomic_requests
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 rx_icrc_encapsulated
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 rx_read_requests
0 -r--r--r-- 1 root root 4.0K 5月  24 15:28 rx_write_requests

HW Counters Description:

Counter Description Group
duplicate_request Number of received packets. A duplicate request is a request that had been previously executed. Error
implied_nak_seq_err Number of time the requested decided an ACK. with a PSN larger than the expected PSN for an RDMA read or response. Error
lifespan The maximum period in ms which defines the aging of the counter reads. Two consecutive reads within this period might return the same values Informative
local_ack_timeout_err The number of times QP's ack timer expired for RC, XRC, DCT QPs at the sender side.The QP retry limit was not exceed, therefore it is still recoverable error. Error
np_cnp_sent The number of CNP packets sent by the Notification Point when it noticed congestion experienced in the RoCEv2 IP header (ECN bits).The counters was added in MLNX_OFED 4.1 Informative
np_ecn_marked_roce_packets The number of RoCEv2 packets received by the notification point which were marked for experiencing the congestion (ECN bits where '11' on the ingress RoCE traffic) .The counters was added in MLNX_OFED 4.1 Informative
out_of_buffer The number of drops occurred due to lack of WQE for the associated QPs. Error
out_of_sequence The number of out of sequence packets received. Error
packet_seq_err The number of received NAK sequence error packets. The QP retry limit was not exceeded. Error
req_cqe_error The number of times requester detected CQEs completed with errors.The counters was added in MLNX_OFED 4.1 Error
req_cqe_flush_error The number of times requester detected CQEs completed with flushed errors.The counters was added in MLNX_OFED 4.1 Error
req_remote_access_errors The number of times requester detected remote access errors.The counters was added in MLNX_OFED 4.1 Error
req_remote_invalid_request The number of times requester detected remote invalid request errors.The counters was added in MLNX_OFED 4.1 Error
resp_cqe_error The number of times responder detected CQEs completed with errors.The counters was added in MLNX_OFED 4.1 Error
resp_cqe_flush_error The number of times responder detected CQEs completed with flushed errors.The counters was added in MLNX_OFED 4.1 Error
resp_local_length_error The number of times responder detected local length errors.The counters was added in MLNX_OFED 4.1 Error
resp_remote_access_errors The number of times responder detected remote access errors.The counters was added in MLNX_OFED 4.1 Error
rnr_nak_retry_err The number of received RNR NAK packets. The QP retry limit was not exceeded. Error
rp_cnp_handled The number of CNP packets handled by the Reaction Point HCA to throttle the transmission rate.The counters was added in MLNX_OFED 4.1 Informative
rp_cnp_ignored The number of CNP packets received and ignored by the Reaction Point HCA. This counter should not raise if RoCE Congestion Control was enabled in the network. If this counter raise, verify that ECN was enabled on the adapter. See HowTo Configure DCQCN (RoCE CC) values for ConnectX-4 (Linux).The counters was added in MLNX_OFED 4.1 Error
rx_atomic_requests The number of received ATOMIC request for the associated QPs. Informative
rx_dct_connect The number of received connection request for the associated DCTs. Informative
rx_read_requests The number of received READ requests for the associated QPs. Informative
rx_write_requests The number of received WRITE requests for the associated QPs. Informative
rx_icrc_encapsulated The number of RoCE packets with ICRC errors.This counter was added in MLNX_OFED 4.4 and kernel 4.19 Error
roce_adp_retrans Counts the number of adaptive retransmissions for RoCE trafficThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0 Informative
roce_adp_retrans_to Counts the number of times RoCE traffic reached timeout due to adaptive retransmissionThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0 Informative
roce_slow_restart Counts the number of times RoCE slow restart was usedThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0 Informative
roce_slow_restart_cnps Counts the number of times RoCE slow restart generated CNP packetsThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0 Informative
roce_slow_restart_trans Counts the number of times RoCE slow restart changed state to slow restartThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0 Informative
  • duplicate_request:(Duplicated packets)接收报文数,重复请求是先前已执行的请求。
  • out_of_sequence:(Drop out of sequence)接收到的乱序包的数量,说明此时已经产生了丢包
  • packet_seq_err:(NAK sequence rcvd)接收到的 NAK 序列错误数据包的数量,未超过 QP 重试限制。
相关推荐
一勺汤2 分钟前
YOLO11改进-模块-引入星型运算Star Blocks
网络·yolo·目标检测·改进·魔改·yolov11·yolov11改进
ChennyWJS3 分钟前
03.HTTPS的实现原理-HTTPS的工作流程
网络·网络协议·http·https
Hacker_Oldv18 分钟前
网络安全攻防学习平台 - 基础关
网络·学习·web安全
沐多22 分钟前
波折重重:一个Linux实时系统Xenomai宕机问题的深度定位过程
linux·xenomai·实时linux·xenomai4
猿经验26 分钟前
tar.gz压缩文件在linux上解压异常问题:gzip:stdin:invalid compressed data
linux·运维·服务器
木卫二号Coding34 分钟前
宝塔-firefox(Docker应用)-构建自己的Web浏览器
linux·docker·开源
dawn43 分钟前
通过GRE协议组建VPN网络
运维·网络·vpn·gre
网络安全(king)1 小时前
网络安全之接入控制
网络·学习·安全·web安全
Dynadot_tech1 小时前
使用DynadotAPI查看域名清仓中的过期域名列表
网络·域名·域名注册·dynadot·过期域名
鱼大大博客1 小时前
Edge Scdn是用来干什么的?
网络·安全·edge