RDMA在KVM实现条件

KVM 支持VF passthrough条件

CPU必须支持 Intel VT-d 或 AMD-Vi(IOMMU)技术

demsg要包含下述两部分

  • DMAR: Intel(R) Virtualization Technology for Directed I/O
  • DMAR: IOMMU enabled

检查CPU是否支持VT-d或AMD-Vi

|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| # dmesg |grep -e ``"DMAR" -e ``"IOMMU"``|grep -e ``"Virtualization" -e enabled [ ``0.000000``] DMAR: IOMMU enabled [ ``0.001068``] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping. [ ``1.150702``] DMAR: Intel(R) Virtualization Technology ``for Directed I/O |

内核必须支持vfio, vfio_iommu_type1, vfio_pci 等模块

检查Kernel加载 IOMMU 相关的内核模块

|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [root``@stgExt1 qemu]# lsmod|grep -e vfio -e iommu vfio_pci ``61440 0 vfio_virqfd ``16384 1 vfio_pci vfio_iommu_type1 ``36864 0 vfio ``36864 2 vfio_iommu_type1,vfio_pci irqbypass ``16384 422 vfio_pci,kvm |

QEMU必须2.0版本以上

centos8.4自带qemu版本4.2.0,BVT环境已升级至8.0.2,且QEMU需要重新编译

configure ./

|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ./configure --prefix=/usr/local/qemu_rdma/ --enable-debug --enable-kvm --enable-vnc --target-list=x86_64-softmmu --enable-spice --enable-spice-protocol --enable-vnc --enable-usb-redir --enable-rdma |

QEMU替换步骤

example

|---------------------------------------------------------------------------------------------|
| ln -sf /usr/local/qemu_rdma/bin/qemu-system-x86_64 /usr/libexec/qemu-kvm setenforce ``0 |

libvirt 版本是 1.2.9 或更高版本

centos8.4自带libvirt 版本为6.0.0

KVM支持SR-IOV

我们把SR-IOV创建出的虚拟网卡称为VF,如下命令可以查看网卡物理端口ens4f0/1(称PF)最大支持创建的VF均为8个;

KVM支持SR-IOV

我们把SR-IOV创建出的虚拟网卡称为VF,如下命令可以查看网卡物理端口ens4f0/1(称PF)最大支持创建的VF均为8个;

|-------------------------------------------------------------------------------------------------------------------------|
| # cat /sys/``class``/net/ens4f0/device/sriov_totalvfs 8 # cat /sys/``class``/net/ens4f1/device/sriov_totalvfs 8 |

ens4f0单个网口虚拟出6个VF

||
| # echo ``6 > /sys/``class``/net/ens4f0/device/sriov_numvfs # lspci|grep Mellanox b1:``00.0 Ethernet controller: Mellanox Technologies MT2894 Family [ConnectX-``6 Lx] b1:``00.1 Ethernet controller: Mellanox Technologies MT2894 Family [ConnectX-``6 Lx] b1:``00.2 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function b1:``00.3 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function b1:``00.4 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function b1:``00.5 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function b1:``00.6 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function b1:``00.7 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function # ip link |grep ens4 261``: ens4f0v0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu ``1500 qdisc mq state UP mode DEFAULT group ``default qlen ``1000 262``: ens4f0v1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu ``1500 qdisc mq state UP mode DEFAULT group ``default qlen ``1000 263``: ens4f0v2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu ``1500 qdisc mq state UP mode DEFAULT group ``default qlen ``1000 264``: ens4f0v3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu ``1500 qdisc mq state UP mode DEFAULT group ``default qlen ``1000 265``: ens4f0v4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu ``1500 qdisc mq state UP mode DEFAULT group ``default qlen ``1000 266``: ens4f0v5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu ``1500 qdisc mq state UP mode DEFAULT group ``default qlen ``1000 18``: ens4f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu ``1500 qdisc mq state UP mode DEFAULT group ``default qlen ``1000 19``: ens4f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu ``1500 qdisc mq state UP mode DEFAULT group ``default qlen ``1000 # ip link show ens4f0v0 261``: ens4f0v0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu ``1500 qdisc mq state UP mode DEFAULT group ``default qlen ``1000 ``link/ether ``56``:ba:``79``:b5:fb:3a brd ff:ff:ff:ff:ff:ff [root``@stgExt1 qemu]# ip link show ens4f0v1 262``: ens4f0v1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu ``1500 qdisc mq state UP mode DEFAULT group ``default qlen ``1000 ``link/ether ``42``:f9:c8:``62``:be:fd brd ff:ff:ff:ff:ff:ff [root``@stgExt1 qemu]# ip link show ens4f0v2 263``: ens4f0v2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu ``1500 qdisc mq state UP mode DEFAULT group ``default qlen ``1000 ``link/ether 2e:2b:``21``:``22``:a7:da brd ff:ff:ff:ff:ff:ff [root``@stgExt1 qemu]# ip link show ens4f0v3 264``: ens4f0v3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu ``1500 qdisc mq state UP mode DEFAULT group ``default qlen ``1000 ``link/ether ``22``:cd:f8:8e:8b:``39 brd ff:ff:ff:ff:ff:ff [root``@stgExt1 qemu]# ip link show ens4f0v4 265``: ens4f0v4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu ``1500 qdisc mq state UP mode DEFAULT group ``default qlen ``1000 ``link/ether b6:b1:``22``:d5:``28``:``46 brd ff:ff:ff:ff:ff:ff [root``@stgExt1 qemu]# ip link show ens4f0v5 266``: ens4f0v5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu ``1500 qdisc mq state UP mode DEFAULT group ``default qlen ``1000 ``link/ether be:``64``:4f:``36``:e0:f7 brd ff:ff:ff:ff:ff:ff |

lspci命令行输出

|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| # lspci -nn |grep Mellanox b1:``00.0 Ethernet controller [``0200``]: Mellanox Technologies MT2894 Family [ConnectX-``6 Lx] [15b3:101f] b1:``00.1 Ethernet controller [``0200``]: Mellanox Technologies MT2894 Family [ConnectX-``6 Lx] [15b3:101f] b1:``00.2 Ethernet controller [``0200``]: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function [15b3:101e] b1:``00.3 Ethernet controller [``0200``]: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function [15b3:101e] b1:``00.4 Ethernet controller [``0200``]: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function [15b3:101e] b1:``00.5 Ethernet controller [``0200``]: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function [15b3:101e] b1:``00.6 Ethernet controller [``0200``]: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function [15b3:101e] b1:``00.7 Ethernet controller [``0200``]: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function [15b3:101e] |

永久生效还需要

创建文件 /etc/modprobe.d/mlx5.conf,并添加以下内容:

cat /etc/modprobe.d/mlx5.conf

|---------------------------------|
| options mlx5_core num_vfs=``2 |

为VF接口创建一个udev 规则/etc/udev/rules.d/ens4f0.rules, 使创建的VF持久化

cat /etc/udev/rules.d/ens4f0.rules

|------------------------------------------------------------------------------------------------------|
| ACTION==``"add"``, SUBSYSTEM==``"net"``, DRIVERS==``"mlx5_core"``, ATTR{device/sriov_numvfs}=``"8" |

重新加载 mlx5_core 内核模块以使配置生效:

modprobe -r mlx5_core && modprobe mlx5_core

|-------------------------------------------------|
| $ modprobe -r mlx5_core && modprobe mlx5_core |

保存生效后,可以查看到VF,例如:

$ ip link show

|------------------|
| $ ip link show |

复制代码
查看RDMA链接状态

$ ip link show

|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| $ rdma link show 0``/``1``: mlx5_0/``1``: state ACTIVE physical_state LINK_UP netdev ens1f0np0 1``/``1``: mlx5_1/``1``: state ACTIVE physical_state LINK_UP netdev ens1f1np1 |

复制代码
网口下层Link Layer: Ethernet表示RoCE协议

||
| # ibstat CA ``'mlx5_0' ``CA type: MT4123 ``Number of ports: ``1 ``Firmware version: ``20.30``.``1004 ``Hardware version: ``0 ``Node GUID: ``0xb83fd20300d3e4c6 ``System image GUID: ``0xb83fd20300d3e4c6 ``Port ``1``: ``State: Active ``Physical state: LinkUp ``Rate: ``100 ``Base lid: ``0 ``LMC: ``0 ``SM lid: ``0 ``Capability mask: ``0x00010000 ``Port GUID: ``0xba3fd2fffed3e4c6 ``Link layer: Ethernet CA ``'mlx5_1' ``CA type: MT4123 ``Number of ports: ``1 ``Firmware version: ``20.30``.``1004 ``Hardware version: ``0 ``Node GUID: ``0xb83fd20300d3e4c7 ``System image GUID: ``0xb83fd20300d3e4c6 ``Port ``1``: ``State: Active ``Physical state: LinkUp ``Rate: ``100 ``Base lid: ``0 ``LMC: ``0 ``SM lid: ``0 ``Capability mask: ``0x00010000 ``Port GUID: ``0xba3fd2fffed3e4c7 ``Link layer: Ethernet |

ibv_devinfo -v 的输出中,每个网络接口都可能包含多个 GID(Global Identifier),每个 GID 表示一个全局唯一标识符,用于唯一标识 InfiniBand 网络中的节点或端口。其中,每个 GID 都会指定一个协议版本,如 RoCE v1 或 RoCE v2。

ibv_devinfo -v 命令的输出中

  • 如果看到 transport: Ethernet,则表示使用以太网协议;
  • 如果同时看到 RoCE v1RoCE v2,则说明使用了 RoCE 协议;

ibv_devinfo -v |grep GID

||
| # ibv_devinfo -v hca_id: mlx5_0 ``transport: InfiniBand (``0``) ``fw_ver: ``20.30``.``1004 ``node_guid: b83f:d203:00d3:e4c6 ``sys_image_guid: b83f:d203:00d3:e4c6 ``vendor_id: ``0x02c9 ``vendor_part_id: ``4123 ``hw_ver: ``0x0 ``board_id: LNV0000000017 ``phys_port_cnt: ``1 ``max_mr_size: ``0xffffffffffffffff ``page_size_cap: ``0xfffffffffffff000 ``max_qp: ``262144 ``max_qp_wr: ``32768 ``device_cap_flags: ``0x25321c36 ``BAD_PKEY_CNTR ``BAD_QKEY_CNTR ``AUTO_PATH_MIG ``CHANGE_PHY_PORT ``PORT_ACTIVE_EVENT ``SYS_IMAGE_GUID ``RC_RNR_NAK_GEN ``MEM_WINDOW ``XRC ``MEM_MGT_EXTENSIONS ``MEM_WINDOW_TYPE_2B ``RAW_IP_CSUM ``MANAGED_FLOW_STEERING ``max_sge: ``30 ``max_sge_rd: ``30 ``max_cq: ``16777216 ``max_cqe: ``4194303 ``max_mr: ``16777216 ``max_pd: ``8388608 ``max_qp_rd_atom: ``16 ``max_ee_rd_atom: ``0 ``max_res_rd_atom: ``4194304 ``max_qp_init_rd_atom: ``16 ``max_ee_init_rd_atom: ``0 ``atomic_cap: ATOMIC_HCA (``1``) ``max_ee: ``0 ``max_rdd: ``0 ``max_mw: ``16777216 ``max_raw_ipv6_qp: ``0 ``max_raw_ethy_qp: ``0 ``max_mcast_grp: ``2097152 ``max_mcast_qp_attach: ``240 ``max_total_mcast_qp_attach: ``503316480 ``max_ah: ``2147483647 ``max_fmr: ``0 ``max_srq: ``8388608 ``max_srq_wr: ``32767 ``max_srq_sge: ``31 ``max_pkeys: ``128 ``local_ca_ack_delay: ``16 ``general_odp_caps: ``ODP_SUPPORT ``ODP_SUPPORT_IMPLICIT ``rc_odp_caps: ``SUPPORT_SEND ``SUPPORT_RECV ``SUPPORT_WRITE ``SUPPORT_READ ``SUPPORT_SRQ ``uc_odp_caps: ``NO SUPPORT ``ud_odp_caps: ``SUPPORT_SEND ``xrc_odp_caps: ``SUPPORT_SEND ``SUPPORT_WRITE ``SUPPORT_READ ``SUPPORT_SRQ ``completion timestamp_mask: ``0x7fffffffffffffff ``hca_core_clock: 156250kHZ ``raw packet caps: ``C-VLAN stripping offload ``Scatter FCS offload ``IP csum offload ``Delay drop ``device_cap_flags_ex: ``0x3000005425321C36 ``RAW_SCATTER_FCS ``PCI_WRITE_END_PADDING ``Unknown flags: ``0x3000004000000000 ``tso_caps: ``max_tso: ``262144 ``supported_qp: ``SUPPORT_RAW_PACKET ``rss_caps: ``max_rwq_indirection_tables: ``1048576 ``max_rwq_indirection_table_size: ``2048 ``rx_hash_function: ``0x1 ``rx_hash_fields_mask: ``0x800000FF ``supported_qp: ``SUPPORT_RAW_PACKET ``max_wq_type_rq: ``8388608 ``packet_pacing_caps: ``qp_rate_limit_min: 1kbps ``qp_rate_limit_max: 100000000kbps ``supported_qp: ``SUPPORT_RAW_PACKET ``tag matching not supported ``cq moderation caps: ``max_cq_count: ``65535 ``max_cq_period: ``4095 us ``maximum available device memory: 131072Bytes ``num_comp_vectors: ``63 ``port: ``1 ``state: PORT_ACTIVE (``4``) ``max_mtu: ``4096 (``5``) ``active_mtu: ``1024 (``3``) ``sm_lid: ``0 ``port_lid: ``0 ``port_lmc: ``0x00 ``link_layer: Ethernet ``max_msg_sz: ``0x40000000 ``port_cap_flags: ``0x04010000 ``port_cap_flags2: ``0x0000 ``max_vl_num: invalid value (``0``) ``bad_pkey_cntr: ``0x0 ``qkey_viol_cntr: ``0x0 ``sm_sl: ``0 ``pkey_tbl_len: ``1 ``gid_tbl_len: ``255 ``subnet_timeout: ``0 ``init_type_reply: ``0 ``active_width: 4X (``2``) ``active_speed: ``25.0 Gbps (``32``) ``phys_state: LINK_UP (``5``) ``GID[ ``0``]: fe80:``0000``:``0000``:``0000``:ba3f:d2ff:fed3:e4c6, RoCE v1 ``GID[ ``1``]: fe80::ba3f:d2ff:fed3:e4c6, RoCE v2 hca_id: mlx5_1 ``transport: InfiniBand (``0``) ``fw_ver: ``20.30``.``1004 ``node_guid: b83f:d203:00d3:e4c7 ``sys_image_guid: b83f:d203:00d3:e4c6 ``vendor_id: ``0x02c9 ``vendor_part_id: ``4123 ``hw_ver: ``0x0 ``board_id: LNV0000000017 ``phys_port_cnt: ``1 ``max_mr_size: ``0xffffffffffffffff ``page_size_cap: ``0xfffffffffffff000 ``max_qp: ``262144 ``max_qp_wr: ``32768 ``device_cap_flags: ``0x25321c36 ``BAD_PKEY_CNTR ``BAD_QKEY_CNTR ``AUTO_PATH_MIG ``CHANGE_PHY_PORT ``PORT_ACTIVE_EVENT ``SYS_IMAGE_GUID ``RC_RNR_NAK_GEN ``MEM_WINDOW ``XRC ``MEM_MGT_EXTENSIONS ``MEM_WINDOW_TYPE_2B ``RAW_IP_CSUM ``MANAGED_FLOW_STEERING ``max_sge: ``30 ``max_sge_rd: ``30 ``max_cq: ``16777216 ``max_cqe: ``4194303 ``max_mr: ``16777216 ``max_pd: ``8388608 ``max_qp_rd_atom: ``16 ``max_ee_rd_atom: ``0 ``max_res_rd_atom: ``4194304 ``max_qp_init_rd_atom: ``16 ``max_ee_init_rd_atom: ``0 ``atomic_cap: ATOMIC_HCA (``1``) ``max_ee: ``0 ``max_rdd: ``0 ``max_mw: ``16777216 ``max_raw_ipv6_qp: ``0 ``max_raw_ethy_qp: ``0 ``max_mcast_grp: ``2097152 ``max_mcast_qp_attach: ``240 ``max_total_mcast_qp_attach: ``503316480 ``max_ah: ``2147483647 ``max_fmr: ``0 ``max_srq: ``8388608 ``max_srq_wr: ``32767 ``max_srq_sge: ``31 ``max_pkeys: ``128 ``local_ca_ack_delay: ``16 ``general_odp_caps: ``ODP_SUPPORT ``ODP_SUPPORT_IMPLICIT ``rc_odp_caps: ``SUPPORT_SEND ``SUPPORT_RECV ``SUPPORT_WRITE ``SUPPORT_READ ``SUPPORT_SRQ ``uc_odp_caps: ``NO SUPPORT ``ud_odp_caps: ``SUPPORT_SEND ``xrc_odp_caps: ``SUPPORT_SEND ``SUPPORT_WRITE ``SUPPORT_READ ``SUPPORT_SRQ ``completion timestamp_mask: ``0x7fffffffffffffff ``hca_core_clock: 156250kHZ ``raw packet caps: ``C-VLAN stripping offload ``Scatter FCS offload ``IP csum offload ``Delay drop ``device_cap_flags_ex: ``0x3000005425321C36 ``RAW_SCATTER_FCS ``PCI_WRITE_END_PADDING ``Unknown flags: ``0x3000004000000000 ``tso_caps: ``max_tso: ``262144 ``supported_qp: ``SUPPORT_RAW_PACKET ``rss_caps: ``max_rwq_indirection_tables: ``1048576 ``max_rwq_indirection_table_size: ``2048 ``rx_hash_function: ``0x1 ``rx_hash_fields_mask: ``0x800000FF ``supported_qp: ``SUPPORT_RAW_PACKET ``max_wq_type_rq: ``8388608 ``packet_pacing_caps: ``qp_rate_limit_min: 1kbps ``qp_rate_limit_max: 100000000kbps ``supported_qp: ``SUPPORT_RAW_PACKET ``tag matching not supported ``cq moderation caps: ``max_cq_count: ``65535 ``max_cq_period: ``4095 us ``maximum available device memory: 131072Bytes ``num_comp_vectors: ``63 ``port: ``1 ``state: PORT_ACTIVE (``4``) ``max_mtu: ``4096 (``5``) ``active_mtu: ``1024 (``3``) ``sm_lid: ``0 ``port_lid: ``0 ``port_lmc: ``0x00 ``link_layer: Ethernet ``max_msg_sz: ``0x40000000 ``port_cap_flags: ``0x04010000 ``port_cap_flags2: ``0x0000 ``max_vl_num: invalid value (``0``) ``bad_pkey_cntr: ``0x0 ``qkey_viol_cntr: ``0x0 ``sm_sl: ``0 ``pkey_tbl_len: ``1 ``gid_tbl_len: ``255 ``subnet_timeout: ``0 ``init_type_reply: ``0 ``active_width: 4X (``2``) ``active_speed: ``25.0 Gbps (``32``) ``phys_state: LINK_UP (``5``) ``GID[ ``0``]: fe80:``0000``:``0000``:``0000``:ba3f:d2ff:fed3:e4c7, RoCE v1 ``GID[ ``1``]: fe80::ba3f:d2ff:fed3:e4c7, RoCE v2 |

更多参考:

QEMU官网 Download QEMU - QEMU

相关推荐
Hacker_Oldv3 分钟前
WPS 认证机制
运维·服务器·wps
bitcsljl11 分钟前
Linux 命令行快捷键
linux·运维·服务器
ac.char14 分钟前
在 Ubuntu 下使用 Tauri 打包 EXE 应用
linux·运维·ubuntu
Mr.1327 分钟前
数据库的三范式是什么?
数据库
Cachel wood34 分钟前
python round四舍五入和decimal库精确四舍五入
java·linux·前端·数据库·vue.js·python·前端框架
Python之栈41 分钟前
【无标题】
数据库·python·mysql
Youkiup41 分钟前
【linux 常用命令】
linux·运维·服务器
qq_297504611 小时前
【解决】Linux更新系统内核后Nvidia-smi has failed...
linux·运维·服务器
风_流沙1 小时前
java 对ElasticSearch数据库操作封装工具类(对你是否适用嘞)
java·数据库·elasticsearch
weixin_437398211 小时前
Linux扩展——shell编程
linux·运维·服务器·bash