IPoIB(IP over InfiniBand)是一种在InfiniBand网络上实现IP协议的技术,它允许在InfiniBand网络上传输IP数据包。IPoIB通过将IP数据包封装在InfiniBand的数据包中,实现了在InfiniBand网络上的高效通信。本文将详细分析IPoIB如何接收数据并传递给网络栈,以及如何从网络栈获取数据并发送出去。
1. IPoIB 接收数据并传递给网络栈
当InfiniBand网卡接收到数据包时,数据包会通过InfiniBand的硬件队列传递到IPoIB驱动。IPoIB驱动的主要任务是处理这些数据包,并将它们传递给Linux内核的网络栈。
1.1 接收数据的主要流程
-
硬件接收队列:
-
InfiniBand网卡接收到数据包后,会将数据包放入接收队列(Receive Queue, RQ)。
-
每个接收队列中的数据包会被DMA(直接内存访问)到主机内存中。
-
-
IPoIB驱动的接收处理:
-
IPoIB驱动通过轮询或中断机制从接收队列中获取数据包。
-
驱动会调用
ipoib_ib_handle_rx_wc
函数来处理接收到的数据包。这个函数会解析数据包,检查其有效性,并将数据包转换为Linux内核网络栈可以处理的sk_buff
结构。
-
-
数据包处理:
-
数据包被封装在
sk_buff
结构中,驱动会根据数据包的类型(单播、多播、广播)进行相应的处理。 -
如果数据包是ARP请求,IPoIB会创建相应的路径条目。
-
数据包会被传递给网络栈的
netif_receive_skb
函数,最终交给上层协议栈处理。
-
-
重新提交接收缓冲区:
- 处理完数据包后,IPoIB驱动会重新提交接收缓冲区(
ipoib_ib_post_receive
),以便接收新的数据包。
- 处理完数据包后,IPoIB驱动会重新提交接收缓冲区(
1.2 关键代码片段
static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
{
// 处理接收到的数据包
// 将数据包转换为sk_buff
skb = priv->rx_ring[wr_id].skb;
skb_put(skb, wc->byte_len);
// 根据数据包类型设置skb->pkt_type
if (!(wc->wc_flags & IB_WC_GRH) || dgid->raw[0] != 0xff)
skb->pkt_type = PACKET_HOST;
else if (memcmp(dgid, dev->broadcast + 4, sizeof(union ib_gid)) == 0)
skb->pkt_type = PACKET_BROADCAST;
else
skb->pkt_type = PACKET_MULTICAST;
// 将数据包交给网络栈
napi_gro_receive(&priv->recv_napi, skb);
// 重新提交接收缓冲区
ipoib_ib_post_receive(dev, wr_id);
}
2. IPoIB 从网络栈获取数据并发送
当Linux内核的网络栈有数据包需要发送时,它会调用IPoIB驱动的发送函数。IPoIB驱动的主要任务是将这些数据包封装成InfiniBand的数据包,并通过InfiniBand网卡发送出去。
2.1 发送数据的主要流程
-
网络栈调用发送函数:
-
当网络栈有数据包需要发送时,它会调用IPoIB驱动的
ndo_start_xmit
函数,即ipoib_start_xmit
。 -
这个函数会处理数据包,并根据数据包的类型(单播、多播)决定如何发送。
-
-
数据包封装:
-
IPoIB驱动会将IP数据包封装在InfiniBand的数据包中。对于多播数据包,IPoIB会添加P_Key等信息。
-
驱动会调用
ipoib_send
或ipoib_send_rss
函数将数据包发送到InfiniBand网卡的发送队列(Send Queue, SQ)。
-
-
发送数据包:
-
数据包会被DMA到InfiniBand网卡的发送队列中,网卡会将数据包发送到目标节点。
-
如果发送队列已满,IPoIB驱动会暂停网络栈的发送队列,直到有足够的空间。
-
-
发送完成处理:
-
当InfiniBand网卡完成数据包的发送后,它会生成一个发送完成事件(Completion Queue Event, CQE)。
-
IPoIB驱动会通过轮询或中断机制处理这些发送完成事件,释放相关的资源,并唤醒网络栈的发送队列。
-
2.2 关键代码片段
static netdev_tx_t ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
// 处理数据包
if (unlikely(phdr->hwaddr[4] == 0xff)) {
// 多播数据包处理
ipoib_mcast_send(dev, phdr->hwaddr, skb);
} else {
// 单播数据包处理
neigh = ipoib_neigh_get(dev, phdr->hwaddr);
if (ipoib_cm_get(neigh)) {
// 使用连接管理发送
priv->fp.ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
} else if (neigh->ah && neigh->ah->valid) {
// 直接发送
rn->send(dev, skb, neigh->ah->ah, IPOIB_QPN(phdr->hwaddr));
}
}
return NETDEV_TX_OK;
}
int ipoib_send_rss(struct net_device *dev, struct sk_buff *skb,
struct ib_ah *address, u32 dqpn)
{
// 将数据包放入发送队列
tx_req = &send_ring->tx_ring[req_index];
tx_req->skb = skb;
// 提交发送请求
rc = post_send_rss(send_ring, req_index, address, dqpn, tx_req, phead, hlen);
if (unlikely(rc)) {
// 发送失败处理
dev_kfree_skb_any(skb);
}
return rc;
}
3. 总结
-
接收数据 :IPoIB驱动从InfiniBand网卡的接收队列中获取数据包,将其转换为
sk_buff
结构,并交给Linux内核的网络栈处理。 -
发送数据:IPoIB驱动从网络栈获取数据包,将其封装成InfiniBand数据包,并通过InfiniBand网卡的发送队列发送出去。
IPoIB驱动通过InfiniBand的硬件队列与Linux内核的网络栈进行交互,实现了在InfiniBand网络上传输IP数据包的功能。这种机制充分利用了InfiniBand网络的高带宽和低延迟特性,同时与现有的IP协议栈兼容,为高性能计算和数据中心网络提供了强大的支持。
相关源码
cpp
/**
* struct rdma_netdev - rdma netdev
* For cases where netstack interfacing is required.
*/
struct rdma_netdev {
void *clnt_priv;
struct ib_device *hca;
u8 port_num;
/*
* cleanup function must be specified.
* FIXME: This is only used for OPA_VNIC and that usage should be
* removed too.
*/
void (*free_rdma_netdev)(struct net_device *netdev);
/* control functions */
void (*set_id)(struct net_device *netdev, int id);
/* send packet */
int (*send)(struct net_device *dev, struct sk_buff *skb,
struct ib_ah *address, u32 dqpn);
/* multicast */
int (*attach_mcast)(struct net_device *dev, struct ib_device *hca,
union ib_gid *gid, u16 mlid,
int set_qkey, u32 qkey);
int (*detach_mcast)(struct net_device *dev, struct ib_device *hca,
union ib_gid *gid, u16 mlid);
};
.ndo_start_xmit = ipoib_start_xmit,
static netdev_tx_t ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
struct ipoib_dev_priv *priv = ipoib_priv(dev);
struct rdma_netdev *rn = netdev_priv(dev);
struct ipoib_neigh *neigh;
struct ipoib_pseudo_header *phdr;
struct ipoib_header *header;
unsigned long flags;
phdr = (struct ipoib_pseudo_header *) skb->data;
skb_pull(skb, sizeof(*phdr));
header = (struct ipoib_header *) skb->data;
if (unlikely(phdr->hwaddr[4] == 0xff)) {
/* multicast, arrange "if" according to probability */
if ((header->proto != htons(ETH_P_IP)) &&
(header->proto != htons(ETH_P_IPV6)) &&
(header->proto != htons(ETH_P_ARP)) &&
(header->proto != htons(ETH_P_RARP)) &&
(header->proto != htons(ETH_P_TIPC))) {
/* ethertype not supported by IPoIB */
++dev->stats.tx_dropped;
dev_kfree_skb_any(skb);
return NETDEV_TX_OK;
}
/* Add in the P_Key for multicast*/
phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff;
phdr->hwaddr[9] = priv->pkey & 0xff;
neigh = ipoib_neigh_get(dev, phdr->hwaddr);
if (likely(neigh))
goto send_using_neigh;
ipoib_mcast_send(dev, phdr->hwaddr, skb);
return NETDEV_TX_OK;
}
/* unicast, arrange "switch" according to probability */
switch (header->proto) {
case htons(ETH_P_IP):
case htons(ETH_P_IPV6):
case htons(ETH_P_TIPC):
neigh = ipoib_neigh_get(dev, phdr->hwaddr);
if (unlikely(!neigh)) {
neigh = neigh_add_path(skb, phdr->hwaddr, dev);
if (likely(!neigh))
return NETDEV_TX_OK;
}
break;
case htons(ETH_P_ARP):
case htons(ETH_P_RARP):
/* for unicast ARP and RARP should always perform path find */
unicast_arp_send(skb, dev, phdr);
return NETDEV_TX_OK;
default:
/* ethertype not supported by IPoIB */
++dev->stats.tx_dropped;
dev_kfree_skb_any(skb);
return NETDEV_TX_OK;
}
send_using_neigh:
/* note we now hold a ref to neigh */
if (ipoib_cm_get(neigh)) {
if (ipoib_cm_up(neigh)) {
priv->fp.ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
goto unref;
}
} else if (neigh->ah && neigh->ah->valid) {
neigh->ah->last_send = rn->send(dev, skb, neigh->ah->ah,
IPOIB_QPN(phdr->hwaddr));
goto unref;
} else if (neigh->ah) {
neigh_refresh_path(neigh, phdr->hwaddr, dev);
}
if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
spin_lock_irqsave(&priv->lock, flags);
/*
* to avoid race with path_rec_completion check if it already
* done, if yes re-send the packet, otherwise push the skb into
* the queue.
* it is safe to check it here while priv->lock around.
*/
if (neigh->ah && neigh->ah->valid)
if (!ipoib_cm_get(neigh) ||
(ipoib_cm_get(neigh) && ipoib_cm_up(neigh))) {
spin_unlock_irqrestore(&priv->lock, flags);
goto send_using_neigh;
}
push_pseudo_header(skb, phdr->hwaddr);
__skb_queue_tail(&neigh->queue, skb);
spin_unlock_irqrestore(&priv->lock, flags);
} else {
++dev->stats.tx_dropped;
dev_kfree_skb_any(skb);
}
unref:
ipoib_neigh_put(neigh);
return NETDEV_TX_OK;
}
int ipoib_send_rss(struct net_device *dev, struct sk_buff *skb,
struct ib_ah *address, u32 dqpn)
{
struct ipoib_dev_priv *priv = ipoib_priv(dev);
struct ipoib_tx_buf *tx_req;
struct ipoib_send_ring *send_ring;
u16 queue_index;
int hlen, rc;
void *phead;
int req_index;
unsigned usable_sge = priv->max_send_sge - !!skb_headlen(skb);
/* Find the correct QP to submit the IO to */
queue_index = skb_get_queue_mapping(skb);
send_ring = priv->send_ring + queue_index;
if (skb_is_gso(skb)) {
hlen = skb_transport_offset(skb) + tcp_hdrlen(skb);
phead = skb->data;
if (unlikely(!skb_pull(skb, hlen))) {
ipoib_warn(priv, "linear data too small\n");
++send_ring->stats.tx_dropped;
++send_ring->stats.tx_errors;
dev_kfree_skb_any(skb);
return -1;
}
} else {
if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) {
ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n",
skb->len, priv->mcast_mtu + IPOIB_ENCAP_LEN);
++send_ring->stats.tx_dropped;
++send_ring->stats.tx_errors;
ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu);
return -1;
}
phead = NULL;
hlen = 0;
}
if (skb_shinfo(skb)->nr_frags > usable_sge) {
if (skb_linearize(skb) < 0) {
ipoib_warn(priv, "skb could not be linearized\n");
++send_ring->stats.tx_dropped;
++send_ring->stats.tx_errors;
dev_kfree_skb_any(skb);
return -1;
}
/* Does skb_linearize return ok without reducing nr_frags? */
if (skb_shinfo(skb)->nr_frags > usable_sge) {
ipoib_warn(priv, "too many frags after skb linearize\n");
++send_ring->stats.tx_dropped;
++send_ring->stats.tx_errors;
dev_kfree_skb_any(skb);
return -1;
}
}
ipoib_dbg_data(priv, "sending packet, length=%d address=%p qpn=0x%06x\n",
skb->len, address, dqpn);
/*
* We put the skb into the tx_ring _before_ we call post_send_rss()
* because it's entirely possible that the completion handler will
* run before we execute anything after the post_send_rss(). That
* means we have to make sure everything is properly recorded and
* our state is consistent before we call post_send_rss().
*/
req_index = send_ring->tx_head & (priv->sendq_size - 1);
tx_req = &send_ring->tx_ring[req_index];
tx_req->skb = skb;
if (skb->len < ipoib_inline_thold &&
!skb_shinfo(skb)->nr_frags) {
tx_req->is_inline = 1;
send_ring->tx_wr.wr.send_flags |= IB_SEND_INLINE;
} else {
if (unlikely(ipoib_dma_map_tx(priv->ca, tx_req))) {
++send_ring->stats.tx_errors;
dev_kfree_skb_any(skb);
return -1;
}
tx_req->is_inline = 0;
send_ring->tx_wr.wr.send_flags &= ~IB_SEND_INLINE;
}
if (skb->ip_summed == CHECKSUM_PARTIAL)
send_ring->tx_wr.wr.send_flags |= IB_SEND_IP_CSUM;
else
send_ring->tx_wr.wr.send_flags &= ~IB_SEND_IP_CSUM;
/* increase the tx_head after send success, but use it for queue state */
if (atomic_read(&send_ring->tx_outstanding) == priv->sendq_size - 1) {
ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
netif_stop_subqueue(dev, queue_index);
}
skb_orphan(skb);
skb_dst_drop(skb);
if (__netif_subqueue_stopped(dev, queue_index))
if (ib_req_notify_cq(send_ring->send_cq, IB_CQ_NEXT_COMP |
IB_CQ_REPORT_MISSED_EVENTS))
ipoib_warn(priv, "request notify on send CQ failed\n");
rc = post_send_rss(send_ring, req_index,
address, dqpn, tx_req, phead, hlen);
if (unlikely(rc)) {
ipoib_warn(priv, "post_send_rss failed, error %d\n", rc);
++send_ring->stats.tx_errors;
if (!tx_req->is_inline)
ipoib_dma_unmap_tx(priv, tx_req);
dev_kfree_skb_any(skb);
if (__netif_subqueue_stopped(dev, queue_index))
netif_wake_subqueue(dev, queue_index);
rc = 0;
} else {
netdev_get_tx_queue(dev, queue_index)->trans_start = jiffies;
rc = send_ring->tx_head;
++send_ring->tx_head;
atomic_inc(&send_ring->tx_outstanding);
}
return rc;
}
static void ipoib_napi_add(struct net_device *dev)
{
struct ipoib_dev_priv *priv = ipoib_priv(dev);
netif_napi_add(dev, &priv->recv_napi, ipoib_rx_poll, IPOIB_NUM_WC);
netif_napi_add(dev, &priv->send_napi, ipoib_tx_poll, MAX_SEND_CQE);
}
int ipoib_rx_poll(struct napi_struct *napi, int budget)
{
struct ipoib_dev_priv *priv =
container_of(napi, struct ipoib_dev_priv, recv_napi);
struct net_device *dev = priv->dev;
int done;
int t;
int n, i;
done = 0;
poll_more:
while (done < budget) {
int max = (budget - done);
t = min(IPOIB_NUM_WC, max);
n = ib_poll_cq(priv->recv_cq, t, priv->ibwc);
for (i = 0; i < n; i++) {
struct ib_wc *wc = priv->ibwc + i;
if (wc->wr_id & IPOIB_OP_RECV) {
++done;
if (wc->wr_id & IPOIB_OP_CM)
ipoib_cm_handle_rx_wc(dev, wc);
else
ipoib_ib_handle_rx_wc(dev, wc);
} else {
pr_warn("%s: Got unexpected wqe id\n", __func__);
}
}
if (n != t)
break;
}
if (done < budget) {
#ifdef CONFIG_COMPAT_LRO_ENABLED_IPOIB
if (dev->features & NETIF_F_LRO)
lro_flush_all(&priv->lro.lro_mgr);
#endif
napi_complete(napi);
if (unlikely(ib_req_notify_cq(priv->recv_cq,
IB_CQ_NEXT_COMP |
IB_CQ_REPORT_MISSED_EVENTS)) &&
napi_reschedule(napi))
goto poll_more;
}
return done;
}
int ipoib_tx_poll(struct napi_struct *napi, int budget)
{
struct ipoib_dev_priv *priv = container_of(napi, struct ipoib_dev_priv,
send_napi);
struct net_device *dev = priv->dev;
int n, i;
struct ib_wc *wc;
poll_more:
n = ib_poll_cq(priv->send_cq, MAX_SEND_CQE, priv->send_wc);
for (i = 0; i < n; i++) {
wc = priv->send_wc + i;
if (wc->wr_id & IPOIB_OP_CM)
ipoib_cm_handle_tx_wc(dev, wc);
else
ipoib_ib_handle_tx_wc(dev, wc);
}
if (n < budget) {
napi_complete(napi);
if (unlikely(ib_req_notify_cq(priv->send_cq, IB_CQ_NEXT_COMP |
IB_CQ_REPORT_MISSED_EVENTS)) &&
napi_reschedule(napi))
goto poll_more;
}
return n < 0 ? 0 : n;
}
static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
{
struct ipoib_dev_priv *priv = ipoib_priv(dev);
unsigned int wr_id = wc->wr_id;
struct ipoib_tx_buf *tx_req;
ipoib_dbg_data(priv, "send completion: id %d, status: %d\n",
wr_id, wc->status);
if (unlikely(wr_id >= priv->sendq_size)) {
ipoib_warn(priv, "send completion event with wrid %d (> %d)\n",
wr_id, priv->sendq_size);
return;
}
tx_req = &priv->tx_ring[wr_id];
if (!tx_req->is_inline)
ipoib_dma_unmap_tx(priv, tx_req);
++dev->stats.tx_packets;
dev->stats.tx_bytes += tx_req->skb->len;
dev_kfree_skb_any(tx_req->skb);
tx_req->skb = NULL;
++priv->tx_tail;
atomic_dec(&priv->tx_outstanding);
if (unlikely(netif_queue_stopped(dev) &&
(atomic_read(&priv->tx_outstanding) <= priv->sendq_size >> 1) &&
test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)))
netif_wake_queue(dev);
if (wc->status != IB_WC_SUCCESS &&
wc->status != IB_WC_WR_FLUSH_ERR) {
struct ipoib_qp_state_validate *qp_work;
ipoib_warn(priv,
"failed send event (status=%d, wrid=%d vend_err %#x)\n",
wc->status, wr_id, wc->vendor_err);
qp_work = kzalloc(sizeof(*qp_work), GFP_ATOMIC);
if (!qp_work)
return;
INIT_WORK(&qp_work->work, ipoib_qp_state_validate_work);
qp_work->priv = priv;
queue_work(priv->wq, &qp_work->work);
}
}
static int ipoib_add_one(struct ib_device *device)
{
struct list_head *dev_list;
struct net_device *dev;
struct ipoib_dev_priv *priv;
unsigned int p;
int count = 0;
dev_list = kmalloc(sizeof(*dev_list), GFP_KERNEL);
if (!dev_list)
return -ENOMEM;
INIT_LIST_HEAD(dev_list);
rdma_for_each_port (device, p) {
if (!rdma_protocol_ib(device, p))
continue;
dev = ipoib_add_port("ib%d", device, p);
if (!IS_ERR(dev)) {
priv = ipoib_priv(dev);
list_add_tail(&priv->list, dev_list);
count++;
}
}
if (!count) {
kfree(dev_list);
return -EOPNOTSUPP;
}
ib_set_client_data(device, &ipoib_client, dev_list);
return 0;
}
static struct net_device *ipoib_add_port(const char *format,
struct ib_device *hca, u8 port)
{
struct rtnl_link_ops *ops = ipoib_get_link_ops();
struct rdma_netdev_alloc_params params;
struct ipoib_dev_priv *priv;
struct net_device *ndev;
int result;
ndev = ipoib_intf_alloc(hca, port, format);
if (IS_ERR(ndev)) {
pr_warn("%s, %d: ipoib_intf_alloc failed %ld\n", hca->name, port,
PTR_ERR(ndev));
return ndev;
}
priv = ipoib_priv(ndev);
INIT_IB_EVENT_HANDLER(&priv->event_handler,
priv->ca, ipoib_event);
ib_register_event_handler(&priv->event_handler);
/* call event handler to ensure pkey in sync */
queue_work(ipoib_workqueue, &priv->flush_heavy);
result = register_netdev(ndev);
if (result) {
pr_warn("%s: couldn't register ipoib port %d; error %d\n",
hca->name, port, result);
ipoib_parent_unregister_pre(ndev);
ipoib_intf_free(ndev);
free_netdev(ndev);
return ERR_PTR(result);
}
if (hca->ops.rdma_netdev_get_params) {
int rc = hca->ops.rdma_netdev_get_params(hca, port,
RDMA_NETDEV_IPOIB,
¶ms);
if (!rc && ops->priv_size < params.sizeof_priv)
ops->priv_size = params.sizeof_priv;
}
#ifdef CONFIG_COMPAT_LRO_ENABLED_IPOIB
/* force lro on the dev->features, because the function
* register_netdev disable it according to our private lro
*/
set_lro_features_bit(priv);
#endif
/*
* We cannot set priv_destructor before register_netdev because we
* need priv to be always valid during the error flow to execute
* ipoib_parent_unregister_pre(). Instead handle it manually and only
* enter priv_destructor mode once we are completely registered.
*/
#ifdef HAVE_NET_DEVICE_NEEDS_FREE_NETDEV
ndev->priv_destructor = ipoib_intf_free;
#endif
if (ipoib_intercept_dev_id_attr(ndev))
goto sysfs_failed;
if (ipoib_cm_add_mode_attr(ndev))
goto sysfs_failed;
if (ipoib_add_pkey_attr(ndev))
goto sysfs_failed;
if (ipoib_add_umcast_attr(ndev))
goto sysfs_failed;
if (device_create_file(&ndev->dev, &dev_attr_create_child))
goto sysfs_failed;
if (device_create_file(&ndev->dev, &dev_attr_delete_child))
goto sysfs_failed;
if (device_create_file(&priv->dev->dev, &dev_attr_set_mac))
goto sysfs_failed;
if (priv->max_tx_queues > 1) {
if (ipoib_set_rss_sysfs(priv))
goto sysfs_failed;
}
return ndev;
sysfs_failed:
ipoib_parent_unregister_pre(ndev);
unregister_netdev(ndev);
return ERR_PTR(-ENOMEM);
}
struct net_device *ipoib_intf_alloc(struct ib_device *hca, u8 port,
const char *name)
{
struct ipoib_dev_priv *priv;
struct net_device *dev;
int rc;
priv = kzalloc(sizeof(*priv), GFP_KERNEL);
if (!priv)
return ERR_PTR(-ENOMEM);
dev = ipoib_alloc_netdev(hca, port, name, priv);
if (IS_ERR(dev)) {
kfree(priv);
return dev;
}
rc = ipoib_intf_init(hca, port, name, dev, priv);
if (rc) {
kfree(priv);
free_netdev(dev);
return ERR_PTR(rc);
}
/*
* Upon success the caller must ensure ipoib_intf_free is called or
* register_netdevice succeed'd and priv_destructor is set to
* ipoib_intf_free.
*/
return dev;
}
int ipoib_intf_init(struct ib_device *hca, u8 port, const char *name,
struct net_device *dev, struct ipoib_dev_priv *priv)
{
struct rdma_netdev *rn = netdev_priv(dev);
int rc;
priv->ca = hca;
priv->port = port;
rc = rdma_init_netdev(hca, port, RDMA_NETDEV_IPOIB, name,
NET_NAME_UNKNOWN, ipoib_setup_common, dev,
!ipoib_enhanced_enabled);
if (rc) {
if (rc != -EOPNOTSUPP)
goto out;
if (priv->num_tx_queues > 1) {
netif_set_real_num_tx_queues(dev, priv->num_tx_queues);
netif_set_real_num_rx_queues(dev, priv->num_rx_queues);
rn->attach_mcast = ipoib_mcast_attach_rss;
rn->send = ipoib_send_rss;
/* Override ethtool_ops to ethtool_ops_rss */
ipoib_set_ethtool_ops_rss(dev);
} else {
rn->attach_mcast = ipoib_mcast_attach;
rn->send = ipoib_send;
}
dev->netdev_ops = ipoib_get_rn_ops(priv);
rn->detach_mcast = ipoib_mcast_detach;
rn->hca = hca;
}
priv->rn_ops = dev->netdev_ops;
dev->netdev_ops = ipoib_get_netdev_ops(priv);
rn->clnt_priv = priv;
/*
* Only the child register_netdev flows can handle priv_destructor
* being set, so we force it to NULL here and handle manually until it
* is safe to turn on.
*/
#ifdef HAVE_NET_DEVICE_NEEDS_FREE_NETDEV
priv->next_priv_destructor = dev->priv_destructor;
dev->priv_destructor = NULL;
#endif
ipoib_build_priv(dev);
return 0;
out:
return rc;
}
static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
{
struct ipoib_dev_priv *priv = ipoib_priv(dev);
unsigned int wr_id = wc->wr_id & ~IPOIB_OP_RECV;
struct sk_buff *skb;
u64 mapping[IPOIB_UD_RX_SG];
union ib_gid *dgid;
union ib_gid *sgid;
ipoib_dbg_data(priv, "recv completion: id %d, status: %d\n",
wr_id, wc->status);
if (unlikely(wr_id >= priv->recvq_size)) {
ipoib_warn(priv, "recv completion event with wrid %d (> %d)\n",
wr_id, priv->recvq_size);
return;
}
skb = priv->rx_ring[wr_id].skb;
if (unlikely(wc->status != IB_WC_SUCCESS)) {
if (wc->status != IB_WC_WR_FLUSH_ERR)
ipoib_warn(priv,
"failed recv event (status=%d, wrid=%d vend_err %#x)\n",
wc->status, wr_id, wc->vendor_err);
ipoib_ud_dma_unmap_rx(priv, priv->rx_ring[wr_id].mapping);
dev_kfree_skb_any(skb);
priv->rx_ring[wr_id].skb = NULL;
return;
}
memcpy(mapping, priv->rx_ring[wr_id].mapping,
IPOIB_UD_RX_SG * sizeof(*mapping));
/*
* If we can't allocate a new RX buffer, dump
* this packet and reuse the old buffer.
*/
if (unlikely(!ipoib_alloc_rx_skb(dev, wr_id))) {
++dev->stats.rx_dropped;
goto repost;
}
ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n",
wc->byte_len, wc->slid);
ipoib_ud_dma_unmap_rx(priv, mapping);
skb_put(skb, wc->byte_len);
/* First byte of dgid signals multicast when 0xff */
dgid = &((struct ib_grh *)skb->data)->dgid;
if (!(wc->wc_flags & IB_WC_GRH) || dgid->raw[0] != 0xff)
skb->pkt_type = PACKET_HOST;
else if (memcmp(dgid, dev->broadcast + 4, sizeof(union ib_gid)) == 0)
skb->pkt_type = PACKET_BROADCAST;
else
skb->pkt_type = PACKET_MULTICAST;
sgid = &((struct ib_grh *)skb->data)->sgid;
/*
* Drop packets that this interface sent, ie multicast packets
* that the HCA has replicated.
*/
if (wc->slid == priv->local_lid && wc->src_qp == priv->qp->qp_num) {
int need_repost = 1;
if ((wc->wc_flags & IB_WC_GRH) &&
sgid->global.interface_id != priv->local_gid.global.interface_id)
need_repost = 0;
if (need_repost) {
dev_kfree_skb_any(skb);
goto repost;
}
}
skb_pull(skb, IB_GRH_BYTES);
#if (LINUX_VERSION_CODE < KERNEL_VERSION(3,14,0)) && ! defined(HAVE_SK_BUFF_CSUM_LEVEL)
/* indicate size for reasmb, only for old kernels */
skb->truesize = SKB_TRUESIZE(skb->len);
#endif
skb->protocol = ((struct ipoib_header *) skb->data)->proto;
skb_add_pseudo_hdr(skb);
++dev->stats.rx_packets;
dev->stats.rx_bytes += skb->len;
if (skb->pkt_type == PACKET_MULTICAST)
dev->stats.multicast++;
if (unlikely(be16_to_cpu(skb->protocol) == ETH_P_ARP))
ipoib_create_repath_ent(dev, skb, wc->slid);
skb->dev = dev;
if ((dev->features & NETIF_F_RXCSUM) &&
likely(wc->wc_flags & IB_WC_IP_CSUM_OK))
skb->ip_summed = CHECKSUM_UNNECESSARY;
#ifdef CONFIG_COMPAT_LRO_ENABLED_IPOIB
if (dev->features & NETIF_F_LRO)
lro_receive_skb(&priv->lro.lro_mgr, skb, NULL);
else
netif_receive_skb(skb);
#else
napi_gro_receive(&priv->recv_napi, skb);
#endif
repost:
if (unlikely(ipoib_ib_post_receive(dev, wr_id)))
ipoib_warn(priv, "ipoib_ib_post_receive failed "
"for buf %d\n", wr_id);
}
IPoIB(IP over InfiniBand)是一种在 InfiniBand 网络上运行 IP 协议的技术。下面将详细分析 IPoIB 如何收到数据并传递给网络栈,以及如何从网络栈拿到数据并发送出去。
1. IPoIB 收到数据并传递给网络栈
1.1 接收数据的起点:完成队列(Completion Queue, CQ)轮询
IPoIB 通过napi
机制进行接收数据的处理,在ipoib_napi_add
函数中注册了接收napi
,并指定了轮询函数ipoib_rx_poll
。
static void ipoib_napi_add(struct net_device *dev)
{
struct ipoib_dev_priv *priv = ipoib_priv(dev);
netif_napi_add(dev, &priv->recv_napi, ipoib_rx_poll, IPOIB_NUM_WC);
...
}
1.2 轮询完成队列
ipoib_rx_poll
函数会不断轮询接收完成队列priv->recv_cq
,通过ib_poll_cq
函数获取完成的工作请求(Work Completion, WC)。
int ipoib_rx_poll(struct napi_struct *napi, int budget)
{
struct ipoib_dev_priv *priv = container_of(napi, struct ipoib_dev_priv, recv_napi);
...
n = ib_poll_cq(priv->recv_cq, t, priv->ibwc);
for (i = 0; i < n; i++) {
struct ib_wc *wc = priv->ibwc + i;
if (wc->wr_id & IPOIB_OP_RECV) {
...
if (wc->wr_id & IPOIB_OP_CM)
ipoib_cm_handle_rx_wc(dev, wc);
else
ipoib_ib_handle_rx_wc(dev, wc);
}
}
...
}
1.3 处理接收完成的工作请求
对于普通的接收工作请求,会调用ipoib_ib_handle_rx_wc
函数进行处理。
static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
{
struct ipoib_dev_priv *priv = ipoib_priv(dev);
...
skb = priv->rx_ring[wr_id].skb;
if (unlikely(wc->status != IB_WC_SUCCESS)) {
...
}
...
skb_put(skb, wc->byte_len);
...
skb->protocol = ((struct ipoib_header *) skb->data)->proto;
skb_add_pseudo_hdr(skb);
...
if (dev->features & NETIF_F_RXCSUM) && likely(wc->wc_flags & IB_WC_IP_CSUM_OK))
skb->ip_summed = CHECKSUM_UNNECESSARY;
#ifdef CONFIG_COMPAT_LRO_ENABLED_IPOIB
if (dev->features & NETIF_F_LRO)
lro_receive_skb(&priv->lro.lro_mgr, skb, NULL);
else
netif_receive_skb(skb);
#else
napi_gro_receive(&priv->recv_napi, skb);
#endif
...
}
1.4 传递数据给网络栈
在ipoib_ib_handle_rx_wc
函数中,会根据设备特性和配置,选择不同的方式将接收到的数据传递给网络栈:
- 如果启用了 LRO(Large Receive Offload)特性,会调用
lro_receive_skb
函数进行处理。 - 否则,会调用
netif_receive_skb
或napi_gro_receive
函数将数据传递给网络栈。
2. IPoIB 从网络栈拿到数据并发送出去
2.1 网络栈调用发送函数
当网络栈需要发送数据时,会调用netdev_ops
中的ndo_start_xmit
函数,在 IPoIB 中对应的是ipoib_start_xmit
函数。
static netdev_tx_t ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
struct ipoib_dev_priv *priv = ipoib_priv(dev);
struct rdma_netdev *rn = netdev_priv(dev);
...
if (unlikely(phdr->hwaddr[4] == 0xff)) {
...
neigh = ipoib_neigh_get(dev, phdr->hwaddr);
if (likely(neigh))
goto send_using_neigh;
ipoib_mcast_send(dev, phdr->hwaddr, skb);
return NETDEV_TX_OK;
}
...
send_using_neigh:
...
if (ipoib_cm_get(neigh)) {
if (ipoib_cm_up(neigh)) {
priv->fp.ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
goto unref;
}
} else if (neigh->ah && neigh->ah->valid) {
neigh->ah->last_send = rn->send(dev, skb, neigh->ah->ah, IPOIB_QPN(phdr->hwaddr));
goto unref;
}
...
}
2.2 选择发送函数
在ipoib_start_xmit
函数中,会根据邻居信息和连接状态选择不同的发送方式:
- 如果连接管理器(Connection Manager, CM)可用且连接已建立,会调用
ipoib_cm_send
函数进行发送。 - 否则,如果邻居的地址句柄(Address Handle, AH)有效,会调用
rn->send
函数进行发送,在多队列情况下rn->send
指向ipoib_send_rss
函数。
2.3 实际发送数据
以ipoib_send_rss
函数为例,该函数会进行一系列的准备工作,如选择正确的发送队列、处理 GSO(Generic Segmentation Offload)、映射 DMA 等,最后调用post_send_rss
函数将数据发送出去。
int ipoib_send_rss(struct net_device *dev, struct sk_buff *skb,
struct ib_ah *address, u32 dqpn)
{
struct ipoib_dev_priv *priv = ipoib_priv(dev);
...
queue_index = skb_get_queue_mapping(skb);
send_ring = priv->send_ring + queue_index;
...
rc = post_send_rss(send_ring, req_index,
address, dqpn, tx_req, phead, hlen);
...
return rc;
}
综上所述,IPoIB 通过napi
机制轮询完成队列接收数据,并根据设备特性将数据传递给网络栈;从网络栈拿到数据后,会根据连接状态和邻居信息选择合适的发送函数进行数据发送。
IPOIB(InfiniBand over IP)是一种在InfiniBand网络上实现IP协议的技术。它允许InfiniBand网络上的设备通过IP协议进行通信。IPOIB通过内核模块实现,与网络栈进行交互,接收和发送数据包。下面详细解释IPOIB如何从网络栈接收数据并传递给网络栈,以及如何从网络栈获取数据并发送出去。
1. 从网络栈接收数据并传递给网络栈
1.1 接收数据包
IPOIB通过InfiniBand硬件接收数据包,并将其传递给内核网络栈。这个过程主要涉及以下几个步骤:
-
接收中断处理:
- 当InfiniBand硬件接收到数据包时,会触发中断。
- 中断处理程序会调用
ipoib_ib_handle_rx_wc
函数来处理接收完成的工作队列(Work Completion, WC)。
-
处理接收完成的工作队列:
ipoib_ib_handle_rx_wc
函数会检查接收完成的工作队列条目(WC),获取接收到的数据包。- 如果接收成功,函数会分配一个新的接收缓冲区(如果需要),并将接收到的数据包复制到新的缓冲区中。
- 然后,函数会将数据包传递给
napi_rx_handler
进行处理。
-
NAPI接收处理:
napi_rx_handler
会将数据包传递给ipoib_rx_poll
函数。ipoib_rx_poll
函数会处理接收到的数据包,包括解封装、校验和验证等。- 最后,函数会将数据包传递给网络栈,通过
netif_receive_skb
或napi_gro_receive
函数将数据包加入到网络栈的接收队列中。
1.2 代码示例
static int ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
{
struct ipoib_dev_priv *priv = ipoib_priv(dev);
unsigned int wr_id = wc->wr_id & ~IPOIB_OP_RECV;
struct sk_buff *skb;
u64 mapping[IPOIB_UD_RX_SG];
union ib_gid *dgid;
union ib_gid *sgid;
ipoib_dbg_data(priv, "recv completion: id %d, status: %d
", wr_id, wc->status);
if (unlikely(wr_id >= priv->recvq_size)) {
ipoib_warn(priv, "recv completion event with wrid %d (> %d)
", wr_id, priv->recvq_size);
return 0;
}
skb = priv->rx_ring[wr_id].skb;
if (unlikely(wc->status != IB_WC_SUCCESS)) {
if (wc->status != IB_WC_WR_FLUSH_ERR)
ipoib_warn(priv, "failed recv event (status=%d, wrid=%d vend_err %#x)
", wc->status, wr_id, wc->vendor_err);
ipoib_ud_dma_unmap_rx(priv, priv->rx_ring[wr_id].mapping);
dev_kfree_skb_any(skb);
priv->rx_ring[wr_id].skb = NULL;
return 0;
}
memcpy(mapping, priv->rx_ring[wr_id].mapping, IPOIB_UD_RX_SG * sizeof(*mapping));
if (unlikely(!ipoib_alloc_rx_skb(dev, wr_id))) {
++dev->stats.rx_dropped;
goto repost;
}
ipoib_ud_dma_unmap_rx(priv, mapping);
skb_put(skb, wc->byte_len);
dgid = &((struct ib_grh *)skb->data)->dgid;
if (!(wc->wc_flags & IB_WC_GRH) || dgid->raw[0] != 0xff)
skb->pkt_type = PACKET_HOST;
else if (memcmp(dgid, dev->broadcast + 4, sizeof(union ib_gid)) == 0)
skb->pkt_type = PACKET_BROADCAST;
else
skb->pkt_type = PACKET_MULTICAST;
sgid = &((struct ib_grh *)skb->data)->sgid;
if (wc->slid == priv->local_lid && wc->src_qp == priv->qp->qp_num) {
int need_repost = 1;
if ((wc->wc_flags & IB_WC_GRH) && sgid->global.interface_id != priv->local_gid.global.interface_id)
need_repost = 0;
if (need_repost) {
dev_kfree_skb_any(skb);
goto repost;
}
}
skb_pull(skb, IB_GRH_BYTES);
skb->protocol = ((struct ipoib_header *)skb->data)->proto;
skb_add_pseudo_hdr(skb);
++dev->stats.rx_packets;
dev->stats.rx_bytes += skb->len;
if (skb->pkt_type == PACKET_MULTICAST)
dev->stats.multicast++;
if (unlikely(be16_to_cpu(skb->protocol) == ETH_P_ARP))
ipoib_create_repath_ent(dev, skb, wc->slid);
skb->dev = dev;
if ((dev->features & NETIF_F_RXCSUM) && likely(wc->wc_flags & IB_WC_IP_CSUM_OK))
skb->ip_summed = CHECKSUM_UNNECESSARY;
#ifdef CONFIG_COMPAT_LRO_ENABLED_IPOIB
if (dev->features & NETIF_F_LRO)
lro_receive_skb(&priv->lro.lro_mgr, skb, NULL);
else
netif_receive_skb(skb);
#else
napi_gro_receive(&priv->recv_napi, skb);
#endif
repost:
if (unlikely(ipoib_ib_post_receive(dev, wr_id)))
ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d
", wr_id);
return 0;
}
2. 从网络栈获取数据并发送出去
2.1 发送数据包
IPOIB通过内核网络栈接收数据包,并将其发送到InfiniBand网络。这个过程主要涉及以下几个步骤:
-
网络栈调用发送函数:
- 当网络栈需要发送数据包时,会调用
dev_queue_xmit
函数。 dev_queue_xmit
函数会调用设备的netdev_ops->ndo_start_xmit
函数,对于IPOIB设备,这个函数是ipoib_start_xmit
。
- 当网络栈需要发送数据包时,会调用
-
处理发送请求:
ipoib_start_xmit
函数会检查数据包的类型(单播或多播),并根据类型选择合适的发送路径。- 对于单播数据包,函数会尝试使用已建立的连接(Connection)或路径(Path)进行发送。
- 对于多播数据包,函数会使用多播组进行发送。
-
发送数据包:
ipoib_start_xmit
函数会将数据包封装成InfiniBand格式,并调用post_send_rss
函数将数据包发送到InfiniBand网络。post_send_rss
函数会将数据包提交到InfiniBand的发送队列,并等待发送完成。
-
NAPI发送处理:
- 发送完成后,InfiniBand硬件会触发发送完成的工作队列(WC)。
ipoib_tx_poll
函数会处理发送完成的工作队列条目(WC),释放发送缓冲区,并更新发送统计信息。
2.2 代码示例
static netdev_tx_t ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
struct ipoib_dev_priv *priv = ipoib_priv(dev);
struct rdma_netdev *rn = netdev_priv(dev);
struct ipoib_neigh *neigh;
struct ipoib_pseudo_header *phdr;
struct ipoib_header *header;
unsigned long flags;
phdr = (struct ipoib_pseudo_header *)skb->data;
skb_pull(skb, sizeof(*phdr));
header = (struct ipoib_header *)skb->data;
if (unlikely(phdr->hwaddr[4] == 0xff)) {
if ((header->proto != htons(ETH_P_IP)) &&
(header->proto != htons(ETH_P_IPV6)) &&
(header->proto != htons(ETH_P_ARP)) &&
(header->proto != htons(ETH_P_RARP)) &&
(header->proto != htons(ETH_P_TIPC))) {
++dev->stats.tx_dropped;
dev_kfree_skb_any(skb);
return NETDEV_TX_OK;
}
phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff;
phdr->hwaddr[9] = priv->pkey & 0xff;
neigh = ipoib_neigh_get(dev, phdr->hwaddr);
if (likely(neigh))
goto send_using_neigh;
ipoib_mcast_send(dev, phdr->hwaddr, skb);
return NETDEV_TX_OK;
}
switch (header->proto) {
case htons(ETH_P_IP):
case htons(ETH_P_IPV6):
case htons(ETH_P_TIPC):
neigh = ipoib_neigh_get(dev, phdr->hwaddr);
if (unlikely(!neigh)) {
neigh = neigh_add_path(skb, phdr->hwaddr, dev);
if (likely(!neigh))
return NETDEV_TX_OK;
}
break;
case htons(ETH_P_ARP):
case htons(ETH_P_RARP):
unicast_arp_send(skb, dev, phdr);
return NETDEV_TX_OK;
default:
++dev->stats.tx_dropped;
dev_kfree_skb_any(skb);
return NETDEV_TX_OK;
}
send_using_neigh:
if (ipoib_cm_get(neigh)) {
if (ipoib_cm_up(neigh)) {
priv->fp.ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
goto unref;
}
} else if (neigh->ah && neigh->ah->valid) {
neigh->ah->last_send = rn->send(dev, skb, neigh->ah->ah, IPOIB_QPN(phdr->hwaddr));
goto unref;
} else if (neigh->ah) {
neigh_refresh_path(neigh, phdr->hwaddr, dev);
}
if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
spin_lock_irqsave(&priv->lock, flags);
push_pseudo_header(skb, phdr->hwaddr);
__skb_queue_tail(&neigh->queue, skb);
spin_unlock_irqrestore(&priv->lock, flags);
} else {
++dev->stats.tx_dropped;
dev_kfree_skb_any(skb);
}
unref:
ipoib_neigh_put(neigh);
return NETDEV_TX_OK;
}
int ipoib_send_rss(struct net_device *dev, struct sk_buff *skb, struct ib_ah *address, u32 dqpn)
{
struct ipoib_dev_priv *priv = ipoib_priv(dev);
struct ipoib_tx_buf *tx_req;
struct ipoib_send_ring *send_ring;
u16 queue_index;
int hlen, rc;
void *phead;
int req_index;
unsigned usable_sge = priv->max_send_sge - !!skb_headlen(skb);
queue_index = skb_get_queue_mapping(skb);
send_ring = priv->send_ring + queue_index;
if (skb_is_gso(skb)) {
hlen = skb_transport_offset(skb) + tcp_hdrlen(skb);
phead = skb->data;
if (unlikely(!skb_pull(skb, hlen))) {
++send_ring->stats.tx_dropped;
++send_ring->stats.tx_errors;
dev_kfree_skb_any(skb);
return -1;
}
} else {
if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) {
++send_ring->stats.tx_dropped;
++send_ring->stats.tx_errors;
ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu);
return -1;
}
phead = NULL;
hlen = 0;
}
req_index = send_ring->tx_head & (priv->sendq_size - 1);
tx_req = &send_ring->tx_ring[req_index];
tx_req->skb = skb;
if (skb->len < ipoib_inline_thold && !skb_shinfo(skb)->nr_frags) {
tx_req->is_inline = 1;
send_ring->tx_wr.wr.send_flags |= IB_SEND_INLINE;
} else {
if (unlikely(ipoib_dma_map_tx(priv->ca, tx_req))) {
++send_ring->stats.tx_errors;
dev_kfree_skb_any(skb);
return -1;
}
tx_req->is_inline = 0;
send_ring->tx_wr.wr.send_flags &= ~IB_SEND_INLINE;
}
if (skb->ip_summed == CHECKSUM_PARTIAL)
send_ring->tx_wr.wr.send_flags |= IB_SEND_IP_CSUM;
else
send_ring->tx_wr.wr.send_flags &= ~IB_SEND_IP_CSUM;
if (atomic_read(&send_ring->tx_outstanding) == priv->sendq_size - 1) {
netif_stop_subqueue(dev, queue_index);
}
skb_orphan(skb);
skb_dst_drop(skb);
rc = post_send_rss(send_ring, req_index, address, dqpn, tx_req, phead, hlen);
if (unlikely(rc)) {
++send_ring->stats.tx_errors;
if (!tx_req->is_inline)
ipoib_dma_unmap_tx(priv, tx_req);
dev_kfree_skb_any(skb);
if (netif_subqueue_stopped(dev, queue_index))
netif_wake_subqueue(dev, queue_index);
rc = 0;
} else {
netdev_get_tx_queue(dev, queue_index)->trans_start = jiffies;
rc = send_ring->tx_head;
++send_ring->tx_head;
atomic_inc(&send_ring->tx_outstanding);
}
return rc;
}
int ipoib_tx_poll(struct napi_struct *napi, int budget)
{
struct ipoib_dev_priv *priv = container_of(napi, struct ipoib_dev_priv, send_napi);
struct net_device *dev = priv->dev;
int n, i;
struct ib_wc *wc;
poll_more:
n = ib_poll_cq(priv->send_cq, MAX_SEND_CQE, priv->send_wc);
for (i = 0; i < n; i++) {
wc = priv->send_wc + i;
if (wc->wr_id & IPOIB_OP_CM)
ipoib_cm_handle_tx_wc(dev, wc);
else
ipoib_ib_handle_tx_wc(dev, wc);
}
if (n < budget) {
napi_complete(napi);
if (unlikely(ib_req_notify_cq(priv->send_cq, IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS)) && napi_reschedule(napi))
goto poll_more;
}
return n < 0 ? 0 : n;
}
static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
{
struct ipoib_dev_priv *priv = ipoib_priv(dev);
unsigned int wr_id = wc->wr_id;
struct ipoib_tx_buf *tx_req;
ipoib_dbg_data(priv, "send completion: id %d, status: %d
", wr_id, wc->status);
if (unlikely(wr_id >= priv->sendq_size)) {
ipoib_warn(priv, "send completion event with wrid %d (> %d)
", wr_id, priv->sendq_size);
return;
}
tx_req = &priv->tx_ring[wr_id];
if (!tx_req->is_inline)
ipoib_dma_unmap_tx(priv, tx_req);
++dev->stats.tx_packets;
dev->stats.tx_bytes += tx_req->skb->len;
dev_kfree_skb_any(tx_req->skb);
tx_req->skb = NULL;
++priv->tx_tail;
atomic_dec(&priv->tx_outstanding);
if (unlikely(netif_queue_stopped(dev) && (atomic_read(&priv->tx_outstanding) <= priv->sendq_size >> 1) && test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)))
netif_wake_queue(dev);
if (wc->status != IB_WC_SUCCESS && wc->status != IB_WC_WR_FLUSH_ERR) {
struct ipoib_qp_state_validate *qp_work;
ipoib_warn(priv, "failed send event (status=%d, wrid=%d vend_err %#x)
", wc->status, wr_id, wc->vendor_err);
qp_work = kzalloc(sizeof(*qp_work), GFP_ATOMIC);
if (!qp_work)
return;
INIT_WORK(&qp_work->work, ipoib_qp_state_validate_work);
qp_work->priv = priv;
queue_work(priv->wq, &qp_work->work);
}
}
总结
IPOIB通过内核网络栈的接口(如dev_queue_xmit
)接收数据包,并将其发送到InfiniBand网络。同时,IPOIB通过处理InfiniBand硬件触发的接收完成的工作队列(WC),将接收到的数据包传递给内核网络栈。这个过程涉及多个层次的交互,包括网络栈、设备驱动和硬件。
IPoIB(IP over InfiniBand)是一种在InfiniBand网络上传输IP数据包的技术。它通过将IP数据包封装在InfiniBand的数据包中,实现在InfiniBand网络上的通信。IPoIB的工作流程可以分为接收数据和发送数据两个部分。
1. 接收数据并交给网络栈
当InfiniBand网卡接收到数据包时,数据包会通过InfiniBand的硬件队列传递到IPoIB驱动。IPoIB驱动的主要任务是处理这些数据包,并将它们传递给Linux内核的网络栈。
接收数据的主要流程如下:
-
硬件接收队列:
-
InfiniBand网卡接收到数据包后,会将数据包放入接收队列(Receive Queue, RQ)。
-
每个接收队列中的数据包会被DMA到主机内存中。
-
-
IPoIB驱动的接收处理:
-
IPoIB驱动通过轮询或中断机制从接收队列中获取数据包。
-
驱动会调用
ipoib_ib_handle_rx_wc
函数来处理接收到的数据包。这个函数会解析数据包,检查其有效性,并将数据包转换为Linux内核网络栈可以处理的sk_buff
结构。
-
-
数据包处理:
-
数据包被封装在
sk_buff
结构中,驱动会根据数据包的类型(单播、多播、广播)进行相应的处理。 -
如果数据包是ARP请求,IPoIB会创建相应的路径条目。
-
数据包会被传递给网络栈的
netif_receive_skb
函数,最终交给上层协议栈处理。
-
-
重新提交接收缓冲区:
- 处理完数据包后,IPoIB驱动会重新提交接收缓冲区(
ipoib_ib_post_receive
),以便接收新的数据包。
- 处理完数据包后,IPoIB驱动会重新提交接收缓冲区(
关键代码片段:
static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
{
// 处理接收到的数据包
// 将数据包转换为sk_buff
skb = priv->rx_ring[wr_id].skb;
skb_put(skb, wc->byte_len);
// 根据数据包类型设置skb->pkt_type
if (!(wc->wc_flags & IB_WC_GRH) || dgid->raw[0] != 0xff)
skb->pkt_type = PACKET_HOST;
else if (memcmp(dgid, dev->broadcast + 4, sizeof(union ib_gid)) == 0)
skb->pkt_type = PACKET_BROADCAST;
else
skb->pkt_type = PACKET_MULTICAST;
// 将数据包交给网络栈
napi_gro_receive(&priv->recv_napi, skb);
// 重新提交接收缓冲区
ipoib_ib_post_receive(dev, wr_id);
}
2. 从网络栈获取数据并发送
当Linux内核的网络栈有数据包需要发送时,它会调用IPoIB驱动的发送函数。IPoIB驱动的主要任务是将这些数据包封装成InfiniBand的数据包,并通过InfiniBand网卡发送出去。
发送数据的主要流程如下:
-
网络栈调用发送函数:
-
当网络栈有数据包需要发送时,它会调用IPoIB驱动的
ndo_start_xmit
函数,即ipoib_start_xmit
。 -
这个函数会处理数据包,并根据数据包的类型(单播、多播)决定如何发送。
-
-
数据包封装:
-
IPoIB驱动会将IP数据包封装在InfiniBand的数据包中。对于多播数据包,IPoIB会添加P_Key等信息。
-
驱动会调用
ipoib_send
或ipoib_send_rss
函数将数据包发送到InfiniBand网卡的发送队列(Send Queue, SQ)。
-
-
发送数据包:
-
数据包会被DMA到InfiniBand网卡的发送队列中,网卡会将数据包发送到目标节点。
-
如果发送队列已满,IPoIB驱动会暂停网络栈的发送队列,直到有足够的空间。
-
-
发送完成处理:
-
当InfiniBand网卡完成数据包的发送后,它会生成一个发送完成事件(Completion Queue Event, CQE)。
-
IPoIB驱动会通过轮询或中断机制处理这些发送完成事件,释放相关的资源,并唤醒网络栈的发送队列。
-
关键代码片段:
static netdev_tx_t ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
// 处理数据包
if (unlikely(phdr->hwaddr[4] == 0xff)) {
// 多播数据包处理
ipoib_mcast_send(dev, phdr->hwaddr, skb);
} else {
// 单播数据包处理
neigh = ipoib_neigh_get(dev, phdr->hwaddr);
if (ipoib_cm_get(neigh)) {
// 使用连接管理发送
priv->fp.ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
} else if (neigh->ah && neigh->ah->valid) {
// 直接发送
rn->send(dev, skb, neigh->ah->ah, IPOIB_QPN(phdr->hwaddr));
}
}
return NETDEV_TX_OK;
}
int ipoib_send_rss(struct net_device *dev, struct sk_buff *skb,
struct ib_ah *address, u32 dqpn)
{
// 将数据包放入发送队列
tx_req = &send_ring->tx_ring[req_index];
tx_req->skb = skb;
// 提交发送请求
rc = post_send_rss(send_ring, req_index, address, dqpn, tx_req, phead, hlen);
if (unlikely(rc)) {
// 发送失败处理
dev_kfree_skb_any(skb);
}
return rc;
}
总结
-
接收数据 :IPoIB驱动从InfiniBand网卡的接收队列中获取数据包,将其转换为
sk_buff
结构,并交给Linux内核的网络栈处理。 -
发送数据:IPoIB驱动从网络栈获取数据包,将其封装成InfiniBand数据包,并通过InfiniBand网卡的发送队列发送出去。
IPoIB驱动通过InfiniBand的硬件队列与Linux内核的网络栈进行交互,实现了在InfiniBand网络上传输IP数据包的功能。
IPoIB(IP over InfiniBand)是一种将IP协议封装在InfiniBand(IB)网络上传输的技术,它通过RDMA(Remote Direct Memory Access)技术实现高效的数据传输。以下是IPoIB接收和发送数据的过程:
数据接收过程
-
硬件中断触发:当InfiniBand网络上的数据包到达时,硬件会触发中断,通知内核有数据可接收。
-
接收队列处理 :内核通过
ib_poll_cq
函数轮询接收队列(CQ),获取完成的工作请求(WR)。ipoib_rx_poll
函数是接收队列的轮询函数,它会调用ib_poll_cq
来获取接收完成事件。 -
数据包处理:
-
如果接收完成事件的状态为成功,IPoIB驱动会从接收队列中取出数据包,并将其封装成
sk_buff
结构。 -
驱动会检查数据包的协议类型,并根据协议类型设置
sk_buff
的协议字段。 -
如果数据包是多播或广播包,驱动会根据目的GID(全局标识符)设置
sk_buff
的pkt_type
字段。
-
-
数据包上交网络栈:
-
驱动会将处理好的
sk_buff
提交给网络栈,通过napi_gro_receive
或netif_receive_skb
函数将数据包传递给上层协议处理。 -
如果网络设备启用了LRO(Large Receive Offload,大接收卸载),则会通过
lro_receive_skb
函数进行LRO处理。
-
数据发送过程
-
网络栈调用发送函数 :当网络栈需要发送数据时,会调用
ipoib_start_xmit
函数,这是IPoIB的发送函数。 -
数据包封装:
-
驱动会检查数据包的目的地址和协议类型。如果是单播地址,会根据目的地址查找邻居表,获取对应的地址和队列对号(QPN)。
-
如果是多播地址,会添加P_Key(分区键)到伪头中,并调用
ipoib_mcast_send
函数发送多播数据包。
-
-
发送队列处理:
-
驱动会将数据包封装成发送请求(WR),并将其放入发送队列(SQ)中。
-
如果发送队列已满,驱动会停止网络队列,直到发送队列有足够的空间。
-
-
硬件发送数据:
-
驱动调用
post_send_rss
函数将发送请求提交给硬件,硬件会根据WR中的信息将数据包发送到目标地址。 -
发送完成后,硬件会在发送完成队列(CQ)中记录完成事件,驱动通过轮询CQ来获取发送完成事件,并进行相应的处理。
-
IPoIB通过这种方式实现了数据的高效接收和发送,充分利用了RDMA技术的优势,降低了CPU的负载,提高了数据传输的效率。
ipoib是InfiniBand over IPoIB (IP over InfiniBand) 的驱动模块,用于在InfiniBand网络上实现IP协议栈。它的工作机制包括接收和发送数据帧,并与网络栈进行交互。下面分别讲解`ipoib`模块如何接收数据并传递给网络栈,以及如何从网络栈获取数据并发送出去。
接收数据并传递给网络栈
- 接收数据:
当InfiniBand设备接收到数据帧时,会触发完成回调。`ipoib`驱动程序会注册一个回调函数来处理这些完成事件,如`ipoib_cm_handle_rx_wc`和`ipoib_handle_rx_wc`函数。这些函数会被调用来处理接收到的数据。
cpp
static void ipoib_cm_handle_rx_wc(struct ib_cq *cq, struct ib_wc *wc);
static void ipoib_handle_rx_wc(struct ib_cq *cq, struct ib_wc *wc);
- 处理接收到的数据:
这些回调函数会检查接收的Work Completion (WC)对象,以确定数据包是否成功接收。如果成功接收,那么驱动会提取数据包并将其封装成网络设备的skb(socket buffer)。
cpp
struct sk_buff *skb;
...
skb = ipoib_ib_handle_rx_wc(priv, wc, &status);
- 传递给网络栈:
将数据包封装成`skb`后,`ipoib`会调用`netif_receive_skb`函数将`skb`递交给内核网络栈进行处理:
cpp
netif_receive_skb(skb);
从网络栈获取数据并发送出去
- 注册xmit函数:
ipoib驱动程序会在网络设备结构中注册一个发送函数,通常是`ipoib_xmit`,这个函数会在上层网络协议栈需要发送数据时被调用。
cpp
dev->netdev_ops = &ipoib_netdev_ops;
...
static const struct net_device_ops ipoib_netdev_ops = {
.ndo_start_xmit = ipoib_start_xmit,
};
- 获取数据:
上层协议栈会调用`ipoib_start_xmit`函数,该函数会传递一个`skb`(包含要发送的数据)和网络设备结构。
cpp
int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev);
- 处理并发送数据:
在`ipoib_start_xmit`函数中,驱动程序会把`skb`中的数据提取出来,并创建适当的InfiniBand消息。然后,通过InfiniBand发送队列将数据包发送到网络上。这个处理通常包括调用底层InfiniBand API如`ib_post_send`来实际发送数据。
cpp
ret = ib_post_send(qp, &wr, &bad_wr);
总结
-
接收数据:InfiniBand设备接收到数据后,触发完成事件,由`ipoib`驱动程序的回调函数处理接收到的数据帧,将其封装成`skb`后调用`netif_receive_skb`传递给内核网络栈。
-
发送数据:网络栈调用`ipoib_start_xmit`函数将数据以`skb`形式传递给`ipoib`驱动程序,驱动解析出数据后通过InfiniBand API发送数据帧。
这样,`ipoib`模块实现了在InfiniBand网络上处理IP数据包的功能,从而能够与上层的网络协议栈进行高效的交互。