IPoIB(IP over InfiniBand)数据接收与发送机制详解

IPoIB(IP over InfiniBand)是一种在InfiniBand网络上实现IP协议的技术,它允许在InfiniBand网络上传输IP数据包。IPoIB通过将IP数据包封装在InfiniBand的数据包中,实现了在InfiniBand网络上的高效通信。本文将详细分析IPoIB如何接收数据并传递给网络栈,以及如何从网络栈获取数据并发送出去。


1. IPoIB 接收数据并传递给网络栈

当InfiniBand网卡接收到数据包时,数据包会通过InfiniBand的硬件队列传递到IPoIB驱动。IPoIB驱动的主要任务是处理这些数据包,并将它们传递给Linux内核的网络栈。

1.1 接收数据的主要流程

  1. 硬件接收队列

    • InfiniBand网卡接收到数据包后,会将数据包放入接收队列(Receive Queue, RQ)。

    • 每个接收队列中的数据包会被DMA(直接内存访问)到主机内存中。

  2. IPoIB驱动的接收处理

    • IPoIB驱动通过轮询或中断机制从接收队列中获取数据包。

    • 驱动会调用ipoib_ib_handle_rx_wc函数来处理接收到的数据包。这个函数会解析数据包,检查其有效性,并将数据包转换为Linux内核网络栈可以处理的sk_buff结构。

  3. 数据包处理

    • 数据包被封装在sk_buff结构中,驱动会根据数据包的类型(单播、多播、广播)进行相应的处理。

    • 如果数据包是ARP请求,IPoIB会创建相应的路径条目。

    • 数据包会被传递给网络栈的netif_receive_skb函数,最终交给上层协议栈处理。

  4. 重新提交接收缓冲区

    • 处理完数据包后,IPoIB驱动会重新提交接收缓冲区(ipoib_ib_post_receive),以便接收新的数据包。

1.2 关键代码片段

复制代码
static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
{
    // 处理接收到的数据包
    // 将数据包转换为sk_buff
    skb = priv->rx_ring[wr_id].skb;
    skb_put(skb, wc->byte_len);

    // 根据数据包类型设置skb->pkt_type
    if (!(wc->wc_flags & IB_WC_GRH) || dgid->raw[0] != 0xff)
        skb->pkt_type = PACKET_HOST;
    else if (memcmp(dgid, dev->broadcast + 4, sizeof(union ib_gid)) == 0)
        skb->pkt_type = PACKET_BROADCAST;
    else
        skb->pkt_type = PACKET_MULTICAST;

    // 将数据包交给网络栈
    napi_gro_receive(&priv->recv_napi, skb);

    // 重新提交接收缓冲区
    ipoib_ib_post_receive(dev, wr_id);
}

2. IPoIB 从网络栈获取数据并发送

当Linux内核的网络栈有数据包需要发送时,它会调用IPoIB驱动的发送函数。IPoIB驱动的主要任务是将这些数据包封装成InfiniBand的数据包,并通过InfiniBand网卡发送出去。

2.1 发送数据的主要流程

  1. 网络栈调用发送函数

    • 当网络栈有数据包需要发送时,它会调用IPoIB驱动的ndo_start_xmit函数,即ipoib_start_xmit

    • 这个函数会处理数据包,并根据数据包的类型(单播、多播)决定如何发送。

  2. 数据包封装

    • IPoIB驱动会将IP数据包封装在InfiniBand的数据包中。对于多播数据包,IPoIB会添加P_Key等信息。

    • 驱动会调用ipoib_sendipoib_send_rss函数将数据包发送到InfiniBand网卡的发送队列(Send Queue, SQ)。

  3. 发送数据包

    • 数据包会被DMA到InfiniBand网卡的发送队列中,网卡会将数据包发送到目标节点。

    • 如果发送队列已满,IPoIB驱动会暂停网络栈的发送队列,直到有足够的空间。

  4. 发送完成处理

    • 当InfiniBand网卡完成数据包的发送后,它会生成一个发送完成事件(Completion Queue Event, CQE)。

    • IPoIB驱动会通过轮询或中断机制处理这些发送完成事件,释放相关的资源,并唤醒网络栈的发送队列。

2.2 关键代码片段

复制代码
static netdev_tx_t ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
    // 处理数据包
    if (unlikely(phdr->hwaddr[4] == 0xff)) {
        // 多播数据包处理
        ipoib_mcast_send(dev, phdr->hwaddr, skb);
    } else {
        // 单播数据包处理
        neigh = ipoib_neigh_get(dev, phdr->hwaddr);
        if (ipoib_cm_get(neigh)) {
            // 使用连接管理发送
            priv->fp.ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
        } else if (neigh->ah && neigh->ah->valid) {
            // 直接发送
            rn->send(dev, skb, neigh->ah->ah, IPOIB_QPN(phdr->hwaddr));
        }
    }

    return NETDEV_TX_OK;
}

int ipoib_send_rss(struct net_device *dev, struct sk_buff *skb,
                   struct ib_ah *address, u32 dqpn)
{
    // 将数据包放入发送队列
    tx_req = &send_ring->tx_ring[req_index];
    tx_req->skb = skb;

    // 提交发送请求
    rc = post_send_rss(send_ring, req_index, address, dqpn, tx_req, phead, hlen);
    if (unlikely(rc)) {
        // 发送失败处理
        dev_kfree_skb_any(skb);
    }

    return rc;
}

3. 总结

  • 接收数据 :IPoIB驱动从InfiniBand网卡的接收队列中获取数据包,将其转换为sk_buff结构,并交给Linux内核的网络栈处理。

  • 发送数据:IPoIB驱动从网络栈获取数据包,将其封装成InfiniBand数据包,并通过InfiniBand网卡的发送队列发送出去。

IPoIB驱动通过InfiniBand的硬件队列与Linux内核的网络栈进行交互,实现了在InfiniBand网络上传输IP数据包的功能。这种机制充分利用了InfiniBand网络的高带宽和低延迟特性,同时与现有的IP协议栈兼容,为高性能计算和数据中心网络提供了强大的支持。

相关源码

cpp 复制代码
/**
 * struct rdma_netdev - rdma netdev
 * For cases where netstack interfacing is required.
 */
struct rdma_netdev {
	void              *clnt_priv;
	struct ib_device  *hca;
	u8                 port_num;

	/*
	 * cleanup function must be specified.
	 * FIXME: This is only used for OPA_VNIC and that usage should be
	 * removed too.
	 */
	void (*free_rdma_netdev)(struct net_device *netdev);

	/* control functions */
	void (*set_id)(struct net_device *netdev, int id);
	/* send packet */
	int (*send)(struct net_device *dev, struct sk_buff *skb,
		    struct ib_ah *address, u32 dqpn);
	/* multicast */
	int (*attach_mcast)(struct net_device *dev, struct ib_device *hca,
			    union ib_gid *gid, u16 mlid,
			    int set_qkey, u32 qkey);
	int (*detach_mcast)(struct net_device *dev, struct ib_device *hca,
			    union ib_gid *gid, u16 mlid);
};

	.ndo_start_xmit		 = ipoib_start_xmit,

static netdev_tx_t ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
	struct ipoib_dev_priv *priv = ipoib_priv(dev);
	struct rdma_netdev *rn = netdev_priv(dev);
	struct ipoib_neigh *neigh;
	struct ipoib_pseudo_header *phdr;
	struct ipoib_header *header;
	unsigned long flags;

	phdr = (struct ipoib_pseudo_header *) skb->data;
	skb_pull(skb, sizeof(*phdr));
	header = (struct ipoib_header *) skb->data;

	if (unlikely(phdr->hwaddr[4] == 0xff)) {
		/* multicast, arrange "if" according to probability */
		if ((header->proto != htons(ETH_P_IP)) &&
		    (header->proto != htons(ETH_P_IPV6)) &&
		    (header->proto != htons(ETH_P_ARP)) &&
		    (header->proto != htons(ETH_P_RARP)) &&
		    (header->proto != htons(ETH_P_TIPC))) {
			/* ethertype not supported by IPoIB */
			++dev->stats.tx_dropped;
			dev_kfree_skb_any(skb);
			return NETDEV_TX_OK;
		}
		/* Add in the P_Key for multicast*/
		phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff;
		phdr->hwaddr[9] = priv->pkey & 0xff;

		neigh = ipoib_neigh_get(dev, phdr->hwaddr);
		if (likely(neigh))
			goto send_using_neigh;
		ipoib_mcast_send(dev, phdr->hwaddr, skb);
		return NETDEV_TX_OK;
	}

	/* unicast, arrange "switch" according to probability */
	switch (header->proto) {
	case htons(ETH_P_IP):
	case htons(ETH_P_IPV6):
	case htons(ETH_P_TIPC):
		neigh = ipoib_neigh_get(dev, phdr->hwaddr);
		if (unlikely(!neigh)) {
			neigh = neigh_add_path(skb, phdr->hwaddr, dev);
			if (likely(!neigh))
				return NETDEV_TX_OK;
		}
		break;
	case htons(ETH_P_ARP):
	case htons(ETH_P_RARP):
		/* for unicast ARP and RARP should always perform path find */
		unicast_arp_send(skb, dev, phdr);
		return NETDEV_TX_OK;
	default:
		/* ethertype not supported by IPoIB */
		++dev->stats.tx_dropped;
		dev_kfree_skb_any(skb);
		return NETDEV_TX_OK;
	}

send_using_neigh:
	/* note we now hold a ref to neigh */
	if (ipoib_cm_get(neigh)) {
		if (ipoib_cm_up(neigh)) {
			priv->fp.ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
			goto unref;
		}
	} else if (neigh->ah && neigh->ah->valid) {
		neigh->ah->last_send = rn->send(dev, skb, neigh->ah->ah,
						IPOIB_QPN(phdr->hwaddr));
		goto unref;
	} else if (neigh->ah) {
		neigh_refresh_path(neigh, phdr->hwaddr, dev);
	}

	if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
		spin_lock_irqsave(&priv->lock, flags);
		/*
		 * to avoid race with path_rec_completion check if it already
		 * done, if yes re-send the packet, otherwise push the skb into
		 * the queue.
		 * it is safe to check it here while priv->lock around.
		 */
		if (neigh->ah && neigh->ah->valid)
			if (!ipoib_cm_get(neigh) ||
			    (ipoib_cm_get(neigh) && ipoib_cm_up(neigh))) {
				spin_unlock_irqrestore(&priv->lock, flags);
				goto send_using_neigh;
			}
		push_pseudo_header(skb, phdr->hwaddr);
		__skb_queue_tail(&neigh->queue, skb);
		spin_unlock_irqrestore(&priv->lock, flags);
	} else {
		++dev->stats.tx_dropped;
		dev_kfree_skb_any(skb);
	}

unref:
	ipoib_neigh_put(neigh);

	return NETDEV_TX_OK;
}

int ipoib_send_rss(struct net_device *dev, struct sk_buff *skb,
		   struct ib_ah *address, u32 dqpn)
{
	struct ipoib_dev_priv *priv = ipoib_priv(dev);
	struct ipoib_tx_buf *tx_req;
	struct ipoib_send_ring *send_ring;
	u16 queue_index;
	int hlen, rc;
	void *phead;
	int req_index;
	unsigned usable_sge = priv->max_send_sge - !!skb_headlen(skb);

	/* Find the correct QP to submit the IO to */
	queue_index = skb_get_queue_mapping(skb);
	send_ring = priv->send_ring + queue_index;

	if (skb_is_gso(skb)) {
		hlen = skb_transport_offset(skb) + tcp_hdrlen(skb);
		phead = skb->data;
		if (unlikely(!skb_pull(skb, hlen))) {
			ipoib_warn(priv, "linear data too small\n");
			++send_ring->stats.tx_dropped;
			++send_ring->stats.tx_errors;
			dev_kfree_skb_any(skb);
			return -1;
		}
	} else {
		if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) {
			ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n",
				   skb->len, priv->mcast_mtu + IPOIB_ENCAP_LEN);
			++send_ring->stats.tx_dropped;
			++send_ring->stats.tx_errors;
			ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu);
			return -1;
		}
		phead = NULL;
		hlen  = 0;
	}
	if (skb_shinfo(skb)->nr_frags > usable_sge) {
		if (skb_linearize(skb) < 0) {
			ipoib_warn(priv, "skb could not be linearized\n");
			++send_ring->stats.tx_dropped;
			++send_ring->stats.tx_errors;
			dev_kfree_skb_any(skb);
			return -1;
		}
		/* Does skb_linearize return ok without reducing nr_frags? */
		if (skb_shinfo(skb)->nr_frags > usable_sge) {
			ipoib_warn(priv, "too many frags after skb linearize\n");
			++send_ring->stats.tx_dropped;
			++send_ring->stats.tx_errors;
			dev_kfree_skb_any(skb);
			return -1;
		}
	}

	ipoib_dbg_data(priv, "sending packet, length=%d address=%p qpn=0x%06x\n",
		       skb->len, address, dqpn);

	/*
	 * We put the skb into the tx_ring _before_ we call post_send_rss()
	 * because it's entirely possible that the completion handler will
	 * run before we execute anything after the post_send_rss().  That
	 * means we have to make sure everything is properly recorded and
	 * our state is consistent before we call post_send_rss().
	 */
	req_index = send_ring->tx_head & (priv->sendq_size - 1);
	tx_req = &send_ring->tx_ring[req_index];
	tx_req->skb = skb;

	if (skb->len < ipoib_inline_thold &&
	    !skb_shinfo(skb)->nr_frags) {
		tx_req->is_inline = 1;
		send_ring->tx_wr.wr.send_flags |= IB_SEND_INLINE;
	} else {
		if (unlikely(ipoib_dma_map_tx(priv->ca, tx_req))) {
			++send_ring->stats.tx_errors;
			dev_kfree_skb_any(skb);
			return -1;
		}
		tx_req->is_inline = 0;
		send_ring->tx_wr.wr.send_flags &= ~IB_SEND_INLINE;
	}

	if (skb->ip_summed == CHECKSUM_PARTIAL)
		send_ring->tx_wr.wr.send_flags |= IB_SEND_IP_CSUM;
	else
		send_ring->tx_wr.wr.send_flags &= ~IB_SEND_IP_CSUM;
	/* increase the tx_head after send success, but use it for queue state */
	if (atomic_read(&send_ring->tx_outstanding) == priv->sendq_size - 1) {
		ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
		netif_stop_subqueue(dev, queue_index);
	}

	skb_orphan(skb);
	skb_dst_drop(skb);

	if (__netif_subqueue_stopped(dev, queue_index))
		if (ib_req_notify_cq(send_ring->send_cq, IB_CQ_NEXT_COMP |
				     IB_CQ_REPORT_MISSED_EVENTS))
			ipoib_warn(priv, "request notify on send CQ failed\n");

	rc = post_send_rss(send_ring, req_index,
			   address, dqpn, tx_req, phead, hlen);
	if (unlikely(rc)) {
		ipoib_warn(priv, "post_send_rss failed, error %d\n", rc);
		++send_ring->stats.tx_errors;
		if (!tx_req->is_inline)
			ipoib_dma_unmap_tx(priv, tx_req);
		dev_kfree_skb_any(skb);
		if (__netif_subqueue_stopped(dev, queue_index))
			netif_wake_subqueue(dev, queue_index);
		rc = 0;
	} else {
		netdev_get_tx_queue(dev, queue_index)->trans_start = jiffies;
		rc = send_ring->tx_head;
		++send_ring->tx_head;
		atomic_inc(&send_ring->tx_outstanding);
	}

	return rc;
}

static void ipoib_napi_add(struct net_device *dev)
{
	struct ipoib_dev_priv *priv = ipoib_priv(dev);

	netif_napi_add(dev, &priv->recv_napi, ipoib_rx_poll, IPOIB_NUM_WC);
	netif_napi_add(dev, &priv->send_napi, ipoib_tx_poll, MAX_SEND_CQE);
}

int ipoib_rx_poll(struct napi_struct *napi, int budget)
{
	struct ipoib_dev_priv *priv =
		container_of(napi, struct ipoib_dev_priv, recv_napi);
	struct net_device *dev = priv->dev;
	int done;
	int t;
	int n, i;

	done  = 0;

poll_more:
	while (done < budget) {
		int max = (budget - done);

		t = min(IPOIB_NUM_WC, max);
		n = ib_poll_cq(priv->recv_cq, t, priv->ibwc);

		for (i = 0; i < n; i++) {
			struct ib_wc *wc = priv->ibwc + i;

			if (wc->wr_id & IPOIB_OP_RECV) {
				++done;
				if (wc->wr_id & IPOIB_OP_CM)
					ipoib_cm_handle_rx_wc(dev, wc);
				else
					ipoib_ib_handle_rx_wc(dev, wc);
			} else {
				pr_warn("%s: Got unexpected wqe id\n", __func__);
			}
		}

		if (n != t)
			break;
	}

	if (done < budget) {
#ifdef CONFIG_COMPAT_LRO_ENABLED_IPOIB
		if (dev->features & NETIF_F_LRO)
			lro_flush_all(&priv->lro.lro_mgr);
#endif
		napi_complete(napi);
		if (unlikely(ib_req_notify_cq(priv->recv_cq,
					      IB_CQ_NEXT_COMP |
					      IB_CQ_REPORT_MISSED_EVENTS)) &&
		    napi_reschedule(napi))
			goto poll_more;
	}

	return done;
}

int ipoib_tx_poll(struct napi_struct *napi, int budget)
{
	struct ipoib_dev_priv *priv = container_of(napi, struct ipoib_dev_priv,
						   send_napi);
	struct net_device *dev = priv->dev;
	int n, i;
	struct ib_wc *wc;

poll_more:
	n = ib_poll_cq(priv->send_cq, MAX_SEND_CQE, priv->send_wc);

	for (i = 0; i < n; i++) {
		wc = priv->send_wc + i;
		if (wc->wr_id & IPOIB_OP_CM)
			ipoib_cm_handle_tx_wc(dev, wc);
		else
			ipoib_ib_handle_tx_wc(dev, wc);
	}

	if (n < budget) {
		napi_complete(napi);
		if (unlikely(ib_req_notify_cq(priv->send_cq, IB_CQ_NEXT_COMP |
					      IB_CQ_REPORT_MISSED_EVENTS)) &&
		    napi_reschedule(napi))
			goto poll_more;
	}
	return n < 0 ? 0 : n;
}

static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
{
	struct ipoib_dev_priv *priv = ipoib_priv(dev);
	unsigned int wr_id = wc->wr_id;
	struct ipoib_tx_buf *tx_req;

	ipoib_dbg_data(priv, "send completion: id %d, status: %d\n",
		       wr_id, wc->status);

	if (unlikely(wr_id >= priv->sendq_size)) {
		ipoib_warn(priv, "send completion event with wrid %d (> %d)\n",
			   wr_id, priv->sendq_size);
		return;
	}

	tx_req = &priv->tx_ring[wr_id];

	if (!tx_req->is_inline)
		ipoib_dma_unmap_tx(priv, tx_req);

	++dev->stats.tx_packets;
	dev->stats.tx_bytes += tx_req->skb->len;

	dev_kfree_skb_any(tx_req->skb);
	tx_req->skb = NULL;

	++priv->tx_tail;
	atomic_dec(&priv->tx_outstanding);

	if (unlikely(netif_queue_stopped(dev) &&
		     (atomic_read(&priv->tx_outstanding) <= priv->sendq_size >> 1) &&
		     test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)))
		netif_wake_queue(dev);

	if (wc->status != IB_WC_SUCCESS &&
	    wc->status != IB_WC_WR_FLUSH_ERR) {
		struct ipoib_qp_state_validate *qp_work;
		ipoib_warn(priv,
			   "failed send event (status=%d, wrid=%d vend_err %#x)\n",
			   wc->status, wr_id, wc->vendor_err);
		qp_work = kzalloc(sizeof(*qp_work), GFP_ATOMIC);
		if (!qp_work)
			return;

		INIT_WORK(&qp_work->work, ipoib_qp_state_validate_work);
		qp_work->priv = priv;
		queue_work(priv->wq, &qp_work->work);
	}
}

static int ipoib_add_one(struct ib_device *device)
{
	struct list_head *dev_list;
	struct net_device *dev;
	struct ipoib_dev_priv *priv;
	unsigned int p;
	int count = 0;

	dev_list = kmalloc(sizeof(*dev_list), GFP_KERNEL);
	if (!dev_list)
		return -ENOMEM;

	INIT_LIST_HEAD(dev_list);

	rdma_for_each_port (device, p) {
		if (!rdma_protocol_ib(device, p))
			continue;
		dev = ipoib_add_port("ib%d", device, p);
		if (!IS_ERR(dev)) {
			priv = ipoib_priv(dev);
			list_add_tail(&priv->list, dev_list);
			count++;
		}
	}

	if (!count) {
		kfree(dev_list);
		return -EOPNOTSUPP;
	}

	ib_set_client_data(device, &ipoib_client, dev_list);
	return 0;
}

static struct net_device *ipoib_add_port(const char *format,
					 struct ib_device *hca, u8 port)
{
	struct rtnl_link_ops *ops = ipoib_get_link_ops();
	struct rdma_netdev_alloc_params params;
	struct ipoib_dev_priv *priv;
	struct net_device *ndev;
	int result;

	ndev = ipoib_intf_alloc(hca, port, format);
	if (IS_ERR(ndev)) {
		pr_warn("%s, %d: ipoib_intf_alloc failed %ld\n", hca->name, port,
			PTR_ERR(ndev));
		return ndev;
	}
	priv = ipoib_priv(ndev);

	INIT_IB_EVENT_HANDLER(&priv->event_handler,
			      priv->ca, ipoib_event);
	ib_register_event_handler(&priv->event_handler);

	/* call event handler to ensure pkey in sync */
	queue_work(ipoib_workqueue, &priv->flush_heavy);

	result = register_netdev(ndev);
	if (result) {
		pr_warn("%s: couldn't register ipoib port %d; error %d\n",
			hca->name, port, result);

		ipoib_parent_unregister_pre(ndev);
		ipoib_intf_free(ndev);
		free_netdev(ndev);

		return ERR_PTR(result);
	}

	if (hca->ops.rdma_netdev_get_params) {
		int rc = hca->ops.rdma_netdev_get_params(hca, port,
						     RDMA_NETDEV_IPOIB,
						     &params);

		if (!rc && ops->priv_size < params.sizeof_priv)
			ops->priv_size = params.sizeof_priv;
	}
#ifdef CONFIG_COMPAT_LRO_ENABLED_IPOIB
	/* force lro on the dev->features, because the function
	 * register_netdev disable it according to our private lro
	 */
	set_lro_features_bit(priv);
#endif

	/*
	 * We cannot set priv_destructor before register_netdev because we
	 * need priv to be always valid during the error flow to execute
	 * ipoib_parent_unregister_pre(). Instead handle it manually and only
	 * enter priv_destructor mode once we are completely registered.
	 */
#ifdef HAVE_NET_DEVICE_NEEDS_FREE_NETDEV
	ndev->priv_destructor = ipoib_intf_free;
#endif
	if (ipoib_intercept_dev_id_attr(ndev))
		goto sysfs_failed;
	if (ipoib_cm_add_mode_attr(ndev))
		goto sysfs_failed;
	if (ipoib_add_pkey_attr(ndev))
		goto sysfs_failed;
	if (ipoib_add_umcast_attr(ndev))
		goto sysfs_failed;
	if (device_create_file(&ndev->dev, &dev_attr_create_child))
		goto sysfs_failed;
	if (device_create_file(&ndev->dev, &dev_attr_delete_child))
		goto sysfs_failed;
	if (device_create_file(&priv->dev->dev, &dev_attr_set_mac))
		goto sysfs_failed;

	if (priv->max_tx_queues > 1) {
		if (ipoib_set_rss_sysfs(priv))
			goto sysfs_failed;
	}

	return ndev;

sysfs_failed:
	ipoib_parent_unregister_pre(ndev);
	unregister_netdev(ndev);
	return ERR_PTR(-ENOMEM);
}

struct net_device *ipoib_intf_alloc(struct ib_device *hca, u8 port,
				    const char *name)
{
	struct ipoib_dev_priv *priv;
	struct net_device *dev;
	int rc;

	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
	if (!priv)
		return ERR_PTR(-ENOMEM);

	dev = ipoib_alloc_netdev(hca, port, name, priv);
	if (IS_ERR(dev)) {
		kfree(priv);
		return dev;
	}

	rc = ipoib_intf_init(hca, port, name, dev, priv);
	if (rc) {
		kfree(priv);
		free_netdev(dev);
		return ERR_PTR(rc);
	}

	/*
	 * Upon success the caller must ensure ipoib_intf_free is called or
	 * register_netdevice succeed'd and priv_destructor is set to
	 * ipoib_intf_free.
	 */
	return dev;
}

int ipoib_intf_init(struct ib_device *hca, u8 port, const char *name,
		    struct net_device *dev, struct ipoib_dev_priv *priv)
{
	struct rdma_netdev *rn = netdev_priv(dev);
	int rc;

	priv->ca = hca;
	priv->port = port;

	rc = rdma_init_netdev(hca, port, RDMA_NETDEV_IPOIB, name,
			      NET_NAME_UNKNOWN, ipoib_setup_common, dev,
			      !ipoib_enhanced_enabled);
	if (rc) {
		if (rc != -EOPNOTSUPP)
			goto out;

		if (priv->num_tx_queues > 1) {
			netif_set_real_num_tx_queues(dev, priv->num_tx_queues);
			netif_set_real_num_rx_queues(dev, priv->num_rx_queues);

			rn->attach_mcast = ipoib_mcast_attach_rss;
			rn->send = ipoib_send_rss;

			/* Override ethtool_ops to ethtool_ops_rss */
			ipoib_set_ethtool_ops_rss(dev);
		} else {
			rn->attach_mcast = ipoib_mcast_attach;
			rn->send = ipoib_send;
		}

		dev->netdev_ops = ipoib_get_rn_ops(priv);
		rn->detach_mcast = ipoib_mcast_detach;
		rn->hca = hca;
	}

	priv->rn_ops = dev->netdev_ops;
	dev->netdev_ops = ipoib_get_netdev_ops(priv);
	rn->clnt_priv = priv;

	/*
	 * Only the child register_netdev flows can handle priv_destructor
	 * being set, so we force it to NULL here and handle manually until it
	 * is safe to turn on.
	 */
#ifdef HAVE_NET_DEVICE_NEEDS_FREE_NETDEV
	priv->next_priv_destructor = dev->priv_destructor;
	dev->priv_destructor = NULL;
#endif
	ipoib_build_priv(dev);

	return 0;

out:
	return rc;
}

static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
{
	struct ipoib_dev_priv *priv = ipoib_priv(dev);
	unsigned int wr_id = wc->wr_id & ~IPOIB_OP_RECV;
	struct sk_buff *skb;
	u64 mapping[IPOIB_UD_RX_SG];
	union ib_gid *dgid;
	union ib_gid *sgid;

	ipoib_dbg_data(priv, "recv completion: id %d, status: %d\n",
		       wr_id, wc->status);

	if (unlikely(wr_id >= priv->recvq_size)) {
		ipoib_warn(priv, "recv completion event with wrid %d (> %d)\n",
			   wr_id, priv->recvq_size);
		return;
	}

	skb  = priv->rx_ring[wr_id].skb;

	if (unlikely(wc->status != IB_WC_SUCCESS)) {
		if (wc->status != IB_WC_WR_FLUSH_ERR)
			ipoib_warn(priv,
				   "failed recv event (status=%d, wrid=%d vend_err %#x)\n",
				   wc->status, wr_id, wc->vendor_err);
		ipoib_ud_dma_unmap_rx(priv, priv->rx_ring[wr_id].mapping);
		dev_kfree_skb_any(skb);
		priv->rx_ring[wr_id].skb = NULL;
		return;
	}

	memcpy(mapping, priv->rx_ring[wr_id].mapping,
	       IPOIB_UD_RX_SG * sizeof(*mapping));

	/*
	 * If we can't allocate a new RX buffer, dump
	 * this packet and reuse the old buffer.
	 */
	if (unlikely(!ipoib_alloc_rx_skb(dev, wr_id))) {
		++dev->stats.rx_dropped;
		goto repost;
	}

	ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n",
		       wc->byte_len, wc->slid);

	ipoib_ud_dma_unmap_rx(priv, mapping);

	skb_put(skb, wc->byte_len);

	/* First byte of dgid signals multicast when 0xff */
	dgid = &((struct ib_grh *)skb->data)->dgid;

	if (!(wc->wc_flags & IB_WC_GRH) || dgid->raw[0] != 0xff)
		skb->pkt_type = PACKET_HOST;
	else if (memcmp(dgid, dev->broadcast + 4, sizeof(union ib_gid)) == 0)
		skb->pkt_type = PACKET_BROADCAST;
	else
		skb->pkt_type = PACKET_MULTICAST;

	sgid = &((struct ib_grh *)skb->data)->sgid;

	/*
	 * Drop packets that this interface sent, ie multicast packets
	 * that the HCA has replicated.
	 */
	if (wc->slid == priv->local_lid && wc->src_qp == priv->qp->qp_num) {
		int need_repost = 1;

		if ((wc->wc_flags & IB_WC_GRH) &&
		    sgid->global.interface_id != priv->local_gid.global.interface_id)
			need_repost = 0;

		if (need_repost) {
			dev_kfree_skb_any(skb);
			goto repost;
		}
	}

	skb_pull(skb, IB_GRH_BYTES);
#if (LINUX_VERSION_CODE < KERNEL_VERSION(3,14,0)) && ! defined(HAVE_SK_BUFF_CSUM_LEVEL)
	/* indicate size for reasmb, only for old kernels */
	skb->truesize = SKB_TRUESIZE(skb->len);
#endif
	skb->protocol = ((struct ipoib_header *) skb->data)->proto;
	skb_add_pseudo_hdr(skb);

	++dev->stats.rx_packets;
	dev->stats.rx_bytes += skb->len;
	if (skb->pkt_type == PACKET_MULTICAST)
		dev->stats.multicast++;

	if (unlikely(be16_to_cpu(skb->protocol) == ETH_P_ARP))
		ipoib_create_repath_ent(dev, skb, wc->slid);

	skb->dev = dev;
	if ((dev->features & NETIF_F_RXCSUM) &&
			likely(wc->wc_flags & IB_WC_IP_CSUM_OK))
		skb->ip_summed = CHECKSUM_UNNECESSARY;
#ifdef CONFIG_COMPAT_LRO_ENABLED_IPOIB
	if (dev->features & NETIF_F_LRO)
		lro_receive_skb(&priv->lro.lro_mgr, skb, NULL);
	else
		netif_receive_skb(skb);
#else
	napi_gro_receive(&priv->recv_napi, skb);
#endif

repost:
	if (unlikely(ipoib_ib_post_receive(dev, wr_id)))
		ipoib_warn(priv, "ipoib_ib_post_receive failed "
			   "for buf %d\n", wr_id);
}

IPoIB(IP over InfiniBand)是一种在 InfiniBand 网络上运行 IP 协议的技术。下面将详细分析 IPoIB 如何收到数据并传递给网络栈,以及如何从网络栈拿到数据并发送出去。

1. IPoIB 收到数据并传递给网络栈

1.1 接收数据的起点:完成队列(Completion Queue, CQ)轮询

IPoIB 通过napi机制进行接收数据的处理,在ipoib_napi_add函数中注册了接收napi,并指定了轮询函数ipoib_rx_poll

复制代码
static void ipoib_napi_add(struct net_device *dev)
{
    struct ipoib_dev_priv *priv = ipoib_priv(dev);

    netif_napi_add(dev, &priv->recv_napi, ipoib_rx_poll, IPOIB_NUM_WC);
    ...
}
1.2 轮询完成队列

ipoib_rx_poll函数会不断轮询接收完成队列priv->recv_cq,通过ib_poll_cq函数获取完成的工作请求(Work Completion, WC)。

复制代码
int ipoib_rx_poll(struct napi_struct *napi, int budget)
{
    struct ipoib_dev_priv *priv = container_of(napi, struct ipoib_dev_priv, recv_napi);
    ...
    n = ib_poll_cq(priv->recv_cq, t, priv->ibwc);
    for (i = 0; i < n; i++) {
        struct ib_wc *wc = priv->ibwc + i;
        if (wc->wr_id & IPOIB_OP_RECV) {
            ...
            if (wc->wr_id & IPOIB_OP_CM)
                ipoib_cm_handle_rx_wc(dev, wc);
            else
                ipoib_ib_handle_rx_wc(dev, wc);
        }
    }
    ...
}
1.3 处理接收完成的工作请求

对于普通的接收工作请求,会调用ipoib_ib_handle_rx_wc函数进行处理。

复制代码
static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
{
    struct ipoib_dev_priv *priv = ipoib_priv(dev);
    ...
    skb  = priv->rx_ring[wr_id].skb;
    if (unlikely(wc->status != IB_WC_SUCCESS)) {
        ...
    }
    ...
    skb_put(skb, wc->byte_len);
    ...
    skb->protocol = ((struct ipoib_header *) skb->data)->proto;
    skb_add_pseudo_hdr(skb);
    ...
    if (dev->features & NETIF_F_RXCSUM) && likely(wc->wc_flags & IB_WC_IP_CSUM_OK))
        skb->ip_summed = CHECKSUM_UNNECESSARY;
#ifdef CONFIG_COMPAT_LRO_ENABLED_IPOIB
    if (dev->features & NETIF_F_LRO)
        lro_receive_skb(&priv->lro.lro_mgr, skb, NULL);
    else
        netif_receive_skb(skb);
#else
    napi_gro_receive(&priv->recv_napi, skb);
#endif
    ...
}
1.4 传递数据给网络栈

ipoib_ib_handle_rx_wc函数中,会根据设备特性和配置,选择不同的方式将接收到的数据传递给网络栈:

  • 如果启用了 LRO(Large Receive Offload)特性,会调用lro_receive_skb函数进行处理。
  • 否则,会调用netif_receive_skbnapi_gro_receive函数将数据传递给网络栈。

2. IPoIB 从网络栈拿到数据并发送出去

2.1 网络栈调用发送函数

当网络栈需要发送数据时,会调用netdev_ops中的ndo_start_xmit函数,在 IPoIB 中对应的是ipoib_start_xmit函数。

复制代码
static netdev_tx_t ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
    struct ipoib_dev_priv *priv = ipoib_priv(dev);
    struct rdma_netdev *rn = netdev_priv(dev);
    ...
    if (unlikely(phdr->hwaddr[4] == 0xff)) {
        ...
        neigh = ipoib_neigh_get(dev, phdr->hwaddr);
        if (likely(neigh))
            goto send_using_neigh;
        ipoib_mcast_send(dev, phdr->hwaddr, skb);
        return NETDEV_TX_OK;
    }
    ...
send_using_neigh:
    ...
    if (ipoib_cm_get(neigh)) {
        if (ipoib_cm_up(neigh)) {
            priv->fp.ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
            goto unref;
        }
    } else if (neigh->ah && neigh->ah->valid) {
        neigh->ah->last_send = rn->send(dev, skb, neigh->ah->ah, IPOIB_QPN(phdr->hwaddr));
        goto unref;
    }
    ...
}
2.2 选择发送函数

ipoib_start_xmit函数中,会根据邻居信息和连接状态选择不同的发送方式:

  • 如果连接管理器(Connection Manager, CM)可用且连接已建立,会调用ipoib_cm_send函数进行发送。
  • 否则,如果邻居的地址句柄(Address Handle, AH)有效,会调用rn->send函数进行发送,在多队列情况下rn->send指向ipoib_send_rss函数。
2.3 实际发送数据

ipoib_send_rss函数为例,该函数会进行一系列的准备工作,如选择正确的发送队列、处理 GSO(Generic Segmentation Offload)、映射 DMA 等,最后调用post_send_rss函数将数据发送出去。

复制代码
int ipoib_send_rss(struct net_device *dev, struct sk_buff *skb,
                   struct ib_ah *address, u32 dqpn)
{
    struct ipoib_dev_priv *priv = ipoib_priv(dev);
    ...
    queue_index = skb_get_queue_mapping(skb);
    send_ring = priv->send_ring + queue_index;
    ...
    rc = post_send_rss(send_ring, req_index,
                       address, dqpn, tx_req, phead, hlen);
    ...
    return rc;
}

综上所述,IPoIB 通过napi机制轮询完成队列接收数据,并根据设备特性将数据传递给网络栈;从网络栈拿到数据后,会根据连接状态和邻居信息选择合适的发送函数进行数据发送。

IPOIB(InfiniBand over IP)是一种在InfiniBand网络上实现IP协议的技术。它允许InfiniBand网络上的设备通过IP协议进行通信。IPOIB通过内核模块实现,与网络栈进行交互,接收和发送数据包。下面详细解释IPOIB如何从网络栈接收数据并传递给网络栈,以及如何从网络栈获取数据并发送出去。

1. 从网络栈接收数据并传递给网络栈

1.1 接收数据包

IPOIB通过InfiniBand硬件接收数据包,并将其传递给内核网络栈。这个过程主要涉及以下几个步骤:

  1. 接收中断处理

    • 当InfiniBand硬件接收到数据包时,会触发中断。
    • 中断处理程序会调用ipoib_ib_handle_rx_wc函数来处理接收完成的工作队列(Work Completion, WC)。
  2. 处理接收完成的工作队列

    • ipoib_ib_handle_rx_wc函数会检查接收完成的工作队列条目(WC),获取接收到的数据包。
    • 如果接收成功,函数会分配一个新的接收缓冲区(如果需要),并将接收到的数据包复制到新的缓冲区中。
    • 然后,函数会将数据包传递给napi_rx_handler进行处理。
  3. NAPI接收处理

    • napi_rx_handler会将数据包传递给ipoib_rx_poll函数。
    • ipoib_rx_poll函数会处理接收到的数据包,包括解封装、校验和验证等。
    • 最后,函数会将数据包传递给网络栈,通过netif_receive_skbnapi_gro_receive函数将数据包加入到网络栈的接收队列中。
1.2 代码示例
复制代码
static int ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
{
    struct ipoib_dev_priv *priv = ipoib_priv(dev);
    unsigned int wr_id = wc->wr_id & ~IPOIB_OP_RECV;
    struct sk_buff *skb;
    u64 mapping[IPOIB_UD_RX_SG];
    union ib_gid *dgid;
    union ib_gid *sgid;

    ipoib_dbg_data(priv, "recv completion: id %d, status: %d
", wr_id, wc->status);

    if (unlikely(wr_id >= priv->recvq_size)) {
        ipoib_warn(priv, "recv completion event with wrid %d (> %d)
", wr_id, priv->recvq_size);
        return 0;
    }

    skb = priv->rx_ring[wr_id].skb;

    if (unlikely(wc->status != IB_WC_SUCCESS)) {
        if (wc->status != IB_WC_WR_FLUSH_ERR)
            ipoib_warn(priv, "failed recv event (status=%d, wrid=%d vend_err %#x)
", wc->status, wr_id, wc->vendor_err);
        ipoib_ud_dma_unmap_rx(priv, priv->rx_ring[wr_id].mapping);
        dev_kfree_skb_any(skb);
        priv->rx_ring[wr_id].skb = NULL;
        return 0;
    }

    memcpy(mapping, priv->rx_ring[wr_id].mapping, IPOIB_UD_RX_SG * sizeof(*mapping));
    if (unlikely(!ipoib_alloc_rx_skb(dev, wr_id))) {
        ++dev->stats.rx_dropped;
        goto repost;
    }

    ipoib_ud_dma_unmap_rx(priv, mapping);
    skb_put(skb, wc->byte_len);

    dgid = &((struct ib_grh *)skb->data)->dgid;
    if (!(wc->wc_flags & IB_WC_GRH) || dgid->raw[0] != 0xff)
        skb->pkt_type = PACKET_HOST;
    else if (memcmp(dgid, dev->broadcast + 4, sizeof(union ib_gid)) == 0)
        skb->pkt_type = PACKET_BROADCAST;
    else
        skb->pkt_type = PACKET_MULTICAST;

    sgid = &((struct ib_grh *)skb->data)->sgid;
    if (wc->slid == priv->local_lid && wc->src_qp == priv->qp->qp_num) {
        int need_repost = 1;
        if ((wc->wc_flags & IB_WC_GRH) && sgid->global.interface_id != priv->local_gid.global.interface_id)
            need_repost = 0;
        if (need_repost) {
            dev_kfree_skb_any(skb);
            goto repost;
        }
    }

    skb_pull(skb, IB_GRH_BYTES);
    skb->protocol = ((struct ipoib_header *)skb->data)->proto;
    skb_add_pseudo_hdr(skb);

    ++dev->stats.rx_packets;
    dev->stats.rx_bytes += skb->len;
    if (skb->pkt_type == PACKET_MULTICAST)
        dev->stats.multicast++;

    if (unlikely(be16_to_cpu(skb->protocol) == ETH_P_ARP))
        ipoib_create_repath_ent(dev, skb, wc->slid);

    skb->dev = dev;
    if ((dev->features & NETIF_F_RXCSUM) && likely(wc->wc_flags & IB_WC_IP_CSUM_OK))
        skb->ip_summed = CHECKSUM_UNNECESSARY;
#ifdef CONFIG_COMPAT_LRO_ENABLED_IPOIB
    if (dev->features & NETIF_F_LRO)
        lro_receive_skb(&priv->lro.lro_mgr, skb, NULL);
    else
        netif_receive_skb(skb);
#else
    napi_gro_receive(&priv->recv_napi, skb);
#endif

repost:
    if (unlikely(ipoib_ib_post_receive(dev, wr_id)))
        ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d
", wr_id);

    return 0;
}

2. 从网络栈获取数据并发送出去

2.1 发送数据包

IPOIB通过内核网络栈接收数据包,并将其发送到InfiniBand网络。这个过程主要涉及以下几个步骤:

  1. 网络栈调用发送函数

    • 当网络栈需要发送数据包时,会调用dev_queue_xmit函数。
    • dev_queue_xmit函数会调用设备的netdev_ops->ndo_start_xmit函数,对于IPOIB设备,这个函数是ipoib_start_xmit
  2. 处理发送请求

    • ipoib_start_xmit函数会检查数据包的类型(单播或多播),并根据类型选择合适的发送路径。
    • 对于单播数据包,函数会尝试使用已建立的连接(Connection)或路径(Path)进行发送。
    • 对于多播数据包,函数会使用多播组进行发送。
  3. 发送数据包

    • ipoib_start_xmit函数会将数据包封装成InfiniBand格式,并调用post_send_rss函数将数据包发送到InfiniBand网络。
    • post_send_rss函数会将数据包提交到InfiniBand的发送队列,并等待发送完成。
  4. NAPI发送处理

    • 发送完成后,InfiniBand硬件会触发发送完成的工作队列(WC)。
    • ipoib_tx_poll函数会处理发送完成的工作队列条目(WC),释放发送缓冲区,并更新发送统计信息。
2.2 代码示例
复制代码
static netdev_tx_t ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
    struct ipoib_dev_priv *priv = ipoib_priv(dev);
    struct rdma_netdev *rn = netdev_priv(dev);
    struct ipoib_neigh *neigh;
    struct ipoib_pseudo_header *phdr;
    struct ipoib_header *header;
    unsigned long flags;

    phdr = (struct ipoib_pseudo_header *)skb->data;
    skb_pull(skb, sizeof(*phdr));
    header = (struct ipoib_header *)skb->data;

    if (unlikely(phdr->hwaddr[4] == 0xff)) {
        if ((header->proto != htons(ETH_P_IP)) &&
            (header->proto != htons(ETH_P_IPV6)) &&
            (header->proto != htons(ETH_P_ARP)) &&
            (header->proto != htons(ETH_P_RARP)) &&
            (header->proto != htons(ETH_P_TIPC))) {
            ++dev->stats.tx_dropped;
            dev_kfree_skb_any(skb);
            return NETDEV_TX_OK;
        }
        phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff;
        phdr->hwaddr[9] = priv->pkey & 0xff;

        neigh = ipoib_neigh_get(dev, phdr->hwaddr);
        if (likely(neigh))
            goto send_using_neigh;
        ipoib_mcast_send(dev, phdr->hwaddr, skb);
        return NETDEV_TX_OK;
    }

    switch (header->proto) {
    case htons(ETH_P_IP):
    case htons(ETH_P_IPV6):
    case htons(ETH_P_TIPC):
        neigh = ipoib_neigh_get(dev, phdr->hwaddr);
        if (unlikely(!neigh)) {
            neigh = neigh_add_path(skb, phdr->hwaddr, dev);
            if (likely(!neigh))
                return NETDEV_TX_OK;
        }
        break;
    case htons(ETH_P_ARP):
    case htons(ETH_P_RARP):
        unicast_arp_send(skb, dev, phdr);
        return NETDEV_TX_OK;
    default:
        ++dev->stats.tx_dropped;
        dev_kfree_skb_any(skb);
        return NETDEV_TX_OK;
    }

send_using_neigh:
    if (ipoib_cm_get(neigh)) {
        if (ipoib_cm_up(neigh)) {
            priv->fp.ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
            goto unref;
        }
    } else if (neigh->ah && neigh->ah->valid) {
        neigh->ah->last_send = rn->send(dev, skb, neigh->ah->ah, IPOIB_QPN(phdr->hwaddr));
        goto unref;
    } else if (neigh->ah) {
        neigh_refresh_path(neigh, phdr->hwaddr, dev);
    }

    if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
        spin_lock_irqsave(&priv->lock, flags);
        push_pseudo_header(skb, phdr->hwaddr);
        __skb_queue_tail(&neigh->queue, skb);
        spin_unlock_irqrestore(&priv->lock, flags);
    } else {
        ++dev->stats.tx_dropped;
        dev_kfree_skb_any(skb);
    }

unref:
    ipoib_neigh_put(neigh);

    return NETDEV_TX_OK;
}

int ipoib_send_rss(struct net_device *dev, struct sk_buff *skb, struct ib_ah *address, u32 dqpn)
{
    struct ipoib_dev_priv *priv = ipoib_priv(dev);
    struct ipoib_tx_buf *tx_req;
    struct ipoib_send_ring *send_ring;
    u16 queue_index;
    int hlen, rc;
    void *phead;
    int req_index;
    unsigned usable_sge = priv->max_send_sge - !!skb_headlen(skb);

    queue_index = skb_get_queue_mapping(skb);
    send_ring = priv->send_ring + queue_index;

    if (skb_is_gso(skb)) {
        hlen = skb_transport_offset(skb) + tcp_hdrlen(skb);
        phead = skb->data;
        if (unlikely(!skb_pull(skb, hlen))) {
            ++send_ring->stats.tx_dropped;
            ++send_ring->stats.tx_errors;
            dev_kfree_skb_any(skb);
            return -1;
        }
    } else {
        if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) {
            ++send_ring->stats.tx_dropped;
            ++send_ring->stats.tx_errors;
            ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu);
            return -1;
        }
        phead = NULL;
        hlen = 0;
    }

    req_index = send_ring->tx_head & (priv->sendq_size - 1);
    tx_req = &send_ring->tx_ring[req_index];
    tx_req->skb = skb;

    if (skb->len < ipoib_inline_thold && !skb_shinfo(skb)->nr_frags) {
        tx_req->is_inline = 1;
        send_ring->tx_wr.wr.send_flags |= IB_SEND_INLINE;
    } else {
        if (unlikely(ipoib_dma_map_tx(priv->ca, tx_req))) {
            ++send_ring->stats.tx_errors;
            dev_kfree_skb_any(skb);
            return -1;
        }
        tx_req->is_inline = 0;
        send_ring->tx_wr.wr.send_flags &= ~IB_SEND_INLINE;
    }

    if (skb->ip_summed == CHECKSUM_PARTIAL)
        send_ring->tx_wr.wr.send_flags |= IB_SEND_IP_CSUM;
    else
        send_ring->tx_wr.wr.send_flags &= ~IB_SEND_IP_CSUM;

    if (atomic_read(&send_ring->tx_outstanding) == priv->sendq_size - 1) {
        netif_stop_subqueue(dev, queue_index);
    }

    skb_orphan(skb);
    skb_dst_drop(skb);

    rc = post_send_rss(send_ring, req_index, address, dqpn, tx_req, phead, hlen);
    if (unlikely(rc)) {
        ++send_ring->stats.tx_errors;
        if (!tx_req->is_inline)
            ipoib_dma_unmap_tx(priv, tx_req);
        dev_kfree_skb_any(skb);
        if (netif_subqueue_stopped(dev, queue_index))
            netif_wake_subqueue(dev, queue_index);
        rc = 0;
    } else {
        netdev_get_tx_queue(dev, queue_index)->trans_start = jiffies;
        rc = send_ring->tx_head;
        ++send_ring->tx_head;
        atomic_inc(&send_ring->tx_outstanding);
    }

    return rc;
}

int ipoib_tx_poll(struct napi_struct *napi, int budget)
{
    struct ipoib_dev_priv *priv = container_of(napi, struct ipoib_dev_priv, send_napi);
    struct net_device *dev = priv->dev;
    int n, i;
    struct ib_wc *wc;

poll_more:
    n = ib_poll_cq(priv->send_cq, MAX_SEND_CQE, priv->send_wc);

    for (i = 0; i < n; i++) {
        wc = priv->send_wc + i;
        if (wc->wr_id & IPOIB_OP_CM)
            ipoib_cm_handle_tx_wc(dev, wc);
        else
            ipoib_ib_handle_tx_wc(dev, wc);
    }

    if (n < budget) {
        napi_complete(napi);
        if (unlikely(ib_req_notify_cq(priv->send_cq, IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS)) && napi_reschedule(napi))
            goto poll_more;
    }
    return n < 0 ? 0 : n;
}

static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
{
    struct ipoib_dev_priv *priv = ipoib_priv(dev);
    unsigned int wr_id = wc->wr_id;
    struct ipoib_tx_buf *tx_req;

    ipoib_dbg_data(priv, "send completion: id %d, status: %d
", wr_id, wc->status);

    if (unlikely(wr_id >= priv->sendq_size)) {
        ipoib_warn(priv, "send completion event with wrid %d (> %d)
", wr_id, priv->sendq_size);
        return;
    }

    tx_req = &priv->tx_ring[wr_id];

    if (!tx_req->is_inline)
        ipoib_dma_unmap_tx(priv, tx_req);

    ++dev->stats.tx_packets;
    dev->stats.tx_bytes += tx_req->skb->len;

    dev_kfree_skb_any(tx_req->skb);
    tx_req->skb = NULL;

    ++priv->tx_tail;
    atomic_dec(&priv->tx_outstanding);

    if (unlikely(netif_queue_stopped(dev) && (atomic_read(&priv->tx_outstanding) <= priv->sendq_size >> 1) && test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)))
        netif_wake_queue(dev);

    if (wc->status != IB_WC_SUCCESS && wc->status != IB_WC_WR_FLUSH_ERR) {
        struct ipoib_qp_state_validate *qp_work;
        ipoib_warn(priv, "failed send event (status=%d, wrid=%d vend_err %#x)
", wc->status, wr_id, wc->vendor_err);
        qp_work = kzalloc(sizeof(*qp_work), GFP_ATOMIC);
        if (!qp_work)
            return;
        INIT_WORK(&qp_work->work, ipoib_qp_state_validate_work);
        qp_work->priv = priv;
        queue_work(priv->wq, &qp_work->work);
    }
}

总结

IPOIB通过内核网络栈的接口(如dev_queue_xmit)接收数据包,并将其发送到InfiniBand网络。同时,IPOIB通过处理InfiniBand硬件触发的接收完成的工作队列(WC),将接收到的数据包传递给内核网络栈。这个过程涉及多个层次的交互,包括网络栈、设备驱动和硬件。

IPoIB(IP over InfiniBand)是一种在InfiniBand网络上传输IP数据包的技术。它通过将IP数据包封装在InfiniBand的数据包中,实现在InfiniBand网络上的通信。IPoIB的工作流程可以分为接收数据和发送数据两个部分。

1. 接收数据并交给网络栈

当InfiniBand网卡接收到数据包时,数据包会通过InfiniBand的硬件队列传递到IPoIB驱动。IPoIB驱动的主要任务是处理这些数据包,并将它们传递给Linux内核的网络栈。

接收数据的主要流程如下:
  1. 硬件接收队列

    • InfiniBand网卡接收到数据包后,会将数据包放入接收队列(Receive Queue, RQ)。

    • 每个接收队列中的数据包会被DMA到主机内存中。

  2. IPoIB驱动的接收处理

    • IPoIB驱动通过轮询或中断机制从接收队列中获取数据包。

    • 驱动会调用ipoib_ib_handle_rx_wc函数来处理接收到的数据包。这个函数会解析数据包,检查其有效性,并将数据包转换为Linux内核网络栈可以处理的sk_buff结构。

  3. 数据包处理

    • 数据包被封装在sk_buff结构中,驱动会根据数据包的类型(单播、多播、广播)进行相应的处理。

    • 如果数据包是ARP请求,IPoIB会创建相应的路径条目。

    • 数据包会被传递给网络栈的netif_receive_skb函数,最终交给上层协议栈处理。

  4. 重新提交接收缓冲区

    • 处理完数据包后,IPoIB驱动会重新提交接收缓冲区(ipoib_ib_post_receive),以便接收新的数据包。
关键代码片段:
复制代码
static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
{
    // 处理接收到的数据包
    // 将数据包转换为sk_buff
    skb = priv->rx_ring[wr_id].skb;
    skb_put(skb, wc->byte_len);

    // 根据数据包类型设置skb->pkt_type
    if (!(wc->wc_flags & IB_WC_GRH) || dgid->raw[0] != 0xff)
        skb->pkt_type = PACKET_HOST;
    else if (memcmp(dgid, dev->broadcast + 4, sizeof(union ib_gid)) == 0)
        skb->pkt_type = PACKET_BROADCAST;
    else
        skb->pkt_type = PACKET_MULTICAST;

    // 将数据包交给网络栈
    napi_gro_receive(&priv->recv_napi, skb);

    // 重新提交接收缓冲区
    ipoib_ib_post_receive(dev, wr_id);
}

2. 从网络栈获取数据并发送

当Linux内核的网络栈有数据包需要发送时,它会调用IPoIB驱动的发送函数。IPoIB驱动的主要任务是将这些数据包封装成InfiniBand的数据包,并通过InfiniBand网卡发送出去。

发送数据的主要流程如下:
  1. 网络栈调用发送函数

    • 当网络栈有数据包需要发送时,它会调用IPoIB驱动的ndo_start_xmit函数,即ipoib_start_xmit

    • 这个函数会处理数据包,并根据数据包的类型(单播、多播)决定如何发送。

  2. 数据包封装

    • IPoIB驱动会将IP数据包封装在InfiniBand的数据包中。对于多播数据包,IPoIB会添加P_Key等信息。

    • 驱动会调用ipoib_sendipoib_send_rss函数将数据包发送到InfiniBand网卡的发送队列(Send Queue, SQ)。

  3. 发送数据包

    • 数据包会被DMA到InfiniBand网卡的发送队列中,网卡会将数据包发送到目标节点。

    • 如果发送队列已满,IPoIB驱动会暂停网络栈的发送队列,直到有足够的空间。

  4. 发送完成处理

    • 当InfiniBand网卡完成数据包的发送后,它会生成一个发送完成事件(Completion Queue Event, CQE)。

    • IPoIB驱动会通过轮询或中断机制处理这些发送完成事件,释放相关的资源,并唤醒网络栈的发送队列。

关键代码片段:
复制代码
static netdev_tx_t ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
    // 处理数据包
    if (unlikely(phdr->hwaddr[4] == 0xff)) {
        // 多播数据包处理
        ipoib_mcast_send(dev, phdr->hwaddr, skb);
    } else {
        // 单播数据包处理
        neigh = ipoib_neigh_get(dev, phdr->hwaddr);
        if (ipoib_cm_get(neigh)) {
            // 使用连接管理发送
            priv->fp.ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
        } else if (neigh->ah && neigh->ah->valid) {
            // 直接发送
            rn->send(dev, skb, neigh->ah->ah, IPOIB_QPN(phdr->hwaddr));
        }
    }

    return NETDEV_TX_OK;
}

int ipoib_send_rss(struct net_device *dev, struct sk_buff *skb,
                   struct ib_ah *address, u32 dqpn)
{
    // 将数据包放入发送队列
    tx_req = &send_ring->tx_ring[req_index];
    tx_req->skb = skb;

    // 提交发送请求
    rc = post_send_rss(send_ring, req_index, address, dqpn, tx_req, phead, hlen);
    if (unlikely(rc)) {
        // 发送失败处理
        dev_kfree_skb_any(skb);
    }

    return rc;
}

总结

  • 接收数据 :IPoIB驱动从InfiniBand网卡的接收队列中获取数据包,将其转换为sk_buff结构,并交给Linux内核的网络栈处理。

  • 发送数据:IPoIB驱动从网络栈获取数据包,将其封装成InfiniBand数据包,并通过InfiniBand网卡的发送队列发送出去。

IPoIB驱动通过InfiniBand的硬件队列与Linux内核的网络栈进行交互,实现了在InfiniBand网络上传输IP数据包的功能。

IPoIB(IP over InfiniBand)是一种将IP协议封装在InfiniBand(IB)网络上传输的技术,它通过RDMA(Remote Direct Memory Access)技术实现高效的数据传输。以下是IPoIB接收和发送数据的过程:

数据接收过程

  1. 硬件中断触发:当InfiniBand网络上的数据包到达时,硬件会触发中断,通知内核有数据可接收。

  2. 接收队列处理 :内核通过ib_poll_cq函数轮询接收队列(CQ),获取完成的工作请求(WR)。ipoib_rx_poll函数是接收队列的轮询函数,它会调用ib_poll_cq来获取接收完成事件。

  3. 数据包处理

    • 如果接收完成事件的状态为成功,IPoIB驱动会从接收队列中取出数据包,并将其封装成sk_buff结构。

    • 驱动会检查数据包的协议类型,并根据协议类型设置sk_buff的协议字段。

    • 如果数据包是多播或广播包,驱动会根据目的GID(全局标识符)设置sk_buffpkt_type字段。

  4. 数据包上交网络栈

    • 驱动会将处理好的sk_buff提交给网络栈,通过napi_gro_receivenetif_receive_skb函数将数据包传递给上层协议处理。

    • 如果网络设备启用了LRO(Large Receive Offload,大接收卸载),则会通过lro_receive_skb函数进行LRO处理。

数据发送过程

  1. 网络栈调用发送函数 :当网络栈需要发送数据时,会调用ipoib_start_xmit函数,这是IPoIB的发送函数。

  2. 数据包封装

    • 驱动会检查数据包的目的地址和协议类型。如果是单播地址,会根据目的地址查找邻居表,获取对应的地址和队列对号(QPN)。

    • 如果是多播地址,会添加P_Key(分区键)到伪头中,并调用ipoib_mcast_send函数发送多播数据包。

  3. 发送队列处理

    • 驱动会将数据包封装成发送请求(WR),并将其放入发送队列(SQ)中。

    • 如果发送队列已满,驱动会停止网络队列,直到发送队列有足够的空间。

  4. 硬件发送数据

    • 驱动调用post_send_rss函数将发送请求提交给硬件,硬件会根据WR中的信息将数据包发送到目标地址。

    • 发送完成后,硬件会在发送完成队列(CQ)中记录完成事件,驱动通过轮询CQ来获取发送完成事件,并进行相应的处理。

IPoIB通过这种方式实现了数据的高效接收和发送,充分利用了RDMA技术的优势,降低了CPU的负载,提高了数据传输的效率。

ipoib是InfiniBand over IPoIB (IP over InfiniBand) 的驱动模块,用于在InfiniBand网络上实现IP协议栈。它的工作机制包括接收和发送数据帧,并与网络栈进行交互。下面分别讲解`ipoib`模块如何接收数据并传递给网络栈,以及如何从网络栈获取数据并发送出去。

接收数据并传递给网络栈

  1. 接收数据:

当InfiniBand设备接收到数据帧时,会触发完成回调。`ipoib`驱动程序会注册一个回调函数来处理这些完成事件,如`ipoib_cm_handle_rx_wc`和`ipoib_handle_rx_wc`函数。这些函数会被调用来处理接收到的数据。

cpp 复制代码
   static void ipoib_cm_handle_rx_wc(struct ib_cq *cq, struct ib_wc *wc);
   static void ipoib_handle_rx_wc(struct ib_cq *cq, struct ib_wc *wc);
  1. 处理接收到的数据:

这些回调函数会检查接收的Work Completion (WC)对象,以确定数据包是否成功接收。如果成功接收,那么驱动会提取数据包并将其封装成网络设备的skb(socket buffer)。

cpp 复制代码
   struct sk_buff *skb;
   ...
   skb = ipoib_ib_handle_rx_wc(priv, wc, &status);
  1. 传递给网络栈:

将数据包封装成`skb`后,`ipoib`会调用`netif_receive_skb`函数将`skb`递交给内核网络栈进行处理:

cpp 复制代码
   netif_receive_skb(skb);

从网络栈获取数据并发送出去

  1. 注册xmit函数:

ipoib驱动程序会在网络设备结构中注册一个发送函数,通常是`ipoib_xmit`,这个函数会在上层网络协议栈需要发送数据时被调用。

cpp 复制代码
   dev->netdev_ops = &ipoib_netdev_ops;
   ...
   static const struct net_device_ops ipoib_netdev_ops = {
       .ndo_start_xmit = ipoib_start_xmit,
   };
  1. 获取数据:

上层协议栈会调用`ipoib_start_xmit`函数,该函数会传递一个`skb`(包含要发送的数据)和网络设备结构。

cpp 复制代码
   int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev);
  1. 处理并发送数据:

在`ipoib_start_xmit`函数中,驱动程序会把`skb`中的数据提取出来,并创建适当的InfiniBand消息。然后,通过InfiniBand发送队列将数据包发送到网络上。这个处理通常包括调用底层InfiniBand API如`ib_post_send`来实际发送数据。

cpp 复制代码
   ret = ib_post_send(qp, &wr, &bad_wr);

总结

  • 接收数据:InfiniBand设备接收到数据后,触发完成事件,由`ipoib`驱动程序的回调函数处理接收到的数据帧,将其封装成`skb`后调用`netif_receive_skb`传递给内核网络栈。

  • 发送数据:网络栈调用`ipoib_start_xmit`函数将数据以`skb`形式传递给`ipoib`驱动程序,驱动解析出数据后通过InfiniBand API发送数据帧。

这样,`ipoib`模块实现了在InfiniBand网络上处理IP数据包的功能,从而能够与上层的网络协议栈进行高效的交互。

相关推荐
BingoGo1 天前
当你的 PHP 应用的 API 没有限流时会发生什么?
后端·php
JaguarJack1 天前
当你的 PHP 应用的 API 没有限流时会发生什么?
后端·php·服务端
BingoGo2 天前
OpenSwoole 26.2.0 发布:支持 PHP 8.5、io_uring 后端及协程调试改进
后端·php
JaguarJack2 天前
OpenSwoole 26.2.0 发布:支持 PHP 8.5、io_uring 后端及协程调试改进
后端·php·服务端
JaguarJack3 天前
推荐 PHP 属性(Attributes) 简洁读取 API 扩展包
后端·php·服务端
BingoGo3 天前
推荐 PHP 属性(Attributes) 简洁读取 API 扩展包
php
JaguarJack4 天前
告别 Laravel 缓慢的 Blade!Livewire Blaze 来了,为你的 Laravel 性能提速
后端·php·laravel
郑州光合科技余经理4 天前
代码展示:PHP搭建海外版外卖系统源码解析
java·开发语言·前端·后端·系统架构·uni-app·php
DianSan_ERP4 天前
电商API接口全链路监控:构建坚不可摧的线上运维防线
大数据·运维·网络·人工智能·git·servlet
呉師傅4 天前
火狐浏览器报错配置文件缺失如何解决#操作技巧#
运维·网络·windows·电脑