OpenStack Nova 虚拟机网卡挂卸载性能优化实践

一、问题现象

在日常运维中，我们遇到了虚拟机网卡（port）挂载和卸载操作的一系列问题：

耗时过长：拥有多个 port 的虚拟机，单次挂/卸载操作耗时 70~90+ 秒
任务互相阻塞：挂载、卸载、Neutron event 回调、周期性缓存同步等操作排队等锁，一个慢则全部堆积
并发场景下的一致性问题：同一个 port 短时间内反复执行挂卸载，出现网卡残留、挂载失败、卸载失败

下面逐一拆解分析和解决思路。

二、问题一：单次操作耗时 70~90 秒

定位过程

在关键路径上加了计时日志后，发现瓶颈出在 _build_network_info_model 中的缓存全量刷新。

一次 attach_interface 的调用链大致如下：

复制代码

attach_interface
  ├── allocate_port_for_instance      # 向 Neutron 申请端口
  ├── driver.attach_interface         # libvirt 热插网卡
  └── get_instance_nw_info            # 刷新 info_cache（瓶颈）
        └── _build_network_info_model
              ├── list_ports           # 查询该 VM 所有端口
              └── 对每个 port 调用 Neutron 获取网络信息

问题出在最后一步：即使只挂载了 1 个 port，也要把该虚拟机名下所有 port 的网络信息全部重新拉一遍。如果 VM 有 10 个 port，每个 port 都要查一次 Neutron，整体耗时就上去了。

在日志中可以看到：

复制代码

_build_network_info_model list_ports took: 0.8s
_build_network_info_model _gather_port_ids_and_networks took: 28.3s
_build_network_info_model _build_vif_model took: 1.2s

_gather_port_ids_and_networks 这个阶段耗时接近 30 秒，对应的就是逐个 port 查询 Neutron 网络信息。

解决方案：增量更新替代全量刷新

核心改动是新增了 incremental_instance_cache_with_nw_info 方法，在挂载场景下，只更新新增 port 的缓存信息，不再全量刷新：

复制代码

# nova/network/base_api.py

def incremental_instance_cache_with_nw_info(impl, context,
                                            instance, nw_info=None,
                                            update_cells=True):
    """增量更新：将新 port 的网络信息追加到现有缓存中"""
    current_nw_info = instance.get_network_info()
    new_nw_info = network_model.NetworkInfo(
        [vif for vif in current_nw_info + nw_info]
    )
    ic = objects.InstanceInfoCache.get_by_instance_uuid(
        context, instance.uuid)
    ic.network_info = new_nw_info
    ic.save(update_cells=update_cells)
    instance.info_cache = ic

同时调整了 attach_interface 中的调用顺序------先做 driver.attach_interface（热插网卡），再做增量缓存更新：

复制代码

attach_interface（优化后）
  ├── allocate_port_for_instance
  ├── driver.attach_interface          # 先完成驱动层操作
  └── incremental_instance_cache       # 只追加新 port 信息到缓存

这样避免了在 driver 操作之前就做全量缓存刷新。优化前后对比：

操作	优化前耗时	优化后耗时
挂载 port	~71s	10~17s
卸载 port	~92s	~7s

三、问题二：多种操作互相阻塞

定位过程

原来的代码中，以下几种操作都用了同一把 instance 级别的锁 refresh_cache-{instance_uuid}：

挂载网卡 → 触发 get_instance_nw_info → 拿 refresh_cache 锁
卸载网卡 → 触发 get_instance_nw_info → 拿 refresh_cache 锁
Neutron network-changed 事件 → 触发 get_instance_nw_info → 拿 refresh_cache 锁
周期任务 _heal_instance_info_cache → 触发 get_instance_nw_info → 拿 refresh_cache 锁

这意味着如果周期任务正在刷新缓存（可能要 30 秒），这期间来了一个挂载请求，也得干等着。

复制代码

时间线示意（优化前）：

周期任务 ─────[refresh_cache 锁]──────────────────── 30s
                                         挂载请求 ─── 排队等锁 ────[执行]── 实际挂载可能只需 10s
                                                                      卸载请求 ─── 继续排队...

解决方案：拆分锁粒度

把 refresh_cache 锁拆成两种用途：

port-{port_uuid} 锁：控制具体 port 的 driver 层 attach/detach 操作
refresh_cache-{instance_uuid} 锁：仅用于缓存全量刷新（周期任务、event 回调等）

挂载和卸载操作不再持有 refresh_cache 锁，而是各自持有 port 粒度的锁：

复制代码

# nova/compute/manager.py

def attach_interface(self, context, instance, network_id, port_id, ...):
    if port_id:
        with lockutils.lock('port-%s' % port_id):
            return self._attach_interface(...)
    else:
        return self._attach_interface(...)

def detach_interface(self, context, instance, port_id):
    with lockutils.lock('port-%s' % port_id):
        # detach 逻辑...

优化后：

复制代码

时间线示意（优化后）：

周期任务 ─────[refresh_cache 锁]──────────
挂载请求 ─────[port-aaa 锁]── 不等待，直接执行
卸载请求 ─────[port-bbb 锁]── 不等待，直接执行

挂载和卸载操作不再被周期任务或其他 event 阻塞。

四、问题三：并发场景下的一致性问题

问题复现

当同一个 port 短时间内连续执行挂载 → 卸载 → 挂载操作，可能出现：

卸载请求先到 driver 层，但挂载请求先完成了缓存写入 → 缓存里有 port 但实际 driver 层已经卸载 → 网卡残留
两个操作并发修改缓存 → 缓存数据错乱 → 后续操作找不到 port 或者重复 → 挂载/卸载失败

解决方案

这个问题在拆分锁的时候已经一并解决了。同一个 port 的所有操作（attach、detach、network-vif-deleted event）都会先获取 port-{port_uuid} 锁，保证串行执行：

复制代码

# attach
with lockutils.lock('port-%s' % port_id):
    self._attach_interface(...)

# detach
with lockutils.lock('port-%s' % port_id):
    self._detach_interface(...)

# neutron event: network-vif-deleted
with lockutils.lock('port-%s' % event.tag):
    self._process_instance_vif_deleted_event(...)

同一个 port 的操作排队执行，不同 port 之间互不阻塞。同时保留了 _heal_instance_info_cache 周期任务定期全量同步缓存，兜底数据一致性。

整体锁结构如下图所示：

复制代码

┌─────────────────────────────────────────────────────┐
│                  Instance 级别                       │
│                                                     │
│   refresh_cache-{instance_uuid} 锁                  │
│   └── 周期任务 _heal_instance_info_cache             │
│   └── network-changed event 回调                     │
│                                                     │
├─────────────────────────────────────────────────────┤
│                   Port 级别                          │
│                                                     │
│   port-{port_uuid} 锁                               │
│   └── attach_interface                              │
│   └── detach_interface                              │
│   └── network-vif-deleted event                     │
│                                                     │
└─────────────────────────────────────────────────────┘

五、优化效果

指标	优化前	优化后
挂载 port 耗时	~71s	10~17s
卸载 port 耗时	~92s	~7s
挂/卸载是否等待其他任务	是，排队等锁	否，独立执行
高并发下同 port 操作	可能出现网卡残留/失败	按请求顺序串行，结果一致

六、小结

这次优化的核心思路是三点：

全量 → 增量：缓存刷新从每次重建所有 port 信息，改为只追加变更的 port
粗锁 → 细锁：把 instance 级别的大锁拆为 port 级别的小锁，减少不必要的等待
统一入口加锁：同一个 port 的 attach/detach/event 操作共用一把 port 锁，保证串行执行

这三个改动配合周期任务的全量兜底同步，在提升性能的同时也保证了数据一致性。