YOLO 双 Backbone 双模态融合：以 LLVIP 数据集为例的红外 - 可见光目标检测实践

在弱光、夜间监控和安防场景中，目标检测面临极大挑战：可见光图像噪声强、细节丢失、目标模糊，而红外图像在结构边缘保持得更好，但语义信息不足。如果能将红外（IR）与可见光（VIS）两种模态有效融合，通常可以大幅提升夜间检测表现。

一种常见且高效的解决方案是 ------ 基于 YOLO 架构，构建"双 Backbone + 双模态输入"的融合检测器 。本文以 LLVIP（Low-Light Visible-Infrared Person） 双模态数据集为例，介绍整个双模态 YOLO 的构建过程，包括：

数据准备
双 Backbone 结构设计
模态融合方式
训练流程
模型可视化推理示例

1. LLVIP 数据集简介

LLVIP（2021）是一个专门用于弱光下行人检测的双模态数据集，包含：

项目	数值
图像对数（VIS + IR）	15488 对
图像分辨率	约 1080P
标注类别	Person
场景	室外/弱光/夜间
是否配准	✔ 完全像素级配准

可见光样例（噪声大）：

红外样例（结构清晰）：

双模态检测的关键优势就在于：

VIS 提供纹理、颜色信息
IR 提供结构、轮廓与稳定亮度

因此构建双 Backbone 模型能让两种特征互补。

2. 为什么要"双 Backbone + 双模态"？

1. 单模态检测的局限性

可见光模态：白天表现优异，但夜间 / 雾霾场景下易受噪声、低对比度影响，目标易漏检。

红外模态：不受光照影响，能清晰捕捉目标轮廓，但缺乏颜色、纹理等细节，小目标检测精度低。

双模态融合的核心是互补信息聚合：用红外模态提供目标 "位置锚点"，用可见光模态补充 "细节特征"，提升复杂场景的鲁棒性。

2. 双 Backbone 架构设计思路

VIS 与 IR 的分布差异极大，如果直接拼接输入：

模态差异太大：卷积难以同时学习
低光噪声干扰特征
IR 热特征稀疏，不适合早期拼接

为每个模态单独设计一个 Backbone 分支，在 Neck 阶段进行高层语义融合， 在 Head 阶段共享检测器。

3. YOLO 双 Backbone 结构设计

我们以 YOLOv8/YOLOv11 为基础，对主干部分进行双模态改造。

(1) 融合方式1：Concat + 1×1 Conv（最常用）

python 复制代码

fusion = Conv1x1( concat([feat_vis, feat_ir], dim=1) )

(2) 融合方式2：加权融合（效果更好）

W 模态权重可学习：

python 复制代码

fusion = w1 * feat_vis + w2 * feat_ir

(3) 融合方式3：Cross Attention（效果最好）

对 VIS → IR、IR → VIS 分别建立注意力：

python 复制代码

att_vis2ir = Softmax(Q_vis · K_ir^T)
att_ir2vis = Softmax(Q_ir · K_vis^T)

feat_fused = feat_vis + att_vis2ir · V_ir + feat_ir + att_ir2vis · V_vis

4. 训练流程（基于 LLVIP）

代码获取：

https://github.com/tgf123/YOLOv8_improve

YOLO 双 Backbone 双模态融合：以 LLVIP 数据集为例的红外 - 可见光目标检测实践_哔哩哔哩_bilibili

YOLOv11模型改进讲解，教您如何使用双backbone提升YOLO11检测精度_哔哩哔哩_bilibili

YOLO双backbone讲，使用Mambaout替换YOLO backbone，可用于单模态和双模态检测，提高精度_哔哩哔哩_bilibili

4.1数据准备

从下面的数据截屏可以看到，图像和标签都是两份

LLVIP/

├── images/

│ ├── train/

│ ├── train2/ # IR

│ ├── val/

│ ├── val2/ # IR

├── labels/

│ ├── train/

│ ├── train2/

│ ├── val/

│ ├── val2/

python 复制代码

path:  E:/Part_time_job_orders/YOLOv8_double_backbone/LLVIP/LLVIP

train: images/train      # VIS
train2: images/train2    # IR
val: images/val          # VIS
val2: images/val2        # IR

nc: 1
names:
  0: Person

4.2模型的训练

从下面的的图中可以看出，和平常的yolo训练时一样的，也是将yaml文件的路径和数据的yaml路径复制一下，然后直接运行就可以。

4.3模型的可视化

python 复制代码

from ultralytics import YOLO

if __name__ == "__main__":
    pth_path = r"E:\Part_time_job_orders\YOLOv11_double_backbone\YOLOv11_double_backbone\runs\detect\train5\weights\best.pt"

    test_path = [r"E:\Part_time_job_orders\YOLOv8_double_backbone\LLVIP\LLVIP\images\train\010001.jpg",r"E:\Part_time_job_orders\YOLOv8_double_backbone\LLVIP\LLVIP\images\train2\010001.jpg"]
                 # ["/home/shengtuo/tangfan/Crack_detection/data/RDD2022_China/images/val/China_MotorBike_001974.jpg","/home/shengtuo/tangfan/Crack_detection/data/RDD2022_China/images/val/China_MotorBike_001974.jpg"]]
    # Load a model
    # model = YOLO('yolov8n.pt')  # load an official model
    model = YOLO(pth_path)  # load a custom model

    # Predict with the model
    results = model(test_path ,imgsz=640, save=True, conf=0.5)  # predict on an image

数据集名称	参考论文	下载链接
MSRS 数据集	PIAFusion: A progressive infrared and visible image fusion network based on illumination aware	https://github.com/Linfeng-Tang/MSRS
RoadScene 数据集	U2Fusion: A Unified Unsupervised Image Fusion Network	https://github.com/hanna-xu/RoadScene
M3FD 数据集	Target-aware Dual Adversarial Learning and a Multi-scenario Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection	https://github.com/JinyuanLiu-CV/TarDAL
TNO 数据集	The TNO Multiband Image Data Collection	https://figshare.com/articles/dataset/TNO_Image_Fusion_Dataset/1008029
LLVIP 数据集	LLVIP: A Visible-infrared Paired Dataset for Low-light Vision	https://github.com/bupt-ai-cz/LLVIP
FMB 数据集	Multi-interactive Feature Learning and a Full-time Multi-modality Benchmark for Image Fusion and Segmentation	https://github.com/JinyuanLiu-CV/SegMiF
FLIR 数据集	FREE Teledyne FLIR Thermal Dataset for Algorithm Training（参考博客）	https://www.flir.com/oem/adas/adas-dataset-form/
MFNet 数据集	MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes	https://www.mi.t.u-tokyo.ac.jp/static/projects/mil_multispectral/
KAIST 数据集	Multispectral Pedestrian Detection: Benchmark Dataset and Baseline（KAIST Multispectral Pedestrian Detection Benchmark）	https://github.com/SoonminHwang/rgbt-ped-detection
PST900 数据集	PST900: RGB-Thermal Calibration, Dataset and Segmentation Network	https://github.com/ShreyasSkandanS/pst900_thermal_rgb
HDO 数据集	RCVS: Round-the-Clock Video Stream registration and fusion scheme	https://github.com/xiehousheng/HDO
MMVS 数据集	VRFF: Video Registration and Fusion Framework	https://github.com/Meng-Sang/VRFF；https://github.com/Meng-Sang/MMVS
DVTOD 数据集	Misaligned Visible-Thermal Object Detection: A Drone-based Benchmark and Baseline	https://github.com/VDT-2048/DVTOD
AWMM-100k 数据集	All-weather Multi-Modality Image Fusion: Unified Framework and 100k Benchmark	https://github.com/ixilai/AWFusion
EMS 数据集	Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion	https://github.com/XunpengYi/Text-IF
INO 数据集	Video Analytics Dataset（参考博客）	https://www.ino.ca/en/technologies/video-analytics-dataset/videos/
M3SVD 数据集	VideoFusion: A Spatio-Temporal Collaborative Network for Mutli-modal Video Fusion and Restoration	https://github.com/Linfeng-Tang/M2VD
VF-Bench 数据集	A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking (NeurIPS 25)	https://github.com/Zhaozixiang1228/VF-Bench
Airsim-VID 数据集	VIFNet: An End-to-end Visible-Infrared Fusion Network for Image Dehazing	https://github.com/mengyu212/VIFNet_dehazing
DeMMI-RF 数据集	Task-Gated Multi-Expert Collaboration Network for Degraded Multi-Modal Image Fusion	https://github.com/LeeX54946/TG-ECNet
AVMS 数据集	FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolution	https://github.com/XylonXu01/FS-Diff
RGBT550K 数据集	M-SpecGene: Generalized Foundation Model for RGBT Multispectral Vision	https://github.com/CalayZhou/M-SpecGene
RGBT-Tiny 数据集	Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines	https://github.com/XinyiYing/RGBT-Tiny
IVT 数据集	TextFusion: Unveiling the power of textual semantics for controllable image fusion	https://github.com/AWCXV/TextFusion
RGB-NIR Scene Dataset 数据集	Multi-spectral SIFT for Scene Category Recognition (CVPR2011)	https://www.epfl.ch/labs/ivrl/research/downloads/rgb-nir-scene-dataset/
RGBD1K 数据集	RGBD1K: A Large-Scale Dataset and Benchmark for RGB-D Object Tracking (AAAI 23)	https://github.com/xuefeng-zhu5/RGBD1K
UniMod1K 数据集	UniMod1K: Towards a More Universal Large-Scale Dataset and Benchmark for Multi-modal Learning (IJCV 24)	https://github.com/xuefeng-zhu5/UniMod1K
IVSD 数据集	Multispectral Image Stitching via Global-Aware Quadrature Pyramid Regression (TIP 24)	https://github.com/Jzy2017/MSGA
Havard 多模态医学图像数据集	无明确对应论文（The Whole Brain Atlas）	https://www.med.harvard.edu/aanlib/home.html；https://github.com/yidamyth/Havard-Medical-Image-Fusion-Datasets
VIFB 数据集基准	VIFB: A Visible and Infrared Image Fusion Benchmark (2020 CVPR Workshops)	https://github.com/xingchenzhang/VIFB
GTPLD 电力线数据集	Ground Truth of Power Line Dataset (Infrared-IR and Visible Light-VL)	https://www.selectdataset.com/dataset/4371a791b8e4090388449273a001dab8