文本引导的图像融合方法

TextFusion（2025Inffus 2023.12）

语义

Test√

1.使用clip对text进行编码，将编码后的特征映射到和图像特征统一的维度上，然后作为权重加在图像特征上：

x = x_vis + text_features * x_ir

2.使用coarse to fine association来生成掩码Bf和热图M作用到损失函数上

3.创建text引导的融合图像：

提出基于文本注意力的图像融合评估度量指标:

W0和CT分别代表传统和改进的评估组件的加权项。置信度得分CT是使用关联机制生成的热图计算的，定义为：

4.数据集

提出IVT数据集，每张图片带有5个文本描述，基本上是语义描述，1、2、3是分开描述，4、5是很全的描述。Train：IVT-LLVIP（10,000 RGBT and text triples） Test：IVT-TNO、IVT-RoadScene、IVT-LLVIP（remaining images）

IF-FILM：Image Fusion via Vision-Language Model（ICML2024 2024.03 DDFM、CDDFuse、IJCAI作者）

语义

Test×

首先，使用BLIP2、GRIT、Segment Anything对图像生成Image Caption, Dense Caption, and Semantic Mask，然后将它们喂到chatgpt来生成文本描述text，再将text输入到frozen的BLIP2得到文本特征，最后，将text的特征concat作为q与图像特征计算cross attention再解码。

数据集：可见光和红外光详细的text描述，包括各个语义对象的具体坐标以及图像的分辨率。还有blip编码的文本特征。

Eg：In this black and white photo of cars on a street at night, the resolution is 384X288. The image captures various objects and activities within the frame. One can see a car in the picture, positioned at [183, 138, 240, 190], as well as a car making a left turn at [25, 142, 109, 188]. Furthermore, a person walking in the rain is also present, depicted at [16, 139, 31, 182]. The photo also includes a large truck, which can be seen at [272, 105, 363, 190] and another car within the frame at [127, 146, 187, 171]. Additionally, the scene shows a car on the street at [232, 148, 257, 177] and a car parked on the street at [369, 154, 381, 190]. Moreover, a person standing can be noticed in the photo, positioned at [107, 139, 118, 170]. Lastly, a tall electrical tower stands tall at [280, 3, 332, 107] and there is also a tall building with several stories at [0, 1, 180, 145]. This image offers a glimpse into a city street at night, showcasing the hustle and bustle of urban life.

OCCO：LVM-guided Infrared and Visible Image Fusion Framework based on Object-aware and Contextual COntrastive Learning（arxiv202503 LiHui）

语义

Test×

输入text prompt，使用grounding dino和sam来提取背景和语义信息（shared和unique）。

然后通过对比学习的思想来构建损失函数。

Text-IF（CVPR2024）

Degradation

Test√

1.使用clip对文本信息进行编码：

复制代码

text_feature = self.model_clip.encode_text(text)

然后将编码后的特征转换成权重和融合信息进行交互：

复制代码

gamma, beta = self.MLP(text_embed).view(batch, -1, 1, 1).chunk(2, dim=1)

x = (1 + gamma) * x + beta

2.在loss函数中通过调整各个损失函数权重的大小来适应各种不同的情况：

复制代码

if task_type == "low_light":
    loss, ssim_loss, max_loss, color_loss, grad_loss = self.fusion_loss(img_A, img_B, img_fused, max_ratio=8, ssim_ratio=1, text_ratio=10)
else:
    raise ValueError(f"Unknown task type: {task_type}")

2.数据集

带degradation标注的可见光红外图像对EMS(MFNet, RoadScene/FLIR_aligned, and LLVIP.)