X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks论文笔记

Nick Blog2023-09-12 9:45

|--------------------------------------------------------------------------------------------------------------|
| Title：X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks |

|----------------------------------------------|
| Code |

1. Motivation

CLIP这一类方法只能进行图片级别的视觉和文本对齐；

也有一些方法利用预训练的目标检测器进行目标级别的视觉和文本对齐，但是只能编码目标内部的特征，无法有效表达多目标上下文关联；

本文致力于进行多粒度（objects, regions, and images）的视觉文本对齐预训练任务；

2. 模型结构

3. 损失函数

3.1 contrastive loss

文本特征和视觉特征之间的相似性定义：

vision-to-text similarity

text-to-vision similarity
GT：one-hot
cross-entropy loss

3.2 matching loss

For each visual concept in a mini-batch, we sample an in-batch hard negative text by following p v 2 t ( V ) p^{v2t}(V) pv2t(V). （与当前视觉特征越接近的文本越可能被采样）
We also sample one hard negative visual concept for each text.
put the pairs as inputs for the fusion module, and then we use xcls, the output [CLS] embedding of the fusion module, to predict the matching probability p m a t c h p^{match} pmatch , and the loss is:

3.3 masked language modeling loss (MLM)

3.4 bbox loss

上一篇：SpringMVC之CRUD(增删改查)

下一篇：什么是IP协议？

热门推荐

01KGG转MP3工具|非KGM文件|解密音频 02GitHub 镜像站点 03BongoCat - 跨平台键盘猫动画工具 04UV安装并设置国内源 05jdk21下载、安装（Windows、Linux、macOS）06零基础搭建赛博朋克个人主页：蓝耘Claude Code完整实战教程 07两千字总结：Codex 国内如何安装和使用的教程，以及如何设置中文回答 0846个Nano-banana 精选提示词，持续更新中 09Linux下V2Ray安装配置指南 10adb安装教程（附adb命令大全详解）adb环境配置教程