[VL|Ref]UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces

1. BaseInfo

Title UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
Adress https://arxiv.org/abs/2312.15715
Journal/Time ICCV2023
Author 港大、字节
Code https://github.com/FoundationVision/UniRef
Read 20241002

2. Creative Q&A

  1. referring image segmentation (RIS), few-shot image segmentation (FSS), referring video object segmentation (RVOS), and video object segmentation (VOS) 将四个任务结合起来的大一统框架。是 UniRef 的升级版。(VOS / FSS 是 图+掩码,RVOS 是 图+掩码+文本)
  2. UniFusion 处理不同任务的模块。
  3. a unified Transformer architecture 实例级分割

3. Concrete

用分层的方式融合视觉和参考信息。

3.1. Model

Unifusion 用视觉做 Q,参考特征做 K 和 V

对参考特征做池化和回归。有 scale 、shift 和 gate 参数。

对多头交叉注意力出来的特征利用参数再残差。

使用 FlashAttention 执行跨注意力操作,当计算稠密的特征图时效率更高且内存消耗更小;受 adaLN-zero 块的启发,偏移和门的参数都是 zero-initialized 的。

3.1.1. Input

图 + 文 + 掩码

3.1.2. Backbone

Resnet 50 / Swin-L

Bert-base / RoBERT-base

3.1.3. Neck

3.1.4. Decoder

DeformableDETR(mask 的动态卷积核参数) + DynamicConv 类似 FPN 结构的,只使用了 2 3 4层的特征。最终要得到 H/4 x W/4 x C

3.1.5. Loss

set as λcls = 2.0, λcls = 2.0, λL1 = 5.0,λmask = 2.0 and λdice = 5.0, respectively

3.2. Training

3.2.1. Resource

NVIDIA A100 GPUs.

4 × 8 A100 GPUs for the objects365 pretraining and 2 × 8 GPUs for the following image-level and video-level training.

3.2.2 Dataset

(i) RIS: RefCOCO 122 consists of 142,209 language descriptions for 50,000 objects in 19,994 images. RefCOCO+ 122 has 141,564 expressions for 49,856 objects in 19,992 images. RefCOCOg 67 includes 85,474 referring expressions for 54,822 objects in 26,711 images. And we use the UMD split for RefCOCOg 67.

(ii) FSS: FSS-1000 50 is a large-scale dataset for FSS task. It contains 10,000 images from 1,000 classes.

(iii) RVOS: Ref-Youtube-VOS 82 is a large-scale referring video object segmentation dataset which contains 3,978 videos with around 15k langauge descriptions. Ref-DAVIS17 42 provides the referring expressions for each object in DAVIS17 75. It contains 90 videos in total.

(iv) VOS: Youtube-VOS1 109 is the popular benchmark for video object segmentation. There are 474 and 507 videos in the validation set for 2018 and 2019 version, respectively. LVOS 31 is a long-term video object segmentation benchmark consisting of 220 videos.videos in LVOS have an average duration of 1.59 minutes, and the videos in Youtube-VOS last 6 seconds on average. MOSE 21 is a newly proposed dataset for evaluating VOS algorithms in complex scenes, such as occlusion and disappearance. It have 2,149 videos clips and 5,200 objects from 36 categories, with a total of 431,725 annotated masks.

3.3. Eval

Table 1: 感觉不算公平对比阿,用的 encoder 都不一样。

3.4. Ablation

UniFusion 在 SAM 中即插即用。

4. Reference

5. Additional

含附录

附件有实验具体配置。

相关推荐
北京软秦科技有限公司17 小时前
搭建数字化风控体系,IACheck紧跟一单一库监管步伐,AI报告审核赋能行业合规升级
人工智能
土拨鼠烧电路17 小时前
第6章:重构者——当应用学会自我厮杀
人工智能·重构
甲维斯17 小时前
Qwen3.7Max 测了一波有点用不起啊!
人工智能·ai编程
暴躁小师兄数据学院17 小时前
【AI大模型应用开发工程师特训笔记】第04讲(第7章):函数与模块
前端·人工智能·python
Hello world.Joey17 小时前
吴恩达深度学习基础
人工智能·深度学习·神经网络·opencv·算法·机器学习·计算机视觉
测试开发-学习笔记17 小时前
从0开始搭建app的自动化(二)-appium+python
python·appium·自动化
SEO_juper17 小时前
AI 内容安全写法:AIGC 初稿 + 人工 E-E-A-T 润色 + 实拍验证
人工智能·aigc·seo·跨境电商·独立站·谷歌优化·外贸电商
水木流年追梦17 小时前
大模型入门-大模型优化方法1
人工智能·学习·算法·机器学习·正则表达式
老虾头18 小时前
从“流量佃农”到“超级个体”:AI一体机赋能下的生产力跃迁
人工智能