1. BaseInfo
| Title | UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces |
| Adress | https://arxiv.org/abs/2312.15715 |
| Journal/Time | ICCV2023 |
| Author | 港大、字节 |
| Code | https://github.com/FoundationVision/UniRef |
| Read | 20241002 |
2. Creative Q&A
- referring image segmentation (RIS), few-shot image segmentation (FSS), referring video object segmentation (RVOS), and video object segmentation (VOS) 将四个任务结合起来的大一统框架。是 UniRef 的升级版。(VOS / FSS 是 图+掩码,RVOS 是 图+掩码+文本)
- UniFusion 处理不同任务的模块。
- a unified Transformer architecture 实例级分割
3. Concrete

用分层的方式融合视觉和参考信息。
3.1. Model
Unifusion 用视觉做 Q,参考特征做 K 和 V
对参考特征做池化和回归。有 scale 、shift 和 gate 参数。
对多头交叉注意力出来的特征利用参数再残差。
使用 FlashAttention 执行跨注意力操作,当计算稠密的特征图时效率更高且内存消耗更小;受 adaLN-zero 块的启发,偏移和门的参数都是 zero-initialized 的。
3.1.1. Input
图 + 文 + 掩码
3.1.2. Backbone
Resnet 50 / Swin-L
Bert-base / RoBERT-base
3.1.3. Neck
3.1.4. Decoder
DeformableDETR(mask 的动态卷积核参数) + DynamicConv 类似 FPN 结构的,只使用了 2 3 4层的特征。最终要得到 H/4 x W/4 x C
3.1.5. Loss

set as λcls = 2.0, λcls = 2.0, λL1 = 5.0,λmask = 2.0 and λdice = 5.0, respectively
3.2. Training
3.2.1. Resource
NVIDIA A100 GPUs.
4 × 8 A100 GPUs for the objects365 pretraining and 2 × 8 GPUs for the following image-level and video-level training.
3.2.2 Dataset
(i) RIS: RefCOCO 122 consists of 142,209 language descriptions for 50,000 objects in 19,994 images. RefCOCO+ 122 has 141,564 expressions for 49,856 objects in 19,992 images. RefCOCOg 67 includes 85,474 referring expressions for 54,822 objects in 26,711 images. And we use the UMD split for RefCOCOg 67.
(ii) FSS: FSS-1000 50 is a large-scale dataset for FSS task. It contains 10,000 images from 1,000 classes.
(iii) RVOS: Ref-Youtube-VOS 82 is a large-scale referring video object segmentation dataset which contains 3,978 videos with around 15k langauge descriptions. Ref-DAVIS17 42 provides the referring expressions for each object in DAVIS17 75. It contains 90 videos in total.
(iv) VOS: Youtube-VOS1 109 is the popular benchmark for video object segmentation. There are 474 and 507 videos in the validation set for 2018 and 2019 version, respectively. LVOS 31 is a long-term video object segmentation benchmark consisting of 220 videos.videos in LVOS have an average duration of 1.59 minutes, and the videos in Youtube-VOS last 6 seconds on average. MOSE 21 is a newly proposed dataset for evaluating VOS algorithms in complex scenes, such as occlusion and disappearance. It have 2,149 videos clips and 5,200 objects from 36 categories, with a total of 431,725 annotated masks.
3.3. Eval
Table 1: 感觉不算公平对比阿,用的 encoder 都不一样。
3.4. Ablation

UniFusion 在 SAM 中即插即用。
4. Reference
5. Additional
含附录
附件有实验具体配置。