[TPAMI 2024]Vision-Language Models for Vision Tasks: A Survey

论文网址:Vision-Language Models for Vision Tasks: A Survey | IEEE Journals & Magazine | IEEE Xplore

论文Github页面:GitHub - jingyi0000/VLM_survey: Collection of AWESOME vision-language models for vision tasks

英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用

未完待续

1. 心得

(1)依旧放松一下,以及很久没看TPAMI了,感觉一直很认可TPAMI的质量啊,拜读一下

2. 论文逐段精读

2.1. Abstract

①Existing problems: train DNN for each visual task, which is laborious and time costing

②Content: a) background of VLM in visual task, b) doundations of VLM, c) datasets, d) pretraining, transfer learning and knowledge distillation methods of VLM, e) benchmarks, f) challenges

laborious adj.费力的;辛苦的

2.2. Introduction

①New paradigm: Pre-training (on large scale data w/ or w/o label), Fune-tuning (for specific labelled training data), and Prediction, see (a) and (b):

②Vision-Language Model Pre-training and Zero-shot Prediction which do not need fune-tuning:

③VLM publication number on Google Scholar:

frisbee n.(投掷游戏用的)飞盘;飞碟

2.3. Background

2.3.1. Training Paradigms for Visual Recognition

(1)Traditional Machine Learning and Prediction

①Mostly hand-crafted and lightweight but hard to cope with complex or multi tasks

②Poor scalability

(2)Deep Learning From Scratch and Prediction

①Low speed convergence from scratch

②A mount of labels needed

(3)Supervised Pre-Training, Fine-Tuning and Prediction

①Speed up convergence

(4)Unsupervised Pre-Training, Fine-Tuning & Prediction

①Does not require labelled data

②Beter performance due to larger samples learning

(5)VLM Pre-Training and Zero-Shot Prediction

①Discarding fine-tuning

②Future directions: a) large scale informative image-text data, b) high-capacity models, c) new pre-training objectives

2.3.2. Development of VLMs for Visual Recognition

①3 improvements to VLMs:

2.3.3. Relevant Surveys

①Framework of their review:

2.4. VLM Foundations

2.4.1. Network Architectures

①Number of image-text pairs:

②Features extracted from pairs: , where with superscript denotes image sample with denotes text

③Image encoder and text encoder in DNN: /

④Encoding operation: and

(1)Architectures for Learning Image Features

①CNN-based architectures: such as VGG, ResNet and EfficientNet

②Transformer-base architectures: such as ViT

(2)Architectures for Learning Language Features

①The framework of standard Transformer: 6 blocks in encoder (each with a multi-head attention layer and MLP) and 6 blocks in decoder (each with a multi-head attention layer, a masked multi-head layer and MLP)

2.4.2. VLM Pre-Training Objectives

(1)Contrastive Objectives

①Image Contrastive Learning: close with positive keys and faraway from negative keys in embedding space. For images(实际上作者这里表达得很特殊,他们是说"对于这样的batch size"大小,这是比较贴近代码的表达,如果要概念上的表达其实就看成总共有这么多样本就好), this loss always be:

where denotes query embedding, denotes key embeddings, denotes positive keys in the -th sample, denotes temperature hyper-parameter

②Image-Text Contrastive Learning: pull paired embeddings closed and others away:

where denotes contrasting the query image with the text keys, denotes contrasting the query text with image keys

③Image-Text-Label Contrastive Learning: supervised:

where , denotes the class label of (相当于多增加了一个样本类循环)

(2)Generative Objectives

①Masked Image Modelling: learns cross-patch correlation by masking a set of patches and reconstructing images. The loss usually is:

where denotes masked patches, denotes unmasked patches(这"|"什么玩意儿啊条件概率吗但是说不通?在不mask的情况下mask的概率???怎么感觉反了呢还是我有问题)

②Masked Language Modelling: mask at a specific ratio:

③Masked Cross-Modal Modelling: randomly masks a subset of image patches and a subset of text tokens then reconstruct by unmasked ones:

④Image-to-Text Generation: through image and text pairs to predict text:

where denotes the number of tokens, is the embedding of the image paired with

(3)Alignment Objectives

①Image-Text Matching: BCE loss:

where measures the alignment probability between the image and text, when matches otherwise 0

②Region-Word Matching: model local cross-modal correlation in dense scenes:

where denotes a region-word pair, when matches otherwise 0

2.4.3. VLM Pre-Training Frameworks

①two-tower, two-leg and one-tower pre-training approaches:

2.4.4. Evaluation Setups and Downstream Tasks

(1)Zero-Shot Prediction

(2)Linear Probing

2.5. Datasets

2.5.1. Datasets for Pre-Training VLMs

2.5.2. Datasets for VLM Evaluation

2.6. Vision-Language Model Pre-Training

2.6.1. VLM Pre-Training With Contrastive Objectives

(1)Image Contrastive Learning

(2)Image-Text Contrastive Learning

(3)Image-Text-Label Contrastive Learning

(4)Discussion

2.6.2. VLM Pre-Training With Generative Objectives

(1)Masked Image Modelling

(2)Masked Language Modelling

(3)Masked Cross-Modal Modelling

(4)Image-to-Text Generation

(5)Discussion

2.6.3. VLM Pre-Training With Alignment Objectives

(1)Image-Text Matching

(2)Region-Word Matching

(3)Discussion

2.6.4. Summary and Discussion

2.7. VLM Transfer Learning

2.7.1. Motivation of Transfer Learning

2.7.2. Common Setup of Transfer Learning

2.7.3. Common Transfer Learning Methods

(1)Transfer Via Prompt Tuning

(2)Transfer Via Feature Adaptation

(3)Other Transfer Methods

2.7.4. Summary and Discussion

2.8. VLM Knowledge Distillation

2.8.1. Motivation of Distilling Knowledge From VLMs

2.8.2. Common Knowledge Distillation Methods

(1)Knowledge Distillation for Object Detection

(2)Knowledge Distillation for Semantic Segmentation

2.8.3. Summary and Discussion

2.9. Performance Comparison

2.9.1. Performance of VLM Pre-Training

2.9.2. Performance of VLM Transfer Learning

2.9.3. Performance of VLM Knowledge Distillation

2.9.4. Summary

2.10. Future Directions

2.11. Conclusion

  1. 知识补充

4. Reference

Zhang, J. et al. (2024) Vision-Language Models for Vision Tasks: A Survey. TPAMI, 46(8): 5625-5644. doi: 10.1109/TPAMI.2024.3369699

相关推荐
Blossom.1183 小时前
使用Python和Scikit-Learn实现机器学习模型调优
开发语言·人工智能·python·深度学习·目标检测·机器学习·scikit-learn
scdifsn4 小时前
动手学深度学习12.7. 参数服务器-笔记&练习(PyTorch)
pytorch·笔记·深度学习·分布式计算·数据并行·参数服务器
DFminer4 小时前
【LLM】fast-api 流式生成测试
人工智能·机器人
郄堃Deep Traffic4 小时前
机器学习+城市规划第十四期:利用半参数地理加权回归来实现区域带宽不同的规划任务
人工智能·机器学习·回归·城市规划
海盗儿5 小时前
Attention Is All You Need (Transformer) 以及Transformer pytorch实现
pytorch·深度学习·transformer
GIS小天5 小时前
AI+预测3D新模型百十个定位预测+胆码预测+去和尾2025年6月7日第101弹
人工智能·算法·机器学习·彩票
阿部多瑞 ABU5 小时前
主流大语言模型安全性测试(三):阿拉伯语越狱提示词下的表现与分析
人工智能·安全·ai·语言模型·安全性测试
cnbestec5 小时前
Xela矩阵三轴触觉传感器的工作原理解析与应用场景
人工智能·线性代数·触觉传感器
不爱写代码的玉子6 小时前
HALCON透视矩阵
人工智能·深度学习·线性代数·算法·计算机视觉·矩阵·c#
sbc-study6 小时前
PCDF (Progressive Continuous Discrimination Filter)模块构建
人工智能·深度学习·计算机视觉