How Well Can Vision Language Models See Image Details?
Large Language Model-based Vision-Language Models (LLM-based VLMs) have demonstrated impressive results in various vision-language understanding tasks. However, how well these VLMs can see image detail beyond the semantic level remains unclear. In our study, we introduce a pixel value prediction task (PVP) to explore "How Well Can Vision Language Models See Image Details?" and to assist VLMs in perceiving more details. Typically, these models comprise a frozen CLIP visual encoder, a large language model, and a connecting module. After fine-tuning VLMs on the PVP task, we find: 1) existing VLMs struggle to predict precise pixel values by only fine-tuning the connection module and LLM; and 2) prediction precision is significantly improved when the vision encoder is also adapted. Additionally, our research reveals that incorporating pixel value prediction as one of the VLM pre-training tasks and vision encoder adaptation markedly boosts VLM performance on downstream image-language understanding tasks requiring detailed image perception, such as referring image segmentation (with an average +10.19 cIoU improvement) and video game decision making (with average score improvements of +80.34 and +70.54 on two games, respectively).
Fig. 1: Method. a) shows our findings: Using the original CLIP vision features, VLMs can only reconstruct a blurry contour without many visual details. The reconstruction result can be improved by adapting the vision encoder. The reconstructed image is generated by querying pixel values with pixel locations, as shown in (b). For better illustration, the connection module between ViT and LLM is ignored. b) shows that we incorporate pixel prediction as a pretraining task for VLM. c) illustrates some downstream tasks performed by VLM, which require both vision detail understanding and language information. Our pretraining improves VLM performance on these tasks.
1. 像素值预测任务(PVP)
输出:该坐标处的RGB像素值[r, g, b]。
2. 像素重建预训练
Fig. 2: Examples of Game Playing by VLM. The input to the VLM is the stacked images and the game instructions. The first row shows an example of playing Carracing. The second row shows the SpaceInvaders game. The number of stacked frames depends on the expert model we used. For example, Carracing uses two frames and SpaceInvaders uses four.
1. PVP任务上的性能评估
2. 下游任务上的表现提升
视频游戏决策:研究者们还评估了VLMs在Car Racing和Space Invaders两个视频游戏上的表现。结果显示,PAE-LaMM在这两个游戏上的平均得分分别提高了80.34和70.54。
Fig. 3: Qualitative results of Reconstruction (a) and (d) are the GroundTruth for reconstruction. (b) and (e) is the reconstructed image of our method. (c) and (f) are the baseline result without CLIP-Vit adaptation. Compared with the baseline, our method reconstructs images with more details. The averaged Reconstruction error of our method and baseline on these 10 images are 6.67, and 24.56, respectively.
Fig. 4: Qualitative results of Referring Image Segmentation. We first use the referring localization ability of the fine-tuned model to generate a bounding box (bbox) for the referring object, and then predict the segmentation mask inside the bbox.
Fig. 5: Qualitative results of Carracing. We show the game observation from different models, including the expert Reinforcement Learning (RL) Model, Baseline Model, and Our Method, all playing under the same game seed. These images depict how each model behaves when controlling the car and approaching the same corner.