视觉语言模型能多好地看到图像细节？ [单位：蒙纳士大学]

How Well Can Vision Language Models See Image Details?

Abstract

Large Language Model-based Vision-Language Models (LLM-based VLMs) have demonstrated impressive results in various vision-language understanding tasks. However, how well these VLMs can see image detail beyond the semantic level remains unclear. In our study, we introduce a pixel value prediction task (PVP) to explore "How Well Can Vision Language Models See Image Details?" and to assist VLMs in perceiving more details. Typically, these models comprise a frozen CLIP visual encoder, a large language model, and a connecting module. After fine-tuning VLMs on the PVP task, we find: 1) existing VLMs struggle to predict precise pixel values by only fine-tuning the connection module and LLM; and 2) prediction precision is significantly improved when the vision encoder is also adapted. Additionally, our research reveals that incorporating pixel value prediction as one of the VLM pre-training tasks and vision encoder adaptation markedly boosts VLM performance on downstream image-language understanding tasks requiring detailed image perception, such as referring image segmentation (with an average +10.19 cIoU improvement) and video game decision making (with average score improvements of +80.34 and +70.54 on two games, respectively).

基于大型语言模型（LLM）的视觉-语言模型（VLMs）在各种视觉-语言理解任务中表现出了令人瞩目的成果。然而，这些VLM在语义层面之外能否有效识别图像细节仍不清楚。在本研究中，我们引入了一个像素值预测任务（PVP），旨在探索"视觉语言模型能多好地看到图像细节？"并帮助VLM感知更多细节。通常，这些模型包括一个冻结的CLIP视觉编码器、一个大型语言模型以及一个连接模块。在对VLM进行PVP任务的微调后，我们发现：1）仅通过微调连接模块和LLM，现有的VLM在预测精确像素值方面存在困难；2）当视觉编码器也进行适配时，预测精度显著提升。此外，我们的研究还表明，将像素值预测作为VLM预训练任务之一，并对视觉编码器进行适配，可以显著提高VLM在需要精细图像感知的下游图像-语言理解任务中的性能，如引用图像分割（平均cIoU提升+10.19）和视频游戏决策制定（在两个游戏上的平均分数分别提升了+80.34和+70.54）。

动机

随着大型语言模型（LLMs）在人工智能领域的革命性进展，基于LLMs的视觉语言模型（VLMs）在视觉和语言交叉领域迅速发展。然而，这些VLMs在语义层面之外的图像细节感知能力仍然不清晰。具体来说，虽然VLMs在诸如视觉问答（VQA）和指代表达理解（REC）等任务中表现出色，但它们是否真的能"看到"图像中的原始细节仍存疑问。因此，本研究的动机是探究当前VLMs在图像细节感知方面的能力，并提出一种方法来增强它们的这种能力。

Fig. 1: Method. a) shows our findings: Using the original CLIP vision features, VLMs can only reconstruct a blurry contour without many visual details. The reconstruction result can be improved by adapting the vision encoder. The reconstructed image is generated by querying pixel values with pixel locations, as shown in (b). For better illustration, the connection module between ViT and LLM is ignored. b) shows that we incorporate pixel prediction as a pretraining task for VLM. c) illustrates some downstream tasks performed by VLM, which require both vision detail understanding and language information. Our pretraining improves VLM performance on these tasks.

方法

为了探究VLMs在图像细节感知方面的能力，研究者们提出了一种像素值预测任务（PVP），并设计了一个像素重建预训练流程来增强VLMs的图像细节理解能力。以下是具体的方法细节：

1. 像素值预测任务（PVP）

研究者们将像素值预测任务设计为一个视觉问答（VQA）格式的任务。给定图像的CLIP特征和图像上的一个(x,y)坐标，模型被要求预测该坐标处的RGB像素值。具体来说，模型通过以下方式接收输入和输出：

输入：图像CLIP特征和坐标(x,y)。

输出：该坐标处的RGB像素值[r, g, b]。

通过这种方式，研究者们能够评估VLMs在像素级细节上的感知能力。

2. 像素重建预训练

为了增强VLMs的图像细节理解能力，研究者们将PVP任务纳入了VLM的预训练流程中。整个预训练过程分为三个阶段：

第一阶段：仅训练大语言模型（LLM）和连接模块，以熟悉新的像素重建任务。

第二阶段：在继续训练LLM和连接模块的同时，适应视觉编码器，以提高VLM对视觉细节的理解能力。

第三阶段：冻结视觉编码器，并减少像素重建任务的采样比例，以在视觉语言空间中实现低级细节和高级语义之间的平衡。

通过这三个阶段的预训练，研究者们期望VLMs在保留一般视觉语言知识的同时，能够更好地理解图像的细节。

Fig. 2: Examples of Game Playing by VLM. The input to the VLM is the stacked images and the game instructions. The first row shows an example of playing Carracing. The second row shows the SpaceInvaders game. The number of stacked frames depends on the expert model we used. For example, Carracing uses two frames and SpaceInvaders uses four.

实验

为了验证提出的方法的有效性，研究者们进行了多项实验，包括在PVP任务上的性能评估以及在下游任务（如指代图像分割和视频游戏决策）上的表现提升。

1. PVP任务上的性能评估

实验设置：研究者们首先评估了仅通过微调连接模块和LLM（冻结视觉编码器）的VLMs在PVP任务上的表现。随后，他们评估了在微调过程中也适应视觉编码器的VLMs的表现。

实验结果：结果显示，仅通过微调连接模块和LLM的VLMs只能重建出模糊的图像轮廓，而同时适应视觉编码器的VLMs能够显著提高像素预测的精度。

2. 下游任务上的表现提升

指代图像分割：在指代图像分割任务中，VLMs需要准确感知图像中由给定句子描述的对象，并生成其分割掩码。实验结果表明，经过像素重建预训练的VLMs（PAE-LaMM）在分割任务上表现出显著提升，平均IoU提高了10.19%。

视频游戏决策：研究者们还评估了VLMs在Car Racing和Space Invaders两个视频游戏上的表现。结果显示，PAE-LaMM在这两个游戏上的平均得分分别提高了80.34和70.54。

这些实验结果验证了研究者们的方法在增强VLMs图像细节理解能力方面的有效性，并展示了这种方法在多种下游任务中的潜在应用。

Fig. 3: Qualitative results of Reconstruction (a) and (d) are the GroundTruth for reconstruction. (b) and (e) is the reconstructed image of our method. (c) and (f) are the baseline result without CLIP-Vit adaptation. Compared with the baseline, our method reconstructs images with more details. The averaged Reconstruction error of our method and baseline on these 10 images are 6.67, and 24.56, respectively.

Fig. 4: Qualitative results of Referring Image Segmentation. We first use the referring localization ability of the fine-tuned model to generate a bounding box (bbox) for the referring object, and then predict the segmentation mask inside the bbox.

Fig. 5: Qualitative results of Carracing. We show the game observation from different models, including the expert Reinforcement Learning (RL) Model, Baseline Model, and Our Method, all playing under the same game seed. These images depict how each model behaves when controlling the car and approaching the same corner.