LLaVA-CoT: Let Vision Language Models Reason Step-by-Step论文解读

文章目录

前言
一、摘要
一、引言
二、文献综述
- [1. Visual reasoning with large language models](#1. Visual reasoning with large language models)
- [2. Chain-of-thought in large language models](#2. Chain-of-thought in large language models)
- [3. Inference time scaling](#3. Inference time scaling)
三、方法
- [1. Enhancing Reasoning Capability through Structured Thinking](#1. Enhancing Reasoning Capability through Structured Thinking)
- - [1. Reasoning Stages](#1. Reasoning Stages)
  - [2. Data Preparation and Model Training](#2. Data Preparation and Model Training)
- [2. Effective Inference Time Scaling using Stage-level Beam Search](#2. Effective Inference Time Scaling using Stage-level Beam Search)
[四、Post-Training Performance](#四、Post-Training Performance)
- [1. Experimental Setup](#1. Experimental Setup)
- [2. Benchmark Results](#2. Benchmark Results)
- [3. Ablation Study](#3. Ablation Study)
[五. Inference Time Scaling](#五. Inference Time Scaling)
- [1. Benchmark Results](#1. Benchmark Results)
- [2. Comparison to Baseline Methods](#2. Comparison to Baseline Methods)
- [3. Scaling Trend of Stage-level Beam Search](#3. Scaling Trend of Stage-level Beam Search)
[六、 Comparison to State-of-the-Art VLMs](#六、 Comparison to State-of-the-Art VLMs)
[七、 Conclusion](#七、 Conclusion)
八、LLaVA-COT的数据制作
- 1、训练数据格式
- 2、数据制作(非常重要)

前言

大型语言模型在推理能力方面展现了显著的进步，尤其是在推理时扩展方面，如OpenAI的o1模型所示。然而，当前的视觉-语言模型（VLMs）在进行系统性和结构性推理时往往面临挑战，特别是在处理复杂的视觉问答任务时。在这项工作中，我们介绍了LLaVA-CoT1，这是一种新型的VLM，旨在进行自主的多阶段推理。不同于链式思维提示，LLaVA-CoT独立地参与到摘要、视觉解释、逻辑推理和结论生成的连续阶段中。这种结构化的方法使得LLaVA-CoT在需要高度推理的任务上实现了明显的精度提升。为了实现这一目标，我们编译了LLaVA-CoT-100k数据集，整合了来自各种视觉问答资源的样本，并提供了结构化的推理注释。此外，我们提出了一种推理时阶段级束搜索方法，这种方法能够实现出色的推理时扩展。值得注意的是，仅用10万训练样本和一种简单而有效的推理时扩展方法，LLaVA-CoT不仅比其基础模型在广泛的多模态推理基准测试中高出7.4%，而且还超越了更大甚至闭源模型的性能，如Gemini-1.5-pro、GPT-4o-mini以及Llama-3.2-90B-Vision-Instruct。代码、数据集和预训练权重已公开在https://github.com/PKU-YuanGroup/LLaVA-CoT。

LLaVA-COT不是LLAVA模型基座，而是Llama-3.2-11BVision-Instruct [42]模型作为基础模型。

论文地址：https://arxiv.org/abs/2411.10440

代码地址：https://github.com/PKU-YuanGroup/LLaVA-CoT

数据地址：https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k/tree/master

一、摘要

Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI's o1. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-CoT1, a novel VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-CoT to achieve marked improvements in precision on reasoning-intensive tasks. To accomplish this, we compile the LLaVA-CoT-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose an inference-time stage-level beam search method, which enables effective inference-time scaling. Remarkably, with only 100k training samples and a simple yet effective inference time scaling method, LLaVA-CoT not only outperforms its base model by 7.4% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct. The code, dataset, and pre-trained weights are publicly available at https://github.com/PKU-YuanGroup/LLaVA-CoT.

大型语言模型在推理能力方面展现了显著的进步，尤其是在推理时扩展方面，如OpenAI的o1模型所示。然而，当前的视觉-语言模型（VLMs）在进行系统性和结构性推理时往往面临挑战，特别是在处理复杂的视觉问答任务时。在这项工作中，我们介绍了LLaVA-CoT1，这是一种新型的VLM，旨在进行自主的多阶段推理。不同于链式思维提示，LLaVA-CoT独立地参与到摘要、视觉解释、逻辑推理和结论生成的连续阶段中。这种结构化的方法使得LLaVA-CoT在需要高度推理的任务上实现了明显的精度提升。为了实现这一目标，我们编译了LLaVA-CoT-100k数据集，整合了来自各种视觉问答资源的样本，并提供了结构化的推理注释。此外，我们提出了一种推理时阶段级束搜索方法，这种方法能够实现出色的推理时扩展。值得注意的是，仅用10万训练样本和一种简单而有效的推理时扩展方法，LLaVA-CoT不仅比其基础模型在广泛的多模态推理基准测试中高出7.4%，而且还超越了更大甚至闭源模型的性能，如Gemini-1.5-pro、GPT-4o-mini以及Llama-3.2-90B-Vision-Instruct。代码、数据集和预训练权重已公开在https://github.com/PKU-YuanGroup/LLaVA-CoT。

一、引言

Large language models, represented by OpenAI o1 [67], demonstrate strong capabilities for systematic and in-depth reasoning, validating the efficacy of inference-time scaling for language models [50]. However, vision is equally important for enabling models to fully understand the world and extend their cognitive abilities [6]. Therefore, developing a multimodal model that integrates both language and vision while facilitating effective, systematic, and deep reasoning holds substantial significance.

大型语言模型，以OpenAI的o1 [67]为代表，展示了系统和深入推理的强大能力，验证了推理时扩展对于语言模型的有效性[50]。然而，视觉对于使模型全面理解世界并扩展其认知能力同样重要[6]。因此，开发一种整合语言和视觉并促进有效、系统和深入推理的多模态模型具有重要意义。

Early open-source vision language models (VLMs) mainly employ a direct prediction approach [23, 32, 34], generating brief answers immediately in response to a question. The main limitation of this direct-response paradigm is its lack of a structured reasoning process, making it less effective for tasks demanding logical reasoning [66]. Recent studies have shown that incorporating Chain-of-Thought (CoT) reasoning encourages the model to reason step by step, significantly improving its question-answering capabilities [56]. However, even with CoT reasoning, most VLMs frequently produce errors or hallucinated outputs during the reasoning progress [26, 33, 53].

早期开源的视觉-语言模型（VLMs）主要采用直接预测方法[23, 32, 34]，立即生成简短的答案来响应问题。这种直接响应范式的局限在于缺乏结构化的推理过程，使其在需要逻辑推理的任务上效果不佳[66]。最近的研究表明，引入链式思维（CoT）推理鼓励模型逐步推理，显著提高了其问答能力[56]。然而，即使采用CoT推理，大多数VLMs在推理过程中经常产生错误或幻觉输出[26, 33, 53]。

Our findings suggest that a significant cause of these issues is the insufficiently systematic and structured nature of the reasoning process in existing VLMs. Specifically, by referring systematic, we mean that the model does not generate a direct reasoning chain but instead engages in multistage reasoning. Structured, on the other hand, refers to the model's ability to clearly identify the reasoning stage it is in and understand the primary task to be addressed at each stage. We observe that VLMs often initiate responses without adequately organizing the problem and the available information. Moreover, they frequently deviate from a logical reasoning toward conclusions, instead of presenting a conclusion prematurely and subsequently attempting to justify it. Given that language models generate responses token-by-token, once an erroneous conclusion is introduced, the model typically continues along a flawed reasoning path. Examples of these issues can be found in Appendix A.

我们的研究发现，这些问题的一个重要原因是在现有VLMs中推理过程不够系统和结构化。具体而言，"系统"指的是模型不生成直接的推理链，而是参与多阶段推理。"结构化"则指模型能够清晰地识别其所处的推理阶段，并理解每个阶段的主要任务。我们观察到，VLMs往往在未充分组织问题和可用信息的情况下就开始响应。此外，它们倾向于偏离逻辑推理路径而过早得出结论，然后试图证明这些结论。由于语言模型是逐令牌生成响应的，一旦引入错误结论，模型通常会沿着有缺陷的推理路径继续下去。这些例子可以在附录A中找到。

OpenAI o1 [67] addresses these issues effectively by enabling the model to independently engage in systematic and structured reasoning through language. Building on this insight, we design LLaVA-CoT. While the community has made some preliminary explorations into the underlying mechanisms of OpenAI o1 [45, 58], the model remains a black box, with its technical details largely unknown. This work demonstrates a potential way to enhance a model's ability to perform autonomous, stage-by-stage reasoning by employing supervised fine-tuning. Specifically, LLaVA-CoT is capable of generating four distinct stages: summary, caption, reasoning, and conclusion. Each stage serves a unique purpose in the reasoning process.

Summary: A brief outline in which the model summarizes the forthcoming task.
Caption: A description of the relevant parts of an image (if present), focusing on elements related to the question.
Reasoning: A detailed analysis in which the model systematically considers the question.
Conclusion: A concise summary of the answer, providing a final response based on the preceding reasoning.
OpenAI的o1 [67]通过使模型独立进行系统和结构化的语言推理有效地解决了这些问题。基于这一见解，我们设计了LLaVA-CoT。虽然社区对OpenAI o1的潜在机制进行了初步探索[45, 58]，但该模型仍然是一个黑盒子，其技术细节在很大程度上未知。这项工作展示了一种增强模型执行自主、分阶段推理能力的潜在途径，即通过监督微调实现。特别是，LLaVA-CoT能够生成四个不同的阶段：摘要、标题、推理和结论。每个阶段在推理过程中都有独特的用途。
摘要：简要概述，模型在此总结即将进行的任务。
标题：如果存在图像，则描述相关部分，重点放在与问题相关的元素上。
推理：详细分析，模型在此系统地考虑问题。
结论：答案的简洁总结，基于前述推理提供最终响应。

To enhance the understanding of CoT processes in LLM, LLaVA-CoT marks each stage with a dedicated tag (e.g., ...) to denote the beginning and end of each stage. Those taggings enable the model to maintain clarity throughout the reasoning process. Unlike traditional CoT reasoning, which allows the model to think freely, our method promotes structured thinking by first organizing the problem and known information, followed by a detailed thought process, and then deriving a conclusion. To achieve this, we construct the LLaVA-CoT-100k dataset by generating responses stage by stage using GPT-4o [3] and then train the model using supervised fine-tuning.

为了增强对LLM中的CoT过程的理解，LLaVA-CoT为每个阶段标记了一个专门的标签（例如，...），以表示每个阶段的开始和结束。这些标记使模型在整个推理过程中保持清晰。与允许模型自由思考的传统CoT推理不同，我们的方法首先组织问题和已知信息，然后进行详细的思考过程，最后得出结论。为此，我们使用GPT-4o [3]按阶段生成响应构建了LLaVA-CoT-100k数据集，然后使用监督微调训练模型。

The structured reasoning in LLaVA-CoT also facilitates efficient inference time scaling. In contrast to conventional scaling methods, such as best-of-N sampling [4, 55] and sentence-level beam search [17, 52], LLaVA-CoT employs a novel stage-level beam search method that generates multiple candidate results at each stage and selects the best one to continue the generation process.

LLaVA-CoT中的结构化推理还促进了有效的推理时扩展。与传统的扩展方法（如最佳N采样[4, 55]和句子级束搜索[17, 52]）相比，LLaVA-CoT采用了一种新颖的阶段级束搜索方法，在每个阶段生成多个候选结果并选择最佳结果继续生成过程。

We conduct experiments on several multimodal reasoning benchmarks, including MMStar [10], MMBench [35], MMVet [64], MathVista [37], AI2D [25], and HallusionBench [18], and observed that LLaVA-CoT offers two primary advantages: First, enabling the model to perform structured reasoning independently substantially outperforms traditional CoT prompting, particularly in complex reasoning tasks that require systematic analysis. Second, our stage-level beam search method is scalable and improves performance reliability, making it more effective in achieving stable and accurate results. Our contributions are summarized as follows:

We introduce LLaVA-CoT, a visual language model designed for systematic reasoning, demonstrating exceptional performance on tasks that require structured thinking and reasoning.
We demonstrate that LLaVA-CoT, using stage-level beam search, is inference-time scalable. This means that with increased computational resources, the performance of our approach can be further enhanced, making it applicable to more complex tasks and scenarios.
Extensive experiments on various benchmarks demonstrate that our method achieves superior performance relative to larger and closed-source models, underscoring the effectiveness of LLaVA-CoT for multimodal reasoning.
我们在几个多模态推理基准上进行了实验，包括MMStar [10]、MMBench [35]、MMVet [64]、MathVista [37]、AI2D [25]和HallusionBench [18]，并观察到LLaVA-CoT提供了两个主要优势：首先，使模型能够独立进行结构化推理，显著优于传统的CoT提示，特别是在需要系统分析的复杂推理任务中。其次，我们的阶段级束搜索方法可扩展且提高了性能可靠性，使其在实现稳定准确的结果方面更为有效。我们的贡献总结如下：
我们介绍了LLaVA-CoT，一种专为系统推理设计的视觉语言模型，在需要结构化思考和推理的任务上表现出色。
我们证明了使用阶段级束搜索的LLaVA-CoT是推理时可扩展的。这意味着随着计算资源的增加，我们方法的性能可以进一步提高，使其适用于更复杂的任务和场景。
在各种基准上的广泛实验表明，我们的方法相对于更大和闭源模型实现了卓越性能，突显了LLaVA-CoT在多模态推理方面的有效性。

二、文献综述

1. Visual reasoning with large language models

Visual reasoning demands the model's visual perception capability and the high-level cognition ability [24, 39]. Several tasks have been applied to evaluate the visual reasoning ability of VLMs, including VQA [22, 28] requiring models to answer from visual contents and textual questions, and Visual Entailment [51] requiring models to determine the consistency of text descriptions and visual contents, etc. Traditional vision-language models employ neural symbolic approaches [5, 12] to explicitly model the visual reasoning process. With the development of LLMs, vision-language models leverage the advanced reasoning abilities of LLMs to interpret visual tasks [34, 63]. Some vision-language models enhance visual reasoning by optimizing the visual encoding strategy [23, 31, 34] to produce cognition-focused visual tokens. VISPROG [19] positions the LLM as a decision-making agent, enhancing visual reasoning by invoking task-specific visual modules. Hu et al. [20] improves reasoning capabilities through sequential instruction tuning. Additionally, instructing learning techniques for language models, including prompt tuning [65], in-context learning, and supervised fine-tuning [49], also contribute to improvements in visual reasoning capabilities.

视觉推理要求模型具备视觉感知能力和高层次的认知能力[24, 39]。为了评估VLMs的视觉推理能力，已经应用了若干任务，包括需要模型从视觉内容和文本问题中回答的VQA [22, 28]，以及要求模型确定文本描述与视觉内容一致性的Visual Entailment [51]等。传统的视觉-语言模型采用神经符号方法[5, 12]明确建模视觉推理过程。随着LLMs的发展，视觉-语言模型利用LLMs先进的推理能力来解释视觉任务[34, 63]。一些视觉-语言模型通过优化视觉编码策略[23, 31, 34]以生成专注于认知的视觉令牌来增强视觉推理。VISPROG [19]将LLM定位为决策代理，通过调用特定任务的视觉模块来增强视觉推理。Hu等人[20]通过顺序指令微调改进了推理能力。此外，针对语言模型的指导学习技术，包括提示微调[65]、上下文学习和监督微调[49]，也对视觉推理能力的提升做出了贡献。

2. Chain-of-thought in large language models

Chain-of-thought prompting [56] offers a step-by-step reasoning trajectory when LLM faces hard questions including commonsense reasoning [16, 47], logical reasoning [29, 59], etc. Specifically, CoT prompting decomposes the question into a group of reasoning steps and builds a chain to guide the model to generate the results of complex problems step-by-step [13]. Recent works have demonstrated that CoT prompting substantially improves the LLM's capability on reasoning tasks. For instance, Prism [44] prompts LLMs by dividing the process into a perception stage and a reasoning stage. MSG [8] pioneers the use of forced Chain-of-Thoughts, establishing a new direction for structured prompting techniques.

链式思维提示[56]在LLM面对包括常识推理[16, 47]、逻辑推理[29, 59]等难题时提供了一步步的推理轨迹。具体来说，CoT提示将问题分解成一组推理步骤，并构建一个链条引导模型逐步生成复杂问题的结果[13]。最近的研究表明，CoT提示大幅提升了LLM在推理任务上的能力。例如，Prism [44]通过将其过程分为感知阶段和推理阶段来提示LLMs。MSG [8]率先使用强制链式思维，为结构化提示技术开辟了新的方向。

3. Inference time scaling

Existing methods for inference time scaling fall into two main categories: those that rely on an external verifier for selection [27, 60] and those that operate independently of any external verifier [21, 57]. The external verifier selection methods can be employed in prevalent methods. On the other hand, inference time scaling methods that do not rely on an external verifier primarily include majority voting [21], best-of-N search [4, 55], and sentence-level beam search [17, 52]. Majority voting is effective for certain types of problems that have standard answers, but it is not suitable for open-ended tasks. Best-of-N search generates N complete answers and allows the model to select the best response. However, generating full answers for selection can complicate the evaluation of their accuracy. Sentence-level beam search generates multiple candidate sentences, selects the best one, and iteratively continues this process. However, this approach operates at too granular a level, which makes it difficult for the model to effectively assess the quality of its responses on a per-sentence basis.

现有的推理时扩展方法主要分为两类：依赖外部验证器进行选择的方法[27, 60]和不依赖任何外部验证器独立运行的方法[21, 57]。外部验证器选择方法可以应用于普遍方法。另一方面，不依赖外部验证器的推理时扩展方法主要包括多数投票[21]、最佳N搜索[4, 55]和句子级束搜索[17, 52]。多数投票对于某些有标准答案的问题类型是有效的，但对于开放式任务则不适合。最佳N搜索生成N个完整答案，并允许模型选择最佳响应。然而，为选择而生成完整答案可能会使准确性评估变得复杂。句子级束搜索生成多个候选句子，选择最佳的一个，并迭代地继续这一过程。但是，这种方法操作的粒度太细，使得模型难以有效地按句子级别评估其响应的质量

三、方法

Our LLaVA-CoT facilitates a progressive, step-by-step reasoning process that enhances the reasoning capabilities of Vision-Language Models (VLMs) and allows for effective inference time scaling [50]. Using structured thinking, LLaVA-CoT achieves a systematic and efficient reasoning process. Its inference-time reasoning framework enables it to outperform existing methods in inference time scalability. This design ensures both robustness and accuracy in complex tasks requiring reasoning, which separates it from traditional approaches. Figure 1 illustrates our general framework of the reasoning process.

我们的LLaVA-CoT促进了逐步、一步步的推理过程，增强了视觉-语言模型（VLMs）的推理能力，并允许有效的推理时扩展[50]。通过结构化思考，LLaVA-CoT实现了系统且高效的推理过程。其推理时推理框架使其在推理时可扩展性方面超越了现有方法。这种设计确保了在需要推理的复杂任务中的鲁棒性和准确性，从而区别于传统方法。图1展示了我们推理过程的一般框架。

1. Enhancing Reasoning Capability through Structured Thinking

Our goal during training time is to develop a visual language model capable of extended chains of reasoning, allowing it to engage in systematic and in-depth reasoning.

我们在训练阶段的目标是开发一种能够进行扩展链式推理的视觉语言模型，使其能够参与系统和深入的推理。

1. Reasoning Stages

Our proposed model, LLaVA-CoT, decomposes the answer generation process into four structured reasoning stages:

Summary Stage: In this initial phase, LLaVA-CoT provides a high-level summary interpretation of the question, outlining the primary aspects of the problem it intends to address.
Caption Stage: If an image is present, LLaVA-CoT offers a concise overview of the visual elements relevant to the question, helping to understand multimodal input.
Reasoning Stage: Building on the initial summary, LLaVA-CoT conducts structured, logical reasoning to derive a preliminary answer.
Conclusion Stage: In this final stage, LLaVA-CoT synthesizes an answer based on the preceding reasoning. Here, the output from the conclusion stage is the direct response provided to the user, while the prior three stages are internal "hidden stages" representing LLaVA-CoT's reasoning process. The output at this stage adapts to the user's requirements: for instance, if the user requests a brief answer, the conclusion will be concise; if detailed explanations are desired, the conclusion provides a thorough, comprehensive response.
我们提出的模型LLaVA-CoT将答案生成过程分解为四个结构化的推理阶段：
摘要阶段：在这个初始阶段，LLaVA-CoT提供了对问题的高层次总结解释，概述了它打算解决的问题的主要方面。
标题阶段：如果存在图像，LLaVA-CoT提供与问题相关的视觉元素的简洁概览，帮助理解多模态输入。
推理阶段：基于最初的摘要，LLaVA-CoT进行结构化逻辑推理以得出初步答案。
结论阶段：在这个最终阶段，LLaVA-CoT根据前述推理综合一个答案。这里的输出是直接提供给用户的响应，而前三个阶段是内部"隐藏阶段"，代表LLaVA-CoT的推理过程。此阶段的输出适应用户的需求：例如，如果用户请求简短答案，结论将是简洁的；如果需要详细解释，结论将提供全面细致的回答。

Each stage is initiated at the model's discretion, without external prompt engineering frameworks or additional prompting. Specifically, we provide the model with four pairs of special tags: , , , and . These tags correspond to summarizing the response approach, describing relevant image content, conducting reasoning, and preparing a final answer, respectively.

每个阶段都是由模型自行决定启动的，无需外部提示工程框架或额外的提示。具体来说，我们为模型提供了四对特殊标签：、、和。这些标签分别对应于总结响应方法、描述相关图像内容、进行推理以及准备最终答案。

Upon training, the model autonomously selects these tags as needed, activating each stage based on its own judgment. As with OpenAI o1 [67], all stages are completed by the model in a single inference pass. This structured approach enables the model to independently manage its reasoning process, improving its adaptability and performance on complex reasoning tasks.

在训练后，模型自主选择这些标签，根据自己的判断激活每个阶段。就像OpenAI o1 [67]一样，所有阶段都由模型在一个推理过程中完成。这种结构化方法使模型能够独立管理其推理过程，提高其在复杂推理任务上的适应性和性能。

2. Data Preparation and Model Training

Most existing VQA datasets lack detailed reasoning processes needed to train the LLaVA-CoT model. Therefore, we compile a new dataset, integrating samples from several widely used VQA datasets, resulting in a total of 99k image QA pairs (each pair may include one or multiple rounds of questioning). As shown in Figure 3, since no multimodal model currently exists that can directly produce systematic, structured reasoning, we use GPT-4o [3] to generate detailed reasoning processes, including summary, caption, reasoning, and conclusion, and compile these into the LLaVA-CoT-100k dataset, which we plan to release for public use. Details of the generation process and examples of the generated data are provided in Appendix B. We include data from both general-purpose VQA datasets and science-targeted VQA datasets specified blow:

大多数现有的VQA数据集缺乏训练LLaVA-CoT模型所需的详细推理过程。因此，我们编译了一个新的数据集，整合了几个广泛使用的VQA数据集，总共包含99k张图像QA对（每对可能包括一轮或多轮提问）。如图3所示，由于目前没有能够直接产生系统、结构化推理的多模态模型，我们使用GPT-4o [3]生成详细的推理过程，包括摘要、标题、推理和结论，并将这些编入LLaVA-CoT-100k数据集中，我们计划将其公开发布。生成过程的细节和生成数据的例子见附录B。我们包括了通用VQA数据集和目标科学的VQA数据集：

General VQA Datasets: We include several general-purpose VQA datasets with distinct focuses. ShareGPT4V [9] provides multi-turn question-answering data from GPT-4V [61] interactions. ChartQA [40] focuses on interpreting charts and graphs. A-OKVQA [48] emphasizes external knowledge beyond visible content. DocVQA [41] involves document-based questions requiring textual comprehension. We also include PISC [30] to understand social relationships, and CLEVR [24] to address object properties, spatial relationships, and counting tasks.

通用VQA数据集：我们包括了多个具有不同焦点的通用VQA数据集。ShareGPT4V [9]提供了来自GPT-4V [61]交互的多轮问答数据。ChartQA [40]专注于图表解读。A-OKVQA [48]强调超出可见内容的外部知识。DocVQA [41]涉及要求文本理解的文档基础问题。我们还包括PISC [30]来理解社会关系，以及CLEVR [24]来解决对象属性、空间关系和计数任务。

Science-Targeted VQA Datasets: These datasets include GeoQA+ [7] for geometric reasoning, along with AI2D [25] and ScienceQA [36], which target scientific questions. CLEVR-Math [14], an extension of CLEVR, focuses on arithmetic analysis in visual contexts. Table 1 shows the number of QA pairs selected from each dataset.

目标科学的VQA数据集：这些数据集包括用于几何推理的GeoQA+ [7]，以及针对科学问题的AI2D [25]和ScienceQA [36]。作为CLEVR扩展的CLEVR-Math [14]专注于视觉上下文中的算术分析。表1显示了从每个数据集中选出的QA对数量。

Model Training: The LLaVA-CoT-100k dataset we construct can be used to further conduct Supervised Fine-Tuning (SFT) on any existing model to enhance reasoning capabilities. In this work, we select the Llama-3.2-11BVision-Instruct [42] model as the base model, and perform a full parameter fine-tuning by using the LLaVA-CoT-100k dataset. The training is conducted on a single node with 8 H100 GPUs. Details on the specific training parameters, including training epochs, learning rate, and optimization settings, are provided in Appendix C.

模型训练：我们构建的LLaVA-CoT-100k数据集可用于进一步对任何现有模型进行监督微调（SFT），以增强推理能力。在这项工作中，我们选择了Llama-3.2-11BVision-Instruct [42]模型作为基础模型，并使用LLaVA-CoT-100k数据集进行全面参数微调。训练是在配备8个H100 GPU的单节点上进行的。关于特定训练参数的详细信息，包括训练轮次、学习率和优化设置，参见附录C。

2. Effective Inference Time Scaling using Stage-level Beam Search

After training, our objective is to further enhance the model's reasoning ability during inference. Specifically, we leverage the stage-based outputs of LLaVA-CoT, which provides an ideal granularity for inference time scaling. Our method follows the steps below:

Sample N responses for the first stage in the solution.
Randomly sample 2 responses and let the model determine which is better, keeping the better response.
Repeat for N − 1 times, retaining the best response.
Sample N responses for the next stage, then repeat steps 2-4 until all stages are processed.
训练后，我们的目标是进一步增强模型在推理时的推理能力。具体而言，我们利用LLaVA-CoT的阶段性输出，这为推理时扩展提供了理想的粒度。我们的方法遵循以下步骤：
对解决方案的第一个阶段采样N个响应。
随机采样2个响应并让模型确定哪个更好，保留更好的响应。
对N-1次重复步骤2-3，保留最佳响应。
对下一个阶段采样N个响应，然后重复步骤2-4直到处理完所有阶段。

Notably, it is the structured output design of LLaVA-CoT that makes this approach feasible, enabling efficient and accurate verification at each stage. This validates the effectiveness of structured output in improving inference time scaling. An illustration of the three approaches is shown in Figure 4, and the detailed implementation of our stage-level beam search can be found in Appendix D.

值得注意的是，正是LLaVA-CoT的结构化输出设计使得这种方法可行，使得每个阶段都能进行高效准确的验证。这验证了结构化输出在改进推理时扩展方面的有效性。三种方法的示意图见图4，我们阶段级束搜索的详细实现见附录D。

附录D介绍开始-阶段级束搜索：

D. Implementation Details of Stage-level Beam Search

In this section, we provide the implementation details of Stage-level Beam Search. As mentioned in the main paper, we randomly sample two responses and allow the model to determine which response is better, retaining the superior one. However, the implementation details for this process are omitted. Here, we elaborate on our method using prompt engineering within LLaVA-CoT, wherein the model acts as a judge to evaluate responses.

The general form of the prompt we use is:

Prompt for Stage-level Beam Search

Now you act as a judge, helping me determine which of the two texts I provide better provides a summary/caption/reasoning/conclusion to solve the question.

For each stage, we provide the model with specific guidance to ensure accurate evaluation:

Summary stage: Please note that a better summary should focus on outlining the main approach instead of stating specific analytical reasoning or math formula.
Caption stage: Please note that a better caption should be as thorough as possible while remaining accurate, capturing as many details as possible rather than providing only general commentary.
Reasoning stage: Begin by thoroughly reviewing the question, followed by an in-depth examination of each answer individually, noting any differences. Subsequently, analyze these differences to determine which response demonstrates stronger reasoning and provide a clear conclusion.
Conclusion stage: Please note that a better conclusion should align with the reasoning. The conclusion should never refuse to answer the question.
The model follows these guidelines to evaluate the two responses for each stage and selects the superior one, ensuring the final output aligns with high-quality reasoning pathway.

D. 阶段级波束搜索的实现细节

在本节中，我们提供了阶段级波束搜索的实现细节。正如主论文所述，我们随机抽取两个回答，并让模型来判断哪一个回答更优，保留较优的那个。然而，该过程的实现细节被省略了。在此，我们详细说明了如何在LLaVA-CoT中通过提示工程使用我们的方法，其中模型充当评委以评估回答。

我们使用的提示的一般形式是：

Prompt for Stage-level Beam Search

现在你充当一名评委，帮助我确定所提供的两段文本中哪一段更好地提供了解决问题所需的总结/标题/推理/结论。

对于每个阶段，我们会向模型提供具体的指导，以确保准确评估：

总结阶段：请注意，更好的总结应侧重于概述主要方法，而不是陈述特定的分析性推理或数学公式。
标题阶段：请注意，更好的标题应当尽可能详尽且准确，捕捉尽可能多的细节，而不仅仅是提供一般的评论。
推理阶段：首先彻底审查问题，然后分别深入检查每个答案，注意任何差异。随后，分析这些差异以确定哪个回答显示出更强的推理能力，并给出明确的结论。
结论阶段：请注意，更好的结论应该与推理相一致。结论不应拒绝回答问题。
模型遵循这些指南来评估每个阶段的两个回答并选择更优的一个，确保最终输出符合高质量推理路径。

附录D介绍结束。

We provide an example in Figure 5. When inference time scaling is not applied, although the model generates correct reasoning steps, it fails to arrive at a concrete answer during the reasoning process. This causes the model to make a guess in the conclusion phase, leading to an incorrect result. In contrast, with inference time scaling, the model retains the reasoning steps leading to the final result, ensuring the correctness of the answer.

我们在图5中提供了一个例子。当未应用推理时扩展时，尽管模型生成了正确的推理步骤，但在推理过程中未能得出具体的答案。这导致模型在结论阶段做出猜测，导致错误的结果。相比之下，当应用推理时扩展时，模型保留了通向最终结果的推理步骤，确保了答案的正确性。

我评价：太厉害了！

四、Post-Training Performance

In this section, we compare LLaVA-CoT with the base model, Llama-3.2-11B-Vision-Instruct, on six commonly used multimodal benchmarks to demonstrate the effectiveness of our approach during the training phase. Following this comparison, we conduct ablation studies to evaluate the contribution of each component within our method, addressing the following three key questions: (1) Is our LLaVA-CoT-100k dataset more effective than directly using the original dataset's Q&A pairs? (2) What is the impact of structured tags on the performance? Specifically, we explore whether LLaVA-CoT can function without tags by implicitly segmenting different stages of the response. (3) In which specific areas does our model show the most improvement compared to the base model, and does it genuinely enhance reasoning capabilities?

在本节中，我们将LLaVA-CoT与基础模型Llama-3.2-11B-Vision-Instruct在六个常用的多模态基准上进行比较，以展示我们训练阶段方法的有效性。在此比较之后，我们进行了消融研究，评估了我们方法中每个组件的贡献，解决以下三个关键问题：(1) 我们的LLaVA-CoT-100k数据集是否比直接使用原始数据集的Q&A对更有效？(2) 结构化标签对性能的影响是什么？具体来说，我们探索了LLaVA-CoT能否通过隐式分割响应的不同阶段来运行。(3) 我们的模型在哪些特定领域相比基础模型显示出最大改进，并且它确实增强了推理能力吗？

1. Experimental Setup

We selected six widely used and challenging benchmarks for our experiments: MMStar [10], MMBench V1.1 [35], MMVet [64], MathVista [37], AI2D [25], and HallusionBench [18]. MMStar, MMBench, and MMVet primarily evaluate the general visual question-answering capabilities of models, while MathVista, and AI2D focus on models' proficiency in mathematical and scientific reasoning. HallusionBench specifically assesses the models' handling of language hallucinations and visual illusions. For MMBench, we use the V1.1 version of the test set, MathVista is evaluated using the testmini set, and the remaining datasets each have a single test set. To ensure fairness and reproducibility, all evaluations are conducted using VLMEvalKit [15], an open-source evaluation toolkit for large vision-language models. The performance metrics of all baseline models are derived from VLMEvalKit's testing results [1].

我们选择了六个广泛使用且具有挑战性的基准进行实验：MMStar [10]、MMBench V1.1 [35]、MMVet [64]、MathVista [37]、AI2D [25]和HallusionBench [18]。MMStar、MMBench和MMVet主要评估模型的一般视觉问答能力，而MathVista和AI2D则专注于模型在数学和科学推理方面的熟练程度。HallusionBench专门评估模型处理语言幻觉和视觉错觉的能力。对于MMBench，我们使用测试集的V1.1版本，MathVista使用testmini集进行评估，其余数据集各有一个测试集。为了确保公平性和可重复性，所有评估均使用VLMEvalKit [15]，这是大型视觉-语言模型的开源评估工具包。所有基线模型的性能指标均来源于VLMEvalKit的测试结果[1]。

2. Benchmark Results

We found that LLaVA-CoT achieves significant performance improvements, despite using only 100k data. According to Table 2, compared to the base model, Llama-3.2-11B-Vision-Instruct, LLaVA-CoT demonstrates notable improvements across general VQA, mathematical reasoning, scientific VQA, and hallucination control tasks, with an average benchmark score increase of 5.8%, thereby validating the effectiveness of our approach.

我们发现，尽管仅使用了100k的数据，LLaVA-CoT实现了显著的性能提升。根据表2所示，与基础模型Llama-3.2-11B-Vision-Instruct相比，LLaVA-CoT在一般VQA、数学推理、科学VQA以及幻觉控制任务方面展示了显著的进步，平均基准分数提高了5.8%，从而验证了我们方法的有效性。

3. Ablation Study

Effectiveness of LLaVA-CoT-100k Compared to Original Datasets. To demonstrate the effectiveness of our improved LLaVA-CoT-100k dataset, we present a comparison between LLaVA-CoT and the model trained on the original Q&A pairs across different benchmarks in Table 2. Although the model trained directly on the original Q&A pairs shows some overall improvement on the base model, its average performance remains significantly lower. In particular, on the MMVet benchmark, which requires more detailed responses, its performance is even worse than the base model. This result underscores the importance of the multi-stage format of our LLaVA-CoT-100k dataset for training models capable of advanced reasoning.

LLaVA-CoT-100k相比原始数据集的有效性。为了证明我们改进的LLaVA-CoT-100k数据集的有效性，我们在不同基准上将LLaVA-CoT与直接在原始Q&A对上训练的模型进行了比较（见表2）。虽然直接在原始Q&A对上训练的模型显示了一些总体上的改进，但其平均表现仍然显著较低。特别是在需要更详细响应的MMVet基准上，其表现甚至低于基础模型。这一结果强调了我们的LLaVA-CoT-100k数据集中多阶段格式对于训练能够进行高级推理的模型的重要性。

Structured Tags are Essential for Enhanced Performance. To examine whether the four tags we introduced improve the model's performance, we compare LLaVA-CoT with the model trained on the LLaVA-CoT-100k dataset with structured tags removed. As shown in Table 2, our results show a significant drop in performance when the tags are removed, indicating that the structured tagging facilitates reasoning and improves model performance. To the best of our knowledge, LLaVA-CoT is the first attempt to successfully enhance a model's reasoning ability and overall performance through a structured reasoning with tags.

结构化标签对增强性能至关重要。为了检查我们引入的四个标签是否改善了模型性能，我们将LLaVA-CoT与从移除了结构化标签的LLaVA-CoT-100k数据集上训练的模型进行了比较。如表2所示，结果显示当移除标签时性能显著下降，表明结构化标签促进了推理并提升了模型性能。据我们所知，LLaVA-CoT是首次成功通过带有标签的结构化推理来增强模型推理能力和整体性能的尝试。

Performance Gains Primarily in Reasoning-Intensive Areas. To analyze the specific areas in which LLaVA-CoT has improved compared to the base model, we conduct a detailed assessment of the model's performance across different skills on the MMStar benchmark. MMStar is designed to evaluate six key capabilities: coarse perception, fine-grained perception, instance reasoning, logical reasoning, math, and science & technology. In Table 3, we compare the base model with LLaVA-CoT. Our analysis reveals that LLaVA-CoT demonstrates notable improvements in tasks requiring systematic reasoning, such as instance reasoning, logical reasoning, math, and science & technology, while showing relatively smaller gains in coarse perception and fine-grained perception. This suggests that our method can mainly improve reasoning capabilities of the model.

性能增益主要集中在推理密集型领域。为了分析LLaVA-CoT相比基础模型在哪一特定领域有所改进，我们对MMStar基准上的模型在不同技能上的表现进行了详细评估。MMStar旨在评估六项关键能力：粗略感知、细粒度感知、实例推理、逻辑推理、数学和科学技术。表3对比了基础模型与LLaVA-CoT。我们的分析揭示出LLaVA-CoT在需要系统推理的任务（例如实例推理、逻辑推理、数学和科学技术）中表现出显著进步，而在粗略感知和细粒度感知方面进步相对较小。这表明我们的方法主要可以提升模型的推理能力。

五. Inference Time Scaling

In this section, we aim to compare the effectiveness of our stage-level beam search approach with traditional methods like best-of-N and sentence-level beam search under comparable computational constraints. The experimental setup mirrors that used in the previous section, with evaluations conducted across the same six benchmarks: MMStar, MMBench V1.1, MMVet, MathVista, AI2D, and HallusionBench. All methods are evaluated using VLMEvalKit to ensure reproducibility.

在本节中，我们旨在比较我们的阶段级束搜索方法与传统方法（如最佳-N和句子级束搜索）在相似计算约束下的有效性。实验设置与前一节使用的相同，在相同的六个基准上进行评估：MMStar、MMBench V1.1、MMVet、MathVista、AI2D和HallusionBench。所有方法均使用VLMEvalKit进行评估以确保可重复性。

1. Benchmark Results

As shown in Table 4, stage-level beam search demonstrates substantial effectiveness in leveraging the structured reasoning stages of LLaVA-CoT. By evaluating outputs at each reasoning stage, this approach strikes a balance between rigorous quality control and computational efficiency, yielding higher inference accuracy on complex reasoning tasks without significant computational overhead. These findings suggest that stage-level beam search, which is made possible by the structured output design of LLaVA-CoT, is an effective and powerful approach for inference time scaling.

如表4所示，阶段级束搜索展示了在利用LLaVA-CoT的结构化推理阶段方面的显著有效性。通过在每个推理阶段评估输出，这种方法在严格的质量控制和计算效率之间找到了平衡，在复杂的推理任务上实现了更高的推理准确性而没有显著增加计算开销。这些发现表明，由LLaVA-CoT的结构化输出设计实现的阶段级束搜索是一种有效且强大的推理时间扩展方法。

2. Comparison to Baseline Methods

We compare our method with baseline inference scaling methods on the MMVet benchmark to evaluate relative performance. For a fair comparison, our stage-level beam search method and the baseline models are evaluated using comparable levels of inference time compute. Specifically, we set N = 10 for the best-of-N method, generate 4 candidate responses per stage for our stage-level beam search, and use a sentence-level beam search generating 2 candidates per sentence. As shown in Table 5, the best-of-N method yields only a modest improvement of 0.6%, while sentence-level beam search even shows a 1.9% decrease in performance. We examine the sub-scores and found that the main reason for the performance drop in sentence-level beam search is the excessively granular sentence-level approach, which struggles to effectively address open-ended questions. In contrast, our stage-level beam search improved performance by 2.6%, highlighting the superiority of stage-based search.

我们在MMVet基准上将我们的方法与基线推理扩展方法进行比较以评估相对性能。为了公平比较，我们的阶段级束搜索方法和基线模型使用可比较级别的推理时间计算进行评估。具体来说，对于最佳-N方法设置N=10，为我们的阶段级束搜索生成每阶段4个候选响应，并使用生成每句2个候选的句子级束搜索。如表5所示，最佳-N方法仅实现了0.6%的小幅改进，而句子级束搜索显示了1.9%的性能下降。我们检查了子分数并发现句子级束搜索性能下降的主要原因是过于细粒度的句子级别方法难以有效解决开放式问题。相比之下，我们的阶段级束搜索提高了2.6%的性能，突显了基于阶段搜索的优势。

3. Scaling Trend of Stage-level Beam Search

To better illustrate the effectiveness of our stage-level beam search as inference time compute increases, we evaluate LLaVA-CoT with different beam sizes on the MMVet benchmark. As shown in Table 6, we test the performance of the model by generating 1 (ie, no inference time scaling), 2, 3, and 4 candidate responses at each reasoning stage, allowing the model to select the best answer from these options. Our findings show that as the number of candidate responses increases, the model's performance consistently improves, confirming that our stage-level beam search approach is scalable. Due to computational resource constraints, we only test a beam size of 2 across all benchmarks. However, it is expected that increasing the beam size will lead to even more significant improvements.

为了更好地说明随着推理时间计算增加，我们阶段级束搜索的有效性，我们在MMVet基准上评估了不同束大小的LLaVA-CoT。如表6所示，我们通过在每个推理阶段生成1（即无推理时间扩展）、2、3和4个候选响应来测试模型的性能，允许模型从这些选项中选择最佳答案。我们的研究结果表明，随着候选响应数量的增加，模型的性能持续提升，证实了我们的阶段级束搜索方法是可扩展的。由于计算资源限制，我们仅在整个基准上测试了束大小为2的情况。然而，预计增加束大小将带来更为显著的改进。

六、 Comparison to State-of-the-Art VLMs

As shown in Table 7, we compare LLaVA-CoT with other state-of-the-art open-source and closed-source vision language models (VLM) across six benchmarks that require advanced reasoning capabilities: MMStar-R, MMBench-R, MMVet-R, MathVista, AI2D, and HallusionBench. MMStar-R, MMBench-R, and MMVet-R are custom benchmarks derived from MMStar, MMBench V1.1, and MMVet, respectively, with tasks requiring only coarse perception, fine-grained perception, and OCR removed. These filtered benchmarks retain tasks that demand complex reasoning, with further details on the selection criteria available in Appendix E. MathVista, AI2D, and HallusionBench inherently focus on advanced reasoning, so we retained all tasks within these benchmarks.

Our results show that LLaVA-CoT consistently outperforms many open-source models of similar or even larger sizes, such as InternVL2-8B [11], Ovis1.5-Gemma2-9B [38], MiniCPM-V2.6-8B [62], Llama-3.2-90B-Vision-Instruct [42], and VILA-1.5-40B [32]. Remarkably, LLaVA-CoT even surpasses certain closed-source models like GPT-4o-mini [43] and Gemini-1.5-pro [46], underscoring the effectiveness of our structured reasoning approach. This comparison validates the advantages of our method, particularly in benchmarks that heavily depend on reasoning skills, and highlights LLaVA-CoT as a competitive model in the domain of reasoning-intensive VLM tasks.

如表7所示，我们将LLaVA-CoT与其他最先进的开源和闭源视觉-语言模型（VLM）在六个需要高级推理能力的基准上进行了比较：MMStar-R、MMBench-R、MMVet-R、MathVista、AI2D和HallusionBench。MMStar-R、MMBench-R和MMVet-R分别是MMStar、MMBench V1.1和MMVet衍生的定制基准，去除了仅需粗略感知、细粒度感知和OCR的任务。这些过滤后的基准保留了需要复杂推理的任务，有关选择标准的更多详情请参见附录E。MathVista、AI2D和HallusionBench本质上专注于高级推理，因此我们保留了这些基准中的所有任务。

我们的结果显示，LLaVA-CoT在许多相似甚至更大规模的开源模型中持续表现出色，例如InternVL2-8B [11]、Ovis1.5-Gemma2-9B [38]、MiniCPM-V2.6-8B [62]、Llama-3.2-90B-Vision-Instruct [42]和VILA-1.5-40B [32]。值得注意的是，LLaVA-CoT甚至超越了某些闭源模型如GPT-4o-mini [43]和Gemini-1.5-pro [46]，这强调了我们结构化推理方法的有效性。这种比较验证了我们方法的优势，特别是在那些高度依赖推理技能的基准上，并突出LLaVA-CoT作为推理密集型VLM任务领域的竞争模型。

七、 Conclusion

In this paper, we present LLaVA-CoT, a novel vision language model that performs structured, autonomous reasoning in multiple stages. By introducing four distinct stages, LLaVA-CoT achieves a systematic reasoning process. Our contributions are twofold: first, the creation of the LLaVA-CoT-100k dataset with detailed reasoning annotations, which supports training on systematic, structured responses; and second, the proposal of a stage-level beam search method, enabling effective inference time scaling. Overall, LLaVA-CoT establishes a new standard for multimodal reasoning in VLMs, offering robust performance and scalability, especially in inference time. Our work paves the way for future research on structured reasoning in VLMs, including potential expansions with external verifiers and the use of reinforcement learning to further enhance complex multimodal reasoning capabilities.

在本文中，我们介绍了LLaVA-CoT，一种新型视觉-语言模型，能够在多个阶段执行结构化自主推理。通过引入四个不同的阶段，LLaVA-CoT实现了系统化的推理过程。我们的贡献有两个方面：首先，创建了具有详细推理注释的LLaVA-CoT-100k数据集，支持对系统化、结构化响应的训练；其次，提出了一个阶段级束搜索方法，使有效的推理时间扩展成为可能。总体而言，LLaVA-CoT为VLM中的多模态推理树立了新标准，提供了强大的性能和可扩展性，尤其是在推理时间方面。我们的工作为未来关于VLM中结构化推理的研究铺平了道路，包括潜在的外部验证器扩展以及使用强化学习进一步增强复杂多模态推理能力的应用。

八、LLaVA-COT的数据制作

1、训练数据格式

我去查找了此模型的数据格式，我只罗列了一个数据示列，其内容如下图。我们可以看到尽管SUMMARY CAPTION REASONING等都是在gpt格式上，我认为就是一起做loss的。

2、数据制作(非常重要)

Overall, we provide GPT-4o with a question, an image, and the original dataset's answer to generate systematic and structured datasets.

总体而言，我们为GPT-4o提供一个问题、一张图片以及原始数据集的答案来生成系统化和结构化的数据集。

Specifically, we guide GPT-4o to generate response data in stages using a carefully designed prompt. The prompt is formatted as follows:

具体来说，我们指导GPT-4o使用精心设计的提示分阶段生成响应数据。提示格式如下：
Prompt for data generation

I have an image and a question that I want you to answer. I need you to strictly follow the format with four specific sections: SUMMARY, CAPTION, REASONING, and CONCLUSION. It is crucial that you adhere to this structure exactly as outlined and that the final answer in the CONCLUSION matches the standard correct answer precisely.

To explain further: In SUMMARY, briefly explain what steps you'll take to solve the problem. In CAPTION, describe the contents of the image, specifically focusing on details relevant to the question. In REASONING, outline a step-by-step thought process you would use to solve the problem based on the image. In CONCLUSION, give the final answer in a direct format, and it must match the correct answer exactly. If it's a multiple choice question, the conclusion should only include the option without repeating what the option is.

Here's how the format should look:

Summarize how you will approach the problem and explain the steps you will take to reach the answer.\] \[Provide a detailed description of the image, particularly emphasizing the aspects related to the question.\] \[Provide a chain-of-thought, logical explanation of the problem. This should outline step-by-step reasoning.\] \[State the final answer in a clear and direct format. It must match the correct answer exactly.\] (Do not forget !) Please apply this format meticulously to analyze the given image and answer the related question, ensuring that the answer matches the standard one perfectly. 我有一张图片和一个问题希望你回答。你需要严格按照包含四个特定部分的格式来回答：摘要（SUMMARY）、说明（CAPTION）、推理（REASONING）和结论（CONCLUSION）。严格遵循此结构非常重要，且结论中的最终答案必须与标准正确答案完全匹配。 进一步解释如下：在摘要（SUMMARY）中，简要说明你将采取哪些步骤解决问题。在说明（CAPTION）中，描述图片的内容，特别关注与问题相关的信息。在推理（REASONING）中，基于图片内容，列出解决问题所使用的逐步思考过程。在结论（CONCLUSION）中，以直接的方式给出最终答案，并且它必须与正确的答案完全匹配。如果是选择题，则结论仅应包括选项而不重复选项的具体内容。 以下是格式应该如何呈现的例子： \[总结你将如何解决问题并解释达到答案的步骤。\] \[详细描述图片内容，特别是强调与问题相关的方面。\] \[提供一个逻辑解释的问题解决链式思维。这应该概述逐步的推理过程。\] \[以清晰直接的形式陈述最终答案。它必须与正确答案完全匹配。\]（不要忘了！） 请仔细应用这种格式分析给定图像并回答相关问题，确保答案与标准答案完美匹配。 After generating data using this prompt, we verify whether the data generated by GPT-4o adheres to the prescribed format and filter out any data that does not comply. Next, we extract the content within ... and apply the following prompt to filter out cases where GPT-4o either refuses to answer or provides an answer that is inconsistent with the original dataset's standard answer: 生成数据后，我们将验证GPT-4o生成的数据是否符合规定的格式，并过滤掉任何不符合要求的数据。接下来，提取...内的内容并应用以下提示来过滤出GPT-4o拒绝回答或提供的答案与原始数据集的标准答案不一致的情况： `Prompt for data verification` Evaluate whether the assistant's response is valid. Respond with 'valid' if the assistant's response is not a refusal and it aligns with the standard answer in meaning. Respond with 'invalid' if the response is a refusal or differs from the standard answer in a meaningful way. A refusal means the assistant states it cannot recognize a specific person/object or refuses to answer the question. Do not consider a response to be a refusal just because it includes the word 'no' or other negative terms. Standard answer: {standard answer} Assistant's response: {assistant response} 评估助手的回答是否有效。如果助手的回答不是拒绝并且与标准答案在意义上相符，则回复"有效"。如果回答是拒绝或与标准答案有显著差异，则回复"无效"。 拒绝指的是助手表示无法识别特定人物/对象或拒绝回答问题。不要仅仅因为回答中包含了"不"或其他否定词汇就认为它是拒绝。 标准答案：{standard answer} 助手回答：{assistant response}