【自然语言处理 NLP】多模态与具身智能：视觉-语言预训练技术手册

第一部分：原理详解

[6.2.1 视觉-语言预训练（Vision-Language Pretraining）](#6.2.1 视觉-语言预训练（Vision-Language Pretraining）)

[6.2.1.1 对比学习（CLIP）与图像-文本对齐](#6.2.1.1 对比学习（CLIP）与图像-文本对齐)

[6.2.1.2 生成式VLM（BLIP/BLIP-2与Q-Former）](#6.2.1.2 生成式VLM（BLIP/BLIP-2与Q-Former）)

[6.2.1.3 投影层设计（MLP Adapter vs Transformer Connector）](#6.2.1.3 投影层设计（MLP Adapter vs Transformer Connector）)

[6.2.1.4 多模态指令微调（InstructBLIP/LLaVA-1.5）](#6.2.1.4 多模态指令微调（InstructBLIP/LLaVA-1.5）)

脚本1：CLIP双编码器训练与零样本分类系统

[脚本2：BLIP-2 Q-Former视觉问答系统](#脚本2：BLIP-2 Q-Former视觉问答系统)

[脚本3：投影层架构对比实验（Science QA）](#脚本3：投影层架构对比实验（Science QA）)

脚本4：多模态指令微调对话系统（LLaVA-1.5风格）

第一部分：原理详解

6.2.1 视觉-语言预训练（Vision-Language Pretraining）

视觉-语言预训练致力于构建能够理解并关联视觉与文本模态的统一表示空间。该领域核心在于通过大规模图像-文本对的学习，建立跨模态的语义对齐机制，使得模型同时具备视觉感知能力与语言理解能力，并能在零样本或少样本场景下完成下游任务。

6.2.1.1 对比学习（CLIP）与图像-文本对齐

对比语言-图像预训练（CLIP）通过双塔编码器架构实现视觉-语言联合嵌入空间构建。该架构包含图像编码器与文本编码器两个独立分支，分别将图像与文本映射至共享的 d 维嵌入空间。

图像编码器采用视觉Transformer（ViT）或卷积神经网络（ResNet）架构，将输入图像 ximg∈RH×W×3 编码为特征向量 v∈Rd 。文本编码器基于Transformer架构，通过字节对编码（BPE）将文本序列 xtext 转换为词嵌入，经多层自注意力机制处理后生成文本特征 t∈Rd 。

对比学习的核心在于构建图像-文本相似度矩阵。对于批次大小为 N 的训练样本，模型计算归一化后的图像特征 {v1,v2,...,vN} 与文本特征 {t1,t2,...,tN} 之间的余弦相似度：

si,j=∥vi∥∥tj∥vi⋅tj⋅exp(τ)

其中 τ 为可学习的温度参数，用于控制分布的平滑程度。InfoNCE损失函数最大化正样本对的相似度同时最小化负样本对相似度：

LInfoNCE=−N1i=1∑N $log\sumj=1Nexp(si,j)exp(si,i)+log\sumj=1Nexp(sj,i)exp(si,i)$

该对称式损失同时考虑图像到文本与文本到图像两个方向的对比。N-pair损失作为变体，采用单个正样本与 N−1 个负样本的对比形式，避免传统softmax在批次较小时的梯度消失问题。

温度参数 τ 的学习至关重要。初始值通常设为 0.07 ，通过反向传播动态调整。较低的 τ 使分布更尖锐，增强对困难负样本的区分能力；较高的 τ 使分布平滑，训练初期提供更稳定的梯度信号。

零样本分类通过提示模板实现。对于ImageNet的 K 个类别，构造文本描述集合 {textk}k=1K ，计算查询图像与各类别描述的相似度，取最大响应作为预测结果：

y=argkmax(∥vquery∥∥tk∥vquery⋅tk)

6.2.1.2 生成式VLM（BLIP/BLIP-2与Q-Former）

BLIP-2提出查询Transformer（Q-Former）架构，解决冻结大规模图像编码器与大型语言模型（LLM）之间的视觉-语言表示鸿沟。该架构核心为轻量级的查询变换器，包含可学习的查询嵌入（Query Tokens），作为视觉信息向LLM语义空间转换的桥梁。

Q-Former采用双阶段预训练策略。第一阶段为表示学习，冻结图像编码器（如ViT-G/14），仅优化Q-Former参数。该阶段通过三个目标函数联合优化：图像-文本对比学习（ITC）、图像-文本匹配（ITM）与基于图像的文本生成（ITG）。ITC对齐图像与文本的潜在表示，ITM通过二分类判别图像-文本对的相关性，ITG则训练Q-Former基于视觉特征生成文本描述。

查询嵌入 Q∈RNq×d 作为可学习参数，通过交叉注意力机制与冻结图像编码器的输出交互。图像编码器提取的视觉特征 Z∈RNpatch×dvit 作为键值对，查询嵌入通过多层Transformer的交叉注意力层逐步提取视觉信息：

Attention(Q,Z,Z)=softmax(dkQWQ(ZWK)T)ZWV

这种基于查询的压缩机制将变长的图像特征压缩为固定数量的查询 token，解决LLM处理高维视觉特征的维度不匹配问题。

第二阶段为生成学习，将Q-Former输出接入冻结的大型语言模型（如Flan-T5）。Q-Former提取的视觉表示经全连接层投影至LLM的嵌入维度，作为软视觉提示（Soft Visual Prompts）前缀输入LLM。该阶段仅训练Q-Former与投影层参数，通过语言建模损失优化视觉到语言的生成能力：

LLM=−t=1∑TlogP(xt∣x<t,Qoutput)

其中 Qoutput 为Q-Former输出的视觉表示。这种冻结LLM的策略保留其丰富的语言知识，同时通过轻量级Q-Former实现视觉感知能力的注入。

在视觉问答任务中，Q-Former将问题文本与图像查询联合编码。问题嵌入与查询 token 拼接后输入Q-Former，生成的视觉-语言联合表示经投影层输入LLM，自回归生成答案序列。

6.2.1.3 投影层设计（MLP Adapter vs Transformer Connector）

视觉-语言模型中投影层承担视觉特征向语言模型嵌入空间映射的关键职责。不同投影架构在表达能力、计算效率与参数量之间存在显著权衡。

线性投影采用单层矩阵变换 W∈Rdvit×dllm ，将视觉特征 v 直接映射至LLM嵌入空间：

h=vW

该方案参数量最小（dvit⋅dllm ），但表达能力受限，无法处理复杂的视觉-语言对齐关系。

多层感知机（MLP）Adapter引入非线性变换增强映射能力。典型结构为两层MLP，中间采用GELU激活函数与Dropout正则化：

h=GELU(vW1+b1)W2+b2

其中 W1∈Rdvit×dhidden ，W2∈Rdhidden×dllm ，dhidden 通常为 dllm 的4倍。GELU激活函数引入非线性，使模型能够学习更复杂的特征变换。该架构在Science QA等知识密集型任务中表现出优于线性投影的准确率，但参数量与计算开销相应增加。

C-Abstractor（基于查询的压缩器）采用Transformer风格的查询机制，通过可学习查询 token 压缩视觉特征。该架构首先使用交叉注意力将变长的图像 patch 特征压缩为固定数量的视觉查询，再经自注意力层 refinement：

QcompressedHh=CrossAttn(Qquery,Vpatches,Vpatches)=SelfAttn(Qcompressed)=MLP(H)

该设计特别适合处理高分辨率图像或长视频序列，通过查询机制实现特征压缩，避免LLM上下文长度的指数级增长。

对比实验表明，在Science QA数据集上，单层线性投影因参数量受限难以建立细粒度的视觉-概念映射，准确率通常低于60%。两层MLP Adapter通过非线性变换捕获视觉特征与科学概念间的复杂关系，准确率可提升5-8个百分点。C-Abstractor在需要细粒度视觉推理的任务中表现最优，但训练稳定性对查询数量与学习率较为敏感。

6.2.1.4 多模态指令微调（InstructBLIP/LLaVA-1.5）

多模态指令微调旨在赋予视觉-语言模型遵循自然语言指令的能力，使其能够执行多样化的多模态任务。LLaVA-1.5提出视觉指令调优框架，通过构建高质量的视觉-指令-回答三元组数据，将预训练的对齐表示迁移至多轮对话场景。

数据格式化遵循指令微调范式。每个训练样本包含图像 ximg 、指令文本 xinstr 与目标回答 y 。指令设计采用多样模板，涵盖描述、推理、问答等多种任务类型。为支持多图输入的交织式对话（Interleaved Image-Text Conversation），模型引入特殊的图像 token <image> 作为占位符，指示图像在文本序列中的位置。

图像分辨率自适应处理高分辨率输入的关键技术。动态填充（Dynamic Padding）保持图像原始长宽比，通过零填充将不同尺寸的图像调整为统一批次输入。图像分块（Image Partitioning）将高分辨率图像切分为多个低分辨率块，每块独立编码后拼接，保留细粒度视觉细节：

Vglobal=ViT(Resize(ximg,224×224)){Vlocali}i=1M=ViT(Crop(ximg,M patches))Vfused=Concat( $Vglobal,Vlocal1,...,VlocalM$ )

LLaVA-1.5架构采用ViT-L/14或ViT-L/14-336作为视觉编码器，通过两层MLP投影层连接至Vicuna或LLaMA-2语言模型。训练分为两个阶段：第一阶段冻结LLM，仅训练投影层实现视觉-语言初步对齐；第二阶段解冻LLM，联合优化投影层与语言模型参数，提升多模态指令遵循能力。

InstructBLIP在BLIP-2基础上引入指令感知型Q-Former。查询嵌入不仅编码视觉信息，还通过文本编码器提取指令特征，实现条件化的视觉特征提取。这种指令感知的视觉编码使模型能够根据具体问题关注图像的不同区域，提升视觉问答的准确率。

多图对话系统通过扩展上下文窗口与位置编码实现。每张图像经独立编码后，其视觉 token 在输入序列中按对话历史顺序排列。模型通过改进的RoPE位置编码或ALiBi机制处理长序列依赖，支持多达4-8张图像的上下文理解。

第二部分：结构化伪代码

算法1：CLIP双编码器对比学习

algorithm

复制

复制代码

\begin{algorithm}
\caption{CLIP Contrastive Pretraining with Dual Encoders}
\begin{algorithmic}[1]
\Require Image encoder $E_v$, Text encoder $E_t$, Temperature $\tau$
\Require Dataset $\mathcal{D} = \{(x_i^{img}, x_i^{text})\}_{i=1}^N$
\State Initialize $\tau \gets 0.07$ (learnable)
\For{each batch $\mathcal{B} = \{(x_i^{img}, x_i^{text})\}_{i=1}^{B}$}
    \State // Encode modalities
    \For{$i \gets 1$ \textbf{to} $B$}
        \State $v_i \gets \text{Normalize}(E_v(x_i^{img})) \in \mathbb{R}^d$
        \State $t_i \gets \text{Normalize}(E_t(x_i^{text})) \in \mathbb{R}^d$
    \EndFor
    \State // Compute similarity matrix
    \State $S \gets [v_i \cdot t_j]_{i,j=1}^{B} \in \mathbb{R}^{B \times B}$
    \State $S \gets S \cdot \exp(\tau)$
    \State // Symmetric InfoNCE loss
    \State $\mathcal{L}_{i2t} \gets -\frac{1}{B} \sum_{i=1}^{B} \log \frac{\exp(S_{i,i})}{\sum_{j=1}^{B} \exp(S_{i,j})}$
    \State $\mathcal{L}_{t2i} \gets -\frac{1}{B} \sum_{j=1}^{B} \log \frac{\exp(S_{j,j})}{\sum_{i=1}^{B} \exp(S_{i,j})}$
    \State $\mathcal{L}_{\text{total}} \gets \frac{1}{2}(\mathcal{L}_{i2t} + \mathcal{L}_{t2i})$
    \State // Backpropagation
    \State $\theta \gets \theta - \eta \nabla_{\theta} \mathcal{L}_{\text{total}}$
    \State $\tau \gets \tau - \eta_{\tau} \nabla_{\tau} \mathcal{L}_{\text{total}}$
\EndFor
\State \Return Trained encoders $E_v, E_t$, learned $\tau$
\end{algorithmic}
\end{algorithm}

算法2：Q-Former双阶段预训练

algorithm

复制

复制代码

\begin{algorithm}
\caption{Q-Former Two-Stage Pretraining for BLIP-2}
\begin{algorithmic}[1]
\Require Frozen image encoder $E_{\text{img}}$ (ViT-G/14), Frozen LLM $\mathcal{M}_{\text{llm}}$
\Require Learnable queries $Q \in \mathbb{R}^{N_q \times d}$, Q-Former $\mathcal{Q}$, Projection layer $P$
\Require Dataset $\mathcal{D} = \{(x^{\text{img}}, x^{\text{text}})\}$

\State // Stage 1: Representation Learning
\For{each batch $(X^{\text{img}}, X^{\text{text}})$}
    \State $Z \gets E_{\text{img}}(X^{\text{img}}) \in \mathbb{R}^{B \times N_p \times d_v}$ \Comment{Freeze}
    \State // Image-Text Contrastive Learning (ITC)
    \State $Q_{\text{out}} \gets \mathcal{Q}(Q, Z)$ \Comment{Cross-attention with queries}
    \State $\mathcal{L}_{\text{itc}} \gets \text{InfoNCE}(Q_{\text{out}}, E_{\text{text}}(X^{\text{text}}))$
    \State // Image-Text Matching (ITM)
    \State $s_{\text{match}} \gets \text{BinaryClassifier}(\text{Concat}(Q_{\text{out}}, X^{\text{text}}))$
    \State $\mathcal{L}_{\text{itm}} \gets \text{BCE}(s_{\text{match}}, y_{\text{match}})$
    \State // Image-Grounded Text Generation (ITG)
    \State $\hat{X}^{\text{text}} \gets \text{Decoder}(\mathcal{Q}(Q, Z))$
    \State $\mathcal{L}_{\text{itg}} \gets -\sum_{t} \log P(x_t | x_{<t}, Q_{\text{out}})$
    \State $\mathcal{L}_{\text{stage1}} \gets \mathcal{L}_{\text{itc}} + \mathcal{L}_{\text{itm}} + \mathcal{L}_{\text{itg}}$
    \State Update $\mathcal{Q}$ parameters only
\EndFor

\State // Stage 2: Generative Learning with Frozen LLM
\For{each batch $(X^{\text{img}}, X^{\text{text}})$}
    \State $Z \gets E_{\text{img}}(X^{\text{img}})$ \Comment{Freeze}
    \State $Q_{\text{out}} \gets \mathcal{Q}(Q, Z)$
    \State $H_{\text{visual}} \gets P(Q_{\text{out}}) \in \mathbb{R}^{B \times N_q \times d_{\text{llm}}}$
    \State // Prefix visual embeddings to LLM
    \State $X^{\text{input}} \gets [H_{\text{visual}}; \text{Embed}(X^{\text{text}})]$
    \State $\mathcal{L}_{\text{lm}} \gets -\sum_{t} \log \mathcal{M}_{\text{llm}}(x_t | x_{<t}, X^{\text{input}})$
    \State Update $\mathcal{Q}, P$ parameters only
\EndFor
\State \Return Trained Q-Former $\mathcal{Q}$, Projection $P$
\end{algorithmic}
\end{algorithm}

算法3：投影层前向传播对比

algorithm

复制

复制代码

\begin{algorithm}
\caption{Projection Layer Architectures: Linear vs MLP vs C-Abstractor}
\begin{algorithmic}[1]
\Require Input visual features $V \in \mathbb{R}^{B \times N \times d_{\text{in}}}$
\Require Target dimension $d_{\text{out}}$, Hidden dim $d_{\text{hidden}} = 4 \times d_{\text{out}}$

\Function{LinearProjection}{$V, W$}
    \State $W \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}}$
    \State \Return $V \cdot W$ \Comment{Simple matrix multiplication}
\EndFunction

\Function{MLPAdapter}{$V, W_1, W_2, b_1, b_2$}
    \State $W_1 \in \mathbb{R}^{d_{\text{in}} \times d_{\text{hidden}}}$, $W_2 \in \mathbb{R}^{d_{\text{hidden}} \times d_{\text{out}}}$
    \State $H \gets \text{GELU}(V \cdot W_1 + b_1)$
    \State $H \gets \text{Dropout}(H, p=0.1)$
    \State \Return $H \cdot W_2 + b_2$
\EndFunction

\Function{CAbstractor}{$V, Q_{\text{query}}, W_Q, W_K, W_V, W_{\text{proj}}$}
    \State $Q_{\text{query}} \in \mathbb{R}^{N_q \times d_k}$ \Comment{Learnable queries}
    \State // Cross-attention compression
    \State $A \gets \text{softmax}\left(\frac{Q_{\text{query}} W_Q (V W_K)^T}{\sqrt{d_k}}\right)$
    \State $H_{\text{compressed}} \gets A \cdot (V W_V) \in \mathbb{R}^{N_q \times d_v}$
    \State // Self-attention refinement
    \State $H_{\text{refined}} \gets \text{SelfAttention}(H_{\text{compressed}})$
    \State \Return $H_{\text{refined}} \cdot W_{\text{proj}}$
\EndFunction

\State // Usage in VLM pipeline
\State $V_{\text{patches}} \gets \text{ViT}(x_{\text{img}}) \in \mathbb{R}^{N_p \times d_{\text{vit}}}$
\State // Method selection based on architecture
\If{architecture == "linear"}
    \State $H_{\text{llm}} \gets \text{LinearProjection}(V_{\text{patches}}, W)$
\ElsIf{architecture == "mlp\_adapter"}
    \State $H_{\text{llm}} \gets \text{MLPAdapter}(V_{\text{patches}}, W_1, W_2, b_1, b_2)$
\ElsIf{architecture == "c\_abstractor"}
    \State $H_{\text{llm}} \gets \text{CAbstractor}(V_{\text{patches}}, Q_{\text{query}}, \ldots)$
\EndIf
\State \Return $H_{\text{llm}}$ \Comment{Ready for LLM consumption}
\end{algorithmic}
\end{algorithm}

算法4：多模态指令微调与多图处理

algorithm

复制

复制代码

\begin{algorithm}
\caption{Multimodal Instruction Tuning with Interleaved Image-Text Conversations}
\begin{algorithmic}[1]
\Require Vision encoder $E_v$, Projection layer $\pi$, LLM $\mathcal{M}$
\Require Instruction dataset $\mathcal{D} = \{(I_1, \ldots, I_m, \text{Instr}, \text{Resp})\}$
\Require Special tokens: $\text{[IMG]}$ placeholder, $\text{[EOI]}$ end-of-image

\Function{ProcessMultiImage}{$I_1, \ldots, I_m, \text{Instr}$}
    \State $V_{\text{all}} \gets []$
    \For{$i \gets 1$ \textbf{to} $m$}
        \State // Dynamic padding or partitioning
        \If{$\text{resolution}(I_i) > 336 \times 336$}
            \State $\{I_i^j\}_{j=1}^K \gets \text{Partition}(I_i, K \text{ patches})$
            \State $V_{\text{local}} \gets \text{Concat}_{j=1}^K(E_v(I_i^j))$
            \State $V_{\text{global}} \gets E_v(\text{Resize}(I_i, 336 \times 336))$
            \State $V_i \gets \text{Concat}(V_{\text{global}}, V_{\text{local}})$
        \Else
            \State $V_i \gets E_v(I_i)$
        \EndIf
        \State $V_i \gets \pi(V_i)$ \Comment{Project to LLM space}
        \State $V_{\text{all}}.\text{append}(V_i)$
    \EndFor
    \State // Interleave visual tokens with text
    \State $T_{\text{processed}} \gets \text{Replace}(\text{Instr}, \text{"<image>"}, V_{\text{all}})$
    \State \Return $T_{\text{processed}}$
\EndFunction

\For{each sample $(\{I_k\}, \text{Instr}, \text{Resp})$}
    \State $X_{\text{input}} \gets \text{ProcessMultiImage}(\{I_k\}, \text{Instr})$
    \State // Teacher forcing for autoregressive generation
    \State $X_{\text{full}} \gets [X_{\text{input}}; \text{Embed}(\text{Resp})]$
    \State $\mathcal{L} \gets -\sum_{t=|\text{Instr}|+1}^{|\text{Instr}|+|\text{Resp}|} \log \mathcal{M}(x_t | x_{<t})$
    \State // Gradient update (stage 2: unfreeze LLM)
    \State $\theta_{\pi}, \theta_{\mathcal{M}} \gets \text{AdamW}(\nabla_{\theta} \mathcal{L})$
\EndFor

\Function{Inference}{$I_{\text{query}}, \text{Instr}_{\text{new}}, \mathcal{M}$}
    \State $X \gets \text{ProcessMultiImage}([I_{\text{query}}], \text{Instr}_{\text{new}})$
    \State $\text{Response} \gets \emptyset$
    \For{$t \gets 1$ \textbf{to} $\text{max\_length}$}
        \State $p_t \gets \mathcal{M}(\cdot | X, \text{Response})$
        \State $x_t \gets \arg\max_{v \in \mathcal{V}} p_t(v)$
        \State $\text{Response}.\text{append}(x_t)$
        \If{$x_t == \text{[EOS]}$}
            \State \textbf{break}
        \EndIf
    \EndFor
    \State \Return $\text{Response}$
\EndFunction
\end{algorithmic}
\end{algorithm}

第三部分：Python代码实现

脚本1：CLIP双编码器训练与零样本分类系统

Python

复制

复制代码

"""
Script 1: CLIP Dual Encoder Training with COCO Captions & Zero-Shot ImageNet Evaluation
============================================================================================
This script implements CLIP (Contrastive Language-Image Pretraining) with:
- Dual encoder architecture (Image Encoder + Text Encoder)
- InfoNCE loss with learnable temperature parameter
- Training on COCO Captions dataset
- Zero-shot evaluation on ImageNet-1K with accuracy visualization

Usage:
1. Prepare COCO Captions dataset in ./data/coco/
2. Prepare ImageNet-1K validation set in ./data/imagenet/val/
3. Run: python clip_training.py --batch_size 256 --epochs 30 --lr 1e-4

Dependencies: torch, torchvision, transformers, datasets, pillow, tqdm, numpy, matplotlib
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, models
from transformers import BertTokenizer, BertModel
from datasets import load_dataset
from PIL import Image
import os
import json
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt
from typing import List, Tuple, Optional

# =============================================================================
# 1. Architecture Components
# =============================================================================

class ImageEncoder(nn.Module):
    """
    Image Encoder based on ResNet50 or ViT.
    Extracts global image features and projects to shared embedding space.
    """
    def __init__(self, embed_dim: int = 512, backbone: str = "resnet50"):
        super().__init__()
        self.embed_dim = embed_dim
        
        if backbone == "resnet50":
            # Using ResNet50 as image backbone
            resnet = models.resnet50(pretrained=True)
            self.backbone = nn.Sequential(*list(resnet.children())[:-1])  # Remove final FC
            self.projection = nn.Linear(2048, embed_dim)
        elif backbone == "vit":
            # Using ViT-B/16
            from torchvision.models import vit_b_16
            vit = vit_b_16(pretrained=True)
            self.backbone = vit
            self.projection = nn.Linear(768, embed_dim)
        
        self.norm = nn.LayerNorm(embed_dim)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: Input images [B, 3, H, W]
        Returns:
            Normalized image embeddings [B, embed_dim]
        """
        if hasattr(self, 'backbone') and isinstance(self.backbone, nn.Sequential):
            # ResNet path
            features = self.backbone(x)
            features = features.view(features.size(0), -1)
        else:
            # ViT path
            features = self.backbone(x)
            features = features[:, 0, :]  # CLS token
        
        projected = self.projection(features)
        normalized = F.normalize(self.norm(projected), dim=-1)
        return normalized


class TextEncoder(nn.Module):
    """
    Text Encoder based on Transformer (BERT).
    Encodes text descriptions into shared embedding space.
    """
    def __init__(self, embed_dim: int = 512, model_name: str = "bert-base-uncased"):
        super().__init__()
        self.tokenizer = BertTokenizer.from_pretrained(model_name)
        self.transformer = BertModel.from_pretrained(model_name)
        self.projection = nn.Linear(768, embed_dim)
        self.norm = nn.LayerNorm(embed_dim)
        
    def forward(self, text: List[str]) -> torch.Tensor:
        """
        Args:
            text: List of text descriptions
        Returns:
            Normalized text embeddings [B, embed_dim]
        """
        # Tokenize
        encoded = self.tokenizer(
            text, 
            padding=True, 
            truncation=True, 
            max_length=77, 
            return_tensors="pt"
        ).to(next(self.parameters()).device)
        
        # Get CLS token representation
        outputs = self.transformer(**encoded)
        cls_output = outputs.last_hidden_state[:, 0, :]  # [B, 768]
        
        projected = self.projection(cls_output)
        normalized = F.normalize(self.norm(projected), dim=-1)
        return normalized


class CLIPModel(nn.Module):
    """
    Complete CLIP model with dual encoders and contrastive learning.
    """
    def __init__(self, embed_dim: int = 512, temperature_init: float = 0.07):
        super().__init__()
        self.image_encoder = ImageEncoder(embed_dim=embed_dim, backbone="resnet50")
        self.text_encoder = TextEncoder(embed_dim=embed_dim)
        
        # Learnable temperature parameter (log scale for stability)
        self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1.0 / temperature_init))
        
    def forward(self, images: torch.Tensor, texts: List[str]) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """
        Args:
            images: Batch of images [B, 3, H, W]
            texts: List of text descriptions
        Returns:
            image_features: [B, embed_dim]
            text_features: [B, embed_dim]
            logit_scale: scalar
        """
        image_features = self.image_encoder(images)
        text_features = self.text_encoder(texts)
        
        # Constrain temperature
        logit_scale = torch.clamp(self.logit_scale.exp(), min=0.001, max=100.0)
        
        return image_features, text_features, logit_scale
    
    def compute_loss(self, image_features: torch.Tensor, text_features: torch.Tensor, logit_scale: torch.Tensor) -> torch.Tensor:
        """
        Compute symmetric InfoNCE loss (contrastive loss).
        """
        batch_size = image_features.shape[0]
        
        # Compute similarity matrix [B, B]
        logits = logit_scale * (image_features @ text_features.t())
        
        # Labels are diagonal (positive pairs)
        labels = torch.arange(batch_size, device=logits.device)
        
        # Symmetric loss: image-to-text + text-to-image
        loss_i2t = F.cross_entropy(logits, labels)
        loss_t2i = F.cross_entropy(logits.t(), labels)
        
        loss = (loss_i2t + loss_t2i) / 2
        return loss


# =============================================================================
# 2. Dataset Implementation
# =============================================================================

class COCOCaptionsDataset(Dataset):
    """
    COCO Captions dataset loader for CLIP training.
    Returns (image, caption) pairs.
    """
    def __init__(self, root_dir: str, split: str = "train", transform=None):
        self.root_dir = root_dir
        self.split = split
        self.transform = transform or self._default_transform()
        
        # Load COCO annotations
        ann_file = os.path.join(root_dir, f"annotations/captions_{split}2017.json")
        with open(ann_file, 'r') as f:
            self.coco_data = json.load(f)
        
        # Build image_id to filename mapping
        self.id_to_filename = {
            img['id']: img['file_name'] 
            for img in self.coco_data['images']
        }
        
        # Create list of (image_path, caption) pairs
        self.samples = []
        for ann in self.coco_data['annotations']:
            img_id = ann['image_id']
            caption = ann['caption']
            filename = self.id_to_filename[img_id]
            img_path = os.path.join(root_dir, f"{split}2017", filename)
            self.samples.append((img_path, caption))
    
    def _default_transform(self):
        return transforms.Compose([
            transforms.Resize(256),
            transforms.RandomCrop(224),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            transforms.Normalize(
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]
            )
        ])
    
    def __len__(self):
        return len(self.samples)
    
    def __getitem__(self, idx):
        img_path, caption = self.samples[idx]
        
        try:
            image = Image.open(img_path).convert('RGB')
        except Exception as e:
            # Return a placeholder if image loading fails
            print(f"Error loading {img_path}: {e}")
            image = Image.new('RGB', (224, 224), color='black')
        
        if self.transform:
            image = self.transform(image)
        
        return image, caption


# =============================================================================
# 3. Zero-Shot Evaluation on ImageNet
# =============================================================================

class ZeroShotImageNetEvaluator:
    """
    Zero-shot classification on ImageNet-1K using CLIP.
    Constructs prompt templates for each class and computes similarity.
    """
    def __init__(self, model: CLIPModel, device: str = "cuda", templates: Optional[List[str]] = None):
        self.model = model
        self.device = device
        
        # Default prompt templates (80 templates from CLIP paper)
        self.templates = templates or [
            "a photo of a {}.",
            "a blurry photo of a {}.",
            "a black and white photo of a {}.",
            "a low resolution photo of a {}.",
            "a photo of a small {}.",
            "a photo of a big {}.",
            "a photo of the {}.",
        ]
        
        # Load ImageNet class names
        self.class_names = self._load_imagenet_classes()
        
    def _load_imagenet_classes(self) -> List[str]:
        """
        Load ImageNet 1000 class names.
        """
        # Placeholder - in practice, load from imagenet_class_index.json
        # For demonstration, using dummy classes
        return [f"class_{i}" for i in range(1000)]
    
    def _encode_text_classes(self) -> torch.Tensor:
        """
        Encode all ImageNet classes with prompt templates.
        Returns text features [1000, embed_dim].
        """
        text_features_list = []
        
        print("Encoding ImageNet classes with prompt templates...")
        for class_name in tqdm(self.class_names):
            # Generate prompts for this class
            texts = [template.format(class_name) for template in self.templates]
            
            # Encode all templates and average
            with torch.no_grad():
                class_features = self.model.text_encoder(texts)
                class_features = class_features.mean(dim=0)  # Average over templates
                class_features = F.normalize(class_features.unsqueeze(0), dim=-1)
                text_features_list.append(class_features)
        
        text_features = torch.cat(text_features_list, dim=0)  # [1000, embed_dim]
        return text_features
    
    def evaluate(self, dataloader: DataLoader) -> Tuple[float, dict]:
        """
        Run zero-shot classification and compute accuracy.
        
        Returns:
            top1_acc: Top-1 accuracy percentage
            metrics: Dictionary containing detailed metrics
        """
        self.model.eval()
        text_features = self._encode_text_classes().to(self.device)
        
        all_predictions = []
        all_targets = []
        
        print("Evaluating on ImageNet validation set...")
        with torch.no_grad():
            for images, targets in tqdm(dataloader):
                images = images.to(self.device)
                
                # Encode images
                image_features = self.model.image_encoder(images)
                
                # Compute similarity with all classes
                similarity = 100.0 * (image_features @ text_features.t())  # [B, 1000]
                
                # Predictions
                predictions = similarity.argmax(dim=1)
                
                all_predictions.extend(predictions.cpu().numpy())
                all_targets.extend(targets.numpy())
        
        # Compute accuracy
        all_predictions = np.array(all_predictions)
        all_targets = np.array(all_targets)
        
        top1_acc = (all_predictions == all_targets).mean() * 100
        top5_preds = np.argsort(-similarity.cpu().numpy(), axis=1)[:, :5]
        top5_acc = np.any(top5_preds == all_targets[:, None], axis=1).mean() * 100
        
        metrics = {
            "top1_accuracy": top1_acc,
            "top5_accuracy": top5_acc,
            "num_samples": len(all_targets)
        }
        
        return top1_acc, metrics
    
    def visualize_confusion_matrix(self, dataloader: DataLoader, num_classes: int = 10, save_path: str = "confusion_matrix.png"):
        """
        Visualize confusion matrix for subset of classes.
        """
        from sklearn.metrics import confusion_matrix
        import seaborn as sns
        
        self.model.eval()
        text_features = self._encode_text_classes().to(self.device)
        
        all_preds = []
        all_targets = []
        
        with torch.no_grad():
            for images, targets in dataloader:
                images = images.to(self.device)
                image_features = self.model.image_encoder(images)
                similarity = image_features @ text_features.t()
                preds = similarity.argmax(dim=1)
                
                all_preds.extend(preds.cpu().numpy())
                all_targets.extend(targets.numpy())
        
        # Compute confusion matrix for first num_classes
        cm = confusion_matrix(all_targets, all_preds, labels=list(range(num_classes)))
        
        plt.figure(figsize=(12, 10))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
        plt.title(f'Confusion Matrix (Top {num_classes} Classes)')
        plt.ylabel('True Label')
        plt.xlabel('Predicted Label')
        plt.tight_layout()
        plt.savefig(save_path, dpi=300)
        print(f"Confusion matrix saved to {save_path}")


# =============================================================================
# 4. Training Loop with Visualization
# =============================================================================

class CLIPTrainer:
    """
    Complete training pipeline for CLIP with logging and visualization.
    """
    def __init__(self, model: CLIPModel, device: str = "cuda", lr: float = 1e-4, weight_decay: float = 0.2):
        self.model = model.to(device)
        self.device = device
        
        # Optimizer with weight decay
        params = [
            {"params": model.image_encoder.parameters(), "lr": lr},
            {"params": model.text_encoder.parameters(), "lr": lr},
            {"params": [model.logit_scale], "lr": lr * 10, "weight_decay": 0}  # Higher LR for temperature
        ]
        
        self.optimizer = torch.optim.AdamW(params, weight_decay=weight_decay, betas=(0.9, 0.98), eps=1e-6)
        self.scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(self.optimizer, T_max=100)
        
        self.train_losses = []
        self.val_accuracies = []
        
    def train_epoch(self, dataloader: DataLoader) -> float:
        self.model.train()
        total_loss = 0.0
        num_batches = 0
        
        pbar = tqdm(dataloader, desc="Training")
        for images, texts in pbar:
            images = images.to(self.device)
            
            self.optimizer.zero_grad()
            
            # Forward pass
            image_features, text_features, logit_scale = self.model(images, texts)
            
            # Compute loss
            loss = self.model.compute_loss(image_features, text_features, logit_scale)
            
            # Backward pass
            loss.backward()
            
            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
            
            self.optimizer.step()
            
            total_loss += loss.item()
            num_batches += 1
            
            pbar.set_postfix({
                "loss": loss.item(),
                "temp": logit_scale.item()
            })
        
        avg_loss = total_loss / num_batches
        self.train_losses.append(avg_loss)
        return avg_loss
    
    def validate(self, evaluator: ZeroShotImageNetEvaluator, dataloader: DataLoader) -> float:
        acc, metrics = evaluator.evaluate(dataloader)
        self.val_accuracies.append(acc)
        return acc, metrics
    
    def plot_training_curves(self, save_path: str = "training_curves.png"):
        """
        Plot training loss and validation accuracy curves.
        """
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
        
        # Loss curve
        ax1.plot(self.train_losses, 'b-', linewidth=2)
        ax1.set_xlabel('Epoch')
        ax1.set_ylabel('Training Loss')
        ax1.set_title('Training Loss Over Time')
        ax1.grid(True, alpha=0.3)
        
        # Accuracy curve
        if self.val_accuracies:
            ax2.plot(self.val_accuracies, 'r-', marker='o', linewidth=2)
            ax2.axhline(y=40.0, color='g', linestyle='--', label='Target 40%')
            ax2.set_xlabel('Validation Step')
            ax2.set_ylabel('ImageNet Top-1 Accuracy (%)')
            ax2.set_title('Zero-Shot Classification Accuracy')
            ax2.legend()
            ax2.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.savefig(save_path, dpi=300)
        print(f"Training curves saved to {save_path}")
    
    def save_checkpoint(self, path: str, epoch: int, metrics: dict):
        checkpoint = {
            'epoch': epoch,
            'model_state_dict': self.model.state_dict(),
            'optimizer_state_dict': self.optimizer.state_dict(),
            'logit_scale': self.model.logit_scale.item(),
            'metrics': metrics
        }
        torch.save(checkpoint, path)
        print(f"Checkpoint saved to {path}")


# =============================================================================
# 5. Main Execution
# =============================================================================

def main():
    import argparse
    parser = argparse.ArgumentParser(description="Train CLIP on COCO Captions")
    parser.add_argument("--batch_size", type=int, default=256, help="Batch size for training")
    parser.add_argument("--epochs", type=int, default=30, help="Number of training epochs")
    parser.add_argument("--lr", type=float, default=1e-4, help="Learning rate")
    parser.add_argument("--embed_dim", type=int, default=512, help="Embedding dimension")
    parser.add_argument("--data_dir", type=str, default="./data/coco", help="COCO dataset directory")
    parser.add_argument("--imagenet_dir", type=str, default="./data/imagenet", help="ImageNet directory for eval")
    args = parser.parse_args()
    
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")
    
    # Initialize model
    model = CLIPModel(embed_dim=args.embed_dim)
    print(f"Model initialized with {sum(p.numel() for p in model.parameters())/1e6:.2f}M parameters")
    
    # Setup datasets
    train_dataset = COCOCaptionsDataset(args.data_dir, split="train")
    train_loader = DataLoader(
        train_dataset, 
        batch_size=args.batch_size, 
        shuffle=True, 
        num_workers=4, 
        pin_memory=True,
        drop_last=True
    )
    
    # Initialize trainer
    trainer = CLIPTrainer(model, device=device, lr=args.lr)
    
    # Training loop
    best_acc = 0.0
    for epoch in range(args.epochs):
        print(f"\n=== Epoch {epoch+1}/{args.epochs} ===")
        
        # Train
        avg_loss = trainer.train_epoch(train_loader)
        print(f"Average training loss: {avg_loss:.4f}, Temperature: {model.logit_scale.exp().item():.4f}")
        
        # Validate every 5 epochs (if ImageNet available)
        if (epoch + 1) % 5 == 0 and os.path.exists(args.imagenet_dir):
            from torchvision.datasets import ImageNet
            val_transform = transforms.Compose([
                transforms.Resize(256),
                transforms.CenterCrop(224),
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
            ])
            imagenet_val = ImageNet(args.imagenet_dir, split="val", transform=val_transform)
            val_loader = DataLoader(imagenet_val, batch_size=128, shuffle=False, num_workers=4)
            
            evaluator = ZeroShotImageNetEvaluator(model, device=device)
            acc, metrics = trainer.validate(evaluator, val_loader)
            print(f"ImageNet Zero-Shot Top-1 Accuracy: {acc:.2f}%")
            print(f"ImageNet Zero-Shot Top-5 Accuracy: {metrics['top5_accuracy']:.2f}%")
            
            # Save best model
            if acc > best_acc:
                best_acc = acc
                trainer.save_checkpoint(f"clip_best_epoch{epoch+1}.pt", epoch+1, metrics)
                
                if acc > 40.0:
                    print("✓ Achieved target accuracy > 40%!")
    
    # Final visualizations
    trainer.plot_training_curves("clip_training_curves.png")
    print(f"\nTraining completed. Best accuracy: {best_acc:.2f}%")

if __name__ == "__main__":
    main()

脚本2：BLIP-2 Q-Former视觉问答系统

Python

复制

复制代码

"""
Script 2: BLIP-2 Q-Former Implementation for Visual Question Answering
============================================================================
This script implements BLIP-2's Q-Former architecture with:
- Frozen ViT-G/14 image encoder
- Query Transformer with learnable query tokens
- Frozen Flan-T5 language model
- Two-stage pretraining (Representation + Generative)
- VQA v2 evaluation with accuracy > 70%

Usage:
1. Prepare VQA v2 dataset in ./data/vqa/
2. Run: python blip2_qformer.py --stage pretrain --epochs 10
3. Run: python blip2_qformer.py --stage finetune --epochs 5 --checkpoint pretrain_final.pt

Dependencies: torch, torchvision, transformers, datasets, tqdm, numpy, matplotlib, pillow
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from transformers import (
    T5ForConditionalGeneration, 
    T5Tokenizer, 
    ViTModel, 
    ViTConfig,
    AutoTokenizer
)
from typing import Optional, Tuple, List, Dict
import json
import os
from PIL import Image
from tqdm import tqdm
import numpy as np
import matplotlib.pyplot as plt
from dataclasses import dataclass


# =============================================================================
# 1. Q-Former Architecture Components
# =============================================================================

class QFormerAttention(nn.Module):
    """
    Multi-head attention with cross-attention capability for Q-Former.
    Supports self-attention (queries attend to themselves) and cross-attention 
    (queries attend to image features).
    """
    def __init__(self, dim: int, num_heads: int = 8, dropout: float = 0.1):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = dim // num_heads
        self.scale = self.head_dim ** -0.5
        
        self.q_proj = nn.Linear(dim, dim)
        self.k_proj = nn.Linear(dim, dim)
        self.v_proj = nn.Linear(dim, dim)
        self.out_proj = nn.Linear(dim, dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, queries, kv=None, attention_mask=None):
        """
        Args:
            queries: [B, N_q, D] (learnable queries)
            kv: [B, N_kv, D] (image features for cross-attention)
            attention_mask: [B, N_q, N_kv]
        """
        batch_size, n_queries = queries.shape[:2]
        
        # If kv is None, self-attention
        if kv is None:
            kv = queries
        
        # Project to Q, K, V
        Q = self.q_proj(queries).view(batch_size, n_queries, self.num_heads, self.head_dim).transpose(1, 2)
        K = self.k_proj(kv).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        V = self.v_proj(kv).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        
        # Attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) * self.scale  # [B, H, N_q, N_kv]
        
        if attention_mask is not None:
            scores = scores.masked_fill(attention_mask.unsqueeze(1) == 0, float('-inf'))
        
        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        
        # Apply attention to values
        attn_output = torch.matmul(attn_weights, V)  # [B, H, N_q, D_head]
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, n_queries, -1)
        
        output = self.out_proj(attn_output)
        return output, attn_weights


class QFormerLayer(nn.Module):
    """
    Single layer of Q-Former combining self-attention and cross-attention.
    Architecture: Self-Attn -> Cross-Attn -> FFN with residual connections.
    """
    def __init__(self, dim: int, num_heads: int = 8, dropout: float = 0.1, drop_path: float = 0.0):
        super().__init__()
        # Self-attention (queries attend to queries)
        self.self_attn = QFormerAttention(dim, num_heads, dropout)
        self.self_attn_norm = nn.LayerNorm(dim)
        
        # Cross-attention (queries attend to image features)
        self.cross_attn = QFormerAttention(dim, num_heads, dropout)
        self.cross_attn_norm = nn.LayerNorm(dim)
        self.query_norm = nn.LayerNorm(dim)
        
        # Feed-forward network
        self.ffn = nn.Sequential(
            nn.Linear(dim, dim * 4),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(dim * 4, dim),
            nn.Dropout(dropout)
        )
        self.ffn_norm = nn.LayerNorm(dim)
        
        self.drop_path = nn.Dropout(drop_path) if drop_path > 0 else nn.Identity()
        
    def forward(self, queries, image_features, self_attn_mask=None, cross_attn_mask=None):
        """
        Args:
            queries: [B, N_q, D] - learnable query embeddings
            image_features: [B, N_patches, D] - frozen image encoder output
        """
        # Self-attention on queries
        residual = queries
        queries = self.self_attn_norm(queries)
        attn_out, _ = self.self_attn(queries, kv=None, attention_mask=self_attn_mask)
        queries = residual + self.drop_path(attn_out)
        
        # Cross-attention to image features
        residual = queries
        queries = self.cross_attn_norm(queries)
        image_features = self.query_norm(image_features)
        attn_out, _ = self.cross_attn(queries, kv=image_features, attention_mask=cross_attn_mask)
        queries = residual + self.drop_path(attn_out)
        
        # FFN
        residual = queries
        queries = self.ffn_norm(queries)
        ffn_out = self.ffn(queries)
        queries = residual + self.drop_path(ffn_out)
        
        return queries


class QFormer(nn.Module):
    """
    Querying Transformer (Q-Former) that compresses image features into 
    fixed number of query tokens through alternating self- and cross-attention.
    """
    def __init__(
        self, 
        num_queries: int = 32,
        hidden_dim: int = 768,
        num_layers: int = 12,
        num_heads: int = 12,
        dropout: float = 0.1
    ):
        super().__init__()
        self.num_queries = num_queries
        self.hidden_dim = hidden_dim
        
        # Learnable query tokens (the key innovation of Q-Former)
        self.query_tokens = nn.Parameter(torch.randn(1, num_queries, hidden_dim))
        nn.init.normal_(self.query_tokens, std=0.02)
        
        # Transformer layers
        self.layers = nn.ModuleList([
            QFormerLayer(hidden_dim, num_heads, dropout)
            for _ in range(num_layers)
        ])
        
        self.norm = nn.LayerNorm(hidden_dim)
        
    def forward(self, image_features):
        """
        Args:
            image_features: [B, N_patches, D] from frozen image encoder
        Returns:
            query_output: [B, N_queries, D] condensed visual representation
        """
        batch_size = image_features.shape[0]
        
        # Expand queries for batch
        queries = self.query_tokens.expand(batch_size, -1, -1)
        
        # Pass through layers
        for layer in self.layers:
            queries = layer(queries, image_features)
        
        queries = self.norm(queries)
        return queries


# =============================================================================
# 2. BLIP-2 Model with Frozen Encoders
# =============================================================================

class BLIP2Model(nn.Module):
    """
    Complete BLIP-2 model combining:
    - Frozen ViT-G/14 image encoder
    - Trainable Q-Former
    - Projection layer to LLM space
    - Frozen Flan-T5 text decoder
    """
    def __init__(
        self,
        image_encoder_name: str = "google/vit-large-patch16-224",
        llm_name: str = "google/flan-t5-base",
        num_queries: int = 32,
        qformer_hidden_dim: int = 768,
        freeze_image_encoder: bool = True,
        freeze_llm: bool = True
    ):
        super().__init__()
        
        # Image encoder (ViT-G/14 or ViT-Large)
        print(f"Loading image encoder: {image_encoder_name}")
        self.image_encoder = ViTModel.from_pretrained(image_encoder_name)
        self.image_dim = self.image_encoder.config.hidden_size
        
        if freeze_image_encoder:
            for param in self.image_encoder.parameters():
                param.requires_grad = False
            self.image_encoder.eval()
            print("Image encoder frozen.")
        
        # Q-Former
        self.qformer = QFormer(
            num_queries=num_queries,
            hidden_dim=qformer_hidden_dim,
            num_layers=12,
            num_heads=12
        )
        
        # Projection layer to LLM embedding space
        self.llm_dim = 768  # T5-base hidden size
        self.proj_to_llm = nn.Linear(qformer_hidden_dim, self.llm_dim)
        
        # Language model (Flan-T5)
        print(f"Loading LLM: {llm_name}")
        self.llm = T5ForConditionalGeneration.from_pretrained(llm_name)
        self.tokenizer = AutoTokenizer.from_pretrained(llm_name)
        
        if freeze_llm:
            for param in self.llm.parameters():
                param.requires_grad = False
            self.llm.eval()
            print("LLM frozen.")
        
        # Query tokenizer for text encoding in Q-Former (for ITC/ITM tasks)
        self.query_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
        
    def encode_image(self, images: torch.Tensor) -> torch.Tensor:
        """
        Encode images through frozen ViT and Q-Former.
        """
        with torch.no_grad() if not self.image_encoder.training else torch.enable_grad():
            vit_outputs = self.image_encoder(images)
            image_features = vit_outputs.last_hidden_state  # [B, N_patches+1, D]
        
        # Q-Former compresses variable-length features to fixed queries
        query_output = self.qformer(image_features)  # [B, N_queries, D]
        return query_output
    
    def project_to_llm(self, query_output: torch.Tensor) -> torch.Tensor:
        """
        Project Q-Former output to LLM embedding space.
        """
        return self.proj_to_llm(query_output)  # [B, N_queries, LLM_dim]
    
    def generate_answer(
        self, 
        images: torch.Tensor, 
        questions: List[str],
        max_length: int = 50
    ) -> List[str]:
        """
        Generate answers for visual questions.
        """
        batch_size = images.shape[0]
        
        # Encode images
        query_output = self.encode_image(images)  # [B, N_queries, D]
        visual_embeds = self.project_to_llm(query_output)  # [B, N_queries, LLM_dim]
        
        # Encode questions
        question_tokens = self.tokenizer(
            questions,
            padding=True,
            truncation=True,
            max_length=128,
            return_tensors="pt"
        ).to(images.device)
        
        # Get question embeddings from LLM encoder
        with torch.no_grad() if not self.llm.training else torch.enable_grad():
            # T5 uses encoder-decoder architecture
            encoder_outputs = self.llm.encoder(
                input_ids=question_tokens.input_ids,
                attention_mask=question_tokens.attention_mask,
                return_dict=True
            )
            
            # Concatenate visual embeddings with question encoder outputs
            # Visual embeddings serve as soft prompts
            combined_embeds = torch.cat([
                visual_embeds,
                encoder_outputs.last_hidden_state
            ], dim=1)  # [B, N_queries + seq_len, D]
            
            # Create attention mask for combined sequence
            visual_attention = torch.ones(
                batch_size, self.qformer.num_queries,
                device=images.device
            )
            combined_attention = torch.cat([
                visual_attention,
                question_tokens.attention_mask
            ], dim=1)
            
            # Generate with LLM
            outputs = self.llm.generate(
                inputs_embeds=combined_embeds,
                attention_mask=combined_attention,
                max_length=max_length,
                num_beams=5,
                early_stopping=True
            )
        
        answers = self.tokenizer.batch_decode(outputs, skip_special_tokens=True)
        return answers
    
    def forward_vqa_loss(
        self, 
        images: torch.Tensor, 
        questions: List[str], 
        answers: List[str]
    ) -> torch.Tensor:
        """
        Compute language modeling loss for VQA training.
        """
        # Encode images
        query_output = self.encode_image(images)
        visual_embeds = self.project_to_llm(query_output)
        
        # Prepare inputs: Question as input, Answer as target
        question_tokens = self.tokenizer(
            questions,
            padding=True,
            truncation=True,
            max_length=128,
            return_tensors="pt"
        ).to(images.device)
        
        # Prepare answer labels
        answer_tokens = self.tokenizer(
            answers,
            padding=True,
            truncation=True,
            max_length=50,
            return_tensors="pt"
        ).to(images.device)
        
        # Encode questions with LLM encoder
        encoder_outputs = self.llm.encoder(
            input_ids=question_tokens.input_ids,
            attention_mask=question_tokens.attention_mask,
            return_dict=True
        )
        
        # Concatenate visual + question
        combined_embeds = torch.cat([visual_embeds, encoder_outputs.last_hidden_state], dim=1)
        visual_mask = torch.ones(images.shape[0], self.qformer.num_queries, device=images.device)
        combined_mask = torch.cat([visual_mask, question_tokens.attention_mask], dim=1)
        
        # Decode with answer as target
        outputs = self.llm(
            inputs_embeds=combined_embeds,
            attention_mask=combined_mask,
            labels=answer_tokens.input_ids,
            decoder_attention_mask=answer_tokens.attention_mask
        )
        
        return outputs.loss


# =============================================================================
# 3. VQA Dataset and Data Loading
# =============================================================================

class VQADataset(Dataset):
    """
    VQA v2 Dataset loader.
    Returns (image, question, answer) tuples.
    """
    def __init__(
        self, 
        image_dir: str,
        question_file: str,
        annotation_file: str,
        transform=None
    ):
        self.image_dir = image_dir
        self.transform = transform or self._default_transform()
        
        # Load questions
        with open(question_file, 'r') as f:
            q_data = json.load(f)
        self.questions = {q['question_id']: q for q in q_data['questions']}
        
        # Load annotations
        with open(annotation_file, 'r') as f:
            a_data = json.load(f)
        self.annotations = {a['question_id']: a for a in a_data['annotations']}
        
        self.qa_pairs = []
        for qid in self.questions.keys():
            question = self.questions[qid]['question']
            image_id = self.questions[qid]['image_id']
            answers = self.annotations[qid]['answers']
            # Use most common answer
            answer_counts = {}
            for ans in answers:
                ans_text = ans['answer'].lower()
                answer_counts[ans_text] = answer_counts.get(ans_text, 0) + 1
            best_answer = max(answer_counts.items(), key=lambda x: x[1])[0]
            
            image_name = f"COCO_train2014_{image_id:012d}.jpg"
            image_path = os.path.join(image_dir, image_name)
            
            if os.path.exists(image_path):
                self.qa_pairs.append({
                    'image_path': image_path,
                    'question': question,
                    'answer': best_answer,
                    'question_id': qid
                })
    
    def _default_transform(self):
        return transforms.Compose([
            transforms.Resize((224, 224)),
            transforms.ToTensor(),
            transforms.Normalize(
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]
            )
        ])
    
    def __len__(self):
        return len(self.qa_pairs)
    
    def __getitem__(self, idx):
        item = self.qa_pairs[idx]
        
        try:
            image = Image.open(item['image_path']).convert('RGB')
        except Exception:
            image = Image.new('RGB', (224, 224), color='black')
        
        if self.transform:
            image = self.transform(image)
        
        return image, item['question'], item['answer'], item['question_id']


# =============================================================================
# 4. Two-Stage Training Pipeline
# =============================================================================

class BLIP2Trainer:
    """
    Trainer implementing BLIP-2's two-stage pretraining:
    Stage 1: Representation learning (ITC, ITM, ITG objectives)
    Stage 2: Generative learning with frozen LLM
    """
    def __init__(
        self,
        model: BLIP2Model,
        device: str = "cuda",
        lr: float = 1e-4,
        stage: str = "generative"  # "representation" or "generative"
    ):
        self.model = model.to(device)
        self.device = device
        self.stage = stage
        
        # Optimizer: only train Q-Former and projection layer
        trainable_params = list(model.qformer.parameters()) + list(model.proj_to_llm.parameters())
        self.optimizer = torch.optim.AdamW(trainable_params, lr=lr, weight_decay=0.05)
        self.scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
            self.optimizer, T_0=2000, T_mult=2
        )
        
        self.train_losses = []
        self.val_accuracies = []
        
    def train_generative_stage(self, dataloader: DataLoader, epochs: int):
        """
        Stage 2: Train Q-Former to generate answers with frozen LLM.
        """
        self.model.train()
        # Keep encoders frozen
        self.model.image_encoder.eval()
        self.model.llm.eval()
        
        for epoch in range(epochs):
            total_loss = 0.0
            num_batches = 0
            
            pbar = tqdm(dataloader, desc=f"Epoch {epoch+1}/{epochs} [Generative]")
            for images, questions, answers, _ in pbar:
                images = images.to(self.device)
                
                self.optimizer.zero_grad()
                
                # Compute VQA loss
                loss = self.model.forward_vqa_loss(images, questions, answers)
                
                loss.backward()
                torch.nn.utils.clip_grad_norm_(
                    list(self.model.qformer.parameters()) + 
                    list(self.model.proj_to_llm.parameters()), 
                    1.0
                )
                self.optimizer.step()
                self.scheduler.step()
                
                total_loss += loss.item()
                num_batches += 1
                
                pbar.set_postfix({"loss": loss.item()})
            
            avg_loss = total_loss / num_batches
            self.train_losses.append(avg_loss)
            print(f"Epoch {epoch+1} - Average Loss: {avg_loss:.4f}")
    
    def evaluate_vqa(
        self, 
        dataloader: DataLoader, 
        max_samples: Optional[int] = None
    ) -> Tuple[float, Dict]:
        """
        Evaluate VQA accuracy.
        Returns exact match accuracy and VQA score (soft accuracy).
        """
        self.model.eval()
        
        exact_matches = 0
        total = 0
        all_results = []
        
        with torch.no_grad():
            for i, (images, questions, answers, qids) in enumerate(tqdm(dataloader, desc="Evaluating")):
                if max_samples and i * dataloader.batch_size >= max_samples:
                    break
                
                images = images.to(self.device)
                
                # Generate predictions
                predictions = self.model.generate_answer(images, questions)
                
                for pred, true, qid in zip(predictions, answers, qids):
                    pred_clean = pred.lower().strip()
                    true_clean = true.lower().strip()
                    
                    # Exact match
                    if pred_clean == true_clean:
                        exact_matches += 1
                    
                    # VQA score (0 or 1 based on string match)
                    vqa_score = 1.0 if pred_clean == true_clean else 0.0
                    
                    all_results.append({
                        'question_id': qid,
                        'answer': pred,
                        'gt_answer': true,
                        'score': vqa_score
                    })
                    
                    total += 1
        
        accuracy = exact_matches / total * 100 if total > 0 else 0.0
        
        metrics = {
            'exact_match_accuracy': accuracy,
            'total_samples': total,
            'correct': exact_matches
        }
        
        return accuracy, metrics, all_results
    
    def plot_training_progress(self, save_path: str = "blip2_training.png"):
        """
        Visualize training loss and validation accuracy.
        """
        fig, ax1 = plt.subplots(figsize=(10, 6))
        
        color = 'tab:blue'
        ax1.set_xlabel('Epoch')
        ax1.set_ylabel('Training Loss', color=color)
        ax1.plot(self.train_losses, color=color, linewidth=2, marker='o')
        ax1.tick_params(axis='y', labelcolor=color)
        ax1.grid(True, alpha=0.3)
        
        if self.val_accuracies:
            ax2 = ax1.twinx()
            color = 'tab:red'
            ax2.set_ylabel('VQA Accuracy (%)', color=color)
            ax2.plot(self.val_accuracies, color=color, linewidth=2, marker='s')
            ax2.axhline(y=70.0, color='green', linestyle='--', label='Target 70%')
            ax2.legend()
            ax2.tick_params(axis='y', labelcolor=color)
        
        plt.title('BLIP-2 Q-Former Training Progress')
        plt.tight_layout()
        plt.savefig(save_path, dpi=300)
        print(f"Training plot saved to {save_path}")
    
    def save_model(self, path: str, epoch: int, metrics: dict):
        torch.save({
            'epoch': epoch,
            'qformer_state_dict': self.model.qformer.state_dict(),
            'proj_state_dict': self.model.proj_to_llm.state_dict(),
            'optimizer_state_dict': self.optimizer.state_dict(),
            'metrics': metrics
        }, path)
        print(f"Model checkpoint saved to {path}")


# =============================================================================
# 5. Visualization and Analysis
# =============================================================================

def visualize_attention_maps(model: BLIP2Model, image_path: str, question: str, save_path: str = "attention_vis.png"):
    """
    Visualize cross-attention weights from Q-Former queries to image patches.
    """
    model.eval()
    
    # Load and preprocess image
    transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
    
    image = Image.open(image_path).convert('RGB')
    image_tensor = transform(image).unsqueeze(0).to(next(model.parameters()).device)
    
    # Hook to capture attention weights
    attention_weights = []
    
    def hook_fn(module, input, output):
        # Capture attention from the last layer
        attention_weights.append(output[1] if isinstance(output, tuple) else None)
    
    # Register hook on last cross-attention layer
    handle = model.qformer.layers[-1].cross_attn.register_forward_hook(hook_fn)
    
    with torch.no_grad():
        # Forward pass
        vit_out = model.image_encoder(image_tensor)
        image_features = vit_out.last_hidden_state[:, 1:, :]  # Remove CLS, keep patches
        
        # Get attention
        query_out = model.qformer(image_features)
    
    handle.remove()
    
    # Visualize
    if attention_weights and attention_weights[0] is not None:
        attn = attention_weights[0].mean(dim=1).squeeze(0).cpu().numpy()  # Average over heads
        
        # Reshape to image grid (14x14 for 224/16)
        h, w = 14, 14
        attn_map = attn[:h*w].reshape(h, w)
        
        # Upsample to original image size
        from scipy.ndimage import zoom
        attn_resized = zoom(attn_map, (image.size[1]/h, image.size[0]/w), order=1)
        
        plt.figure(figsize=(12, 6))
        
        plt.subplot(1, 2, 1)
        plt.imshow(image)
        plt.title(f'Original Image\nQ: {question}')
        plt.axis('off')
        
        plt.subplot(1, 2, 2)
        plt.imshow(image)
        plt.imshow(attn_resized, alpha=0.6, cmap='hot')
        plt.title('Q-Former Cross-Attention')
        plt.axis('off')
        
        plt.tight_layout()
        plt.savefig(save_path, dpi=300)
        print(f"Attention visualization saved to {save_path}")


# =============================================================================
# 6. Main Execution
# =============================================================================

def main():
    import argparse
    parser = argparse.ArgumentParser(description="Train BLIP-2 Q-Former")
    parser.add_argument("--stage", type=str, default="generative", choices=["generative"])
    parser.add_argument("--epochs", type=int, default=10)
    parser.add_argument("--batch_size", type=int, default=8)
    parser.add_argument("--lr", type=float, default=1e-4)
    parser.add_argument("--data_dir", type=str, default="./data/vqa")
    parser.add_argument("--checkpoint", type=str, default=None)
    args = parser.parse_args()
    
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")
    
    # Initialize model
    model = BLIP2Model(
        image_encoder_name="google/vit-large-patch16-224",
        llm_name="google/flan-t5-base",
        num_queries=32,
        freeze_image_encoder=True,
        freeze_llm=True
    )
    
    # Load checkpoint if provided
    if args.checkpoint:
        ckpt = torch.load(args.checkpoint, map_location=device)
        model.qformer.load_state_dict(ckpt['qformer_state_dict'])
        model.proj_to_llm.load_state_dict(ckpt['proj_state_dict'])
        print(f"Loaded checkpoint from {args.checkpoint}")
    
    # Setup dataset
    train_dataset = VQADataset(
        image_dir=os.path.join(args.data_dir, "train2014"),
        question_file=os.path.join(args.data_dir, "v2_OpenEnded_mscoco_train2014_questions.json"),
        annotation_file=os.path.join(args.data_dir, "v2_mscoco_train2014_annotations.json")
    )
    
    train_loader = DataLoader(
        train_dataset,
        batch_size=args.batch_size,
        shuffle=True,
        num_workers=4,
        pin_memory=True
    )
    
    # Setup trainer
    trainer = BLIP2Trainer(model, device=device, lr=args.lr, stage=args.stage)
    
    # Train
    if args.stage == "generative":
        trainer.train_generative_stage(train_loader, args.epochs)
    
    # Validation
    val_dataset = VQADataset(
        image_dir=os.path.join(args.data_dir, "val2014"),
        question_file=os.path.join(args.data_dir, "v2_OpenEnded_mscoco_val2014_questions.json"),
        annotation_file=os.path.join(args.data_dir, "v2_mscoco_val2014_annotations.json")
    )
    val_loader = DataLoader(val_dataset, batch_size=args.batch_size, shuffle=False)
    
    print("\nEvaluating on validation set...")
    accuracy, metrics, _ = trainer.evaluate_vqa(val_loader, max_samples=1000)
    print(f"\nValidation VQA Accuracy: {accuracy:.2f}%")
    print(f"Correct: {metrics['correct']}/{metrics['total_samples']}")
    
    if accuracy > 70.0:
        print("✓ Achieved target accuracy > 70%!")
    
    # Save model
    trainer.save_model("blip2_qformer_final.pt", args.epochs, metrics)
    trainer.plot_training_progress("blip2_training.png")
    
    # Example visualization
    if len(val_dataset) > 0:
        sample_img, sample_q, _, _ = val_dataset[0]
        visualize_attention_maps(
            model, 
            val_dataset.qa_pairs[0]['image_path'], 
            sample_q,
            "attention_sample.png"
        )

if __name__ == "__main__":
    main()

脚本3：投影层架构对比实验（Science QA）

Python

复制

复制代码

"""
Script 3: Projection Layer Architecture Comparison on Science QA
=================================================================
This script compares three projection approaches:
1. Linear Projection (single layer)
2. MLP Adapter (2-layer with GELU activation)
3. C-Abstractor (Query-based compression with Transformer)

Evaluates on Science QA dataset measuring:
- Accuracy on multimodal science questions
- Training time per epoch
- Convergence speed
- Parameter efficiency

Usage:
1. Prepare Science QA dataset in ./data/science_qa/
2. Run: python projection_comparison.py --method all --epochs 20

Dependencies: torch, torchvision, transformers, datasets, pandas, seaborn, matplotlib, tqdm, pillow
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, models
from transformers import AutoTokenizer, AutoModelForCausalLM, LlamaConfig
from typing import Optional, Dict, List, Tuple
import json
import os
import time
from PIL import Image
from tqdm import tqdm
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from dataclasses import dataclass
import numpy as np


# =============================================================================
# 1. Projection Layer Implementations
# =============================================================================

class LinearProjection(nn.Module):
    """
    Simple linear projection from vision to language space.
    Baseline approach with minimal parameters.
    """
    def __init__(self, in_dim: int, out_dim: int):
        super().__init__()
        self.projection = nn.Linear(in_dim, out_dim, bias=False)
        self.norm = nn.LayerNorm(out_dim)
        
    def forward(self, visual_features: torch.Tensor) -> torch.Tensor:
        """
        Args:
            visual_features: [B, N, D_in] or [B, D_in]
        Returns:
            projected: [B, N, D_out] or [B, D_out]
        """
        projected = self.projection(visual_features)
        normalized = self.norm(projected)
        return normalized


class MLPAdapter(nn.Module):
    """
    Two-layer MLP adapter with GELU activation and dropout.
    Similar to LLaVA's MLP projection layer.
    """
    def __init__(self, in_dim: int, out_dim: int, hidden_dim: Optional[int] = None, dropout: float = 0.1):
        super().__init__()
        if hidden_dim is None:
            hidden_dim = out_dim * 4  # Expansion ratio of 4
            
        self.fc1 = nn.Linear(in_dim, hidden_dim)
        self.activation = nn.GELU()
        self.dropout1 = nn.Dropout(dropout)
        self.fc2 = nn.Linear(hidden_dim, out_dim)
        self.dropout2 = nn.Dropout(dropout)
        self.norm = nn.LayerNorm(out_dim)
        
    def forward(self, visual_features: torch.Tensor) -> torch.Tensor:
        x = self.fc1(visual_features)
        x = self.activation(x)
        x = self.dropout1(x)
        x = self.fc2(x)
        x = self.dropout2(x)
        x = self.norm(x)
        return x


class CAbstractor(nn.Module):
    """
    C-Abstractor: Query-based visual feature compression.
    Uses learnable queries to compress variable-length visual tokens 
    to fixed number of visual abstracts.
    """
    def __init__(
        self, 
        in_dim: int, 
        out_dim: int, 
        num_queries: int = 64,
        num_layers: int = 2,
        num_heads: int = 8
    ):
        super().__init__()
        self.num_queries = num_queries
        self.query_dim = in_dim
        
        # Learnable query tokens
        self.queries = nn.Parameter(torch.randn(1, num_queries, in_dim))
        nn.init.normal_(self.queries, std=0.02)
        
        # Cross-attention layers for compression
        self.layers = nn.ModuleList([
            nn.MultiheadAttention(in_dim, num_heads, batch_first=True)
            for _ in range(num_layers)
        ])
        
        self.norms = nn.ModuleList([nn.LayerNorm(in_dim) for _ in range(num_layers)])
        
        # Projection to LLM dimension
        self.proj = nn.Linear(in_dim, out_dim)
        self.final_norm = nn.LayerNorm(out_dim)
        
    def forward(self, visual_features: torch.Tensor) -> torch.Tensor:
        """
        Args:
            visual_features: [B, N_patches, D_in]
        Returns:
            compressed: [B, num_queries, D_out]
        """
        batch_size = visual_features.shape[0]
        
        # Expand queries
        queries = self.queries.expand(batch_size, -1, -1)
        
        # Iterative cross-attention refinement
        current = queries
        for layer, norm in zip(self.layers, self.norms):
            # Cross-attention: queries attend to visual features
            attn_out, _ = layer(current, visual_features, visual_features)
            current = norm(current + attn_out)
        
        # Project to LLM dimension
        output = self.proj(current)
        output = self.final_norm(output)
        return output


# =============================================================================
# 2. Multimodal LLM with Switchable Projection
# =============================================================================

class MultimodalLLM(nn.Module):
    """
    Multimodal LLM supporting different projection architectures.
    Uses frozen ViT and frozen LLM with trainable projection layer.
    """
    def __init__(
        self,
        projection_type: str = "linear",  # "linear", "mlp", "c_abstractor"
        vit_model: str = "google/vit-base-patch16-224",
        llm_model: str = "meta-llama/Llama-2-7b-hf",  # Using smaller for demo
        freeze_vit: bool = True,
        freeze_llm: bool = True,
        num_queries: int = 64
    ):
        super().__init__()
        self.projection_type = projection_type
        
        # Load frozen ViT
        from transformers import ViTModel
        self.vit = ViTModel.from_pretrained(vit_model)
        vit_dim = self.vit.config.hidden_size
        
        if freeze_vit:
            for param in self.vit.parameters():
                param.requires_grad = False
            self.vit.eval()
        
        # Initialize projection based on type
        llm_dim = 4096  # LLaMA-2 7B hidden size
        
        if projection_type == "linear":
            self.projection = LinearProjection(vit_dim, llm_dim)
        elif projection_type == "mlp":
            self.projection = MLPAdapter(vit_dim, llm_dim, hidden_dim=llm_dim*4)
        elif projection_type == "c_abstractor":
            self.projection = CAbstractor(vit_dim, llm_dim, num_queries=num_queries)
        else:
            raise ValueError(f"Unknown projection type: {projection_type}")
        
        # Load frozen LLM (using smaller model for computational efficiency)
        # In practice, use LLaMA-2 or similar
        from transformers import GPT2LMHeadModel, GPT2Tokenizer
        self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
        self.tokenizer.pad_token = self.tokenizer.eos_token
        self.llm = GPT2LMHeadModel.from_pretrained("gpt2")
        llm_dim = self.llm.config.n_embd
        
        if freeze_llm:
            for param in self.llm.parameters():
                param.requires_grad = False
            self.llm.eval()
        
        self.num_trainable_params = sum(p.numel() for p in self.projection.parameters() if p.requires_grad)
        
    def forward(
        self, 
        images: torch.Tensor, 
        questions: List[str], 
        answers: Optional[List[str]] = None
    ) -> Dict[str, torch.Tensor]:
        """
        Forward pass with optional teacher forcing for training.
        """
        batch_size = images.shape[0]
        
        # Extract visual features
        with torch.no_grad() if not self.vit.training else torch.enable_grad():
            vit_out = self.vit(images)
            visual_features = vit_out.last_hidden_state  # [B, N_patches+1, D]
        
        # Project to LLM space
        visual_embeds = self.projection(visual_features)  # [B, N_vis, D_llm]
        
        # Encode questions
        question_tokens = self.tokenizer(
            questions,
            padding=True,
            truncation=True,
            max_length=128,
            return_tensors="pt"
        ).to(images.device)
        
        # Get question embeddings
        with torch.no_grad() if not self.llm.training else torch.enable_grad():
            question_embeds = self.llm.transformer.wte(question_tokens.input_ids)
        
        # Concatenate visual + question embeddings
        if self.projection_type == "c_abstractor":
            # C-Abstractor already compresses to fixed queries
            combined_embeds = torch.cat([visual_embeds, question_embeds], dim=1)
            # Create attention mask
            visual_mask = torch.ones(batch_size, visual_embeds.shape[1], device=images.device)
            combined_mask = torch.cat([visual_mask, question_tokens.attention_mask], dim=1)
        else:
            # Linear/MLP: visual_embeds might be full sequence or pooled
            if visual_embeds.dim() == 2:  # Pooled
                visual_embeds = visual_embeds.unsqueeze(1)
            combined_embeds = torch.cat([visual_embeds, question_embeds], dim=1)
            visual_mask = torch.ones(batch_size, visual_embeds.shape[1], device=images.device)
            combined_mask = torch.cat([visual_mask, question_tokens.attention_mask], dim=1)
        
        # Forward through LLM
        if answers is not None:
            # Training mode with labels
            answer_tokens = self.tokenizer(
                answers,
                padding=True,
                truncation=True,
                max_length=128,
                return_tensors="pt"
            ).to(images.device)
            
            # Create labels: -100 for visual and question positions
            labels = answer_tokens.input_ids.clone()
            labels[labels == self.tokenizer.pad_token_id] = -100
            
            # Pad labels to match combined sequence length
            visual_len = visual_embeds.shape[1]
            question_len = question_embeds.shape[1]
            prefix_padding = torch.full((batch_size, visual_len + question_len), -100, 
                                       dtype=torch.long, device=images.device)
            full_labels = torch.cat([prefix_padding, labels], dim=1)
            
            outputs = self.llm(
                inputs_embeds=combined_embeds,
                attention_mask=combined_mask,
                labels=full_labels
            )
            return {"loss": outputs.loss, "logits": outputs.logits}
        else:
            # Inference mode
            outputs = self.llm(
                inputs_embeds=combined_embeds,
                attention_mask=combined_mask
            )
            return {"logits": outputs.logits}
    
    def generate_answer(self, images: torch.Tensor, questions: List[str], max_length: int = 50) -> List[str]:
        """
        Generate answers autoregressively.
        """
        self.eval()
        batch_size = images.shape[0]
        
        with torch.no_grad():
            # Encode image and question
            vit_out = self.vit(images)
            visual_features = vit_out.last_hidden_state
            visual_embeds = self.projection(visual_features)
            
            if visual_embeds.dim() == 2:
                visual_embeds = visual_embeds.unsqueeze(1)
            
            question_tokens = self.tokenizer(questions, padding=True, truncation=True, 
                                           max_length=128, return_tensors="pt").to(images.device)
            question_embeds = self.llm.transformer.wte(question_tokens.input_ids)
            
            combined_embeds = torch.cat([visual_embeds, question_embeds], dim=1)
            
            # Generate
            visual_len = visual_embeds.shape[1]
            
            outputs = self.llm.generate(
                inputs_embeds=combined_embeds,
                max_length=combined_embeds.shape[1] + max_length,
                num_beams=3,
                early_stopping=True,
                pad_token_id=self.tokenizer.eos_token_id
            )
            
            # Decode only the answer part (after question)
            answers = self.tokenizer.batch_decode(outputs[:, question_embeds.shape[1]:], skip_special_tokens=True)
            
        return answers


# =============================================================================
# 3. Science QA Dataset
# =============================================================================

class ScienceQADataset(Dataset):
    """
    Science QA dataset loader.
    Supports multimodal questions with images and multiple-choice answers.
    """
    def __init__(self, data_dir: str, split: str = "train", transform=None):
        self.data_dir = data_dir
        self.split = split
        self.transform = transform or self._default_transform()
        
        # Load data
        data_file = os.path.join(data_dir, f"{split}.json")
        with open(data_file, 'r') as f:
            self.data = json.load(f)
        
        self.samples = []
        for item in self.data:
            question = item.get('question', '')
            choices = item.get('choices', [])
            answer = item.get('answer', 0)  # Index of correct choice
            image_path = item.get('image', None)
            
            if image_path:
                image_path = os.path.join(data_dir, image_path)
                if os.path.exists(image_path):
                    self.samples.append({
                        'question': question,
                        'choices': choices,
                        'answer_idx': answer,
                        'answer_text': choices[answer] if choices and answer < len(choices) else str(answer),
                        'image_path': image_path,
                        'lecture': item.get('lecture', ''),
                        'solution': item.get('solution', '')
                    })
    
    def _default_transform(self):
        return transforms.Compose([
            transforms.Resize((224, 224)),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        ])
    
    def format_question(self, sample: Dict) -> str:
        """
        Format question with choices for multiple choice answering.
        """
        question = sample['question']
        choices = sample['choices']
        
        # Format: "Question: ... Choices: A) ... B) ... Answer:"
        choice_str = " ".join([f"{chr(65+i)}) {c}" for i, c in enumerate(choices)])
        formatted = f"Question: {question} Choices: {choice_str} Answer:"
        return formatted
    
    def __len__(self):
        return len(self.samples)
    
    def __getitem__(self, idx):
        sample = self.samples[idx]
        
        # Load image
        try:
            image = Image.open(sample['image_path']).convert('RGB')
        except Exception:
            image = Image.new('RGB', (224, 224), color='black')
        
        if self.transform:
            image = self.transform(image)
        
        question_text = self.format_question(sample)
        answer_text = sample['answer_text']
        
        return image, question_text, answer_text, sample['answer_idx']


# =============================================================================
# 4. Training and Evaluation Framework
# =============================================================================

class ProjectionComparator:
    """
    Compare different projection architectures on Science QA.
    Tracks accuracy, training time, and convergence metrics.
    """
    def __init__(self, device: str = "cuda", batch_size: int = 4):
        self.device = device
        self.batch_size = batch_size
        self.results = {}
        
    def train_model(
        self, 
        projection_type: str, 
        train_loader: DataLoader, 
        val_loader: DataLoader,
        epochs: int = 20,
        lr: float = 2e-4
    ) -> Dict:
        """
        Train a specific projection architecture and return metrics.
        """
        print(f"\n{'='*60}")
        print(f"Training with {projection_type.upper()} projection")
        print(f"{'='*60}")
        
        # Initialize model
        model = MultimodalLLM(projection_type=projection_type).to(self.device)
        
        # Optimizer (only projection parameters)
        optimizer = torch.optim.AdamW(
            model.projection.parameters(),
            lr=lr,
            weight_decay=0.01,
            betas=(0.9, 0.999)
        )
        scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)
        
        # Training metrics
        train_losses = []
        val_accuracies = []
        epoch_times = []
        
        for epoch in range(epochs):
            start_time = time.time()
            
            # Training
            model.train()
            epoch_loss = 0.0
            num_batches = 0
            
            pbar = tqdm(train_loader, desc=f"{projection_type} Epoch {epoch+1}/{epochs}")
            for images, questions, answers, _ in pbar:
                images = images.to(self.device)
                
                optimizer.zero_grad()
                
                outputs = model(images, questions, answers)
                loss = outputs['loss']
                
                loss.backward()
                torch.nn.utils.clip_grad_norm_(model.projection.parameters(), 1.0)
                optimizer.step()
                
                epoch_loss += loss.item()
                num_batches += 1
                pbar.set_postfix({"loss": loss.item()})
            
            avg_loss = epoch_loss / num_batches
            train_losses.append(avg_loss)
            
            # Validation
            val_acc = self.evaluate(model, val_loader)
            val_accuracies.append(val_acc)
            
            epoch_time = time.time() - start_time
            epoch_times.append(epoch_time)
            
            scheduler.step()
            
            print(f"Epoch {epoch+1}: Loss={avg_loss:.4f}, Val Acc={val_acc:.2f}%, Time={epoch_time:.2f}s")
        
        # Store results
        results = {
            'projection_type': projection_type,
            'train_losses': train_losses,
            'val_accuracies': val_accuracies,
            'epoch_times': epoch_times,
            'final_accuracy': val_accuracies[-1],
            'avg_epoch_time': np.mean(epoch_times),
            'total_params': model.num_trainable_params,
            'convergence_epoch': self._find_convergence_epoch(val_accuracies)
        }
        
        return results, model
    
    def evaluate(self, model: MultimodalLLM, dataloader: DataLoader) -> float:
        """
        Evaluate model accuracy on validation set.
        """
        model.eval()
        correct = 0
        total = 0
        
        with torch.no_grad():
            for images, questions, answers, answer_indices in tqdm(dataloader, desc="Evaluating"):
                images = images.to(self.device)
                
                # Generate predictions
                preds = model.generate_answer(images, questions, max_length=20)
                
                # Check accuracy (exact match)
                for pred, true in zip(preds, answers):
                    if pred.strip().lower() == true.strip().lower():
                        correct += 1
                    total += 1
        
        accuracy = correct / total * 100 if total > 0 else 0.0
        return accuracy
    
    def _find_convergence_epoch(self, accuracies: List[float], threshold: float = 0.5) -> int:
        """
        Find epoch where model converges (accuracy improvement < threshold).
        """
        for i in range(1, len(accuracies)):
            if abs(accuracies[i] - accuracies[i-1]) < threshold:
                return i
        return len(accuracies)
    
    def compare_all(
        self, 
        train_loader: DataLoader, 
        val_loader: DataLoader,
        epochs: int = 20
    ) -> pd.DataFrame:
        """
        Train and compare all three projection architectures.
        """
        projection_types = ["linear", "mlp", "c_abstractor"]
        
        for proj_type in projection_types:
            results, _ = self.train_model(proj_type, train_loader, val_loader, epochs)
            self.results[proj_type] = results
        
        # Create comparison dataframe
        comparison_data = []
        for proj_type, res in self.results.items():
            comparison_data.append({
                'Architecture': proj_type.upper(),
                'Final Accuracy (%)': res['final_accuracy'],
                'Avg Epoch Time (s)': res['avg_epoch_time'],
                'Trainable Params': res['total_params'],
                'Convergence Epoch': res['convergence_epoch']
            })
        
        df = pd.DataFrame(comparison_data)
        return df
    
    def plot_comparison(self, save_path: str = "projection_comparison.png"):
        """
        Visualize comparison of different projection architectures.
        """
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        
        colors = {'linear': 'blue', 'mlp': 'red', 'c_abstractor': 'green'}
        
        # Plot 1: Validation Accuracy over Epochs
        ax1 = axes[0, 0]
        for proj_type, res in self.results.items():
            ax1.plot(res['val_accuracies'], label=proj_type.upper(), 
                    color=colors[proj_type], marker='o', linewidth=2)
        ax1.set_xlabel('Epoch')
        ax1.set_ylabel('Validation Accuracy (%)')
        ax1.set_title('Accuracy Comparison: Linear vs MLP vs C-Abstractor')
        ax1.legend()
        ax1.grid(True, alpha=0.3)
        
        # Plot 2: Training Loss Curves
        ax2 = axes[0, 1]
        for proj_type, res in self.results.items():
            ax2.plot(res['train_losses'], label=proj_type.upper(),
                    color=colors[proj_type], linewidth=2)
        ax2.set_xlabel('Epoch')
        ax2.set_ylabel('Training Loss')
        ax2.set_title('Training Loss Comparison')
        ax2.legend()
        ax2.grid(True, alpha=0.3)
        
        # Plot 3: Training Time per Epoch
        ax3 = axes[1, 0]
        time_data = [res['avg_epoch_time'] for res in self.results.values()]
        labels = [k.upper() for k in self.results.keys()]
        bars = ax3.bar(labels, time_data, color=[colors[k.lower()] for k in self.results.keys()])
        ax3.set_ylabel('Average Epoch Time (seconds)')
        ax3.set_title('Training Efficiency Comparison')
        for bar, val in zip(bars, time_data):
            height = bar.get_height()
            ax3.text(bar.get_x() + bar.get_width()/2., height,
                    f'{val:.1f}s', ha='center', va='bottom')
        
        # Plot 4: Parameter Efficiency vs Accuracy
        ax4 = axes[1, 1]
        params = [res['total_params']/1e6 for res in self.results.values()]  # Millions
        accs = [res['final_accuracy'] for res in self.results.values()]
        
        for proj_type, res in self.results.items():
            ax4.scatter(res['total_params']/1e6, res['final_accuracy'], 
                       s=200, label=proj_type.upper(), color=colors[proj_type], alpha=0.7)
        
        ax4.set_xlabel('Trainable Parameters (Millions)')
        ax4.set_ylabel('Final Accuracy (%)')
        ax4.set_title('Parameter Efficiency Trade-off')
        ax4.legend()
        ax4.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.savefig(save_path, dpi=300, bbox_inches='tight')
        print(f"\nComparison plot saved to {save_path}")


# =============================================================================
# 5. Main Execution
# =============================================================================

def main():
    import argparse
    parser = argparse.ArgumentParser(description="Compare Projection Architectures on Science QA")
    parser.add_argument("--method", type=str, default="all", 
                       choices=["linear", "mlp", "c_abstractor", "all"])
    parser.add_argument("--epochs", type=int, default=20)
    parser.add_argument("--batch_size", type=int, default=4)
    parser.add_argument("--lr", type=float, default=2e-4)
    parser.add_argument("--data_dir", type=str, default="./data/science_qa")
    args = parser.parse_args()
    
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")
    
    # Setup datasets
    train_dataset = ScienceQADataset(args.data_dir, split="train")
    val_dataset = ScienceQADataset(args.data_dir, split="val")
    
    train_loader = DataLoader(train_dataset, batch_size=args.batch_size, 
                             shuffle=True, num_workers=2)
    val_loader = DataLoader(val_dataset, batch_size=args.batch_size, 
                           shuffle=False, num_workers=2)
    
    print(f"Loaded {len(train_dataset)} training samples and {len(val_dataset)} validation samples")
    
    # Initialize comparator
    comparator = ProjectionComparator(device=device, batch_size=args.batch_size)
    
    if args.method == "all":
        # Compare all methods
        results_df = comparator.compare_all(train_loader, val_loader, args.epochs)
        print("\n" + "="*80)
        print("COMPARISON RESULTS")
        print("="*80)
        print(results_df.to_string(index=False))
        print("="*80)
        
        # Generate plots
        comparator.plot_comparison("projection_comparison.png")
    else:
        # Train single method
        results, model = comparator.train_model(
            args.method, train_loader, val_loader, args.epochs, args.lr
        )
        print(f"\nFinal Results for {args.method}:")
        print(f"Accuracy: {results['final_accuracy']:.2f}%")
        print(f"Avg Epoch Time: {results['avg_epoch_time']:.2f}s")
        print(f"Trainable Params: {results['total_params']:,}")
        
        # Save model
        torch.save(model.projection.state_dict(), f"projection_{args.method}_final.pt")

if __name__ == "__main__":
    main()

脚本4：多模态指令微调对话系统（LLaVA-1.5风格）

Python

复制

复制代码

"""
Script 4: Multimodal Instruction Tuning with Interleaved Image-Text Conversations
===================================================================================
Implementation of LLaVA-1.5 style multimodal dialogue system with:
- Visual instruction tuning on multi-turn conversations
- Support for interleaved image-text inputs (multiple images per conversation)
- Dynamic image resolution handling (padding and partitioning)
- Vicuna/LLaMA-2 based language model with MLP projection

Features:
- Multi-image context understanding
- Instruction following with visual grounding
- Conversation history management
- Interactive demo mode

Usage:
1. Prepare instruction tuning data in JSON format
2. Run training: python multimodal_chat.py --mode train --epochs 3
3. Run interactive demo: python multimodal_chat.py --mode demo --checkpoint ./best_model.pt

Dependencies: torch, torchvision, transformers, pillow, gradio (optional), tqdm, matplotlib
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, models
from transformers import (
    LlamaForCausalLM, 
    LlamaTokenizer, 
    AutoTokenizer, 
    AutoModelForCausalLM,
    CLIPImageProcessor,
    CLIPVisionModel
)
from typing import List, Dict, Tuple, Optional, Union
import json
import os
from PIL import Image
from dataclasses import dataclass, field
from tqdm import tqdm
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import copy


# =============================================================================
# 1. Conversation and Data Structures
# =============================================================================

@dataclass
class ConversationTurn:
    """Single turn in a multimodal conversation."""
    role: str  # "user" or "assistant"
    content: str
    images: List[str] = field(default_factory=list)  # List of image paths
    
@dataclass
class MultimodalConversation:
    """Complete conversation with multiple turns and images."""
    turns: List[ConversationTurn] = field(default_factory=list)
    conversation_id: str = ""
    
    def to_plain_text(self, image_token: str = "<image>") -> str:
        """
        Convert conversation to text format with image tokens inserted.
        """
        text_parts = []
        for turn in self.turns:
            role_marker = "USER:" if turn.role == "user" else "ASSISTANT:"
            content = turn.content
            
            # Insert image tokens
            for img_path in turn.images:
                content = content.replace(f"[IMAGE:{img_path}]", image_token)
            
            text_parts.append(f"{role_marker} {content}")
        
        return "\n".join(text_parts) + "\nASSISTANT:"


# =============================================================================
# 2. Visual Encoder and Projection
# =============================================================================

class LLaVAVisualEncoder(nn.Module):
    """
    Visual encoder based on CLIP ViT with dynamic resolution support.
    Supports both standard resizing and image partitioning for high-res inputs.
    """
    def __init__(
        self,
        vision_tower: str = "openai/clip-vit-large-patch14-336",
        select_layer: int = -2,  # Layer to extract features from
        select_feature: str = "patch",  # "patch" or "cls_patch"
        image_size: int = 336,
        patch_size: int = 14
    ):
        super().__init__()
        
        self.image_processor = CLIPImageProcessor.from_pretrained(vision_tower)
        self.vision_tower = CLIPVisionModel.from_pretrained(vision_tower)
        
        self.select_layer = select_layer
        self.select_feature = select_feature
        self.image_size = image_size
        self.patch_size = patch_size
        self.num_patches = (image_size // patch_size) ** 2
        
        # Freeze vision tower
        for param in self.vision_tower.parameters():
            param.requires_grad = False
        self.vision_tower.eval()
        
        self.hidden_size = self.vision_tower.config.hidden_size
        
    def process_images(
        self, 
        images: Union[List[Image.Image], List[torch.Tensor]], 
        return_tensors: str = "pt"
    ) -> torch.Tensor:
        """
        Process images with dynamic padding to maintain aspect ratio.
        """
        if isinstance(images[0], torch.Tensor):
            # Already tensors, just stack
            return torch.stack(images)
        
        # PIL Images - apply preprocessing
        processed = []
        for img in images:
            # Dynamic resize maintaining aspect ratio
            w, h = img.size
            scale = self.image_size / max(w, h)
            new_w, new_h = int(w * scale), int(h * scale)
            
            img_resized = img.resize((new_w, new_h), Image.BICUBIC)
            
            # Create square image with padding
            img_square = Image.new('RGB', (self.image_size, self.image_size), (255, 255, 255))
            left = (self.image_size - new_w) // 2
            top = (self.image_size - new_h) // 2
            img_square.paste(img_resized, (left, top))
            
            # Convert to tensor
            img_tensor = self.image_processor(img_square, return_tensors=return_tensors)['pixel_values'][0]
            processed.append(img_tensor)
        
        return torch.stack(processed)
    
    def encode_images(self, pixel_values: torch.Tensor) -> torch.Tensor:
        """
        Encode images through vision tower.
        Returns patch features.
        """
        with torch.no_grad():
            outputs = self.vision_tower(
                pixel_values=pixel_values,
                output_hidden_states=True,
                return_dict=True
            )
            
            # Select specific layer
            image_features = outputs.hidden_states[self.select_layer]
            
            if self.select_feature == "patch":
                # Remove CLS token, keep only patch tokens
                image_features = image_features[:, 1:]
            elif self.select_feature == "cls_patch":
                # Keep both CLS and patches
                pass
            
            return image_features
    
    def forward(self, images: torch.Tensor) -> torch.Tensor:
        return self.encode_images(images)


class MLPProjection(nn.Module):
    """
    Two-layer MLP projection from vision space to LLM space.
    Standard architecture from LLaVA-1.5.
    """
    def __init__(self, vision_dim: int, llm_dim: int, hidden_dim: Optional[int] = None):
        super().__init__()
        
        if hidden_dim is None:
            hidden_dim = vision_dim
            
        self.linear_1 = nn.Linear(vision_dim, hidden_dim, bias=True)
        self.activation = nn.GELU()
        self.linear_2 = nn.Linear(hidden_dim, llm_dim, bias=True)
        
    def forward(self, vision_features: torch.Tensor) -> torch.Tensor:
        """
        Args:
            vision_features: [B, N, D_vis]
        Returns:
            projected: [B, N, D_llm]
        """
        x = self.linear_1(vision_features)
        x = self.activation(x)
        x = self.linear_2(x)
        return x


# =============================================================================
# 3. Multimodal LLM with Interleaved Image-Text Support
# =============================================================================

class MultimodalChatModel(nn.Module):
    """
    Complete multimodal chat model supporting interleaved image-text conversations.
    Combines frozen visual encoder, MLP projection, and LLM with visual instruction tuning.
    """
    def __init__(
        self,
        llm_model: str = "lmsys/vicuna-7b-v1.5",  # or "meta-llama/Llama-2-7b-chat-hf"
        vision_tower: str = "openai/clip-vit-large-patch14-336",
        multimodal_projector: Optional[nn.Module] = None
    ):
        super().__init__()
        
        # Load tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(llm_model, use_fast=False)
        self.tokenizer.pad_token = self.tokenizer.eos_token
        self.tokenizer.padding_side = "right"
        
        # Load LLM
        self.llm = LlamaForCausalLM.from_pretrained(
            llm_model,
            torch_dtype=torch.float16,
            device_map="auto"
        )
        self.llm_dim = self.llm.config.hidden_size
        
        # Visual components
        self.vision_encoder = LLaVAVisualEncoder(vision_tower)
        
        # Projection layer
        if multimodal_projector is None:
            self.projector = MLPProjection(
                self.vision_encoder.hidden_size,
                self.llm_dim,
                hidden_dim=self.vision_encoder.hidden_size
            )
        else:
            self.projector = multimodal_projector
        
        # Image token handling
        self.image_token = "<image>"
        self.image_token_id = 32000  # Special token ID for image
        
        # Special tokens for multi-image separation
        self.im_start_token = "<im_start>"
        self.im_end_token = "<im_end>"
        
    def prepare_inputs_with_images(
        self,
        conversations: List[MultimodalConversation],
        images_list: List[List[Image.Image]]
    ) -> Dict[str, torch.Tensor]:
        """
        Prepare input tensors with interleaved image and text embeddings.
        
        Args:
            conversations: List of conversation objects
            images_list: List of image lists corresponding to each conversation
        """
        batch_size = len(conversations)
        
        all_input_ids = []
        all_attention_masks = []
        all_labels = []
        max_length = 0
        
        # Process each conversation
        for conv, images in zip(conversations, images_list):
            # Encode images
            if images:
                pixel_values = self.vision_encoder.process_images(images).to(self.llm.device)
                image_features = self.vision_encoder(pixel_values)  # [num_images, num_patches, vis_dim]
                image_embeds = self.projector(image_features)  # [num_images, num_patches, llm_dim]
                
                # Flatten image embeddings: [num_images * num_patches, llm_dim]
                image_embeds_flat = image_embeds.view(-1, image_embeds.shape[-1])
            else:
                image_embeds_flat = None
            
            # Tokenize text
            prompt_text = conv.to_plain_text(self.image_token)
            prompt_tokens = self.tokenizer(
                prompt_text,
                return_tensors="pt",
                truncation=True,
                max_length=2048
            )
            
            input_ids = prompt_tokens['input_ids'][0]
            
            # Replace image tokens with placeholder embeddings
            # In practice, we prepare the input_embeds directly
            image_token_positions = (input_ids == self.image_token_id).nonzero(as_tuple=True)[0]
            
            # Create input embeddings
            input_embeds = self.llm.get_input_embeddings()(input_ids.to(self.llm.device))
            
            # Insert image embeddings at image token positions
            if image_embeds_flat is not None and len(image_token_positions) > 0:
                # For simplicity, replace consecutive image tokens with image patches
                # In full implementation, need careful handling of positions
                pass
            
            # Labels: mask non-assistant tokens with -100
            labels = input_ids.clone()
            # Find "ASSISTANT:" positions and only compute loss after them
            
            all_input_ids.append(input_ids)
            max_length = max(max_length, len(input_ids))
        
        # Pad batch
        batch_input_ids = torch.full((batch_size, max_length), self.tokenizer.pad_token_id, dtype=torch.long)
        batch_attention_mask = torch.zeros((batch_size, max_length), dtype=torch.long)
        
        for i, ids in enumerate(all_input_ids):
            batch_input_ids[i, :len(ids)] = ids
            batch_attention_mask[i, :len(ids)] = 1
        
        return {
            "input_ids": batch_input_ids.to(self.llm.device),
            "attention_mask": batch_attention_mask.to(self.llm.device),
            "images": [imgs for imgs in images_list]  # Keep for embedding generation
        }
    
    def forward(
        self,
        input_ids: torch.Tensor,
        attention_mask: torch.Tensor,
        images: Optional[List[List[Image.Image]]] = None,
        labels: Optional[torch.Tensor] = None
    ) -> Dict[str, torch.Tensor]:
        """
        Forward pass supporting multiple images per sample.
        """
        # Get text embeddings
        inputs_embeds = self.llm.get_input_embeddings()(input_ids)
        
        # Inject visual embeddings
        if images is not None:
            for batch_idx, sample_images in enumerate(images):
                if sample_images:
                    # Process and encode images
                    pixel_values = self.vision_encoder.process_images(sample_images).to(self.llm.device)
                    image_features = self.vision_encoder(pixel_values)
                    image_embeds = self.projector(image_features)
                    
                    # Find image token positions in this sample
                    # Assume image tokens are marked by special token ID
                    # For this implementation, we append visual embeddings as prefix
                    # Full implementation requires token-level insertion
                    
                    # Prepend visual embeddings to text embeddings for this sample
                    sample_visual = image_embeds.view(-1, image_embeds.shape[-1])
                    inputs_embeds[batch_idx] = torch.cat([
                        sample_visual,
                        inputs_embeds[batch_idx, sample_visual.shape[0]:]
                    ], dim=0)
        
        # Forward through LLM
        outputs = self.llm(
            inputs_embeds=inputs_embeds,
            attention_mask=attention_mask,
            labels=labels,
            return_dict=True
        )
        
        return outputs
    
    @torch.no_grad()
    def generate_response(
        self,
        conversation: MultimodalConversation,
        images: List[Image.Image],
        max_new_tokens: int = 512,
        temperature: float = 0.2,
        top_p: float = 0.9
    ) -> str:
        """
        Generate response for a conversation with multiple images.
        """
        self.eval()
        
        # Prepare inputs
        prompt_text = conversation.to_plain_text(self.image_token)
        
        # Encode images
        if images:
            pixel_values = self.vision_encoder.process_images(images).to(self.llm.device)
            image_features = self.vision_encoder(pixel_values)
            image_embeds = self.projector(image_features)
            # Flatten: [total_visual_tokens, llm_dim]
            image_embeds = image_embeds.view(-1, image_embeds.shape[-1])
        else:
            image_embeds = None
        
        # Encode prompt text
        input_ids = self.tokenizer(prompt_text, return_tensors="pt")['input_ids'].to(self.llm.device)
        inputs_embeds = self.llm.get_input_embeddings()(input_ids)
        
        # Prepend image embeddings
        if image_embeds is not None:
            inputs_embeds = torch.cat([image_embeds.unsqueeze(0), inputs_embeds], dim=1)
            # Adjust attention mask
            visual_attention = torch.ones(1, image_embeds.shape[0], device=self.llm.device)
            text_attention = torch.ones_like(input_ids)
            attention_mask = torch.cat([visual_attention, text_attention], dim=1)
        else:
            attention_mask = torch.ones_like(input_ids)
        
        # Generate
        output_ids = self.llm.generate(
            inputs_embeds=inputs_embeds,
            attention_mask=attention_mask,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            num_beams=1,
            use_cache=True
        )
        
        # Decode
        # Remove prompt part
        response_ids = output_ids[0, inputs_embeds.shape[1]:]
        response = self.tokenizer.decode(response_ids, skip_special_tokens=True)
        
        return response.strip()


# =============================================================================
# 4. Instruction Tuning Dataset
# =============================================================================

class MultimodalInstructionDataset(Dataset):
    """
    Dataset for multimodal instruction tuning with interleaved image-text conversations.
    Supports formats like LLaVA-Instruct or custom multimodal dialogue data.
    """
    def __init__(
        self, 
        data_path: str,
        image_dir: str,
        transform: Optional[transforms.Compose] = None
    ):
        self.image_dir = Path(image_dir)
        
        # Load conversation data
        with open(data_path, 'r') as f:
            self.data = json.load(f)
        
        self.samples = []
        for item in self.data:
            # Parse conversation format
            conversation = self._parse_conversation(item)
            self.samples.append({
                'conversation': conversation,
                'image_paths': item.get('images', []),
                'id': item.get('id', '')
            })
    
    def _parse_conversation(self, item: Dict) -> MultimodalConversation:
        """
        Parse JSON item into MultimodalConversation.
        """
        turns = []
        for turn_data in item.get('conversations', []):
            turn = ConversationTurn(
                role=turn_data['from'],  # 'human' or 'gpt' mapped to user/assistant
                content=turn_data['value'],
                images=turn_data.get('images', [])
            )
            # Map roles
            if turn.role == 'human':
                turn.role = 'user'
            elif turn.role == 'gpt':
                turn.role = 'assistant'
            turns.append(turn)
        
        return MultimodalConversation(
            turns=turns,
            conversation_id=item.get('id', '')
        )
    
    def __len__(self):
        return len(self.samples)
    
    def __getitem__(self, idx) -> Tuple[MultimodalConversation, List[Image.Image], str]:
        """
        Returns conversation, loaded images, and target response.
        """
        sample = self.samples[idx]
        conversation = sample['conversation']
        
        # Load images
        images = []
        for img_path in sample['image_paths']:
            full_path = self.image_dir / img_path
            if full_path.exists():
                try:
                    img = Image.open(full_path).convert('RGB')
                    images.append(img)
                except Exception as e:
                    print(f"Error loading image {full_path}: {e}")
        
        # Get target (last assistant response)
        target = ""
        for turn in reversed(conversation.turns):
            if turn.role == 'assistant':
                target = turn.content
                break
        
        return conversation, images, target


# =============================================================================
# 5. Training Pipeline with Visualization
# =============================================================================

class MultimodalTrainer:
    """
    Trainer for multimodal instruction tuning with conversation history.
    """
    def __init__(
        self,
        model: MultimodalChatModel,
        device: str = "cuda",
        lr: float = 2e-5,
        warmup_steps: int = 100
    ):
        self.model = model
        self.device = device
        
        # Only train projection layer and LLM LoRA (if applied)
        trainable_params = list(model.projector.parameters())
        
        self.optimizer = torch.optim.AdamW(trainable_params, lr=lr, weight_decay=0.0)
        self.scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
            self.optimizer, T_max=1000, eta_min=lr/10
        )
        
        self.train_losses = []
        self.val_metrics = []
        
    def compute_conversation_loss(
        self,
        conversation: MultimodalConversation,
        images: List[Image.Image],
        target_response: str
    ) -> torch.Tensor:
        """
        Compute language modeling loss for a single conversation turn.
        """
        # Prepare inputs with images
        prompt_text = conversation.to_plain_text(self.model.image_token)
        
        # Tokenize full sequence (prompt + target)
        full_text = prompt_text + " " + target_response + self.model.tokenizer.eos_token
        tokens = self.model.tokenizer(
            full_text,
            return_tensors="pt",
            truncation=True,
            max_length=2048
        ).to(self.device)
        
        # Create labels: mask prompt part, only compute loss on target
        labels = tokens['input_ids'].clone()
        
        # Find target start position (after prompt)
        prompt_tokens = self.model.tokenizer(prompt_text, return_tensors="pt")['input_ids']
        prompt_len = prompt_tokens.shape[1]
        
        # Mask prompt tokens in labels
        labels[:, :prompt_len] = -100
        
        # Forward
        outputs = self.model(
            input_ids=tokens['input_ids'],
            attention_mask=tokens['attention_mask'],
            images=[images] if images else None,
            labels=labels
        )
        
        return outputs.loss
    
    def train_epoch(self, dataloader: DataLoader) -> float:
        """
        Train for one epoch over all conversations.
        """
        self.model.train()
        # Keep encoders frozen
        self.model.vision_encoder.eval()
        self.model.llm.eval()
        
        total_loss = 0.0
        num_batches = 0
        
        pbar = tqdm(dataloader, desc="Training")
        for conversations, images_list, targets in pbar:
            # Process each item in batch individually due to variable length
            batch_loss = 0.0
            
            for conv, imgs, target in zip(conversations, images_list, targets):
                self.optimizer.zero_grad()
                
                loss = self.compute_conversation_loss(conv, imgs, target)
                loss.backward()
                
                torch.nn.utils.clip_grad_norm_(self.model.projector.parameters(), 1.0)
                self.optimizer.step()
                
                batch_loss += loss.item()
            
            avg_batch_loss = batch_loss / len(conversations)
            total_loss += avg_batch_loss
            num_batches += 1
            
            pbar.set_postfix({"loss": avg_batch_loss})
            self.scheduler.step()
        
        epoch_loss = total_loss / num_batches
        self.train_losses.append(epoch_loss)
        return epoch_loss
    
    def evaluate_dialogue_quality(
        self, 
        dataloader: DataLoader,
        num_samples: int = 50
    ) -> Dict[str, float]:
        """
        Evaluate response quality using metrics like BLEU or exact match.
        """
        self.model.eval()
        
        exact_matches = 0
        total = 0
        
        with torch.no_grad():
            for i, (conversations, images_list, targets) in enumerate(tqdm(dataloader, desc="Evaluating")):
                if i >= num_samples:
                    break
                
                for conv, imgs, target in zip(conversations, images_list, targets):
                    # Generate response
                    pred_response = self.model.generate_response(conv, imgs, max_new_tokens=100)
                    
                    # Simple exact match (can be replaced with BLEU/ROUGE)
                    if target.strip().lower() in pred_response.lower():
                        exact_matches += 1
                    total += 1
        
        accuracy = exact_matches / total * 100 if total > 0 else 0.0
        metrics = {
            'response_accuracy': accuracy,
            'total_evaluated': total
        }
        self.val_metrics.append(metrics)
        return metrics
    
    def visualize_conversation(
        self,
        conversation: MultimodalConversation,
        images: List[Image.Image],
        save_path: str = "conversation_vis.png"
    ):
        """
        Visualize a conversation with images and model response.
        """
        # Get model response
        response = self.model.generate_response(conversation, images)
        
        # Create figure with images and text
        num_images = len(images)
        fig = plt.figure(figsize=(15, 5 * (num_images + 1)))
        
        # Display images
        for i, img in enumerate(images):
            ax = plt.subplot(num_images + 1, 2, 2*i + 1)
            plt.imshow(img)
            plt.axis('off')
            plt.title(f"Image {i+1}")
        
        # Display conversation
        ax_text = plt.subplot(num_images + 1, 2, (2*num_images + 1, 2*num_images + 2))
        conversation_text = conversation.to_plain_text("<image>")
        full_text = f"{conversation_text}\n\nModel Response:\n{response}"
        
        ax_text.text(0.1, 0.5, full_text, fontsize=10, verticalalignment='center',
                    family='monospace', wrap=True)
        ax_text.axis('off')
        
        plt.tight_layout()
        plt.savefig(save_path, dpi=150, bbox_inches='tight')
        print(f"Conversation visualization saved to {save_path}")
    
    def plot_training_curves(self, save_path: str = "multimodal_training.png"):
        """
        Plot training loss and validation metrics.
        """
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
        
        # Loss curve
        ax1.plot(self.train_losses, 'b-', marker='o')
        ax1.set_xlabel('Epoch')
        ax1.set_ylabel('Training Loss')
        ax1.set_title('Instruction Tuning Loss')
        ax1.grid(True, alpha=0.3)
        
        # Accuracy curve
        if self.val_metrics:
            accs = [m['response_accuracy'] for m in self.val_metrics]
            ax2.plot(accs, 'r-', marker='s')
            ax2.set_xlabel('Evaluation Step')
            ax2.set_ylabel('Response Accuracy (%)')
            ax2.set_title('Dialogue Quality')
            ax2.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.savefig(save_path, dpi=300)


# =============================================================================
# 6. Interactive Demo Interface
# =============================================================================

class MultimodalChatInterface:
    """
    Interactive interface for multimodal chat demo.
    Supports multi-image upload and conversation history.
    """
    def __init__(self, model: MultimodalChatModel, device: str = "cuda"):
        self.model = model
        self.device = device
        self.conversation_history = MultimodalConversation()
        
    def add_user_message(self, text: str, image_paths: List[str]):
        """Add user turn with text and optional images."""
        turn = ConversationTurn(
            role="user",
            content=text,
            images=image_paths
        )
        self.conversation_history.turns.append(turn)
    
    def get_model_response(self) -> str:
        """Generate assistant response based on history."""
        # Load images from last user turn
        last_turn = self.conversation_history.turns[-1]
        images = []
        for img_path in last_turn.images:
            if os.path.exists(img_path):
                images.append(Image.open(img_path).convert('RGB'))
        
        # Generate
        response = self.model.generate_response(
            self.conversation_history,
            images,
            max_new_tokens=512
        )
        
        # Add to history
        assistant_turn = ConversationTurn(role="assistant", content=response)
        self.conversation_history.turns.append(assistant_turn)
        
        return response
    
    def clear_history(self):
        """Reset conversation."""
        self.conversation_history = MultimodalConversation()
    
    def run_gradio_demo(self, share: bool = False):
        """
        Launch Gradio demo interface.
        """
        try:
            import gradio as gr
        except ImportError:
            print("Gradio not installed. Install with: pip install gradio")
            return
        
        def respond(message, history, image_files):
            # Add user message
            self.add_user_message(message, image_files if image_files else [])
            
            # Get response
            response = self.get_model_response()
            
            return response
        
        # Create interface
        with gr.Blocks() as demo:
            gr.Markdown("# Multimodal Chat with LLaVA-1.5\nUpload images and ask questions!")
            
            with gr.Row():
                with gr.Column():
                    image_input = gr.File(
                        label="Upload Images", 
                        file_count="multiple",
                        file_types=["image"]
                    )
                    chatbot = gr.Chatbot(height=500)
                    msg = gr.Textbox(label="Message")
                    clear = gr.Button("Clear")
                
                with gr.Column():
                    gr.Markdown("### Instructions")
                    gr.Markdown("""
                    1. Upload one or more images
                    2. Type your question or instruction
                    3. The model will respond based on all provided images
                    4. You can continue the conversation with follow-up questions
                    """)
            
            msg.submit(
                respond, 
                [msg, chatbot, image_input], 
                [chatbot]
            )
            clear.click(self.clear_history, None, chatbot, queue=False)
        
        demo.launch(share=share)


# =============================================================================
# 7. Main Execution
# =============================================================================

def main():
    import argparse
    parser = argparse.ArgumentParser(description="Multimodal Instruction Tuning")
    parser.add_argument("--mode", type=str, default="train", choices=["train", "demo", "eval"])
    parser.add_argument("--data_path", type=str, default="./data/llava_instruct.json")
    parser.add_argument("--image_dir", type=str, default="./data/images")
    parser.add_argument("--checkpoint", type=str, default=None)
    parser.add_argument("--epochs", type=int, default=3)
    parser.add_argument("--batch_size", type=int, default=1)  # Small batch for demo
    parser.add_argument("--lr", type=float, default=2e-5)
    parser.add_argument("--share", action="store_true", help="Share Gradio demo")
    args = parser.parse_args()
    
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")
    
    # Initialize model
    print("Initializing model...")
    model = MultimodalChatModel(
        llm_model="lmsys/vicuna-7b-v1.5"  # Use smaller for testing: "gpt2"
    )
    
    if args.checkpoint:
        print(f"Loading checkpoint: {args.checkpoint}")
        ckpt = torch.load(args.checkpoint, map_location=device)
        model.projector.load_state_dict(ckpt['projector_state_dict'])
    
    if args.mode == "train":
        # Setup dataset
        dataset = MultimodalInstructionDataset(args.data_path, args.image_dir)
        dataloader = DataLoader(
            dataset, 
            batch_size=args.batch_size, 
            shuffle=True,
            collate_fn=lambda x: (
                [item[0] for item in x],  # conversations
                [item[1] for item in x],  # images lists
                [item[2] for item in x]   # targets
            )
        )
        
        # Train
        trainer = MultimodalTrainer(model, device=device, lr=args.lr)
        
        for epoch in range(args.epochs):
            print(f"\nEpoch {epoch+1}/{args.epochs}")
            loss = trainer.train_epoch(dataloader)
            print(f"Average loss: {loss:.4f}")
            
            # Save checkpoint
            torch.save({
                'epoch': epoch,
                'projector_state_dict': model.projector.state_dict(),
                'optimizer_state_dict': trainer.optimizer.state_dict(),
                'loss': loss
            }, f"multimodal_epoch{epoch+1}.pt")
        
        trainer.plot_training_curves("multimodal_training.png")
        
    elif args.mode == "demo":
        # Interactive demo
        interface = MultimodalChatInterface(model, device)
        interface.run_gradio_demo(share=args.share)
        
    elif args.mode == "eval":
        # Evaluation mode
        dataset = MultimodalInstructionDataset(args.data_path, args.image_dir)
        dataloader = DataLoader(dataset, batch_size=1, shuffle=False)
        
        trainer = MultimodalTrainer(model, device=device)
        metrics = trainer.evaluate_dialogue_quality(dataloader)
        
        print(f"\nEvaluation Results:")
        print(f"Response Accuracy: {metrics['response_accuracy']:.2f}%")
        
        # Visualize examples
        for i, (conv, imgs, _) in enumerate(dataset):
            if i >= 3:
                break
            trainer.visualize_conversation(conv, imgs, f"eval_sample_{i}.png")

if __name__ == "__main__":
    main()