【论文复现】（CLIP）文本也能和图像配对

📝个人主页🌹：Eternity._

🌹🌹期待您的关注 🌹🌹

❀ （CLIP）文本也能和图像配对

概述
算法介绍
演示效果
核心逻辑
使用方式
部署方式
参考文献

概述

模态，作为数据展现的一种方式，涵盖了诸如图像、文本、声音以及点云等多种类型。而多模态学习，则是一种让模型能够同时驾驭并融合多种这类数据形式的技术，它对于提升模型的预测精度和适应能力大有裨益。以自动驾驶汽车为例，为了确保对周围交通环境的全面而准确的感知，车辆通常会配备多种传感器，比如相机和激光雷达。相机捕捉的图像富含详尽的纹理细节，但在呈现物体的空间深度方面可能不够精确；相比之下，激光雷达生成的点云数据则能精确描绘出周围环境的3D轮廓，尽管这些数据点往往分布得相对稀疏。将这两种模态的数据结合起来作为模型的输入，可以极大地增强模型对周围环境的感知能力。

本文所涉及的所有资源的获取方式：这里

算法介绍

CLIP模型的核心机制在于运用对比学习策略，使模型能够有效区分正面与负面样本。为此，CLIP设计了一个多模态编码器架构，该架构融合了两种专门的子编码器：一个是图像编码器，它可以选择基于卷积神经网络（CNN）或者更现代的视觉变换器（ViT）技术构建；另一个是文本编码器，它则基于Transformer模型打造。这两个编码器分别将图像和文本数据转化为特定的表示形式，并通过一个线性变换过程，将这些表示映射到一个共享的多模态嵌入空间中。在训练过程中，CLIP通过同时优化图像编码器和文本编码器，力求最大化一个批次中N对真实匹配的图像与文本嵌入之间的余弦相似度。这种相似度度量成为了评估图像与文本之间匹配程度的关键指标。

演示效果

核心逻辑

将图片和文本分别通过图像编码器和文本编码器得到特征I_f与T_f；
之后通过线性投影，将特征转换到多模态嵌入空间的向量I_E与T_e；
最后计算图像文本对之间的相似度，以及交叉熵损失；

python 复制代码

# image_encoder - ResNet or Vision Transformer 
# text_encoder - CBOW or Text Transformer 
# I[n, h, w, c] - minibatch of aligned images 
# T[n, l] - minibatch of aligned texts 
# W_i[d_i, d_e] - learned proj of image to embed 
# W_t[d_t, d_e] - learned proj of text to embed 
# t - learned temperature parameter 
# extract feature representations of each modality 

I_f = image_encoder(I)   #[n, d_i] 
T_f = text_encoder(T)  #[n, d_t] 

# joint multimodal embedding [n, d_e] 
I_e = l2_normalize(np.dot(I_f, W_i), axis=1) 
T_e = l2_normalize(np.dot(T_f, W_t), axis=1) 

# scaled pairwise cosine similarities [n, n] 
logits = np.dot(I_e, T_e.T) * np.exp(t) 

# symmetric loss function 
labels = np.arange(n) 
loss_i = cross_entropy_loss(logits, labels, axis=0) 
loss_t = cross_entropy_loss(logits, labels, axis=1) 
loss = (loss_i + loss_t)/2

使用方式

修改文字和图像，获得两者之间的相似度

python 复制代码

import clip
import os
import torch
from PIL import Image 
import numpy as np
import matplotlib.pyplot as plt

os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

device = "cuda" if torch.cuda.is_available() else "cpu"

model,preprocess = clip.load("ViT-B/32",device=device)
descriptions = {
    "cat":"a type of pet",
    "guitar":"musician always use"
    }

original_images=[]
images=[]
texts=[]

for filename in [filename for filename in os.listdir('./images')if filename.endswith('png') or filename.endswith('.jpg')]:
    name = filename.split('.')[0]
    image = Image.open(os.path.join('./images',filename)).convert("RGB")
    original_images.append(image)
    images.append(preprocess(image))
    texts.append(descriptions[name])
image_input = torch.tensor(np.stack(images))
text_tokens = clip.tokenize(["This is "+ desc for desc in texts])
with torch.no_grad():
    image_features = model.encode_image(image_input).float()
    text_features = model.encode_text(text_tokens).float()
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = text_features.cpu().numpy() @ image_features.cpu().numpy().T
count = len(descriptions)

plt.figure(figsize=(20, 14))
plt.imshow(similarity, vmin=0.1, vmax=1.0)
# plt.colorbar()
plt.yticks(range(count), texts, fontsize=18)
plt.xticks([])
for i, image in enumerate(original_images):
    plt.imshow(image, extent=(i - 0.5, i + 0.5, -1.6, -0.6), origin="lower")
for x in range(similarity.shape[1]):
    for y in range(similarity.shape[0]):
        plt.text(x, y, f"{similarity[y, x]:.2f}", ha="center", va="center", size=12)

for side in ["left", "top", "right", "bottom"]:
  plt.gca().spines[side].set_visible(False)

plt.xlim([-0.5, count - 0.5])
plt.ylim([count + 0.5, -2])

plt.title("Cosine similarity between text and image features", size=20)
plt.show()

部署方式

bash 复制代码

# 利用如下代码创建环境
conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
pip install ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git

参考文献

CLIP代码地址
 github地址

编程未来，从这里启航！解锁无限创意，让每一行代码都成为你通往成功的阶梯，帮助更多人欣赏与学习！

更多内容详见：这里