Performance Metrics in Evaluating Stable Diffusion Models

1.Performance Metrics in Evaluating Stable Diffusion Models

笔记来源：

1.Performance Metrics in Evaluating Stable Diffusion Models

2.Denoising Diffusion Probabilistic Models

3.A simple explanation of the Inception Score

4.What is the inception score (IS)?

5.Kullback--Leibler divergence

6.Inception Score (IS) 与 Fréchet Inception Distance (FID)

7.Fréchet inception distance

8.Using CLIP Score to evaluated images

下图引用自：Wikipedia

1.1 Inception Score (IS): Evaluating Realism Through Classification

IS takes a unique approach by assessing the likelihood of a generated image being classified as accurate by a pre-trained image classifier.
Higher IS scores reflect greater realism and logic in generated images. Also, it shows the model's proficiency in capturing real image essence.

Prerequisites

（1）Pre-trained Inception v3 Network: This model is used to classify the generated images.

（2）Generated Images: A diverse set of images generated by the Stable Diffusion model based on various text prompts.

Steps to Calculate Inception Score
（1）Generate Images

Use the Stable Diffusion model to generate a large number of images from diverse text prompts. The more diverse the text prompts, the better the evaluation.
（2）Preprocess Images

Ensure that the images are correctly sized (typically 299x299 pixels) and normalized to the format expected by the Inception v3 network.
（3）Pass Images Through Inception v3

Feed each generated image into the Inception v3 network to obtain the predicted label distributions p ( y ∣ x ) p(y|x) p(y∣x). This provides a probability distribution over classes for each image.（x：image，y：label）

（4）Compute Marginal Distribution

Calculate the marginal distribution p ( y ) p(y) p(y) over all generated images.

（5）Calculate KL Divergence

Compute the Kullback-Leibler (KL) divergence between the conditional distribution p ( y ∣ x ) p(y|x) p(y∣x) for each generated image and the marginal distribution p ( y ) p(y) p(y) over all generated images. Average the KL divergences across all images.

(The KL divergence is a measure of how similar/different two probability distributions are.)

下图引用自：Kullback--Leibler divergence

KL散度衡量两个概率分布之间的差异程度，通过计算KL散度值，我们可以了解两个概率分布到底有多相似

两个概率分布的差异程度越大，则KL散度值越大

两个概率分布的差异程度小，则KL散度值越小

两个概率分布相同，则KL散度值为0

以下是 KL 散度如何根据我们的两个分布而变化：

（6）Exponentiation

The Inception Score is the exponentiation of the average KL divergence.

To get the final score, we take the exponential of the KL divergence (to make the score grow to bigger numbers to make it easier to see it improve) and finally take the average of this for all of our images. The result is the Inception score!

计算过程梳理：
（1）通过inception v3网络求出每一张生成图片的概率分布 p ( y ∣ x ) p(y|x) p(y∣x)
（2）求出所有生成图片的概率分布 p ( y ) p(y) p(y)
（3）计算每一张生成图像概率分布 p ( y ∣ x ) p(y|x) p(y∣x)和所有生成图片之间概率分布 p ( y ) p(y) p(y)的KL散度，这里我们得到多个KL散度值
（4）我们将上述多个KL散度值求和后平均
（5）将（4）中值进行指数运算得到最终Inception Score

KL散度值大代表着单个生成图片的具有较高质量且易区分（被分类器区分）
IS值大代表生成的图片不仅多样性大而且具有较高质量

对生成图片计算IS，代码引用自：python实现Inception Score代码（读取自己生成的图片）也可参考：sbarratt /inception-score-pytorch

python 复制代码

import torch
from torch import nn
from torch.nn import functional as F
import numpy as np
from torchvision.models.inception import inception_v3
from PIL import Image
import os
from scipy.stats import entropy
import argparse
from tqdm import tqdm


'''
（1）Generate Images
（2）Preprocess Images
    Ensure that the images are correctly sized (typically 299x299 pixels) 
    and normalized to the format expected by the Inception v3 network.
（3）Compute predicted label distributions p(y|x)
    Pass Images Through Inception v3 to obtain the predicted label distributions p(y|x). 
    This provides a probability distribution over classes for each image
（4）Compute Marginal Distribution p(y)
    Calculate the marginal distribution p(y) over all generated images.
（5）Calculate KL Divergence
    D_KL(p(y|x)|p(y))
（6）Average the KL divergences across all images.
    Expectation(D_KL(p(y|x)|p(y)))
（7）Exponentiation
    Exp(Expectation(D_KL(p(y|x)|p(y)))) equivalent to Expectation(Exp(D_KL(p(y|x)|p(y))))
'''

# (1) python Inception_score.py --inupt_image_dir path_of_your_generated_images
# Argument parser setup
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--input_image_dir', type=str, default='./input_images', help='Directory containing input images')
parser.add_argument('--batch_size', type=int, default=1, help='Batch size for processing images')
parser.add_argument('--device', type=str, choices=["cuda:0", "cpu"], default="cuda:0", help='Device for computation')
args = parser.parse_args()

# (2) Preprocess images: Normalization
# Inception v3 model preprocessing constants
mean_inception = [0.485, 0.456, 0.406]  # Mean for normalization
std_inception = [0.229, 0.224, 0.225]  # Standard deviation for normalization

# image -> array
def imread(filename):
    """
    Loads an image file and converts it into a (height, width, 3) uint8 numpy array.
    Args:
        filename (str): Path to the image file.
    Returns:
        np.ndarray: Image data in a (height, width, 3) format.
    """
    return np.asarray(Image.open(filename), dtype=np.uint8)[..., :3]

# calculate IS
def inception_score(batch_size=args.batch_size, resize=True, splits=1):
    """
    Computes the Inception Score for images in the specified directory.
    Args:
        batch_size (int): Number of images to process in each batch.
        resize (bool): Whether to resize images to the input size of the Inception model.
        splits (int): Number of subsets
    Returns:
        tuple: Maximum Inception Score and average Inception Score.
    """
    device = torch.device(args.device)  # Set computation device (CPU or GPU)

    # Load pre-trained Inception v3 model
    inception_model = inception_v3(pretrained=True, transform_input=False).to(device)
    inception_model.eval()  # Set model to evaluation mode
    # Ensure that the images are correctly sized (typically 299x299 pixels)
    # normalized to the format expected by the Inception v3 network.
    up = nn.Upsample(size=(299, 299), mode='bilinear', align_corners=False).to(device)
    # calculate p(y|x)
    # y:label of class, x: an image
    def get_pred(x):
        """
        Computes class probabilities using the Inception model.
        Args:
            x (torch.Tensor): Batch of images.
        Returns:
            np.ndarray: Class probabilities for each image.
        """
        if resize:
            x = up(x)  # Resize images if needed
        x = inception_model(x)  # Get model predictions
        return F.softmax(x, dim=1).data.cpu().numpy()  # Apply softmax and move to CPU

    print('Computing predictions using Inception v3 model')

    files = read_dir()  # Get list of image files
    N = len(files)
    # store p(y|x) of each image
    # Initialize a numpy array to store predictions for all images
    # N is num of generated images
    # 1000 corresponds to the number of output classes in the Inception v3 model
    # Each row will store the prediction (class probabilities) for one image
    preds = np.zeros((N, 1000))  # Array to store predictions

    # Adjust batch size if it's larger than the number of images
    if batch_size > N:
        print('Warning: Batch size is larger than the number of images. Setting batch size to data size.')
        batch_size = N

    # Process images in batches
    for i in tqdm(range(0, N, batch_size)):  # Loop over the range of image indices in steps of batch_size
        start = i  # Start index for the current batch
        end = min(i + batch_size, N)  # End index for the current batch, ensuring it doesn't exceed the number of images
        # Convert the list of image arrays to a single numpy array
        # For each file in the current batch, read the image and convert it to a float32 numpy array
        images = np.array(
            [imread(f).astype(np.float32) for f in files[start:end]])  # Read and convert images to float32
        # Rearrange the dimensions of the images to (n_images, 3, height, width)
        # normalize pixel values to [0, 1] range
        images = images.transpose((0, 3, 1, 2)) / 255
        # Convert the NumPy array to a PyTorch tensor of type FloatTensor and move it to the specified device
        batch = torch.from_numpy(images).type(torch.FloatTensor).to(device)
        # Compute class probabilities for the current batch using the Inception model and store the predictions
        # in the preds array at indices corresponding to the current batch
        preds[start:end] = get_pred(batch)  # Store predictions for the current batch
    # Ensure that the batch size is greater than 0 to avoid invalid batch processing
    assert batch_size > 0
    # Ensure that the total number of images is greater than the batch size
    # to allow for meaningful splitting and processing
    assert N > batch_size

    # Compute the Inception Score using KL Divergence
    print('Computing KL Divergence')
    # The split_scores list gathers the Inception Scores for each subset,
    # which are then averaged to obtain a final score.
    split_scores = []  # Initialize an empty list to store scores for each split
    for k in range(splits):
        part = preds[k * (N // splits): (k + 1) * (N // splits), :]  # Split predictions into equal parts
        # p(y)
        py = np.mean(part, axis=0)  # Compute the marginal probability by averaging predictions in the split
        # Compute the KL Divergence for each image's prediction against the marginal probability
        # D_KL(p(y|x)|p(y))
        scores = [entropy(pyx, py) for pyx in part]
        # Exp(D_KL(p(y|x)|p(y)))
        split_scores.append(
            np.exp(scores))  # Convert the KL Divergence scores to exponentials and append to split_scores
    # mean Expectation(Exp(D_KL(p(y|x)|p(y))))
    return np.max(split_scores), np.mean(split_scores)  # Return the maximum and average Inception Scores


def read_dir():
    """
    Recursively reads all image files from the specified directory.
    Returns:
        list: List of file paths.
    """
    dirPath = args.input_image_dir  # Get the directory path from command-line arguments
    allFiles = []  # Initialize an empty list to store file paths

    if os.path.isdir(dirPath):  # Check if the specified path is a directory
        # Walk through the directory tree
        for root, _, files in os.walk(dirPath):
            for file in files:
                # For each file, construct the full path and add it to the list
                allFiles.append(os.path.join(root, file))
    else:
        # Print an error message if the specified path is not a directory
        print('Error: Specified path is not a directory.')

    return allFiles  # Return the list of file paths

# Splitting the Data: The splits parameter allows dividing the predictions into multiple subsets.
# This is helpful for reducing the variance in the final Inception Score.
if __name__ == '__main__':
    max_is, avg_is = inception_score(splits=1)  # Compute Inception Scores
    print(f'MAX IS: {max_is:.4f}')
    print(f'Average IS: {avg_is:.4f}')

使用预训练模型生成六张图片，实际计算IS需要大量图片（如50000张），这里仅做测试

IS计算结果如下图，IS值越大说明生成图片的质量越好，多样性越大

IS局限性

(1) If you're learning to generate something not present in the classifier's training data (e.g. sharks are not in ILSVRC 2014) then you may always get low IS despite generating high quality images since that image doesn't get classified as a distinct class
(2) If the classifier network cannot detect features relevant to your concept of image quality, then poor quality images may still get high scores.

1.2 Fréchet inception distance (FID): Assessing Image Distribution Similarity

Differences between IS and FID

Unlike the earlier inception score (IS), which evaluates only the distribution of generated images,

the FID compares the distribution of generated images with the distribution of a set of real images ("ground truth").

在 Inception V3 的"世界观"下，凡是不像 ImageNet 的数据，都是不真实的，都不能保证输出一个 sharp 的 predition distribution。所以，要想更好地评价生成模型，就要使用更加有效的方法计算真实分布与生成样本之间的距离。FID正是衡量了生成样本与真实世界样本之间的距离。---引用自：Inception Score (IS) 与 Fréchet Inception Distance (FID)

FID

FID stands as a cornerstone metric that measures the distance between the distributions of generated and real images.
Lower FID scores signify a closer match between generated and real-world images. In addition, it shows superior model performance in mimicking real data distributions.

下图引用自：Fréchet inception distance

（1）Generating Images with Prompts

Use your diffusion model to generate images from text prompts.

（2）Extract Features

Pass both the generated images and a set of reference images through a pre-trained Inception network to extract feature vectors. Usually, the Inception v3 model is used for this purpose.

（3）Compute FID Score

Calculate the FID score between the feature distributions of the generated images and the reference images.

代码参考：mseitzer/pytorch-fid，其中主要的两个文件InceptionV3和计算FID Score

可安装后将其作为模块，直接进行计算

powershell 复制代码

pip install pytorch-fid

生成图片作为sample dataset，ImageNet数据集本身作为reference dataset

powershell 复制代码

python -m pytorch_fid path/to/dataset1 path/to/dataset2

1.3 CLIP Score

Text-guided image generation involves the use of models like StableDiffusionPipeline to generate images based on textual prompts. Also, it evaluates them using CLIP scores.

CLIP scores measure the fit between image-caption pairs. Higher scores signify better compatibility between the image and its associated caption.

Practical Implementation
（1）Generating Images with Prompts

StableDiffusionPipeline generates images based on multiple prompts. And it creates a diverse set of images aligned with the given textual cues.

（2）Computing CLIP Scores

After generating images, the CLIP scores are calculated to quantify the compatibility between each image and its corresponding prompt.

（3）Comparative Evaluation

Comparing Different Checkpoints: Generating images using different checkpoints, calculating CLIP scores for each set, and performing a comparative analysis assesses the performance differences between the versions. For example, comparing v1--4 and v1--5 checkpoints revealed improved performance in the latter.

以下网站可直接对图片和其对应文本进行评分：taesiri/CLIPScore

代码参考一：CLIP Score for PyTorch

Install PyTorch

powershell 复制代码

pip install torch  # Choose a version that suits your GPU

Install CLIP

powershell 复制代码

pip install git+https://github.com/openai/CLIP.git

Install clip-score from PyPI

powershell 复制代码

pip install clip-score

Usage

powershell 复制代码

python -m clip_score path/to/image path/to/text

代码参考二：Using CLIP Score to evaluated images

powershell 复制代码

pip install -U torch torchvision
pip install -U git+https://github.com/openai/CLIP.git

python 复制代码

import torch
import clip
from PIL import Image

def get_clip_score(image_path, text):
# Load the pre-trained CLIP model and the image
model, preprocess = clip.load('ViT-B/32')
image = Image.open(image_path)

    # Preprocess the image and tokenize the text
    image_input = preprocess(image).unsqueeze(0)
    text_input = clip.tokenize([text])
    
    # Move the inputs to GPU if available
    device = "cuda" if torch.cuda.is_available() else "cpu"
    image_input = image_input.to(device)
    text_input = text_input.to(device)
    model = model.to(device)
    
    # Generate embeddings for the image and text
    with torch.no_grad():
        image_features = model.encode_image(image_input)
        text_features = model.encode_text(text_input)
    
    # Normalize the features
    image_features = image_features / image_features.norm(dim=-1, keepdim=True)
    text_features = text_features / text_features.norm(dim=-1, keepdim=True)
    
    # Calculate the cosine similarity to get the CLIP score
    clip_score = torch.matmul(image_features, text_features.T).item()
    
    return clip_score

image_path = "path/to/your/image.jpg"
text = "your text description"

score = get_clip_score(image_path, text)
print(f"CLIP Score: {score}")