Stable-Diffusion1.5

SD1.5权重：https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main

SDXL权重：https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/tree/main

diffusers库中的SD代码pipelines：https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion

diffusers库中的SDXL代码pipelines：

https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/stable_diffusion_xl

参考：深入浅出完整解析Stable Diffusion（SD）核心基础知识 - 知乎 (zhihu.com)

1.VAE

2.Unet

[3.CLIP Text Encoder](#3.CLIP Text Encoder)

Stable Diffusion模型整体上是一个End-to-End模型，主要由VAE（变分自编码器，Variational Auto-Encoder），U-Net以及CLIP Text Encoder三个核心组件构成

1.VAE

VAE（变分自编码器，Variational Auto-Encoder）是基于Encoder-Decoder架构的生成模型。VAE的Encoder（编码器）结构能将输入图像转换为低维Latent特征，并作为U-Net的输入。VAE的Decoder（解码器）结构能将低维Latent特征重建还原成像素级图像。

在Stable Diffusion中，VAE模型主要起到了图像压缩和图像重建的作用

当我们输入一个尺寸为 H×W×C 的数据，VAE的Encoder模块会将其编码为一个大小为h×w×c的低维Latent特征，其中f=H/h=W/w为VAE的下采样率（Downsampling Factor） 。反之，VAE的Decoder模块有一个相同的**上采样率（Upsampling Factor）**将低维Latent特征重建成像素级别的图像

VAE Encoder 和VAE Decoder结构图

python 复制代码

#VAE压缩与重建代码展示
import cv2
import torch
import numpy as np
from diffusers import AutoencoderKL

# 加载VAE模型: VAE模型可以通过指定subfolder文件来单独加载。
# SD V1.5模型权重百度云网盘：关注Rocky的公众号WeThinkIn，后台回复：SD模型，即可获得资源链接
VAE = AutoencoderKL.from_pretrained("/本地路径/stable-diffusion-v1-5", subfolder="vae")
VAE.to("cuda", dtype=torch.float16)

# 用OpenCV读取和调整图像大小
raw_image = cv2.imread("catwoman.png")
raw_image = cv2.cvtColor(raw_image, cv2.COLOR_BGR2RGB)
raw_image = cv2.resize(raw_image, (1024, 1024))

# 将图像数据转换为浮点数并归一化
image = raw_image.astype(np.float32) / 127.5 - 1.0

# 调整数组维度以匹配PyTorch的格式 (N, C, H, W)
image = image.transpose(2, 0, 1)
image = image[None, :, :, :]

# 转换为PyTorch张量
image = torch.from_numpy(image).to("cuda", dtype=torch.float16)

# 压缩图像为Latent特征并重建
with torch.inference_mode():
    # 使用VAE进行压缩和重建
    latent = VAE.encode(image).latent_dist.sample()
    rec_image = VAE.decode(latent).sample

    # 后处理
    rec_image = (rec_image / 2 + 0.5).clamp(0, 1)
    rec_image = rec_image.cpu().permute(0, 2, 3, 1).numpy()

    # 反归一化
    rec_image = (rec_image * 255).round().astype("uint8")
    rec_image = rec_image[0]

    # 保存重建后图像
    cv2.imwrite("reconstructed_catwoman.png", cv2.cvtColor(rec_image, cv2.COLOR_RGB2BGR))

OpenAI开源的**一致性解码器（consistency-decoder），**支持作为Stable Diffusion 1.x和2.x的VAE模型。由于consistency-decoder模型较大（FP32：2.49G，FP16：1.2G），重建耗时会比原生的SD VAE模型大得多，并且在高分辨率（比如1024x1024）下效果并没有明显高于原生的SD VAE模型，所以最好将consistency-decoder模型作为补充储备模型之用。

python 复制代码

import torch
from diffusers import DiffusionPipeline, ConsistencyDecoderVAE

# SD 1.5和consistency-decoder模型权重百度云网盘：关注Rocky的公众号WeThinkIn，后台回复：SD模型，即可获得资源链接
vae = ConsistencyDecoderVAE.from_pretrained("/本地路径/consistency-decoder", torch_dtype=pipe.torch_dtype)
pipe = StableDiffusionPipeline.from_pretrained(
    "/本地路径/stable-diffusion-v1-5", vae=vae, torch_dtype=torch.float16
).to("cuda")

pipe("horse", generator=torch.manual_seed(0)).images

2.Unet

U-Net模型是一个关键核心部分，能够预测噪声残差 ，并结合Sampling method（调度算法：DDPM、DDIM、DPM++等）对输入的特征矩阵进行重构，逐步将其从随机高斯噪声转化成图片的Latent Feature。

CrossAttention结构中，VAE模型的输出latent space用来生成Q，文字特征信息用来生成K和V

自注意力：自己特征生成q,k,v，计算注意力

交叉注意力：一个特征生成q，另一个特征生成k，v，然后计算注意力

3.CLIP Text Encoder

CLIP模型是一个基于对比学习的多模态模型，主要包含Text Encoder和Image Encoder两个模型 。其中Text Encoder用来提取文本的特征，可以使用NLP中常用的text transformer模型作为Text Encoder；而Image Encoder主要用来提取图像的特征，可以使用CNN/Vision transformer模型（ResNet和ViT等）作为Image Encoder。与此同时，他直接使用4亿个图片与标签文本对数据集进行训练，来学习图片与本文内容的对应关系。

CLIP的训练：

CLIP在训练时，从训练集中随机取出一张图片和标签文本，接着CLIP模型的任务主要是通过Text Encoder和Image Encoder分别将标签文本和图片提取embedding向量 ，然后用余弦相似度（cosine similarity） 来比较两个embedding向量的相似性，以判断随机抽取的标签文本和图片是否匹配，并进行梯度反向传播，不断进行优化训练。

上面我们讲到CLIP模型主要包含Text Encoder和Image Encoder两个部分 ，在Stable Diffusion中主要使用了Text Encoder部分。CLIP Text Encoder模型将输入的文本Prompt进行编码，转换成Text Embeddings（文本的语义信息） ，通过前面章节提到的U-Net网络的CrossAttention模块嵌入Stable Diffusion中作为Condition条件，对生成图像的内容进行一定程度上的控制与引导 ，目前SD模型使用的的是CLIP ViT-L/14中的Text Encoder模型。

CLIP ViT-L/14 中的Text Encoder是只包含Transformer结构的模型，一共由12个CLIPEncoderLayer模块组成，模型参数大小是123M，具体CLIP Text Encoder模型结构如下图所示。其中特征维度为768，token数量是77，所以输出的Text Embeddings的维度为77x768。

python 复制代码

from transformers import CLIPTextModel, CLIPTokenizer

# 加载 CLIP Text Encoder模型和Tokenizer
# SD V1.5模型权重百度云网盘：关注Rocky的公众号WeThinkIn，后台回复：SDV1.5模型，即可获得资源链接
text_encoder = CLIPTextModel.from_pretrained("/本地路径/stable-diffusion-v1-5", subfolder="text_encoder").to("cuda")
text_tokenizer = CLIPTokenizer.from_pretrained("/本地路径/stable-diffusion-v1-5", subfolder="tokenizer")

# 将输入SD模型的prompt进行tokenize，得到对应的token ids特征
prompt = "1girl,beautiful"
text_token_ids = text_tokenizer(
    prompt,
    padding="max_length",
    max_length=text_tokenizer.model_max_length,
    truncation=True,
    return_tensors="pt"
).input_ids

print("text_token_ids' shape:",text_token_ids.shape)
print("text_token_ids:",text_token_ids)

# 将token ids特征输入CLIP Text Encoder模型中输出77x768的Text Embeddings特征
text_embeddings = text_encoder(text_token_ids.to("cuda"))[0] # 由于CLIP Text Encoder模型输出的是一个元组，所以需要[0]对77x768的Text Embeddings特征进行提取
print("text_embeddings' shape:",text_embeddings.shape)
print(text_embeddings)

---------------- 运行结果 ----------------
text_token_ids' shape: torch.Size([1, 77])
text_token_ids: tensor([[49406,   272,  1611,   267,  1215, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407]])
text_embeddings' shape: torch.Size([1, 77, 768])
tensor([[[-0.3884,  0.0229, -0.0522,  ..., -0.4899, -0.3066,  0.0675],
         [-0.8425, -1.1393,  1.2756,  ..., -0.2595,  1.6293, -0.7857],
         [ 0.1753, -0.9846,  0.1879,  ...,  0.0664, -1.4935, -1.2614],
         ...,
         [ 0.2039, -0.7296, -0.3212,  ...,  0.6748, -0.5813, -0.7323],
         [ 0.1921, -0.7344, -0.3045,  ...,  0.6803, -0.5852, -0.7230],
         [ 0.2114, -0.6436, -0.3047,  ...,  0.6624, -0.5575, -0.7586]]],
       device='cuda:0', grad_fn=<NativeLayerNormBackward0>)

一般来说，我们提取CLIP Text Encoder模型最后一层特征作为CrossAttention模块的输入，但是开源社区的不断实践为我们总结了如下经验：当我们生成二次元内容时，可以选择提取CLIP Text Encoder模型倒数第二层特征；当我们生成写实场景内容时，可以选择提取CLIP Text Encoder模型最后一层的特征。这让Rocky想起了SRGAN以及感知损失，其也是提取了VGG网络的中间层特征才达到了最好的效果，AI领域的"传承"与共性，往往在这些不经意间，让人感到人工智能的魅力与美妙。

由于CLIP训练时所采用的最大Token数是77，所以在SD模型进行前向推理时，当输入Prompt的Token数量超过77时，将通过Clip操作拉回77x768，而如果Token数不足77则会使用padding操作得到77x768。如果说全卷积网络的设计让图像输入尺寸不再受限，那么CLIP的这个设置就让输入的文本长度不再受限（可以是空文本）。无论是非常长的文本，还是空文本，最后都将得到一样维度的特征矩阵。

同时在SD模型的训练中，一般来说CLIP的整体性能是足够支撑我们的下游细分任务的，所以CLIP Text Encoder模型参数是冻结的，我们不需要对其重新训练。

4.SD训练过程：

我们进行Stable Diffusion模型训练时，VAE部分和CLIP部分都是冻结的，所以说官方在训练SD系列模型的时候，训练过程一般主要训练U-Net部分。

数据集：数据集格式为图片+对应的描述(prompt)

如上图，unet预测latent features添加的噪声noise1，计算noise1与noise的交叉熵损失，后续有N个step的添加噪声，预测噪声。

python 复制代码

for step, batch in enumerate(train_dataloader):
    with torch.no_grad():
        # 将image转到latent空间
        latents = vae.encode(batch["image"]).latent_dist.sample()
        latents = latents * vae.config.scaling_factor # rescaling latents
        # 提取text embeddings
        text_input_ids = text_tokenizer(
            batch["text"],
            padding="max_length",
            max_length=tokenizer.model_max_length,
            truncation=True,
            return_tensors="pt"
    ).input_ids
    text_embeddings = text_encoder(text_input_ids)[0]
    
    # 随机采样噪音
    noise = torch.randn_like(latents)
    bsz = latents.shape[0]
    # 随机采样timestep
    timesteps = torch.randint(0, noise_scheduler.num_train_timesteps, (bsz,), device=latents.device)
    timesteps = timesteps.long()

    # 将noise添加到latent上，即扩散过程
    noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)

    # 预测noise并计算loss
    model_pred = unet(noisy_latents, timesteps, encoder_hidden_states=text_embeddings).sample
    loss = F.mse_loss(model_pred.float(), noise.float(), reduction="mean")

5.SD推理

5.1文生图

noise充当latent feature一步步降噪，对n个step后的lantent feature使用VAE Decoder解码

5.2图生图

对latent feature一步步降噪，对n个step后的lantent feature使用VAE Decoder解码

python 复制代码

示例1：
#读取diffuers库
from diffusers import StableDiffusionPipeline

#初始化SD模型，加载预训练权重
pipe = StableDiffusionPipeline.from_pretrained("/本地路径/stable-diffusion-v1-5")

#使用GPU加速
pipe.to("cuda")

#如GPU的内存少于10GB，可以加载float16精度的SD模型
pipe = StableDiffusionPipeline.from_pretrained("/本地路径/stable-diffusion-v1-5", revision="fp16", torch_dtype=torch.float16)

#接下来，我们就可以运行pipeline了
prompt = "a photograph of an astronaut riding a horse"

image = pipe(prompt).images[0]




示例2：
# 加载文生图pipeline
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", # 其中一个版本
    torch_dtype=torch.float16).to("cuda")
# 输入text，这里text又称为prompt
prompts = [
    "a photograph of an astronaut riding a horse",
    "A cute otter in a rainbow whirlpool holding shells, watercolor",
    "An avocado armchair",
    "A white dog wearing sunglasses"]
generator = torch.Generator("cuda").manual_seed(42) # 定义随机seed，保证可重复性
# 执行推理
images = pipe(
    prompts,
    height=512,
    width=512,
    num_inference_steps=50,
    guidance_scale=7.5,
    negative_prompt=None,
    num_images_per_prompt=1,
    generator=generator).images

除了将SD模型权重整体加载，我们还可以将SD模型的不同组件权重进行单独加载：

python 复制代码

from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler
from diffusers import LMSDiscreteScheduler

# 单独加载VAE模型 
vae = AutoencoderKL.from_pretrained("/本地路径/stable-diffusion-v1-5", subfolder="vae")

# 单独家在CLIP模型和tokenizer
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")

# 单独加载U-Net模型
unet = UNet2DConditionModel.from_pretrained("/本地路径/stable-diffusion-v1-5", subfolder="unet")

# 单独加载调度算法
scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)


#此部分为GPT写的，未经检验
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, LMSDiscreteScheduler, StableDiffusionPipeline
import torch
from PIL import Image
import numpy as np

# 单独加载VAE模型
vae = AutoencoderKL.from_pretrained("/本地路径/stable-diffusion-v1-5", subfolder="vae")

# 单独加载CLIP模型和tokenizer
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")

# 单独加载U-Net模型
unet = UNet2DConditionModel.from_pretrained("/本地路径/stable-diffusion-v1-5", subfolder="unet")

# 单独加载调度算法
scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)

# 定义StableDiffusionPipeline
class CustomStableDiffusionPipeline(StableDiffusionPipeline):
    def __init__(self, vae, text_encoder, tokenizer, unet, scheduler):
        super().__init__(vae=vae, text_encoder=text_encoder, tokenizer=tokenizer, unet=unet, scheduler=scheduler)

# 创建自定义管道
pipe = CustomStableDiffusionPipeline(vae=vae, text_encoder=text_encoder, tokenizer=tokenizer, unet=unet, scheduler=scheduler)

# 设置为评估模式
pipe = pipe.to("cuda") if torch.cuda.is_available() else pipe.to("cpu")
pipe.eval()

# 文本输入
prompt = "A fantasy landscape with mountains and a river"

# 编码文本输入
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(pipe.device)

# 使用文本编码器生成文本嵌入
text_embeddings = text_encoder(input_ids)[0]

# 生成初始噪声
latents = torch.randn((1, unet.in_channels, 64, 64), device=pipe.device)

# 调度算法初始化
scheduler.set_timesteps(50)

# 逐步生成图像
for t in scheduler.timesteps:
    # 预测噪声残差
    with torch.no_grad():
        noise_pred = unet(latents, t, encoder_hidden_states=text_embeddings).sample

    # 计算梯度下降
    latents = scheduler.step(noise_pred, t, latents).prev_sample

# 解码潜变量以获取图像
with torch.no_grad():
    image = vae.decode(latents).sample

# 将图像转换为PIL格式
image = (image / 2 + 0.5).clamp(0, 1)
image = image.cpu().permute(0, 2, 3, 1).numpy()
image = (image * 255).astype(np.uint8)
image = Image.fromarray(image[0])

# 显示图像
image.show()

6.SD模型的加速方法参考：

深入浅出完整解析Stable Diffusion（SD）核心基础知识 - 知乎 (zhihu.com)

第8小节