【Hugging Face】Hugging Face Diffusers的使用方式

前言

前面我们了解了Hugging Face 核心库 Transformers 的使用方式,今天我们继续探索Hugging Face核心库 Diffusers 库的使用方式。对往期内容感兴趣的小伙伴也可以看往期:

简介

Diffusers 是Hugging Face开源的 扩散模型(Diffusion Models)一站式工具箱,把最前沿的扩散相关论文 / 权重 封装成简单、可组合的 API,让你用几行代码就能做文生图、图生图、音频生成、视频生成、3D 生成、分子生成、数据增强等任务,也支持训练 / 微调自己的模型。

Diffusers官网文档:huggingface.co/docs/diffus...

Diffusers核心API

  • 管道:从高层次设计的多种类函数,目的在于方便部署和实现任务,能够快速的用于训练好的主流扩散模型来生成样本
  • 模型:在训练新的扩散模型的时候需要用到网络结构,比如UNet
  • 调度器:在推理的过程中使用多种不同的技巧来从噪声中生成图像,同时也可以生成训练过程中带噪声的图像。

安装

安装PyTorch

安装CPU版本

php 复制代码
$ pip install torch torchvision torchaudio

安装NVIDIA GPU版本,更多安装方式可以查看PyTorch官网:pytorch.org/

perl 复制代码
$ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

安装Diffusers

php 复制代码
# Accelerate 加速模型加载,用于推理和训练
# Transformers 是运行最受欢迎的扩散模型(如 Stable Diffusion)所必需的
$ uv add diffusers accelerate transformers
或者
$ pip install diffusers accelerate transformers

基本使用

这是AI生成的Diffuser文生图的整个流程,有助于我们理解整个操作流程

Pipeline管道

在Diffusers中,Pipeline是把模型、调度器、处理等组件"粘合"起来的一条生产线。首先来看一下,Pipeline的默认使用方式。以sd-dreambooth-library/disco-diffusion-style文生图模型为例,新建一个Colab代码块

ini 复制代码
from diffusers import DiffusionPipeline
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# 加载管道,会先下载训练好的模型
pipeline = DiffusionPipeline.from_pretrained("sd-dreambooth-library/disco-diffusion-style")
pipeline.to(device)
# prompt,最好使用英文,中文效果不太好
prompt = "A cyberpunk-style building"
# 图片生成
image = pipeline(
    prompt, 
    num_inference_steps=50, # 迭代步数
    guidance_scale=7.5, # 引导系数
).images[0]
# 展示图片
display(image)

点击运行,效果如下

Diffusers中还包含了很多Pipeline类型,下面是一些生图的Pipeline,感兴趣的小伙伴可以自行了解更多

  • Text-To-Image:文生图,StableDiffusionPipeline
  • Image-To-Image:图生图,StableDiffusionImg2ImgPipeline
  • In-Painting:蒙版重绘,StableDiffusionInpaintPipeline
  • Upscale Image:超分辨率(放大4倍),StableDiffusionUpscalePipeline
  • Pix-To-Pix:图像画风编辑,StableDiffusionInstructPix2PixPipeline
  • Depth-To-Image:深度绘图,StableDiffusionDepth2ImgPipeline

DiffusionPipeline

使用DiffusionPipeline实例化模型时,如果模型没有下载,会自动下载模型

Pipeline中包含多种管道,DiffusionPipeline是用预训练的扩散系统进行推理的最简单方法。它是一个包含模型和调度器的端到端系统,可以直接使用DiffusionPipeline完成许多任务。

Pipeline使用 from_pretrained() 方法加载模型

python 复制代码
from diffusers import DiffusionPipeline
# 加载模型,use_safetensors使用安全无代码格式,建议默认加上
pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use_safetensors=True)
print(pipeline)

DiffusionPipeline会下载并缓存所有建模、分词和调度组件。打印pipeline对象可以发现Stable Diffusion管道由 UNet2DConditionModel 和 PNDMScheduler 等组件组成,DiffusionPipeline为抽象类底层会被更改为StableDiffusionPipeline

使用Pipeline生成图像

scss 复制代码
image = pipeline("An image of a squirrel in Picasso style").images[0]
display(image)

点击运行,生图效果如下:

GPU加速

Pipeline同样可以将管道放在 GPU 上,就像你使用任何 PyTorch 模块一样为管道添加GPU加速

ini 复制代码
import torch
from diffusers import DiffusionPipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use_safetensors=True)
# 在GPU上运行管道
pipeline.to(device)

Pipeline参数

参数1:精度

默认情况下,DiffusionPipeline 使用完整 float32 精度进行 50 步推理,为了加速生图过程我们可以选择降低精度为 float16 或减少推理步数

python 复制代码
import torch
from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True)

参数2:提示词

提示词是生图主体的关键内容

ini 复制代码
# 提示词
prompt = "portrait photo of a old warrior chief"
image = pipeline(prompt).images[0]

参数3:随机种子

为了确保我们可以使用同一张图像并对其进行改进,我们使用 Generator 并设置一个种子以实现可重复性

ini 复制代码
import torch
generator = torch.Generator("cuda").manual_seed(0)
image = pipeline(prompt, generator=generator).images[0]

参数4:迭代步数

Stable Diffusion 模型默认使用 PNDMScheduler,通常需要~50 步推理,但像 DPMSolverMultistepScheduler 这样的性能更好的调度器,只需要 20 或 25 步推理。

ini 复制代码
from diffusers import DiffusionPipeline
from diffusers import DPMSolverMultistepScheduler
import torch
pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True)
# 加载调度器
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
image = pipeline(
    prompt, 
    num_inference_steps=25, # 迭代步数
).images[0]

参数5:反向提示词

正反向提示词用于控制生图中想要和不想要内容

ini 复制代码
negative_prompt = "low quality, bad anatomy, deformed, blurry"
image = pipeline(
        prompt,
        negative_prompt=negative_prompt, # 反向提示词
        height=height,
        width=width,
        num_inference_steps=num_inference_steps,
        generator=generator
    ).images[0]

参数6:图片宽高

图片宽高用于控制生图的尺寸大小

ini 复制代码
# 设置生成图像的参数
height = 512
width = 512
image = pipeline(
    prompt,
    height=height, # 图片高
    width=width,   # 图片宽
).images[0]

参数7:提示词引导系数

提示词引导系数决定了生图效果与提示词的关系,提示词系数越大生图效果越贴近提示词,提示词系数接近 1 时提示词将被忽略

  • 1.0-3.0:完全忽略提示词,自由生成
  • 3.0-10.0:提示词与创意平衡,7.0-7.5为默认黄金区,官方示例都用7.5
  • 10.0-15.0:需要精确还原 prompt
ini 复制代码
image = pipeline(
    prompt, 
    guidance_scale=7.5, # 提示词引导系数
).images[0]

enable_attention_slicing(节省内存)

其他配置不变,只加上 enable_attention_slicing

ini 复制代码
from diffusers import DiffusionPipeline
from diffusers import DPMSolverMultistepScheduler
import torch
from diffusers.utils import make_image_grid
device = "cuda" if torch.cuda.is_available() else "cpu"
# 加载pipeline
pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True)
pipeline.to(device)
# 开启enable_attention_slicing节省内存
pipeline.enable_attention_slicing() 
# 加载调度器
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
prompt = "portrait photo of a old warrior chief"
def get_inputs(batch_size=1):
    generator = [torch.Generator("cuda").manual_seed(i) for i in range(batch_size)]
    prompts = batch_size * [prompt]
    num_inference_steps = 20
    return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps}
images = pipeline(**get_inputs(batch_size=8)).images
make_image_grid(images, 2, 4)

一次生成了8张图片且没有报 OOM 错误

VAE

VAE是一个优化显存和提速的操作,VAE 把高分辨率像素空间(512×512×3 ≈ 786 k 元素)压缩成低维潜空间(64×64×4 ≈ 16 k 元素),让扩散模型在更小、更快的空间里做"加噪 / 去噪",最后再解码回真实图像。

ini 复制代码
from diffusers import DiffusionPipeline, AutoencoderKL
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# 加载pipeline
pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True)
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to(device)
pipeline.vae = vae
pipeline.to(device)
prompt = "portrait photo of a old warrior chief"
def get_inputs(batch_size=1):
    generator = [torch.Generator(device).manual_seed(i) for i in range(batch_size)]
    prompts = batch_size * [prompt]
    num_inference_steps = 20
    return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps}
images = pipeline(**get_inputs(batch_size=8)).images
make_image_grid(images, 2, 4)

AutoPipeline

AutoPipeline 类的设计旨在简化 Diffusers 中的各种管道。它是一个通用的任务优先级管道,让你专注于任务(AutoPipelineForText2Image、AutoPipelineForImage2Image 和 AutoPipelineForInpainting),而无需知道具体的管道类。AutoPipeline 会自动检测应使用的正确管道类。

AutoPipelineForText2Image文生图示例

ini 复制代码
from diffusers import AutoPipelineForText2Image
import torch
pipe_txt2img = AutoPipelineForText2Image.from_pretrained(
    "dreamlike-art/dreamlike-photoreal-2.0", torch_dtype=torch.float16, use_safetensors=True
).to("cuda")
prompt = "cinematic photo of Godzilla eating sushi with a cat in a izakaya, 35mm photograph, film, professional, 4k, highly detailed"
generator = torch.Generator(device="cpu").manual_seed(37)
image = pipe_txt2img(prompt, generator=generator).images[0]
image

AutoPipelineForImage2Image图生图示例

ini 复制代码
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image
import torch
pipe_img2img = AutoPipelineForImage2Image.from_pretrained(
    "dreamlike-art/dreamlike-photoreal-2.0", torch_dtype=torch.float16, use_safetensors=True
).to("cuda")
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()
init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/autopipeline-text2img.png")
prompt = "cinematic photo of Godzilla eating burgers with a cat in a fast food restaurant, 35mm photograph, film, professional, 4k, highly detailed"
generator = torch.Generator(device="cpu").manual_seed(53)
image = pipe_img2img(prompt, image=init_image, generator=generator).images[0]
image

AutoPipelineForInpainting图片修复示例

ini 复制代码
from diffusers import AutoPipelineForInpainting
from diffusers.utils import load_image
import torch
pipeline = AutoPipelineForInpainting.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, use_safetensors=True
).to("cuda")
init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/autopipeline-img2img.png")
mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/autopipeline-mask.png")
prompt = "cinematic photo of a owl, 35mm photograph, film, professional, 4k, highly detailed"
generator = torch.Generator(device="cpu").manual_seed(38)
image = pipeline(prompt, image=init_image, mask_image=mask_image, generator=generator, strength=0.4).images[0]
image

模型

如果不确定要使用哪个模型,可以使用 AutoModel API 自动选择模型

在 Diffusers 中,扩散模型在进行生成图片的时候,是噪声扩散的逆过程,模型是真正负责学习、预测噪声或压缩/还原功能的可学习网络。

Diffusers中有4类最常用的模型:

  • UNet(条件/无条件):学习「当前带噪图 → 噪声残差」或「直接预测原图」, 典型实例 UNet2DConditionModel、UNet2DModel, 默认模型
  • VAE/AutoencoderKL:把高维像素空间压缩到低维潜空间,节省计算;推理时再解码,典型实例 AutoencoderKL
  • Text Encoder/CLIP:把提示词编码成「条件向量」供条件扩散模型使用,典型实例 CLIPTextModel
  • ControlNet/LoRA:在不改原 UNet 权重的前提下,注入额外条件或微调,典型实例 ControlNetModel、LoRAAdapter

官网图片降噪示例

下面通过官方示例看一下,完整的降噪过程,我对图片展示部分做了修改,其他的代码和官方是保持一致的。

ini 复制代码
from diffusers import UNet2DModel, DDPMScheduler
import torch
import tqdm
import PIL.Image
import numpy as np
from diffusers.utils import make_image_grid
device = "cuda" if torch.cuda.is_available() else "cpu"
# 加载模型
model_id = "google/ddpm-cat-256"
model = UNet2DModel.from_pretrained(model_id)
# 将模型放在GPU上
model.to(device)
# 加载调度器
scheduler = DDPMScheduler.from_config(model.config)
# 生成随机种子
torch.manual_seed(0)
# 设置图片尺寸
noisy_sample = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size)
# 将照片放到GPU上
noisy_sample = noisy_sample.to(device)
sample = noisy_sample
# Helper function to convert tensor to PIL Image
def convert_to_pil_image(sample_tensor):
    image_processed = sample_tensor.cpu().permute(0, 2, 3, 1)
    image_processed = (image_processed + 1.0) * 127.5
    image_processed = image_processed.numpy().astype(np.uint8)
    image_pil = PIL.Image.fromarray(image_processed[0])
    return image_pil
# 开始生成一个只猫
images = []
for i, t in enumerate(tqdm.tqdm(scheduler.timesteps)):
    # 1. predict noise residual
    with torch.no_grad():
        residual = model(sample, t).sample
    # 2. compute less noisy image and set x_t -> x_t-1
    sample = scheduler.step(residual, t, sample).prev_sample
    # 3. optionally look at image
    if (i + 1) % 50 == 0:
        pil_image = convert_to_pil_image(sample)
        images.append(pil_image) # Append the PIL image to the list
# Display the collected images after the loop
make_image_grid(images, 5, 4)

运行完成,我们将得到图片的整个去噪过程图,虽然最终生成的效果不是很好看,但是大概也能看出整个图片生成的过程了

有条件/无条件模型

  • UNet2DModel:无条件 2D UNet,只认识「带噪图 + 时间步」。
  • UNet2DConditionModel:有条件 2D UNet,额外吃「文本/深度/姿态等条件向量」,是 Stable Diffusion、ControlNet 等的核心网络。

模型通过 from_pretrained() 方法初始化,该方法还会在本地缓存模型权重,因此下次加载模型时会更快。

ini 复制代码
from diffusers import scheduler, UNet2DModel
repo_id = "google/ddpm-cat-256"
scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256")
model = UNet2DModel.from_pretrained(repo_id, use_safetensors=True)
print(model.config)

不知道使用哪个模型类,也可以使用AutoModel

ini 复制代码
from diffusers import AutoModel
repo_id = "google/ddpm-cat-256"
model = AutoModel.from_pretrained(repo_id, use_safetensors=True)
print(model.config)

模型打印结果如下

模型参数:

  • sample_size:输入样本的高度和宽度尺寸
  • in_channels:输入样本的输入通道数
  • down_block_types和up_block_types:用于创建U-Net架构的下采样和上采样块的类型
  • block_out_channels:下采样块的输出通道数;也以相反的顺序用于上采样块的输入通道数
  • layers_per_block:每个U-Net块中存在的ResNet块的数量。

无条件降噪生图案例,可以看到上面的官方示例,这里注意了解下,有条件的生图过程

ini 复制代码
from PIL import Image
import torch
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, UniPCMultistepScheduler
from tqdm.auto import tqdm
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "CompVis/stable-diffusion-v1-4"
# vae
vae = AutoencoderKL.from_pretrained(model_name, subfolder="vae", use_safetensors=True)
# 分词器
tokenizer = CLIPTokenizer.from_pretrained(model_name, subfolder="tokenizer")
# 文本编码器
text_encoder = CLIPTextModel.from_pretrained(model_name, subfolder="text_encoder", use_safetensors=True)
# 模型
unet = UNet2DConditionModel.from_pretrained(model_name, subfolder="unet", use_safetensors=True)
# 调度器
scheduler = UniPCMultistepScheduler.from_pretrained(model_name, subfolder="scheduler")
# 加速推理
vae.to(device)
text_encoder.to(device)
unet.to(device)
prompt = ["a photograph of an astronaut riding a horse"]
height = 512  # default height of Stable Diffusion
width = 512  # default width of Stable Diffusion
num_inference_steps = 25  # Number of denoising steps
guidance_scale = 7.5  # Scale for classifier-free guidance
# Seed generator to create the initial latent noise
generator = torch.Generator(device).manual_seed(0) # Ensure generator is on the correct device
batch_size = len(prompt)
# 对提示词进行分词
text_input = tokenizer(
    prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt"
)
with torch.no_grad():
    text_embeddings = text_encoder(text_input.input_ids.to(device))[0]
max_length = text_input.input_ids.shape[-1]
# 填充标记的嵌入
uncond_input = tokenizer([""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt")
uncond_embeddings = text_encoder(uncond_input.input_ids.to(device))[0]
# 将条件嵌入和无条件嵌入连接成一个批次,以避免进行两次前向传递
text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
print(device)
# 之所以被"8"整除,因为vae模型有3次下采样
latents = torch.randn(
    (batch_size, unet.config.in_channels, height // 8, width // 8),
    generator=generator,
    device=device,
)
# 去噪图像
latents = latents * scheduler.init_noise_sigma
# 循环时间步长
scheduler.set_timesteps(num_inference_steps)
for t in tqdm(scheduler.timesteps):
    # expand the latents if we are doing classifier-free guidance to avoid doing two forward passes.
    latent_model_input = torch.cat([latents] * 2)
    latent_model_input = scheduler.scale_model_input(latent_model_input, timestep=t)
    # predict the noise residual
    with torch.no_grad():
        noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
    # perform guidance
    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
    # compute the previous noisy sample x_t -> x_t-1
    latents = scheduler.step(noise_pred, t, latents).prev_sample
# 图片解码
latents = 1 / 0.18215 * latents
with torch.no_grad():
    image = vae.decode(latents).sample
image = (image / 2 + 0.5).clamp(0, 1).squeeze()
image = (image.permute(1, 2, 0) * 255).to(torch.uint8).cpu().numpy()
image = Image.fromarray(image)
image

示例中CLIPTextModel是文本编码器,用于把toknes编码为向量,用来控制扩散模型的生成。CLIPTextModel需要接收通过CLIPTokenizer分词器处理的提示词文本,这就和我们上期了解到的transformers联系起来了。这是手动实现有条件生图的过程,通过Pipeline将会大大简化生图流程,最后看下生图效果:

ControlNetModel

ControlNetModel通过在边缘图、深度图、分割图和姿态检测关键点等额外输入条件下对模型进行调节,从而在文本到图像生成方面提供了更高的控制度。

ini 复制代码
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
url = "https://huggingface.co/lllyasviel/ControlNet-v1-1/blob/main/control_v11p_sd15_canny.pth"  # can also be a local path
controlnet = ControlNetModel.from_single_file(url)
url = "https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/blob/main/v1-5-pruned.safetensors"  # can also be a local path
pipeline = StableDiffusionControlNetPipeline.from_single_file(url, controlnet=controlnet)

调度器

使用过stable diffusion的小伙伴应该听过一个词叫采样器,调度器其实就是stable diffusion中的采样器。调度器是一个非常重要的部分,不同的调度器具有不同的去噪速度和质量权衡,Pipeline的默认调度器是PNDMScheduler。

Euler调度器

ini 复制代码
from diffusers import DiffusionPipeline, EulerDiscreteScheduler
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# 加载管道,会先下载训练好的模型
pipeline = DiffusionPipeline.from_pretrained("sd-dreambooth-library/disco-diffusion-style")
# 添加调度器
pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config)
pipeline.to(device)
# prompt,最好使用英文,中文效果不太好
prompt = "A cyberpunk-style building"
# 图片生成
image = pipeline(
    prompt, 
    num_inference_steps=50, # 迭代步数
    guidance_scale=7.5, # 引导系数
).images[0]
# 展示图片
display(image)

EulerA调度器

ini 复制代码
from diffusers import DiffusionPipeline, EulerAncestralDiscreteScheduler
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# 加载管道,会先下载训练好的模型
pipeline = DiffusionPipeline.from_pretrained("sd-dreambooth-library/disco-diffusion-style")
# 添加调度器
pipeline.scheduler = EulerAncestralDiscreteScheduler.from_config(pipeline.scheduler.config)
pipeline.to(device)
# prompt,最好使用英文,中文效果不太好
prompt = "A cyberpunk-style building"
# 图片生成
image = pipeline(
    prompt, 
    num_inference_steps=50, # 迭代步数
    guidance_scale=7.5, # 引导系数
).images[0]
# 展示图片
display(image)

DPM++2M调度器

ini 复制代码
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# 加载管道,会先下载训练好的模型
pipeline = DiffusionPipeline.from_pretrained("sd-dreambooth-library/disco-diffusion-style")
# 添加调度器
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
pipeline.to(device)
# prompt,最好使用英文,中文效果不太好
prompt = "A cyberpunk-style building"
# 图片生成
image = pipeline(
    prompt, 
    num_inference_steps=50, # 迭代步数
    guidance_scale=7.5, # 引导系数
).images[0]
# 展示图片
display(image)

DPM++ 2M Karras调度器(推荐)

ini 复制代码
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# 加载管道,会先下载训练好的模型
pipeline = DiffusionPipeline.from_pretrained("sd-dreambooth-library/disco-diffusion-style")
# 添加调度器
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config, use_karras_sigmas=True)
pipeline.to(device)
# prompt,最好使用英文,中文效果不太好
prompt = "A cyberpunk-style building"
# 图片生成
image = pipeline(
    prompt, 
    num_inference_steps=50, # 迭代步数
    guidance_scale=7.5, # 引导系数
).images[0]
# 展示图片
display(image)

训练扩散模型

第1步:训练配置

创建一个 TrainingConfig 来包含训练参数

ini 复制代码
from dataclasses import dataclass
@dataclass
class TrainingConfig:
    image_size = 128  # the generated image resolution
    train_batch_size = 16
    eval_batch_size = 16  # how many images to sample during evaluation
    num_epochs = 50
    gradient_accumulation_steps = 1
    learning_rate = 1e-4
    lr_warmup_steps = 500
    save_image_epochs = 10
    save_model_epochs = 30
    mixed_precision = "fp16"  # `no` for float32, `fp16` for automatic mixed precision
    output_dir = "ddpm-butterflies-128"  # the model name locally and on the HF Hub
    push_to_hub = True  # whether to upload the saved model to the HF Hub
    hub_model_id = "<your-username>/<my-awesome-model>"  # the name of the repository to create on the HF Hub
    hub_private_repo = None
    overwrite_output_dir = True  # overwrite the old model when re-running the notebook
    seed = 0
config = TrainingConfig()

第2步:加载数据集,调整训练数据

加载训练数据集

ini 复制代码
from datasets import load_dataset
config.dataset_name = "huggan/smithsonian_butterflies_subset"
dataset = load_dataset(config.dataset_name, split="train")

可视化数据集Image

css 复制代码
import matplotlib.pyplot as plt
fig, axs = plt.subplots(1, 4, figsize=(16, 4))
for i, image in enumerate(dataset[:4]["image"]):
    axs[i].imshow(image)
    axs[i].set_axis_off()
fig.show()

对图片尺寸进行调整

ini 复制代码
from torchvision import transforms
preprocess = transforms.Compose(
    [
        transforms.Resize((config.image_size, config.image_size)),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize([0.5], [0.5]),
    ]
)

将图像通道转换为RGB

arduino 复制代码
def transform(examples):
    images = [preprocess(image.convert("RGB")) for image in examples["image"]]
    return {"images": images}
dataset.set_transform(transform)

再次可视化确认图像大小

ini 复制代码
import torch
train_dataloader = torch.utils.data.DataLoader(dataset, batch_size=config.train_batch_size, shuffle=True)

第3步:创建UNet2DModel

ini 复制代码
from diffusers import UNet2DModel
model = UNet2DModel(
    sample_size=config.image_size,  # the target image resolution
    in_channels=3,  # the number of input channels, 3 for RGB images
    out_channels=3,  # the number of output channels
    layers_per_block=2,  # how many ResNet layers to use per UNet block
    block_out_channels=(128, 128, 256, 256, 512, 512),  # the number of output channels for each UNet block
    down_block_types=(
        "DownBlock2D",  # a regular ResNet downsampling block
        "DownBlock2D",
        "DownBlock2D",
        "DownBlock2D",
        "AttnDownBlock2D",  # a ResNet downsampling block with spatial self-attention
        "DownBlock2D",
    ),
    up_block_types=(
        "UpBlock2D",  # a regular ResNet upsampling block
        "AttnUpBlock2D",  # a ResNet upsampling block with spatial self-attention
        "UpBlock2D",
        "UpBlock2D",
        "UpBlock2D",
        "UpBlock2D",
    ),
)
sample_image = dataset[0]["images"].unsqueeze(0)
print("Input shape:", sample_image.shape)
print("Output shape:", model(sample_image, timestep=0).sample.shape)

第4步:创建一个调度器

ini 复制代码
import torch
from PIL import Image
from diffusers import DDPMScheduler
# 调度器
noise_scheduler = DDPMScheduler(num_train_timesteps=1000)
noise = torch.randn(sample_image.shape)
timesteps = torch.LongTensor([50])
# 调度器为图片添加噪声
noisy_image = noise_scheduler.add_noise(sample_image, noise, timesteps)
# 查看噪声图
Image.fromarray(((noisy_image.permute(0, 2, 3, 1) + 1.0) * 127.5).type(torch.uint8).numpy()[0])

计算训练损失

ini 复制代码
import torch.nn.functional as F
noise_pred = model(noisy_image, timesteps).sample
loss = F.mse_loss(noise_pred, noise)

第5步:训练模型

创建优化器和学习率调度器

ini 复制代码
from diffusers.optimization import get_cosine_schedule_with_warmup
optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate)
lr_scheduler = get_cosine_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=config.lr_warmup_steps,
    num_training_steps=(len(train_dataloader) * config.num_epochs),
)

创建评估模型

ini 复制代码
from diffusers import DDPMPipeline
from diffusers.utils import make_image_grid
import os
def evaluate(config, epoch, pipeline):
    # Sample some images from random noise (this is the backward diffusion process).
    # The default pipeline output type is `List[PIL.Image]`
    images = pipeline(
        batch_size=config.eval_batch_size,
        generator=torch.Generator(device='cpu').manual_seed(config.seed), # Use a separate torch generator to avoid rewinding the random state of the main training loop
    ).images
    # Make a grid out of the images
    image_grid = make_image_grid(images, rows=4, cols=4)
    # Save the images
    test_dir = os.path.join(config.output_dir, "samples")
    os.makedirs(test_dir, exist_ok=True)
    image_grid.save(f"{test_dir}/{epoch:04d}.png")
ini 复制代码
from accelerate import Accelerator
from huggingface_hub import create_repo, upload_folder
from tqdm.auto import tqdm
from pathlib import Path
import os
def train_loop(config, model, noise_scheduler, optimizer, train_dataloader, lr_scheduler):
    # Initialize accelerator and tensorboard logging
    accelerator = Accelerator(
        mixed_precision=config.mixed_precision,
        gradient_accumulation_steps=config.gradient_accumulation_steps,
        log_with="tensorboard",
        project_dir=os.path.join(config.output_dir, "logs"),
    )
    if accelerator.is_main_process:
        if config.output_dir is not None:
            os.makedirs(config.output_dir, exist_ok=True)
        if config.push_to_hub:
            repo_id = create_repo(
                repo_id=config.hub_model_id or Path(config.output_dir).name, exist_ok=True
            ).repo_id
        accelerator.init_trackers("train_example")
    # Prepare everything
    # There is no specific order to remember, you just need to unpack the
    # objects in the same order you gave them to the prepare method.
    model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
        model, optimizer, train_dataloader, lr_scheduler
    )
    global_step = 0
    # Now you train the model
    for epoch in range(config.num_epochs):
        progress_bar = tqdm(total=len(train_dataloader), disable=not accelerator.is_local_main_process)
        progress_bar.set_description(f"Epoch {epoch}")
        for step, batch in enumerate(train_dataloader):
            clean_images = batch["images"]
            # Sample noise to add to the images
            noise = torch.randn(clean_images.shape, device=clean_images.device)
            bs = clean_images.shape[0]
            # Sample a random timestep for each image
            timesteps = torch.randint(
                0, noise_scheduler.config.num_train_timesteps, (bs,), device=clean_images.device,
                dtype=torch.int64
            )
            # Add noise to the clean images according to the noise magnitude at each timestep
            # (this is the forward diffusion process)
            noisy_images = noise_scheduler.add_noise(clean_images, noise, timesteps)
            with accelerator.accumulate(model):
                # Predict the noise residual
                noise_pred = model(noisy_images, timesteps, return_dict=False)[0]
                loss = F.mse_loss(noise_pred, noise)
                accelerator.backward(loss)
                if accelerator.sync_gradients:
                    accelerator.clip_grad_norm_(model.parameters(), 1.0)
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()
            progress_bar.update(1)
            logs = {"loss": loss.detach().item(), "lr": lr_scheduler.get_last_lr()[0], "step": global_step}
            progress_bar.set_postfix(**logs)
            accelerator.log(logs, step=global_step)
            global_step += 1
        # After each epoch you optionally sample some demo images with evaluate() and save the model
        if accelerator.is_main_process:
            pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)
            if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:
                evaluate(config, epoch, pipeline)
            if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:
                if config.push_to_hub:
                    upload_folder(
                        repo_id=repo_id,
                        folder_path=config.output_dir,
                        commit_message=f"Epoch {epoch}",
                        ignore_patterns=["step_*", "epoch_*"],
                    )
                else:
                    pipeline.save_pretrained(config.output_dir)

启动训练

ini 复制代码
from accelerate import notebook_launcher
args = (config, model, noise_scheduler, optimizer, train_dataloader, lr_scheduler)
notebook_launcher(train_loop, args, num_processes=1)

第6步:验证效果

python 复制代码
import glob
sample_images = sorted(glob.glob(f"{config.output_dir}/samples/*.png"))
Image.open(sample_images[-1])

训练50轮,时间太久了,测试我只测试了5轮,这里是我训练了5轮的效果

参考来源

友情提示

见原文:【Hugging Face】Hugging Face Diffusers的使用方式

本文同步自微信公众号 "程序员小溪" ,这里只是同步,想看及时消息请移步我的公众号,不定时更新我的学习经验。

相关推荐
闲不住的李先森22 分钟前
Prompt 角色的概念
llm·aigc·ai编程
春末的南方城市3 小时前
统一虚拟试穿框架OmniTry:突破服装局限,实现多品类可穿戴物品虚拟试穿无蒙版新跨越。
人工智能·深度学习·机器学习·计算机视觉·aigc
用户5191495848454 小时前
HTTP/3/QUIC TLS密码套件配置错误漏洞分析
人工智能·aigc
Mintopia4 小时前
AIGC 多模态大模型在 Web 场景中的融合技术与挑战
前端·javascript·aigc
PetterHillWater4 小时前
Qoder特色功能仓库wiki索引
aigc
精灵vector4 小时前
基于视觉的网页浏览Langraph Agent
python·aigc·ai编程
墨风如雪16 小时前
刷爆AI圈!字节Waver 1.0,统一视频生成新里程碑!
aigc
程序员X小鹿16 小时前
火爆全网的神秘模型Nano Banana曝出真身!P图新王!(附免费入口)
aigc
cxyxiaokui0011 天前
检索增强生成(RAG):打破模型知识壁垒的革命性架构
java·aigc
产品研究员1 天前
AI内容创作APP原型设计详解:案例拆解与高效搭建技巧
app·aigc