【Hugging Face】Hugging Face Diffusers的使用方式

前言

前面我们了解了Hugging Face 核心库 Transformers 的使用方式,今天我们继续探索Hugging Face核心库 Diffusers 库的使用方式。对往期内容感兴趣的小伙伴也可以看往期:

简介

Diffusers 是Hugging Face开源的 扩散模型(Diffusion Models)一站式工具箱,把最前沿的扩散相关论文 / 权重 封装成简单、可组合的 API,让你用几行代码就能做文生图、图生图、音频生成、视频生成、3D 生成、分子生成、数据增强等任务,也支持训练 / 微调自己的模型。

Diffusers官网文档:huggingface.co/docs/diffus...

Diffusers核心API

  • 管道:从高层次设计的多种类函数,目的在于方便部署和实现任务,能够快速的用于训练好的主流扩散模型来生成样本
  • 模型:在训练新的扩散模型的时候需要用到网络结构,比如UNet
  • 调度器:在推理的过程中使用多种不同的技巧来从噪声中生成图像,同时也可以生成训练过程中带噪声的图像。

安装

安装PyTorch

安装CPU版本

php 复制代码
$ pip install torch torchvision torchaudio

安装NVIDIA GPU版本,更多安装方式可以查看PyTorch官网:pytorch.org/

perl 复制代码
$ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

安装Diffusers

php 复制代码
# Accelerate 加速模型加载,用于推理和训练
# Transformers 是运行最受欢迎的扩散模型(如 Stable Diffusion)所必需的
$ uv add diffusers accelerate transformers
或者
$ pip install diffusers accelerate transformers

基本使用

这是AI生成的Diffuser文生图的整个流程,有助于我们理解整个操作流程

Pipeline管道

在Diffusers中,Pipeline是把模型、调度器、处理等组件"粘合"起来的一条生产线。首先来看一下,Pipeline的默认使用方式。以sd-dreambooth-library/disco-diffusion-style文生图模型为例,新建一个Colab代码块

ini 复制代码
from diffusers import DiffusionPipeline
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# 加载管道,会先下载训练好的模型
pipeline = DiffusionPipeline.from_pretrained("sd-dreambooth-library/disco-diffusion-style")
pipeline.to(device)
# prompt,最好使用英文,中文效果不太好
prompt = "A cyberpunk-style building"
# 图片生成
image = pipeline(
    prompt, 
    num_inference_steps=50, # 迭代步数
    guidance_scale=7.5, # 引导系数
).images[0]
# 展示图片
display(image)

点击运行,效果如下

Diffusers中还包含了很多Pipeline类型,下面是一些生图的Pipeline,感兴趣的小伙伴可以自行了解更多

  • Text-To-Image:文生图,StableDiffusionPipeline
  • Image-To-Image:图生图,StableDiffusionImg2ImgPipeline
  • In-Painting:蒙版重绘,StableDiffusionInpaintPipeline
  • Upscale Image:超分辨率(放大4倍),StableDiffusionUpscalePipeline
  • Pix-To-Pix:图像画风编辑,StableDiffusionInstructPix2PixPipeline
  • Depth-To-Image:深度绘图,StableDiffusionDepth2ImgPipeline

DiffusionPipeline

使用DiffusionPipeline实例化模型时,如果模型没有下载,会自动下载模型

Pipeline中包含多种管道,DiffusionPipeline是用预训练的扩散系统进行推理的最简单方法。它是一个包含模型和调度器的端到端系统,可以直接使用DiffusionPipeline完成许多任务。

Pipeline使用 from_pretrained() 方法加载模型

python 复制代码
from diffusers import DiffusionPipeline
# 加载模型,use_safetensors使用安全无代码格式,建议默认加上
pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use_safetensors=True)
print(pipeline)

DiffusionPipeline会下载并缓存所有建模、分词和调度组件。打印pipeline对象可以发现Stable Diffusion管道由 UNet2DConditionModel 和 PNDMScheduler 等组件组成,DiffusionPipeline为抽象类底层会被更改为StableDiffusionPipeline

使用Pipeline生成图像

scss 复制代码
image = pipeline("An image of a squirrel in Picasso style").images[0]
display(image)

点击运行,生图效果如下:

GPU加速

Pipeline同样可以将管道放在 GPU 上,就像你使用任何 PyTorch 模块一样为管道添加GPU加速

ini 复制代码
import torch
from diffusers import DiffusionPipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use_safetensors=True)
# 在GPU上运行管道
pipeline.to(device)

Pipeline参数

参数1:精度

默认情况下,DiffusionPipeline 使用完整 float32 精度进行 50 步推理,为了加速生图过程我们可以选择降低精度为 float16 或减少推理步数

python 复制代码
import torch
from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True)

参数2:提示词

提示词是生图主体的关键内容

ini 复制代码
# 提示词
prompt = "portrait photo of a old warrior chief"
image = pipeline(prompt).images[0]

参数3:随机种子

为了确保我们可以使用同一张图像并对其进行改进,我们使用 Generator 并设置一个种子以实现可重复性

ini 复制代码
import torch
generator = torch.Generator("cuda").manual_seed(0)
image = pipeline(prompt, generator=generator).images[0]

参数4:迭代步数

Stable Diffusion 模型默认使用 PNDMScheduler,通常需要~50 步推理,但像 DPMSolverMultistepScheduler 这样的性能更好的调度器,只需要 20 或 25 步推理。

ini 复制代码
from diffusers import DiffusionPipeline
from diffusers import DPMSolverMultistepScheduler
import torch
pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True)
# 加载调度器
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
image = pipeline(
    prompt, 
    num_inference_steps=25, # 迭代步数
).images[0]

参数5:反向提示词

正反向提示词用于控制生图中想要和不想要内容

ini 复制代码
negative_prompt = "low quality, bad anatomy, deformed, blurry"
image = pipeline(
        prompt,
        negative_prompt=negative_prompt, # 反向提示词
        height=height,
        width=width,
        num_inference_steps=num_inference_steps,
        generator=generator
    ).images[0]

参数6:图片宽高

图片宽高用于控制生图的尺寸大小

ini 复制代码
# 设置生成图像的参数
height = 512
width = 512
image = pipeline(
    prompt,
    height=height, # 图片高
    width=width,   # 图片宽
).images[0]

参数7:提示词引导系数

提示词引导系数决定了生图效果与提示词的关系,提示词系数越大生图效果越贴近提示词,提示词系数接近 1 时提示词将被忽略

  • 1.0-3.0:完全忽略提示词,自由生成
  • 3.0-10.0:提示词与创意平衡,7.0-7.5为默认黄金区,官方示例都用7.5
  • 10.0-15.0:需要精确还原 prompt
ini 复制代码
image = pipeline(
    prompt, 
    guidance_scale=7.5, # 提示词引导系数
).images[0]

enable_attention_slicing(节省内存)

其他配置不变,只加上 enable_attention_slicing

ini 复制代码
from diffusers import DiffusionPipeline
from diffusers import DPMSolverMultistepScheduler
import torch
from diffusers.utils import make_image_grid
device = "cuda" if torch.cuda.is_available() else "cpu"
# 加载pipeline
pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True)
pipeline.to(device)
# 开启enable_attention_slicing节省内存
pipeline.enable_attention_slicing() 
# 加载调度器
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
prompt = "portrait photo of a old warrior chief"
def get_inputs(batch_size=1):
    generator = [torch.Generator("cuda").manual_seed(i) for i in range(batch_size)]
    prompts = batch_size * [prompt]
    num_inference_steps = 20
    return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps}
images = pipeline(**get_inputs(batch_size=8)).images
make_image_grid(images, 2, 4)

一次生成了8张图片且没有报 OOM 错误

VAE

VAE是一个优化显存和提速的操作,VAE 把高分辨率像素空间(512×512×3 ≈ 786 k 元素)压缩成低维潜空间(64×64×4 ≈ 16 k 元素),让扩散模型在更小、更快的空间里做"加噪 / 去噪",最后再解码回真实图像。

ini 复制代码
from diffusers import DiffusionPipeline, AutoencoderKL
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# 加载pipeline
pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True)
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to(device)
pipeline.vae = vae
pipeline.to(device)
prompt = "portrait photo of a old warrior chief"
def get_inputs(batch_size=1):
    generator = [torch.Generator(device).manual_seed(i) for i in range(batch_size)]
    prompts = batch_size * [prompt]
    num_inference_steps = 20
    return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps}
images = pipeline(**get_inputs(batch_size=8)).images
make_image_grid(images, 2, 4)

AutoPipeline

AutoPipeline 类的设计旨在简化 Diffusers 中的各种管道。它是一个通用的任务优先级管道,让你专注于任务(AutoPipelineForText2Image、AutoPipelineForImage2Image 和 AutoPipelineForInpainting),而无需知道具体的管道类。AutoPipeline 会自动检测应使用的正确管道类。

AutoPipelineForText2Image文生图示例

ini 复制代码
from diffusers import AutoPipelineForText2Image
import torch
pipe_txt2img = AutoPipelineForText2Image.from_pretrained(
    "dreamlike-art/dreamlike-photoreal-2.0", torch_dtype=torch.float16, use_safetensors=True
).to("cuda")
prompt = "cinematic photo of Godzilla eating sushi with a cat in a izakaya, 35mm photograph, film, professional, 4k, highly detailed"
generator = torch.Generator(device="cpu").manual_seed(37)
image = pipe_txt2img(prompt, generator=generator).images[0]
image

AutoPipelineForImage2Image图生图示例

ini 复制代码
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image
import torch
pipe_img2img = AutoPipelineForImage2Image.from_pretrained(
    "dreamlike-art/dreamlike-photoreal-2.0", torch_dtype=torch.float16, use_safetensors=True
).to("cuda")
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()
init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/autopipeline-text2img.png")
prompt = "cinematic photo of Godzilla eating burgers with a cat in a fast food restaurant, 35mm photograph, film, professional, 4k, highly detailed"
generator = torch.Generator(device="cpu").manual_seed(53)
image = pipe_img2img(prompt, image=init_image, generator=generator).images[0]
image

AutoPipelineForInpainting图片修复示例

ini 复制代码
from diffusers import AutoPipelineForInpainting
from diffusers.utils import load_image
import torch
pipeline = AutoPipelineForInpainting.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, use_safetensors=True
).to("cuda")
init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/autopipeline-img2img.png")
mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/autopipeline-mask.png")
prompt = "cinematic photo of a owl, 35mm photograph, film, professional, 4k, highly detailed"
generator = torch.Generator(device="cpu").manual_seed(38)
image = pipeline(prompt, image=init_image, mask_image=mask_image, generator=generator, strength=0.4).images[0]
image

模型

如果不确定要使用哪个模型,可以使用 AutoModel API 自动选择模型

在 Diffusers 中,扩散模型在进行生成图片的时候,是噪声扩散的逆过程,模型是真正负责学习、预测噪声或压缩/还原功能的可学习网络。

Diffusers中有4类最常用的模型:

  • UNet(条件/无条件):学习「当前带噪图 → 噪声残差」或「直接预测原图」, 典型实例 UNet2DConditionModel、UNet2DModel, 默认模型
  • VAE/AutoencoderKL:把高维像素空间压缩到低维潜空间,节省计算;推理时再解码,典型实例 AutoencoderKL
  • Text Encoder/CLIP:把提示词编码成「条件向量」供条件扩散模型使用,典型实例 CLIPTextModel
  • ControlNet/LoRA:在不改原 UNet 权重的前提下,注入额外条件或微调,典型实例 ControlNetModel、LoRAAdapter

官网图片降噪示例

下面通过官方示例看一下,完整的降噪过程,我对图片展示部分做了修改,其他的代码和官方是保持一致的。

ini 复制代码
from diffusers import UNet2DModel, DDPMScheduler
import torch
import tqdm
import PIL.Image
import numpy as np
from diffusers.utils import make_image_grid
device = "cuda" if torch.cuda.is_available() else "cpu"
# 加载模型
model_id = "google/ddpm-cat-256"
model = UNet2DModel.from_pretrained(model_id)
# 将模型放在GPU上
model.to(device)
# 加载调度器
scheduler = DDPMScheduler.from_config(model.config)
# 生成随机种子
torch.manual_seed(0)
# 设置图片尺寸
noisy_sample = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size)
# 将照片放到GPU上
noisy_sample = noisy_sample.to(device)
sample = noisy_sample
# Helper function to convert tensor to PIL Image
def convert_to_pil_image(sample_tensor):
    image_processed = sample_tensor.cpu().permute(0, 2, 3, 1)
    image_processed = (image_processed + 1.0) * 127.5
    image_processed = image_processed.numpy().astype(np.uint8)
    image_pil = PIL.Image.fromarray(image_processed[0])
    return image_pil
# 开始生成一个只猫
images = []
for i, t in enumerate(tqdm.tqdm(scheduler.timesteps)):
    # 1. predict noise residual
    with torch.no_grad():
        residual = model(sample, t).sample
    # 2. compute less noisy image and set x_t -> x_t-1
    sample = scheduler.step(residual, t, sample).prev_sample
    # 3. optionally look at image
    if (i + 1) % 50 == 0:
        pil_image = convert_to_pil_image(sample)
        images.append(pil_image) # Append the PIL image to the list
# Display the collected images after the loop
make_image_grid(images, 5, 4)

运行完成,我们将得到图片的整个去噪过程图,虽然最终生成的效果不是很好看,但是大概也能看出整个图片生成的过程了

有条件/无条件模型

  • UNet2DModel:无条件 2D UNet,只认识「带噪图 + 时间步」。
  • UNet2DConditionModel:有条件 2D UNet,额外吃「文本/深度/姿态等条件向量」,是 Stable Diffusion、ControlNet 等的核心网络。

模型通过 from_pretrained() 方法初始化,该方法还会在本地缓存模型权重,因此下次加载模型时会更快。

ini 复制代码
from diffusers import scheduler, UNet2DModel
repo_id = "google/ddpm-cat-256"
scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256")
model = UNet2DModel.from_pretrained(repo_id, use_safetensors=True)
print(model.config)

不知道使用哪个模型类,也可以使用AutoModel

ini 复制代码
from diffusers import AutoModel
repo_id = "google/ddpm-cat-256"
model = AutoModel.from_pretrained(repo_id, use_safetensors=True)
print(model.config)

模型打印结果如下

模型参数:

  • sample_size:输入样本的高度和宽度尺寸
  • in_channels:输入样本的输入通道数
  • down_block_types和up_block_types:用于创建U-Net架构的下采样和上采样块的类型
  • block_out_channels:下采样块的输出通道数;也以相反的顺序用于上采样块的输入通道数
  • layers_per_block:每个U-Net块中存在的ResNet块的数量。

无条件降噪生图案例,可以看到上面的官方示例,这里注意了解下,有条件的生图过程

ini 复制代码
from PIL import Image
import torch
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, UniPCMultistepScheduler
from tqdm.auto import tqdm
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "CompVis/stable-diffusion-v1-4"
# vae
vae = AutoencoderKL.from_pretrained(model_name, subfolder="vae", use_safetensors=True)
# 分词器
tokenizer = CLIPTokenizer.from_pretrained(model_name, subfolder="tokenizer")
# 文本编码器
text_encoder = CLIPTextModel.from_pretrained(model_name, subfolder="text_encoder", use_safetensors=True)
# 模型
unet = UNet2DConditionModel.from_pretrained(model_name, subfolder="unet", use_safetensors=True)
# 调度器
scheduler = UniPCMultistepScheduler.from_pretrained(model_name, subfolder="scheduler")
# 加速推理
vae.to(device)
text_encoder.to(device)
unet.to(device)
prompt = ["a photograph of an astronaut riding a horse"]
height = 512  # default height of Stable Diffusion
width = 512  # default width of Stable Diffusion
num_inference_steps = 25  # Number of denoising steps
guidance_scale = 7.5  # Scale for classifier-free guidance
# Seed generator to create the initial latent noise
generator = torch.Generator(device).manual_seed(0) # Ensure generator is on the correct device
batch_size = len(prompt)
# 对提示词进行分词
text_input = tokenizer(
    prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt"
)
with torch.no_grad():
    text_embeddings = text_encoder(text_input.input_ids.to(device))[0]
max_length = text_input.input_ids.shape[-1]
# 填充标记的嵌入
uncond_input = tokenizer([""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt")
uncond_embeddings = text_encoder(uncond_input.input_ids.to(device))[0]
# 将条件嵌入和无条件嵌入连接成一个批次,以避免进行两次前向传递
text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
print(device)
# 之所以被"8"整除,因为vae模型有3次下采样
latents = torch.randn(
    (batch_size, unet.config.in_channels, height // 8, width // 8),
    generator=generator,
    device=device,
)
# 去噪图像
latents = latents * scheduler.init_noise_sigma
# 循环时间步长
scheduler.set_timesteps(num_inference_steps)
for t in tqdm(scheduler.timesteps):
    # expand the latents if we are doing classifier-free guidance to avoid doing two forward passes.
    latent_model_input = torch.cat([latents] * 2)
    latent_model_input = scheduler.scale_model_input(latent_model_input, timestep=t)
    # predict the noise residual
    with torch.no_grad():
        noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
    # perform guidance
    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
    # compute the previous noisy sample x_t -> x_t-1
    latents = scheduler.step(noise_pred, t, latents).prev_sample
# 图片解码
latents = 1 / 0.18215 * latents
with torch.no_grad():
    image = vae.decode(latents).sample
image = (image / 2 + 0.5).clamp(0, 1).squeeze()
image = (image.permute(1, 2, 0) * 255).to(torch.uint8).cpu().numpy()
image = Image.fromarray(image)
image

示例中CLIPTextModel是文本编码器,用于把toknes编码为向量,用来控制扩散模型的生成。CLIPTextModel需要接收通过CLIPTokenizer分词器处理的提示词文本,这就和我们上期了解到的transformers联系起来了。这是手动实现有条件生图的过程,通过Pipeline将会大大简化生图流程,最后看下生图效果:

ControlNetModel

ControlNetModel通过在边缘图、深度图、分割图和姿态检测关键点等额外输入条件下对模型进行调节,从而在文本到图像生成方面提供了更高的控制度。

ini 复制代码
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
url = "https://huggingface.co/lllyasviel/ControlNet-v1-1/blob/main/control_v11p_sd15_canny.pth"  # can also be a local path
controlnet = ControlNetModel.from_single_file(url)
url = "https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/blob/main/v1-5-pruned.safetensors"  # can also be a local path
pipeline = StableDiffusionControlNetPipeline.from_single_file(url, controlnet=controlnet)

调度器

使用过stable diffusion的小伙伴应该听过一个词叫采样器,调度器其实就是stable diffusion中的采样器。调度器是一个非常重要的部分,不同的调度器具有不同的去噪速度和质量权衡,Pipeline的默认调度器是PNDMScheduler。

Euler调度器

ini 复制代码
from diffusers import DiffusionPipeline, EulerDiscreteScheduler
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# 加载管道,会先下载训练好的模型
pipeline = DiffusionPipeline.from_pretrained("sd-dreambooth-library/disco-diffusion-style")
# 添加调度器
pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config)
pipeline.to(device)
# prompt,最好使用英文,中文效果不太好
prompt = "A cyberpunk-style building"
# 图片生成
image = pipeline(
    prompt, 
    num_inference_steps=50, # 迭代步数
    guidance_scale=7.5, # 引导系数
).images[0]
# 展示图片
display(image)

EulerA调度器

ini 复制代码
from diffusers import DiffusionPipeline, EulerAncestralDiscreteScheduler
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# 加载管道,会先下载训练好的模型
pipeline = DiffusionPipeline.from_pretrained("sd-dreambooth-library/disco-diffusion-style")
# 添加调度器
pipeline.scheduler = EulerAncestralDiscreteScheduler.from_config(pipeline.scheduler.config)
pipeline.to(device)
# prompt,最好使用英文,中文效果不太好
prompt = "A cyberpunk-style building"
# 图片生成
image = pipeline(
    prompt, 
    num_inference_steps=50, # 迭代步数
    guidance_scale=7.5, # 引导系数
).images[0]
# 展示图片
display(image)

DPM++2M调度器

ini 复制代码
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# 加载管道,会先下载训练好的模型
pipeline = DiffusionPipeline.from_pretrained("sd-dreambooth-library/disco-diffusion-style")
# 添加调度器
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
pipeline.to(device)
# prompt,最好使用英文,中文效果不太好
prompt = "A cyberpunk-style building"
# 图片生成
image = pipeline(
    prompt, 
    num_inference_steps=50, # 迭代步数
    guidance_scale=7.5, # 引导系数
).images[0]
# 展示图片
display(image)

DPM++ 2M Karras调度器(推荐)

ini 复制代码
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# 加载管道,会先下载训练好的模型
pipeline = DiffusionPipeline.from_pretrained("sd-dreambooth-library/disco-diffusion-style")
# 添加调度器
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config, use_karras_sigmas=True)
pipeline.to(device)
# prompt,最好使用英文,中文效果不太好
prompt = "A cyberpunk-style building"
# 图片生成
image = pipeline(
    prompt, 
    num_inference_steps=50, # 迭代步数
    guidance_scale=7.5, # 引导系数
).images[0]
# 展示图片
display(image)

训练扩散模型

第1步:训练配置

创建一个 TrainingConfig 来包含训练参数

ini 复制代码
from dataclasses import dataclass
@dataclass
class TrainingConfig:
    image_size = 128  # the generated image resolution
    train_batch_size = 16
    eval_batch_size = 16  # how many images to sample during evaluation
    num_epochs = 50
    gradient_accumulation_steps = 1
    learning_rate = 1e-4
    lr_warmup_steps = 500
    save_image_epochs = 10
    save_model_epochs = 30
    mixed_precision = "fp16"  # `no` for float32, `fp16` for automatic mixed precision
    output_dir = "ddpm-butterflies-128"  # the model name locally and on the HF Hub
    push_to_hub = True  # whether to upload the saved model to the HF Hub
    hub_model_id = "<your-username>/<my-awesome-model>"  # the name of the repository to create on the HF Hub
    hub_private_repo = None
    overwrite_output_dir = True  # overwrite the old model when re-running the notebook
    seed = 0
config = TrainingConfig()

第2步:加载数据集,调整训练数据

加载训练数据集

ini 复制代码
from datasets import load_dataset
config.dataset_name = "huggan/smithsonian_butterflies_subset"
dataset = load_dataset(config.dataset_name, split="train")

可视化数据集Image

css 复制代码
import matplotlib.pyplot as plt
fig, axs = plt.subplots(1, 4, figsize=(16, 4))
for i, image in enumerate(dataset[:4]["image"]):
    axs[i].imshow(image)
    axs[i].set_axis_off()
fig.show()

对图片尺寸进行调整

ini 复制代码
from torchvision import transforms
preprocess = transforms.Compose(
    [
        transforms.Resize((config.image_size, config.image_size)),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize([0.5], [0.5]),
    ]
)

将图像通道转换为RGB

arduino 复制代码
def transform(examples):
    images = [preprocess(image.convert("RGB")) for image in examples["image"]]
    return {"images": images}
dataset.set_transform(transform)

再次可视化确认图像大小

ini 复制代码
import torch
train_dataloader = torch.utils.data.DataLoader(dataset, batch_size=config.train_batch_size, shuffle=True)

第3步:创建UNet2DModel

ini 复制代码
from diffusers import UNet2DModel
model = UNet2DModel(
    sample_size=config.image_size,  # the target image resolution
    in_channels=3,  # the number of input channels, 3 for RGB images
    out_channels=3,  # the number of output channels
    layers_per_block=2,  # how many ResNet layers to use per UNet block
    block_out_channels=(128, 128, 256, 256, 512, 512),  # the number of output channels for each UNet block
    down_block_types=(
        "DownBlock2D",  # a regular ResNet downsampling block
        "DownBlock2D",
        "DownBlock2D",
        "DownBlock2D",
        "AttnDownBlock2D",  # a ResNet downsampling block with spatial self-attention
        "DownBlock2D",
    ),
    up_block_types=(
        "UpBlock2D",  # a regular ResNet upsampling block
        "AttnUpBlock2D",  # a ResNet upsampling block with spatial self-attention
        "UpBlock2D",
        "UpBlock2D",
        "UpBlock2D",
        "UpBlock2D",
    ),
)
sample_image = dataset[0]["images"].unsqueeze(0)
print("Input shape:", sample_image.shape)
print("Output shape:", model(sample_image, timestep=0).sample.shape)

第4步:创建一个调度器

ini 复制代码
import torch
from PIL import Image
from diffusers import DDPMScheduler
# 调度器
noise_scheduler = DDPMScheduler(num_train_timesteps=1000)
noise = torch.randn(sample_image.shape)
timesteps = torch.LongTensor([50])
# 调度器为图片添加噪声
noisy_image = noise_scheduler.add_noise(sample_image, noise, timesteps)
# 查看噪声图
Image.fromarray(((noisy_image.permute(0, 2, 3, 1) + 1.0) * 127.5).type(torch.uint8).numpy()[0])

计算训练损失

ini 复制代码
import torch.nn.functional as F
noise_pred = model(noisy_image, timesteps).sample
loss = F.mse_loss(noise_pred, noise)

第5步:训练模型

创建优化器和学习率调度器

ini 复制代码
from diffusers.optimization import get_cosine_schedule_with_warmup
optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate)
lr_scheduler = get_cosine_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=config.lr_warmup_steps,
    num_training_steps=(len(train_dataloader) * config.num_epochs),
)

创建评估模型

ini 复制代码
from diffusers import DDPMPipeline
from diffusers.utils import make_image_grid
import os
def evaluate(config, epoch, pipeline):
    # Sample some images from random noise (this is the backward diffusion process).
    # The default pipeline output type is `List[PIL.Image]`
    images = pipeline(
        batch_size=config.eval_batch_size,
        generator=torch.Generator(device='cpu').manual_seed(config.seed), # Use a separate torch generator to avoid rewinding the random state of the main training loop
    ).images
    # Make a grid out of the images
    image_grid = make_image_grid(images, rows=4, cols=4)
    # Save the images
    test_dir = os.path.join(config.output_dir, "samples")
    os.makedirs(test_dir, exist_ok=True)
    image_grid.save(f"{test_dir}/{epoch:04d}.png")
ini 复制代码
from accelerate import Accelerator
from huggingface_hub import create_repo, upload_folder
from tqdm.auto import tqdm
from pathlib import Path
import os
def train_loop(config, model, noise_scheduler, optimizer, train_dataloader, lr_scheduler):
    # Initialize accelerator and tensorboard logging
    accelerator = Accelerator(
        mixed_precision=config.mixed_precision,
        gradient_accumulation_steps=config.gradient_accumulation_steps,
        log_with="tensorboard",
        project_dir=os.path.join(config.output_dir, "logs"),
    )
    if accelerator.is_main_process:
        if config.output_dir is not None:
            os.makedirs(config.output_dir, exist_ok=True)
        if config.push_to_hub:
            repo_id = create_repo(
                repo_id=config.hub_model_id or Path(config.output_dir).name, exist_ok=True
            ).repo_id
        accelerator.init_trackers("train_example")
    # Prepare everything
    # There is no specific order to remember, you just need to unpack the
    # objects in the same order you gave them to the prepare method.
    model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
        model, optimizer, train_dataloader, lr_scheduler
    )
    global_step = 0
    # Now you train the model
    for epoch in range(config.num_epochs):
        progress_bar = tqdm(total=len(train_dataloader), disable=not accelerator.is_local_main_process)
        progress_bar.set_description(f"Epoch {epoch}")
        for step, batch in enumerate(train_dataloader):
            clean_images = batch["images"]
            # Sample noise to add to the images
            noise = torch.randn(clean_images.shape, device=clean_images.device)
            bs = clean_images.shape[0]
            # Sample a random timestep for each image
            timesteps = torch.randint(
                0, noise_scheduler.config.num_train_timesteps, (bs,), device=clean_images.device,
                dtype=torch.int64
            )
            # Add noise to the clean images according to the noise magnitude at each timestep
            # (this is the forward diffusion process)
            noisy_images = noise_scheduler.add_noise(clean_images, noise, timesteps)
            with accelerator.accumulate(model):
                # Predict the noise residual
                noise_pred = model(noisy_images, timesteps, return_dict=False)[0]
                loss = F.mse_loss(noise_pred, noise)
                accelerator.backward(loss)
                if accelerator.sync_gradients:
                    accelerator.clip_grad_norm_(model.parameters(), 1.0)
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()
            progress_bar.update(1)
            logs = {"loss": loss.detach().item(), "lr": lr_scheduler.get_last_lr()[0], "step": global_step}
            progress_bar.set_postfix(**logs)
            accelerator.log(logs, step=global_step)
            global_step += 1
        # After each epoch you optionally sample some demo images with evaluate() and save the model
        if accelerator.is_main_process:
            pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)
            if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:
                evaluate(config, epoch, pipeline)
            if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:
                if config.push_to_hub:
                    upload_folder(
                        repo_id=repo_id,
                        folder_path=config.output_dir,
                        commit_message=f"Epoch {epoch}",
                        ignore_patterns=["step_*", "epoch_*"],
                    )
                else:
                    pipeline.save_pretrained(config.output_dir)

启动训练

ini 复制代码
from accelerate import notebook_launcher
args = (config, model, noise_scheduler, optimizer, train_dataloader, lr_scheduler)
notebook_launcher(train_loop, args, num_processes=1)

第6步:验证效果

python 复制代码
import glob
sample_images = sorted(glob.glob(f"{config.output_dir}/samples/*.png"))
Image.open(sample_images[-1])

训练50轮,时间太久了,测试我只测试了5轮,这里是我训练了5轮的效果

参考来源

友情提示

见原文:【Hugging Face】Hugging Face Diffusers的使用方式

本文同步自微信公众号 "程序员小溪" ,这里只是同步,想看及时消息请移步我的公众号,不定时更新我的学习经验。

相关推荐
redreamSo42 分钟前
AI Daily | AI日报:哈萨比斯:AI能建模所有进化事物; Anthropic 反杀 OpenAI,称霸企业 LLM 市场; 马斯克与LeCun激辩:研究者是否存在?
程序员·aigc·资讯
coder_pig2 小时前
👦抠腚男孩的AI学习之旅 | 2、玩转Prompt提示词工程
aigc·ai编程
Mintopia3 小时前
🌐AIGC:从硅芯片中孕育的缪斯女神
前端·javascript·aigc
小溪彼岸3 小时前
【Hugging Face】Hugging Face Transformers的使用方式
aigc
墨风如雪14 小时前
月之暗面亮剑:Kimi K2 高速版,用速度与价格重塑牌局
aigc
AntBlack16 小时前
闲谈 :AI 生成视频哪家强 ,掘友们有没有推荐的工具?
前端·后端·aigc
xiaohezi19 小时前
不止于工具:PromptPilot如何将AI开发从“手工作坊”推向“工业时代”?
aigc
万俟淋曦1 天前
人工智能图像生成的道德利弊
人工智能·aigc·图像识别
_Meilinger_1 天前
论文研读|基于图像修复的AI生成图像检测(CVPR 2025)
人工智能·深度学习·计算机视觉·ai·aigc·图像取证·生成图像检测