StreamMultiDiffusion 是根据用户指定的区域文本提示生成实时 交互式 多文本到图像的功能。
该视频意味着该项目最终可以让你通过细粒度的区域提示控制来生成大尺寸图像。以前,这是根本不可行的。每次试验花费一个小时意味着你无法多次采样来选择你想要的最佳生成或调整生成过程以实现你的意图。然而,该项目已将延迟从一小时减少到一分钟,使该技术对创作者来说是可行的!
相关链接
项目:github.com/ironjr/Stre... 论文:arxiv.org/abs/2403.09... 论文:StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control(具有基于区域的语义控制的实时交互生成)
论文摘要
扩散模型在文本到图像合成方面取得的巨大成功,使其成为下一代图像生成和编辑最终用户应用程序的有希望的候选者。 以前的工作重点是通过减少推理时间来提高扩散模型的可用性,或者通过允许新的细粒度控制(例如基于区域的文本提示)来增加用户交互性。然而,我们凭经验发现,整合两个工作分支并非易事,限制了扩散模型的潜力。 为了解决这种不兼容性,我们提出了 StreamMultiDiffusion,这是第一个基于区域的实时文本到图像生成框架。通过稳定快速推理技术并将模型重组为新提出的多提示流批处理架构,我们实现了× 10比现有解决方案更快的全景生成速度,以及在单个 RTX 2080 Ti GPU 上基于区域的文本到图像合成的生成速度为 1.57 FPS。 我们的解决方案开辟了一种名为语义调色板的交互式图像生成的新范例,其中从给定的多个手绘区域实时生成高质量图像,并编码规定的语义含义(例如,鹰、女孩)。
- 交互式图像生成从头开始与细粒度区域控制。换句话说,你用意义来描绘图像。
- 提示分离。当同时生成两个或更多对象不再被无意的内容混合所困扰!
- 实时图像绘制和编辑。可以利用任何上传的照片或你想要的艺术品。
主要模块
- model.StableMultiDiffusionPipeline: 用于单调用生成(可能更适合CLI用户)
- model.StreamMultiDiffusion用于流媒体应用程序,例如这个README页面的主图。我们为下面可能的应用程序提供最小的示例。
流式生成处理
使用多提示流批处理,通过对原始流批处理架构的修改,可以流式传输这个多提示文本到图像的生成过程,以永远生成图像。
结果
代码
ini
import torch
from util import seed_everything, Streamer
from model import StreamMultiDiffusion
# The following packages are imported only for loading the images.
import torchvision.transforms as T
import requests
import time
import imageio # This is not included in our requirements.txt!
from functools import reduce
from io import BytesIO
from PIL import Image
seed = 2024
device = 0
height = 768
width = 512
# Load the module.
device = torch.device(f'cuda:{device}')
smd = StreamMultiDiffusion(
device,
hf_key='ironjr/BlazingDriveV11m',
sd_version='1.5',
height=height,
width=width,
cfg_type='none',
autoflush=True,
use_tiny_vae=True,
mask_type='continuous',
bootstrap_steps=2,
bootstrap_mix_steps=1.5,
seed=seed,
)
# Load the masks.
masks = []
for i in range(1, 3):
url = f'https://raw.githubusercontent.com/ironjr/StreamMultiDiffusion/main/assets/zeus/prompt_p{i}.png'
response = requests.get(url)
mask = Image.open(BytesIO(response.content)).convert('RGBA')
mask = (T.ToTensor()(mask)[-1:] > 0.5).float()
masks.append(mask)
# In this example, background is simply set as non-marked regions.
background = reduce(torch.logical_and, [m == 0 for m in masks])
# Register a background, prompts, and masks (this can be called multiple times).
smd.update_background(Image.new(size=(width, height), mode='RGB', color=(255, 255, 255)))
smd.update_single_layer(
idx=0,
prompt='a photo of Mount Olympus',
negative_prompt='worst quality, bad quality, normal quality, cropped, framed',
mask=background,
mask_strength=1.0,
mask_std=0.0,
prompt_strength=1.0,
)
smd.update_single_layer(
idx=1,
prompt='1girl, looking at viewer, lifts arm, smile, happy, Greek goddess Athena',
negative_prompt='worst quality, bad quality, normal quality, cropped, framed',
mask=masks[0],
mask_strength=1.0,
mask_std=0.0,
prompt_strength=1.0,
)
smd.update_single_layer(
idx=2,
prompt='a small, sitting owl',
negative_prompt='worst quality, bad quality, normal quality, cropped, framed',
mask=masks[1],
mask_strength=1.0,
mask_std=0.0,
prompt_strength=1.0,
)
# Generate images... forever.
# while True:
# image = smd()
# image.save(f'{str(int(time.time() % 100000))}.png') # This will take up your hard drive pretty much soon.
# display(image) # If `from IPython.display import display` is called.
#
# You can also intercept the process in the middle of the generation by updating other background, prompts or masks.
# smd.update_single_layer(
# idx=2,
# prompt='a small, sitting owl',
# negative_prompt='worst quality, bad quality, normal quality, cropped, framed',
# mask=masks[1],
# mask_strength=1.0,
# mask_std=0.0,
# prompt_strength=1.0,
# )
# Or make a video/gif from your generation stream (requires `imageio`)
frames = []
for _ in range(50):
image = smd()
frames.append(image)
imageio.mimsave('my_beautiful_creation.gif', frames, loop=0)
基于区域的多文本到图像生成
支持从任意数量的提示掩码对生成任意大小的图像。
结果
代码
ini
import torch
from model import StableMultiDiffusionPipeline
from util import seed_everything
# The following packages are imported only for loading the images.
import torchvision.transforms as T
import requests
from functools import reduce
from io import BytesIO
from PIL import Image
seed = 2024
device = 0
# Load the module.
seed_everything(seed)
device = torch.device(f'cuda:{device}')
smd = StableMultiDiffusionPipeline(
device,
hf_key='ironjr/BlazingDriveV11m',
sd_version='1.5',
)
# Load prompts.
prompts = [
# Background prompt.
'1girl, 1boy, times square',
# Foreground prompts.
'1boy, looking at viewer, brown hair, casual shirt',
'1girl, looking at viewer, pink hair, leather jacket',
]
negative_prompts = [
'',
'1girl', # (Optional) The first prompt is a boy so we don't want a girl.
'1boy', # (Optional) The first prompt is a girl so we don't want a boy.
]
negative_prompt_prefix = 'worst quality, bad quality, normal quality, cropped, framed'
negative_prompts = [negative_prompt_prefix + ', ' + p for p in negative_prompts]
# Load masks.
masks = []
for i in range(1, 3):
url = f'https://raw.githubusercontent.com/ironjr/StreamMultiDiffusion/main/assets/timessquare/timessquare_{i}.png'
response = requests.get(url)
mask = Image.open(BytesIO(response.content)).convert('RGBA')
mask = (T.ToTensor()(mask)[-1:] > 0.5).float()
masks.append(mask)
# In this example, background is simply set as non-marked regions.
background = reduce(torch.logical_and, [m == 0 for m in masks])
masks = torch.stack([background] + masks, dim=0).float()
height, width = masks.shape[-2:] # (768, 768) in this example.
# Sample an image.
image = smd(
prompts,
negative_prompts,
masks=masks,
mask_strengths=1,
mask_stds=0,
height=height,
width=width,
bootstrap_steps=2,
)
image.save('my_beautiful_creation.png')
基于更大区域的多文本到图像生成
下面的代码使用50步DDIM采样器的原始MultiDiffusion管道大约需要一个小时来运行代码,但我们已经将其减少到一分钟。
结果
代码
ini
import torch
from model import StableMultiDiffusionPipeline
from util import seed_everything
# The following packages are imported only for loading the images.
import torchvision.transforms as T
import requests
from functools import reduce
from io import BytesIO
from PIL import Image
seed = 2024
device = 0
# Load the module.
seed_everything(seed)
device = torch.device(f'cuda:{device}')
smd = StableMultiDiffusionPipeline(device)
# Load prompts.
prompts = [
# Background prompt.
'clear deep blue sky',
# Foreground prompts.
'summer mountains',
'the sun',
'the moon',
'a giant waterfall',
'a giant waterfall',
'clean deep blue lake',
'a large tree',
'a large tree',
]
negative_prompts = ['worst quality, bad quality, normal quality, cropped, framed'] * len(prompts)
# Load masks.
masks = []
for i in range(1, 9):
url = f'https://raw.githubusercontent.com/ironjr/StreamMultiDiffusion/main/assets/irworobongdo/irworobongdo_{i}.png'
response = requests.get(url)
mask = Image.open(BytesIO(response.content)).convert('RGBA')
mask = (T.ToTensor()(mask)[-1:] > 0.5).float()
masks.append(mask)
# In this example, background is simply set as non-marked regions.
background = reduce(torch.logical_and, [m == 0 for m in masks])
masks = torch.stack([background] + masks, dim=0).float()
height, width = masks.shape[-2:] # (768, 1920) in this example.
# Sample an image.
image = smd(
prompts,
negative_prompts,
masks=masks,
mask_strengths=1,
mask_stds=0,
height=height,
width=width,
bootstrap_steps=2,
)
image.save('my_beautiful_creation.png')