本文主要介绍基于AWS inferentia芯片高性价比的部署Stable Diffusion。相比Nvidia GPU同等性能机型，AWS inferentia能将推理的成本可降低多达 70%。如果你希望测试GPU机型也可以看在AWS上快速部署Stable Diffusion webui[AWS EC2 GPU机型]

环境准备

名称	版本
Ubuntu Server	Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20230817
Python3	3.10.x
Stable Diffusion	pretrained_sd2_512_inference

以往大家认为~~AI绘图只能是Nvidia GPU卡~~ ,但是实际上AWS inferentia也可以做而且对有经验的AI开发人员而言，效果更好性价比更高 。此外我选的EC2设置了128G的SSD磁盘并有公网IP（方便远程SSH访问并直接对外提供Stable Diffusion服务）

安装

可以参考AWS Neuron、PyTorch Neuron (torch-neuronx) Setup和pretrained_sd2_512_inference

创建inf2实例

启动一个有公网IP的inferentia机型，这里我选择的是inf2.8xlarge机型并搭配Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20230817系统，这个系统已经预装好了相关的驱动。机型选择inf2.8xlarge，并开启公网访问，磁盘配置为128GB

安装Stable Diffusion

更新neuronx到最新版本

ini 复制代码

# Update OS packages 
sudo apt-get update -y

# Update OS headers 
sudo apt-get install linux-headers-$(uname -r) -y

# Install git 
sudo apt-get install git -y

# update Neuron Driver
sudo apt-get update aws-neuronx-dkms=2.* -y

# Update Neuron Runtime
sudo apt-get install aws-neuronx-collectives=2.* -y
sudo apt-get install aws-neuronx-runtime-lib=2.* -y

# Update Neuron Tools
sudo apt-get install aws-neuronx-tools=2.* -y

# Add PATH
export PATH=/opt/aws/neuron/bin:$PATH

安装依赖组件,并安装jupyter notebook

bash 复制代码

# Activate Python venv 
source /opt/aws_neuron_venv_pytorch/bin/activate 

# Install Jupyter notebook kernel
pip install ipykernel 
python3.8 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)"
pip install jupyter notebook
pip install environment_kernels

# Set pip repository pointing to the Neuron repository 
python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com

# Update Neuron Compiler and Framework
python -m pip install --upgrade neuronx-cc==2.* torch-neuronx torchvision

开放jupyter notebook的远程访问

css 复制代码

# 创建配置
jupyter-lab --generate-config
# 设置密码
jupyter server password
# 启动jupyter-lab并允许远程访问
jupyter-lab --ip 0.0.0.0 --port 8888 --no-browser

注意：设置AWS安全组放行8888端口并只允许你的IP访问

用浏览器访问jupyter notebook

点击进入python3对话框

安装依赖组件

ini 复制代码

!pip install diffusers==0.14.0 transformers==4.30.2 accelerate==0.16.0 safetensors==0.3.1 matplotlib

导入依赖

python 复制代码

import os
os.environ["NEURON_FUSE_SOFTMAX"] = "1"

import torch
import torch.nn as nn
import torch_neuronx
import numpy as np

from matplotlib import pyplot as plt
from matplotlib import image as mpimg
import time
import copy
from IPython.display import clear_output

from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.models.unet_2d_condition import UNet2DConditionOutput
from diffusers.models.cross_attention import CrossAttention

# Define datatype
DTYPE = torch.bfloat16

clear_output(wait=False)

定义python3工具类和函数

ini 复制代码

class UNetWrap(nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet

    def forward(self, sample, timestep, encoder_hidden_states, cross_attention_kwargs=None):
        out_tuple = self.unet(sample, timestep, encoder_hidden_states, return_dict=False)
        return out_tuple

class NeuronUNet(nn.Module):
    def __init__(self, unetwrap):
        super().__init__()
        self.unetwrap = unetwrap
        self.config = unetwrap.unet.config
        self.in_channels = unetwrap.unet.in_channels
        self.device = unetwrap.unet.device

    def forward(self, sample, timestep, encoder_hidden_states, cross_attention_kwargs=None):
        sample = self.unetwrap(sample, timestep.to(dtype=DTYPE).expand((sample.shape[0],)), encoder_hidden_states)[0]
        return UNet2DConditionOutput(sample=sample)

class NeuronTextEncoder(nn.Module):
    def __init__(self, text_encoder):
        super().__init__()
        self.neuron_text_encoder = text_encoder
        self.config = text_encoder.config
        self.dtype = text_encoder.dtype
        self.device = text_encoder.device

    def forward(self, emb, attention_mask = None):
        return [self.neuron_text_encoder(emb)['last_hidden_state']]
    

# Optimized attention
def get_attention_scores(self, query, key, attn_mask):       
    dtype = query.dtype

    if self.upcast_attention:
        query = query.float()
        key = key.float()

    # Check for square matmuls
    if(query.size() == key.size()):
        attention_scores = custom_badbmm(
            key,
            query.transpose(-1, -2)
        )

        if self.upcast_softmax:
            attention_scores = attention_scores.float()

        attention_probs = attention_scores.softmax(dim=1).permute(0,2,1)
        attention_probs = attention_probs.to(dtype)

    else:
        attention_scores = custom_badbmm(
            query,
            key.transpose(-1, -2)
        )

        if self.upcast_softmax:
            attention_scores = attention_scores.float()

        attention_probs = attention_scores.softmax(dim=-1)
        attention_probs = attention_probs.to(dtype)
        
    return attention_probs

def custom_badbmm(a, b):
    bmm = torch.bmm(a, b)
    scaled = bmm * 0.125
    return scaled

def decode_latents(self, latents):
    latents = latents.to(torch.float)
    latents = 1 / self.vae.config.scaling_factor * latents
    image = self.vae.decode(latents).sample
    image = (image / 2 + 0.5).clamp(0, 1)
    image = image.cpu().permute(0, 2, 3, 1).float().numpy()
    return image

使用neuron编译模型，这样才能在inf芯片上运行，首次编译耗时数分钟，我们可以稍等...

ini 复制代码

# For saving compiler artifacts
COMPILER_WORKDIR_ROOT = 'sd2_compile_dir_512'

# Model ID for SD version pipeline
model_id = "stabilityai/stable-diffusion-2-1-base"

# --- Compile UNet and save ---
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=DTYPE)

# Replace original cross-attention module with custom cross-attention module for better performance
CrossAttention.get_attention_scores = get_attention_scores

# Apply double wrapper to deal with custom return type
pipe.unet = NeuronUNet(UNetWrap(pipe.unet))

# Only keep the model being compiled in RAM to minimze memory pressure
unet = copy.deepcopy(pipe.unet.unetwrap)
del pipe

# Compile unet
sample_1b = torch.randn([1, 4, 64, 64], dtype=DTYPE)
timestep_1b = torch.tensor(999, dtype=DTYPE).expand((1,))
encoder_hidden_states_1b = torch.randn([1, 77, 1024], dtype=DTYPE)
example_inputs = sample_1b, timestep_1b, encoder_hidden_states_1b

unet_neuron = torch_neuronx.trace(
    unet,
    example_inputs,
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'unet'),
    compiler_args=["--model-type=unet-inference", "--enable-fast-loading-neuron-binaries"]
)

# Enable asynchronous and lazy loading to speed up model load
torch_neuronx.async_load(unet_neuron)
torch_neuronx.lazy_load(unet_neuron)

# save compiled unet
unet_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'unet/model.pt')
torch.jit.save(unet_neuron, unet_filename)

# delete unused objects
del unet
del unet_neuron



# --- Compile CLIP text encoder and save ---

# Only keep the model being compiled in RAM to minimze memory pressure
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=DTYPE)
text_encoder = copy.deepcopy(pipe.text_encoder)
del pipe

# Apply the wrapper to deal with custom return type
text_encoder = NeuronTextEncoder(text_encoder)
# Compile text encoder
# This is used for indexing a lookup table in torch.nn.Embedding,
# so using random numbers may give errors (out of range).
emb = torch.tensor([[49406, 18376,   525,  7496, 49407,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
        0,     0,     0,     0,     0,     0,     0]])
text_encoder_neuron = torch_neuronx.trace(
        text_encoder.neuron_text_encoder, 
        emb, 
        compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder'),
        compiler_args=["--enable-fast-loading-neuron-binaries"]
        )

# Enable asynchronous loading to speed up model load
torch_neuronx.async_load(text_encoder_neuron)

# Save the compiled text encoder
text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')
torch.jit.save(text_encoder_neuron, text_encoder_filename)

# delete unused objects
del text_encoder
del text_encoder_neuron



# --- Compile VAE decoder and save ---

# Only keep the model being compiled in RAM to minimze memory pressure
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
decoder = copy.deepcopy(pipe.vae.decoder)
del pipe

# Compile vae decoder
decoder_in = torch.randn([1, 4, 64, 64], dtype=torch.float32)
decoder_neuron = torch_neuronx.trace(
    decoder, 
    decoder_in, 
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder'),
    compiler_args=["--enable-fast-loading-neuron-binaries"]
)

# Enable asynchronous loading to speed up model load
torch_neuronx.async_load(decoder_neuron)

# Save the compiled vae decoder
decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
torch.jit.save(decoder_neuron, decoder_filename)

# delete unused objects
del decoder
del decoder_neuron



# --- Compile VAE post_quant_conv and save ---

# Only keep the model being compiled in RAM to minimze memory pressure
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
post_quant_conv = copy.deepcopy(pipe.vae.post_quant_conv)
del pipe

# # Compile vae post_quant_conv
post_quant_conv_in = torch.randn([1, 4, 64, 64], dtype=torch.float32)
post_quant_conv_neuron = torch_neuronx.trace(
    post_quant_conv, 
    post_quant_conv_in,
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv'),
)

# Enable asynchronous loading to speed up model load
torch_neuronx.async_load(post_quant_conv_neuron)

# # Save the compiled vae post_quant_conv
post_quant_conv_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv/model.pt')
torch.jit.save(post_quant_conv_neuron, post_quant_conv_filename)

# delete unused objects
del post_quant_conv
del post_quant_conv_neuron

编译成功后会有Compiler status PASS字样的提示

加载编译后的模型并生图

ini 复制代码

# --- Load all compiled models ---
COMPILER_WORKDIR_ROOT = 'sd2_compile_dir_512'
model_id = "stabilityai/stable-diffusion-2-1-base"
text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')
decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
unet_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'unet/model.pt')
post_quant_conv_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv/model.pt')

pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=DTYPE)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

# Replaces StableDiffusionPipeline's decode_latents method with our custom decode_latents method defined above.
StableDiffusionPipeline.decode_latents = decode_latents

# Load the compiled UNet onto two neuron cores.
pipe.unet = NeuronUNet(UNetWrap(pipe.unet))
device_ids = [0,1]
pipe.unet.unetwrap = torch_neuronx.DataParallel(torch.jit.load(unet_filename), device_ids, set_dynamic_batching=False)

# Load other compiled models onto a single neuron core.
pipe.text_encoder = NeuronTextEncoder(pipe.text_encoder)
pipe.text_encoder.neuron_text_encoder = torch.jit.load(text_encoder_filename)
pipe.vae.decoder = torch.jit.load(decoder_filename)
pipe.vae.post_quant_conv = torch.jit.load(post_quant_conv_filename)

# Run pipeline
prompt = ["a photo of an astronaut riding a horse on mars",
          "sonic on the moon",
          "elvis playing guitar while eating a hotdog",
          "saved by the bell",
          "engineers eating lunch at the opera",
          "panda eating bamboo on a plane",
          "A digital illustration of a steampunk flying machine in the sky with cogs and mechanisms, 4k, detailed, trending in artstation, fantasy vivid colors",
          "kids playing soccer at the FIFA World Cup"
         ]

# First do a warmup run so all the asynchronous loads can finish
image_warmup = pipe(prompt[0]).images[0]

plt.title("Image")
plt.xlabel("X pixel scaling")
plt.ylabel("Y pixels scaling")

total_time = 0
for x in prompt:
    start_time = time.time()
    image = pipe(x).images[0]
    total_time = total_time + (time.time()-start_time)
    image.save("image.png")
    image = mpimg.imread("image.png")
    #clear_output(wait=True)
    plt.imshow(image)
    plt.show()
print("Average time: ", np.round((total_time/len(prompt)), 2), "seconds")

我们样例中输入了一组提示词，我们看到按照提示词画了一些图片

提示词a photo of an astronaut riding a horse on mars生成了宇航员

当我们调整提示词为Portrait of renaud sechan, pen and ink, intricate line drawings, by craig mullins, ruan jia, kentaro miura, greg rutkowski, loundraw生图效果如下

关于AWS自研芯片

AWS 目前自研了云原生场景下的多种高性能芯片，并大量的使用在AWS上取得了非常好的反响，主要是如下4类

名称	网站	功能
Nitro	AWS Nitro System	虚拟化、安全、网络、磁盘管理等，主要是提升EC2的整体IO性能、虚拟化效率并提升安全性
Graviton	AWS Graviton 处理器	基于ARM64指令集的通用场景下的服务器CPU，大量用在服务器领域，目前已经占据最大的ARM服务器市场份额（单价比X86_64低20%，性能更好性价比提示34%），几乎全部的AWS客户都有用到Graviton托管服务或者基于Graviton裸机自建服务
Inferentia	AWS Inferentia	专用于深度学习推理场景下的推理芯片，也是本次使用的芯片。
Trainium	AWS Trainium	专用于深度学习模型训练场景下的训练芯片

在机器学习领域，Trainium和Inferentia是完美的搭档，分别处理复杂的模型训练和任务推理场景，它们专为大规模机器学习场景设计因此效率更高，价格更低并且能耗很好。一般主流的模型都可以用neuron sdk进行适配后高效的完成AI/ML任务。

本次inf2.8xlarge搭载了最新(2023)的第二代inf芯片,提供高达2.3 petaflops的DL性能和高达384 GB的总加速器内存以及9.8 TB/s 的带宽。AWS Neuron SDK 与 PyTorch 和TensorFlow等流行的机器学习框架原生集成。因此，用户可以继续使用现有框架和应用程序代码在Inf2上进行部署。开发人员可以在 AWS Deep Learning AMI、AWS Deep Learning容器或 Amazon Elastic Container Service (Amazon ECS)、Amazon Elastic Kubernetes Service (Amazon EKS) 和 Amazon SageMaker 等托管服务中使用Inf2实例。

Amazon EC2 Inf2实例的核心是 AWS Inferentia2 设备，每个设备包含两个 NeuronCores-v2。每个 NeuronCore-v2 都是一个独立的异构计算单元，具有四个主要引擎：张量（Tensor）、向量（Vector）、标量（Scalar）和 GPSIMD 引擎。张量引擎针对矩阵运算进行了优化。标量引擎针对 ReLU（整流线性单元）函数等元素运算进行了优化。向量引擎针对非元素向量操作进行了优化，包括批量归一化或池化。下图显示了 AWS Inferentia2设备架构的内部工作原理。

AWS Inferentia2 支持多种数据类型，包括 FP32、TF32、BF16、FP16 和 UINT8，因此用户可以根据工作负载选择最合适的数据类型。它还支持新的可配置 FP8 (cFP8) 数据类型，这与大型模型特别相关，因为它减少了模型的内存占用和 I/O 要求。

AWS Inferentia2 嵌入了支持动态执行的通用数字信号处理器（DSP），因此无需在主机上展开或执行控制流运算符。AWS Inferentia2 还支持动态输入形状，这对于输入张量大小未知的模型（例如处理文本的模型）来说非常关键。

AWS Inferentia2 支持用 C++ 编写的自定义运算符。Neuron Custom C++ Operators 使用户能够编写在 NeuronCores 上本机运行的 C++ 自定义运算符。使用标准 PyTorch 自定义运算符编程接口将 CPU 自定义运算符迁移到 Neuron 并实现新的实验运算符，所有这些都不需要对 NeuronCore 硬件有深入了解。

Inf2 实例是 Amazon EC2 上的第一个推理优化实例，可通过芯片之间的直接超高速连接（NeuronLink v2）支持分布式推理。NeuronLink v2 使用集体通信 (Collective Communications) 运算符（例如 all-reduce）在所有芯片上运行高性能推理管道。

Neuron SDK

AWS Neuron 是一种 SDK，可优化在 AWS Inferentia 和 Trainium 上执行的复杂神经网络模型的性能。AWS Neuron 包括深度学习编译器、运行时和工具，这些工具与 TensorFlow 和 PyTorch 等流行框架原生集成，它预装在 AWS Deep Learning AMI 和 Deep Learning Containers 中，供客户快速开始运行高性能且经济高效的推理。

Neuron 编译器接受多种格式（TensorFlow、PyTorch、XLA HLO）的机器学习模型，并优化它们以在 Neuron 设备上的运行。Neuron 编译器在机器学习框架内调用，其中模型由 Neuron Framework 插件发送到编译器。生成的编译器工件称为 NEFF 文件（Neuron 可执行文件格式），该文件又由 Neuron 运行时加载到 Neuron 设备。

Neuron 运行时由内核驱动程序和 C/C++ 库组成，后者提供 API 来访问 Inferentia 和 Trainium Neuron 设备。TensorFlow和PyTorch 的 Neuron ML 框架插件使用 Neuron 运行时在 NeuronCores 上加载和运行模型。Neuron 运行时将编译的深度学习模型（也称为 Neuron 可执行文件格式 (NEFF)）加载到 Neuron 设备，并针对高吞吐量和低延迟进行了优化。

使用 Inf2 实例运行流行的应用程序，例如文本摘要、代码生成、视频和图像生成、语音识别、个性化等

参考：

在AWS上快速部署Stable Diffusion webui[AWS EC2 inf2机型]

环境准备

安装

关于AWS自研芯片

Neuron SDK