Ascend DrivingSDK 是基于昇腾 NPU 平台开发的适用于自动驾驶场景的算子和模型加速库,提供了一系列高性能的算子和模型加速接口,支持 PyTorch 框架。
Ascend DrivingSDK 中的 modulated_deform_conv2d 是少有的融合算子,使用单个 kernel 完成 Deformable Convolution 的计算。然而由于910B 采用 vector core 和 cube core 分离架构,二者间的同步开销较大。910B 系列芯片拥有高达96MB 到192MB 的 L2缓存,并且在默认情况下开启。因此,modulated_deform_conv2d 算子的输入基本都在 L2缓存上。
modulated_deform_conv2d 在 Ascend C 层面有两个算子,v2针对3x3卷积进行了优化。奇怪的是这里没有选择新增 kernel 而是新增算子。不知出于何种原因,两个版本的算子参数列表顺序不同。
op | PreProcess | ComputeWeight | ComputeBilinearInterpolation | ProcessCube | Output |
---|---|---|---|---|---|
deformable_conv2d | 缓存一行的卷积窗口索引,复用 H o H_o Ho 次 | 向量计算 W o k h k w W_o k_h k_w Wokhkw 个点的权重 | 单次加载 4 C i 4C_i 4Ci 数据插值 | 单次计算 W o W_o Wo 的结果 | 暂存 im2col 用于 ∂ L ∂ W \frac{\partial L}{\partial W} ∂W∂L |
deformable_conv2d_v2 | / | 向量计算 k h k w k_h k_w khkw 个点的权重 | k h k w k_h k_w khkw 条加载 C i C_i Ci 的指令然后插值 | 单次计算128个输出点 | / |
v2 性能比 v1好,原因应该是 v2中内存拷贝的同步间隔更长(9:4)。全局内存访问的延迟很高,v1单条指令拷贝 4 C i 4C_i 4Ci,v2单次加载 C i C_i Ci,但是9条指令后才同步。v1 缓存 im2col 节省计算,但是又没有省太多。因为计算 ∂ L ∂ Δ m n \frac{\partial L}{\partial \Delta \mathbf{m}_n} ∂Δmn∂L 时仍然需要双线性插值的结果。
两个算子功能不完备,例如不支持 half 类型、不支持 deform_group、不支持 bias 等。v2算子更为简陋。在此情况下无文档描述和防呆,使用时不免要费一些周折。不知出于何种原因,两个版本的算子参数列表顺序不同。难以想象这是工业级的代码,遑论车规。唯一的优点是像 AMD 一样开源,期待用户自己定位解决。
ModulatedDeformConv2dFunction
将输入转为 NHWC 格式。
python
class ModulatedDeformConv2dFunction(Function):
@staticmethod
@custom_fwd(cast_inputs=torch.float32)
# pylint: disable=huawei-too-many-arguments
def forward(
ctx,
x: torch.Tensor,
offset: torch.Tensor,
mask: torch.Tensor,
weight: torch.Tensor,
bias: Optional[nn.Parameter] = None,
stride: Union[int, Tuple[int, ...]] = 1,
padding: Union[int, Tuple[int, ...]] = 0,
dilation: Union[int, Tuple[int, ...]] = 1,
groups: int = 1,
deformable_groups: int = 1,
):
ctx.kernel_size = [weight.size(2), weight.size(3)]
ctx.stride = _pair(stride)
ctx.padding = _pair(padding)
ctx.dilation = _pair(dilation)
ctx.groups = groups
ctx.deformable_groups = deformable_groups
nhwc_x = x.permute(0, 2, 3, 1).contiguous()
nhwc_offset = offset.permute(0, 2, 3, 1).contiguous()
nhwc_weight = weight.permute(0, 2, 3, 1).contiguous()
nhwc_mask = mask.permute(0, 2, 3, 1).contiguous()
out, offset_output = mx_driving._C.modulated_deformable_conv2d(
nhwc_x,
nhwc_offset,
nhwc_mask,
nhwc_weight,
None,
ctx.kernel_size,
ctx.stride,
ctx.padding,
ctx.dilation,
ctx.groups,
ctx.deformable_groups,
False,
)
ctx.save_for_backward(nhwc_x, nhwc_offset, nhwc_weight, nhwc_mask, offset_output)
return out
ModulatedDeformConv2dFunction.backward
将 ∂ L ∂ Y \frac{\partial L}{\partial Y} ∂Y∂L 转置为 N × H o × C o × W o N\times H_o \times C_o \times W_o N×Ho×Co×Wo
python
@staticmethod
@once_differentiable
@custom_bwd
# pylint: disable=huawei-too-many-arguments,too-many-return-values
def backward(ctx, grad_out):
nhwc_x, nhwc_offset, nhwc_weight, nhwc_mask, offset_output = ctx.saved_tensors
nhwc_grad_out = grad_out.permute(0, 2, 1, 3).contiguous()
grad_x, grad_weight, _, grad_offset, grad_mask = mx_driving._C.modulated_deformable_conv2d_backward(
nhwc_x,
nhwc_offset,
nhwc_mask,
nhwc_weight,
None,
offset_output,
nhwc_grad_out,
ctx.kernel_size,
ctx.stride,
ctx.padding,
ctx.dilation,
ctx.groups,
ctx.deformable_groups,
False,
)
return (
grad_x,
grad_offset,
grad_mask,
grad_weight,
None,
None,
None,
None,
None,
None,
)
modulated_deformable_conv2d
group=1 else modulated_deformable_conv2d DeformableConv2dV2 DeformableConv2d
TORCH_CHECK_NPU 检查输入张量是否都存储在 NPU 设备上。
cpp
std::tuple<at::Tensor, at::Tensor> modulated_deformable_conv2d(const at::Tensor& input, const at::Tensor& offset,
const at::Tensor& mask, const at::Tensor& weight, const c10::optional<at::Tensor>& bias_opt,
at::IntArrayRef kernel_size, at::IntArrayRef stride, at::IntArrayRef padding, at::IntArrayRef dilation,
int64_t groups, int64_t deformable_groups, int64_t with_bias)
{
TORCH_CHECK_NPU(input);
TORCH_CHECK_NPU(offset);
TORCH_CHECK_NPU(mask);
TORCH_CHECK_NPU(weight);
对维度和参数进行检查。
cpp
TORCH_CHECK(input.dim() == INPUT_DIM, "input must to be a 4D Tensor, but got: ", input.dim());
TORCH_CHECK(offset.dim() == INPUT_DIM, "offset has to be a 4D Tensor, but got: ", offset.dim());
TORCH_CHECK(mask.dim() == INPUT_DIM, "mask has to be a 4D Tensor, but got: ", mask.dim());
TORCH_CHECK(weight.dim() == INPUT_DIM, "weight has to be a 4D Tensor, but got: ", weight.dim());
TORCH_CHECK(stride[0] > 0 && stride[1] > 0, "stride must be greater than 0");
TORCH_CHECK(kernel_size[0] > 0 && kernel_size[1] > 0, "kernel_size must be greater than 0");
TORCH_CHECK(dilation[0] > 0 && dilation[1] > 0, "dilation must be greater than 0");
c10::value_or_else 已经废弃了,推荐使用 std::optional::value_or。
安全地处理可选的bias_opt
。
cpp
const at::Tensor& bias = c10::value_or_else(bias_opt, [] { return at::Tensor(); });
uint32_t n = static_cast<uint32_t>(input.size(0));
uint32_t c_in = static_cast<uint32_t>(input.size(3));
uint32_t h_in = static_cast<uint32_t>(input.size(1));
uint32_t w_in = static_cast<uint32_t>(input.size(2));
uint32_t h_out = static_cast<uint32_t>(offset.size(1));
uint32_t w_out = static_cast<uint32_t>(offset.size(2));
uint32_t c_out = static_cast<uint32_t>(weight.size(0));
uint32_t kh = static_cast<uint32_t>(weight.size(1));
uint32_t kw = static_cast<uint32_t>(weight.size(2));
TORCH_CHECK(kh == kernel_size[0] && kw == kernel_size[1], "kernel size mismatch");
TORCH_CHECK(mask.size(-1) == kh * kw, "The shape of the mask is invalid");
TORCH_CHECK(groups > 0, "groups must be greater than 0");
TORCH_CHECK(c_out % groups == 0, "weight's out channel should be divided by groups");
TORCH_CHECK(c_in % groups == 0, "input's channel should be divided by groups");
bool modulated = true;
如果是无分组卷积并且输入通道数为256或512,调用 DeformableConv2dV2,否则调用 DeformableConv2d。两个算子的参数顺序不同。
DeformableConv2dV2 算子有两个输出:output
的形状为 N × H o × W o × C o N \times H_o \times W_o \times C_o N×Ho×Wo×Co,offset_output
的形状为 N × H o W o × k h k w × C i N\times H_o W_o \times k_h k_w \times C_i N×HoWo×khkw×Ci;
DeformableConv2d 算子有两个输出:output
的形状为 N × H o × C o × W o N \times H_o \times C_o \times W_o N×Ho×Co×Wo,offset_output
的形状为 N × H o × W o × G × k h k w C i G N\times H_o \times W_o \times G \times \frac{k_h k_w C_i}{G} N×Ho×Wo×G×GkhkwCi。
注意 :DeformableConv2dV2 要求 k h k w = 9 k_h k_w =9 khkw=9,但是这里没有加判断条件。
cpp
if ((groups == 1) && ((c_in == CHANNEL_256) || (c_in == CHANNEL_512))) {
at::Tensor output = at::empty({n, h_out, w_out, c_out}, input.options());
at::Tensor offset_output = at::empty({n, h_out * w_out, kh * kw, c_in}, input.options());
EXEC_NPU_CMD(aclnnDeformableConv2dV2, input, offset, mask, weight, bias, kernel_size, stride, padding, dilation,
groups, deformable_groups, modulated, with_bias, output, offset_output);
output = output.permute({0, 3, 1, 2});
return std::tie(output, offset_output);
} else {
at::Tensor output = at::empty({n, h_out, c_out, w_out}, input.options());
at::Tensor offset_output = at::empty({n, h_out, w_out, groups, kh * kw * c_in / groups}, input.options());
EXEC_NPU_CMD(aclnnDeformableConv2d, input, weight, bias, offset, mask, kernel_size, stride, padding, dilation,
groups, deformable_groups, modulated, with_bias, output, offset_output);
output = output.permute({0, 2, 1, 3});
return std::tie(output, offset_output);
}
}
DeformableConv2dKernel::Process
DeformableConv2dKernel::Process PreProcess ProcessVector ProcessCube
cpp
template<bool modulated>
__aicore__ inline void DeformableConv2dKernel<modulated>::Process()
{
PreProcess();
for (uint32_t taskIdx = start_; taskIdx < end_; ++taskIdx) {
ProcessVector(taskIdx);
ProcessCube(taskIdx);
}
mm_.End();
}
DeformableConv2dKernel::PreProcess
所有 VectorCore 协作完成一行输出所需的索引的计算,类似于 Allgather 模式。
TBuf:: Get 从 TBuf 上获取指定长度的 Tensor,或者获取全部长度的 Tensor。
auxH
和auxW
的大小为 W o k h k w W_o k_h k_w Wokhkw,存储一行输出的卷积窗口坐标 ( h i , w i ) (h_i, w_i) (hi,wi)。
w i _ s t a r t = w o ⋅ s − p h i _ s t a r t = h o ⋅ s − p \begin{aligned} w_{i\start}=w_o \cdot s−p\\ h{i\_start}=h_o \cdot s−p \end{aligned} wi_start=wo⋅s−phi_start=ho⋅s−p
由于auxH
和auxW
预先计算后用于多行输出的索引,因此auxH
中是窗口内的相对偏移,没有加 h o ⋅ s h_o \cdot s ho⋅s。
auxStart_
为当前核需要处理的起始索引。
cpp
template<bool modulated>
__aicore__ inline void DeformableConv2dKernel<modulated>::PreProcess()
{
LocalTensor<float> auxH = auxHBuf_.Get<float>();
LocalTensor<float> auxW = auxWBuf_.Get<float>();
uint32_t idx = 0;
for (int32_t w = auxStart_; w < auxEnd_; ++w) {
for (int32_t i = 0; i < kH_; ++i) {
for (int32_t j = 0; j < kW_; ++j) {
auxW.SetValue(idx, static_cast<float>(w * strideW_ - padW_ + j * dilationW_));
auxH.SetValue(idx, static_cast<float>(-padH_ + i * dilationH_));
++idx;
}
}
}
GlobalTensor::operator[] 根据输入的offset
偏移返回新的 GlobalTensor。
valRptTimes_
是 C i C_i Ci 拷贝次数。
每个核将本地计算的索引拷贝到auxWGm_
和auxHGm_
。
cpp
DataCopyPad(auxWGm_[auxStart_ * kernelSize_], auxW,
{1, static_cast<uint16_t>(B32_BYTE_SIZE * (auxEnd_ - auxStart_) * kernelSize_), 0, 0});
DataCopyPad(auxHGm_[auxStart_ * kernelSize_], auxH,
{1, static_cast<uint16_t>(B32_BYTE_SIZE * (auxEnd_ - auxStart_) * kernelSize_), 0, 0});
SyncAll();
同步后,从全局内存中拷贝得到完整的一行输出所需的索引。
注意 :这里有个问题是两次全局内存访问的延迟比较高。卷积核通常为3x3,如果 W o W_o Wo 比较小的情况下,每个核自行计算比联合计算的开销可能更小。
DataCopy(auxW, auxWGm_, {1, rowOffsetBlk_, 0, 0});
DataCopy(auxH, auxHGm_, {1, rowOffsetBlk_, 0, 0});
将feature
清零。
cpp
LocalTensor<float> feature = featureBuf_.Get<float>();
Duplicate<float, false>(feature, 0.f, MASK_PLACEHOLDER, 4 * valRptTimes_, 1, 8);
}
DeformableConv2dKernel::ProcessVector
DeformableConv2dKernel::ProcessVector CopyInOffset ComputeWeight ComputeBilinearInterpolation
cpp
template<bool modulated>
__aicore__ inline void DeformableConv2dKernel<modulated>::ProcessVector(uint32_t taskIdx)
{
uint32_t batch = taskIdx / hOut_;
srcOffset_ = batch * hIn_ * wIn_ * cIn_;
dstOffset_ = taskIdx * rowIn_;
LocalTensor<float> offset = offsetBuf_.Get<float>();
LocalTensor<float> auxW = auxWBuf_.Get<float>();
LocalTensor<float> auxH = auxHBuf_.Get<float>();
LocalTensor<int32_t> offsetInt = offsetIntBuf_.Get<int32_t>();
LocalTensor<float> weight = weightBuf_.Get<float>();
LocalTensor<float> feature = featureBuf_.Get<float>();
LocalTensor<float> mask;
if (modulated) {
mask = maskBuf_.Get<float>();
}
LocalTensor<float> offsetOutput = offsetOutputBuf_.Get<float>();
DeformableConv2dKernel::CopyInOffset 拷贝一行的 Δ p n \Delta p_n Δpn 和 Δ m \Delta m Δm 并解交织 Δ p n \Delta p_n Δpn。
DeformableConv2dKernel::ComputeWeight 计算采样位置和插值权重。
cpp
CopyInOffset(taskIdx, offset, mask);
ComputeWeight(taskIdx, auxW, auxH, offset, offsetInt, weight, mask);
SetFlag<HardEvent::V_MTE2>(calEvt_);
WaitFlag<HardEvent::V_MTE2>(calEvt_);
SetFlag<HardEvent::MTE3_V>(0);
SetFlag<HardEvent::MTE3_V>(1);
uint8_t ping = 0;
DeformableConv2dKernel::ComputeBilinearInterpolation 加载计算和保存。
cpp
for (uint32_t w = 0; w < wOut_; ++w) {
WaitFlag<HardEvent::MTE3_V>(ping);
ComputeBilinearInterpolation(w, offset, offsetInt, feature, weight, offsetOutput[ping * kwIn_]);
SetFlag<HardEvent::MTE3_V>(ping);
ping = 1 - ping;
}
WaitFlag<HardEvent::MTE3_V>(0);
WaitFlag<HardEvent::MTE3_V>(1);
}
DeformableConv2dKernel::CopyInOffset
cpp
template<bool modulated>
__aicore__ inline void DeformableConv2dKernel<modulated>::CopyInOffset(
uint32_t taskIdx, const LocalTensor<float>& offset, const LocalTensor<float>& mask)
{
uint32_t offsetIdx = taskIdx * rowOffset_ * 2;
DataCopy(offset, offsetGm_[offsetIdx], {1, doubleRowOffsetBlk_, 0, 0});
if (modulated) {
DataCopy(mask, maskGm_[taskIdx * rowOffset_], {1, rowOffsetBlk_, 0, 0});
}
SetFlag<HardEvent::MTE2_V>(copyEvt_);
WaitFlag<HardEvent::MTE2_V>(copyEvt_);
uint64_t cnt;
GatherMask(offset[2 * alignedRowOffset_], offset, 2, false, MASK_PLACEHOLDER, gatherParams_, cnt);
GatherMask(offset[3 * alignedRowOffset_], offset, 1, false, MASK_PLACEHOLDER, gatherParams_, cnt);
SetVectorMask<float>(FULL_MASK, FULL_MASK);
}
DeformableConv2dKernel::ComputeWeight
offset
是中间变量,offsetInt
和weight
是输出,但使用常量引用。
cpp
template<bool modulated>
__aicore__ inline void DeformableConv2dKernel<modulated>::ComputeWeight(uint32_t taskIdx,
const LocalTensor<float>& auxW, const LocalTensor<float>& auxH, const LocalTensor<float>& offset,
const LocalTensor<int32_t>& offsetInt, const LocalTensor<float>& weight, const LocalTensor<float>& mask)
{
offset
的大小为 4 × W o k h k w 4\times W_o k_h k_w 4×Wokhkw,用于临时变量。
使用 Copy 指令取 x i x_i xi。
h
为 y o y_o yo,auxH
加 y o ⋅ s y_o \cdot s yo⋅s 后为实际坐标 y i y_i yi。
offset
的前半部分为卷积窗口索引 p + p n p + p_n p+pn。
cpp
int32_t h = taskIdx % hOut_;
Copy<float, false>(offset, auxW, MASK_PLACEHOLDER, rptTimes_, {1, 1, 8, 8});
Adds<float, false>(offset[alignedRowOffset_], auxH, float(h * strideH_), MASK_PLACEHOLDER, rptTimes_, {1, 1, 8, 8});
由于内存连续,一条加法指令实现浮点坐标的计算: p + p n + Δ p n p + p_n+\Delta p_n p+pn+Δpn。
cpp
Add<float, false>(
offset, offset, offset[2 * alignedRowOffset_], MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 1, 8, 8, 8});
offsetInt
转为整型坐标 ( y 1 , x 1 ) (y_1, x_1) (y1,x1),offset
的后半部分存储浮点类型的左上角坐标。
cpp
Cast<int32_t, float, false>(
offsetInt, offset, RoundMode::CAST_FLOOR, MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 8, 8});
Cast<float, int32_t, false>(
offset[2 * alignedRowOffset_], offsetInt, RoundMode::CAST_NONE, MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 8, 8});
前半部分为差值 y − y 1 y-y_1 y−y1 和 x − x 1 x-x_1 x−x1。
weight
为1,因此后半部分为 y 2 − y y_2-y y2−y 和 x 2 − x x_2-x x2−x。
cpp
Sub<float, false>(
offset, offset, offset[2 * alignedRowOffset_], MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 1, 8, 8, 8}); // lw, lh
Duplicate<float, false>(weight, 1.f, MASK_PLACEHOLDER, 2 * rptTimes_, 1, 8);
Sub<float, false>(
offset[2 * alignedRowOffset_], weight, offset, MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 1, 8, 8, 8}); // hw, hh
两个维度相乘得到4个点的权值:
w 11 = ( x 2 − x ) ( y 2 − y ) w 21 = ( x − x 1 ) ( y 2 − y ) w 12 = ( x 2 − x ) ( y − y 1 ) w 22 = ( x − x 1 ) ( y − y 1 ) \begin{aligned} w_{11} &= (x_2 -x)(y_2 -y) \\ w_{21} &=(x -x_1)(y_2 -y) \\ w_{12} &=(x_2 -x)(y -y_1)\\ w_{22} &=(x -x_1)(y -y_1) \end{aligned} w11w21w12w22=(x2−x)(y2−y)=(x−x1)(y2−y)=(x2−x)(y−y1)=(x−x1)(y−y1)
cpp
Mul<float, false>(weight, offset[2 * alignedRowOffset_], offset[3 * alignedRowOffset_], MASK_PLACEHOLDER, rptTimes_,
{1, 1, 1, 8, 8, 8}); // hw * hh
Mul<float, false>(weight[alignedRowOffset_], offset, offset[3 * alignedRowOffset_], MASK_PLACEHOLDER, rptTimes_,
{1, 1, 1, 8, 8, 8}); // lw * hh
Mul<float, false>(weight[2 * alignedRowOffset_], offset[alignedRowOffset_], offset[2 * alignedRowOffset_],
MASK_PLACEHOLDER, rptTimes_, {1, 1, 1, 8, 8, 8}); // hw * lh
Mul<float, false>(weight[3 * alignedRowOffset_], offset, offset[alignedRowOffset_], MASK_PLACEHOLDER, rptTimes_,
{1, 1, 1, 8, 8, 8}); // lh * lw
将调制权重 Δ m \Delta_m Δm 乘到4个插值权重上。
weight
的形状为 4 × W o k h k w 4\times W_o k_h k_w 4×Wokhkw,mask
的形状为 W o k h k w W_o k_h k_w Wokhkw,二者不等长导致需要调用4次。
代码注释没有删除。
cpp
if (modulated) {
Mul<float, false>(weight, weight, mask, MASK_PLACEHOLDER, rptTimes_, {1, 1, 1, 8, 8, 8});
Mul<float, false>(weight[alignedRowOffset_], weight[alignedRowOffset_], mask, MASK_PLACEHOLDER, rptTimes_,
{1, 1, 1, 8, 8, 8}); // lw * hh
Mul<float, false>(weight[2 * alignedRowOffset_], weight[2 * alignedRowOffset_], mask, MASK_PLACEHOLDER,
rptTimes_, {1, 1, 1, 8, 8, 8}); // hw * lh
Mul<float, false>(weight[3 * alignedRowOffset_], weight[3 * alignedRowOffset_], mask, MASK_PLACEHOLDER,
rptTimes_, {1, 1, 1, 8, 8, 8}); // lh * lw
}
}
DeformableConv2dKernel::ComputeBilinearInterpolation
offset
没有用到。
cpp
template<bool modulated>
__aicore__ inline void DeformableConv2dKernel<modulated>::ComputeBilinearInterpolation(uint32_t w,
const LocalTensor<float>& offset, const LocalTensor<int32_t>& offsetInt, const LocalTensor<float>& feature,
const LocalTensor<float>& weight, const LocalTensor<float>& offsetOutput)
{
首先将offsetOutput
清零,其形状为 k h k w C i k_h k_w C_i khkwCi
cpp
Duplicate<float, false>(offsetOutput, 0.f, MASK_PLACEHOLDER, kernelSize_ * valRptTimes_, 1, 8);
uint8_t ping = 0;
uint32_t kernelOffset = w * kernelSize_;
SetFlag<HardEvent::V_MTE2>(0);
SetFlag<HardEvent::V_MTE2>(1);
传入 x o x_o xo。
pw
和ph
为数组中的索引。
gmOffset
为输入点的一维偏移。
SetFlag 同一核内不同流水之间的同步指令。
cpp
#pragma bisheng auto_sync parallel
for (uint32_t kIdx = 0; kIdx < kernelSize_; ++kIdx) {
uint32_t pw = kIdx + kernelOffset;
uint32_t ph = pw + alignedRowOffset_;
int32_t w0 = offsetInt.GetValue(pw);
int32_t h0 = offsetInt.GetValue(ph);
int32_t w1 = w0 + 1;
int32_t h1 = h0 + 1;
uint32_t outOffset = kIdx * cIn_;
uint32_t ftOffset = ping * featureOffset_;
WaitFlag<HardEvent::V_MTE2>(ping);
对于每个输入点 ( y , x ) (y, x) (y,x),如果 ( y 1 , x 1 ) , ( y 1 , x 2 ) , ( y 2 , x 1 ) , ( y 2 , x 2 ) (y1,x1), (y1,x2), (y2,x1), (y2,x2) (y1,x1),(y1,x2),(y2,x1),(y2,x2) 均在图像内,则一次加载4个点。
Axpy 将输入元素与标量求积后,累加到目的元素。
cpp
if (0 < h1 && h1 < hIn_) {
if (0 < w1 && w1 < wIn_) {
uint64_t gmOffset = srcOffset_ + (h0 * wIn_ + w0) * cIn_;
DataCopy(feature[ftOffset], xGm_[gmOffset], cpQuadValParams_);
SetFlag<HardEvent::MTE2_V>(copyEvt_);
WaitFlag<HardEvent::MTE2_V>(copyEvt_);
PipeBarrier<PIPE_V>();
Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset], weight.GetValue(pw),
MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});
PipeBarrier<PIPE_V>();
Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + cIn_], weight.GetValue(ph),
MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});
PipeBarrier<PIPE_V>();
Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 2 * cIn_],
weight.GetValue(pw + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});
PipeBarrier<PIPE_V>();
Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 3 * cIn_],
weight.GetValue(ph + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});
} else if (w1 == 0) {
uint64_t gmOffset = srcOffset_ + (h0 * wIn_) * cIn_;
DataCopy(feature[ftOffset + cIn_], xGm_[gmOffset], cpColDoubleValParams_);
SetFlag<HardEvent::MTE2_V>(copyEvt_);
WaitFlag<HardEvent::MTE2_V>(copyEvt_);
PipeBarrier<PIPE_V>();
Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + cIn_], weight.GetValue(ph),
MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});
PipeBarrier<PIPE_V>();
Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 3 * cIn_],
weight.GetValue(ph + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});
} else if (w1 == wIn_) {
uint64_t gmOffset = srcOffset_ + (h0 * wIn_ + w0) * cIn_;
DataCopy(feature[ftOffset], xGm_[gmOffset], cpColDoubleValParams_);
SetFlag<HardEvent::MTE2_V>(copyEvt_);
WaitFlag<HardEvent::MTE2_V>(copyEvt_);
PipeBarrier<PIPE_V>();
Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset], weight.GetValue(pw),
MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});
PipeBarrier<PIPE_V>();
Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 2 * cIn_],
weight.GetValue(pw + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});
}
} else if (h1 == 0) {
if (0 < w1 && w1 < wIn_) {
uint64_t gmOffset = srcOffset_ + w0 * cIn_;
DataCopy(feature[ftOffset + 2 * cIn_], xGm_[gmOffset], cpRowDoubleValParams_);
SetFlag<HardEvent::MTE2_V>(copyEvt_);
WaitFlag<HardEvent::MTE2_V>(copyEvt_);
PipeBarrier<PIPE_V>();
Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 2 * cIn_],
weight.GetValue(pw + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});
PipeBarrier<PIPE_V>();
Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 3 * cIn_],
weight.GetValue(ph + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});
} else if (w1 == 0) {
uint64_t gmOffset = srcOffset_;
DataCopy(feature[ftOffset + 3 * cIn_], xGm_[gmOffset], cpOneValParams_);
SetFlag<HardEvent::MTE2_V>(copyEvt_);
WaitFlag<HardEvent::MTE2_V>(copyEvt_);
PipeBarrier<PIPE_V>();
Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 3 * cIn_],
weight.GetValue(ph + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});
} else if (w1 == wIn_) {
uint64_t gmOffset = srcOffset_ + w0 * cIn_;
DataCopy(feature[ftOffset + 2 * cIn_], xGm_[gmOffset], cpOneValParams_);
SetFlag<HardEvent::MTE2_V>(copyEvt_);
WaitFlag<HardEvent::MTE2_V>(copyEvt_);
PipeBarrier<PIPE_V>();
Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 2 * cIn_],
weight.GetValue(pw + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});
}
} else if (h1 == hIn_) {
if (0 < w1 && w1 < wIn_) {
uint64_t gmOffset = srcOffset_ + (h0 * wIn_ + w0) * cIn_;
DataCopy(feature[ftOffset], xGm_[gmOffset], cpRowDoubleValParams_);
SetFlag<HardEvent::MTE2_V>(copyEvt_);
WaitFlag<HardEvent::MTE2_V>(copyEvt_);
PipeBarrier<PIPE_V>();
Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset], weight.GetValue(pw),
MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});
PipeBarrier<PIPE_V>();
Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + cIn_], weight.GetValue(ph),
MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});
} else if (w1 == 0) {
uint64_t gmOffset = srcOffset_ + (h0 * wIn_) * cIn_;
DataCopy(feature[ftOffset + cIn_], xGm_[gmOffset], cpOneValParams_);
SetFlag<HardEvent::MTE2_V>(copyEvt_);
WaitFlag<HardEvent::MTE2_V>(copyEvt_);
PipeBarrier<PIPE_V>();
Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + cIn_], weight.GetValue(ph),
MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});
} else if (w1 == wIn_) {
uint64_t gmOffset = srcOffset_ + (h0 * wIn_ + w0) * cIn_;
DataCopy(feature[ftOffset], xGm_[gmOffset], cpOneValParams_);
SetFlag<HardEvent::MTE2_V>(copyEvt_);
WaitFlag<HardEvent::MTE2_V>(copyEvt_);
PipeBarrier<PIPE_V>();
Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset], weight.GetValue(pw),
MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});
}
}
SetFlag<HardEvent::V_MTE2>(ping);
ping = 1 - ping;
}
将插值得到 k h k w C i k_h k_w C_i khkwCi 分段拷出到形状为 G × W i × k h k w C i G G \times W_i\times \frac{k_h k_w C_i}{G} G×Wi×GkhkwCi 的全局内存offsetOutputGm_
中。
cpp
SetFlag<HardEvent::V_MTE3>(calEvt_);
WaitFlag<HardEvent::V_MTE3>(calEvt_);
for (uint32_t i = 0; i < groups_; ++i) {
DataCopy(offsetOutputGm_[dstOffset_ + rowInPerGroup_ * i], offsetOutput[i * cInPerGroup_], cpOffsetOutParams_);
}
dstOffset_ += kwInPerGroup_;
WaitFlag<HardEvent::V_MTE2>(0);
WaitFlag<HardEvent::V_MTE2>(1);
}
DeformableConv2dKernel::ProcessCube
SetTensorB 设置矩阵乘的右矩阵 B 需要转置。
采用循环方式实现 Batch Matmul:weight 的形状为 G × C o G × k h k w C i G G\times \frac{C_o}{G} \times \frac{k_h k_wC_i}{G} G×GCo×GkhkwCi,im2col 的形状为 G × W o × k h k w C i G G \times W_o\times \frac{k_h k_w C_i}{G} G×Wo×GkhkwCi,输出为 G × C o G × W o G\times \frac{C_o}{G}\times W_o G×GCo×Wo。
这使得多核输出形状为 N × H o × C o × W o N\times H_o \times C_o \times W_o N×Ho×Co×Wo。
cpp
template<bool modulated>
__aicore__ inline void DeformableConv2dKernel<modulated>::ProcessCube(uint32_t taskIdx)
{
uint64_t aOffset = 0;
uint64_t bOffset = taskIdx * rowIn_;
uint64_t cOffset = taskIdx * rowOut_;
for (uint32_t i = 0; i < groups_; ++i) {
mm_.SetTensorA(weightGm_[aOffset]);
mm_.SetTensorB(offsetOutputGm_[bOffset], true);
mm_.template IterateAll<false>(yGm_[cOffset]);
aOffset += kernelPerGroup_;
bOffset += rowInPerGroup_;
cOffset += rowOutPerGroup_;
}
}
DeformableConv2dV2Kernel::Process
DeformableConv2dV2Kernel::Process ProcessVector ProcessCube
DeformableConv2dV2Kernel::ProcessVector 每次生成 im2col 矩阵的一行。累积cubeTileTaskCount_
行后,调用 DeformableConv2dV2Kernel::ProcessCube 进行卷积。
cpp
template<bool modulated>
__aicore__ inline void DeformableConv2dV2Kernel<modulated>::Process()
{
for (int32_t taskIdx = start_; taskIdx < end_; taskIdx++) {
ProcessVector(taskIdx);
int32_t innerCubeTaskIdx = (taskIdx - start_) % cubeTileTaskCount_;
bool startCubeFlag = (innerCubeTaskIdx == cubeTileTaskCount_ - 1) || (taskIdx == end_ - 1);
if (startCubeFlag) {
ProcessCube(taskIdx, innerCubeTaskIdx);
}
}
mm_.End();
}
DeformableConv2dV2Kernel::ProcessVector
DeformableConv2dV2Kernel::ProcessVector CopyInFeature
每次调用处理卷积输出特征图上一个点对应的输入,即 im2col 矩阵的一行。将一个 kernel window 中的值展开成一列,写入 img2colMatGm_
中。
将taskIdx
解码为对应的(n, h_out, w_out)
。
cpp
template<bool modulated>
__aicore__ inline void DeformableConv2dV2Kernel<modulated>::ProcessVector(uint32_t taskIdx)
{
int16_t batchIdx = taskIdx / (featureMapSize_);
int16_t hOutIdx = (taskIdx % (featureMapSize_)) / wOut_;
int16_t wOutIdx = taskIdx % wOut_;
将当前taskIdx
对应的 Δ p \Delta p Δp 和 Δ m \Delta m Δm 加载到本地内存。单次拷贝18或9个元素,过少。
cpp
// CopyIn Offset
DataCopy(copyInOffsetLocal_, offsetGm_[taskIdx * OFFSET_SIZE], OFFSET_ALIGNED_SIZE);
SetFlag<HardEvent::MTE2_V>(copyInOffsetEventID);
if (modulated) {
DataCopy(maskLocal_, maskGm_[taskIdx * X_OFFSET_SIZE], X_OFFSET_ALIGNED_SIZE);
SetFlag<HardEvent::MTE2_V>(copyInMaskEventID);
}
WaitFlag<HardEvent::MTE2_V>(copyInOffsetEventID);
将交错存储的 ( Δ y , Δ x ) (\Delta y, \Delta x) (Δy,Δx) 分离开,存入独立的xOffsetLocal_
和yOffsetLocal_
缓冲区。
加上卷积窗口坐标得到 p n + Δ p n p_n +\Delta p_n pn+Δpn。
cpp
GatherMask(xOffsetLocal_, copyInOffsetLocal_, 1, true, maskForGatherMask_, {1, 1, 8, 0}, cnt_);
GatherMask(yOffsetLocal_, copyInOffsetLocal_, 2, true, maskForGatherMask_, {1, 1, 8, 0}, cnt_);
Add(xOffsetLocal_, xOffsetLocal_, constKHIdxLocal_, X_OFFSET_ALIGNED_SIZE);
Add(yOffsetLocal_, yOffsetLocal_, constKWIdxLocal_, X_OFFSET_ALIGNED_SIZE);
对浮点坐标 ( i + Δ h i , j + Δ w i ) (i+\Delta h_i, j+\Delta w_i) (i+Δhi,j+Δwi) 取整得到双线性插值所需的四个方向的坐标。
计算小数偏移 。
cpp
Floor(topPosLocal_, xOffsetLocal_, X_OFFSET_ALIGNED_SIZE);
Floor(leftPosLocal_, yOffsetLocal_, X_OFFSET_ALIGNED_SIZE);
Adds(bottomPosLocal_, topPosLocal_, 1.0f, X_OFFSET_ALIGNED_SIZE);
Adds(rightPosLocal_, leftPosLocal_, 1.0f, X_OFFSET_ALIGNED_SIZE);
fracH
和fracW
为单个方向上的插值权重 y − y 1 y-y_1 y−y1 和 x − x 1 x-x_1 x−x1。
cpp
Sub(fracHLocal_, xOffsetLocal_, topPosLocal_, X_OFFSET_ALIGNED_SIZE);
Sub(fracWLocal_, yOffsetLocal_, leftPosLocal_, X_OFFSET_ALIGNED_SIZE);
用输出点的坐标减去卷积核的半径,从而找到与之对应的输入区域的起始位置。这里假定 s h = 1 , s w = 1 s_h =1, s_w=1 sh=1,sw=1, 然而算子入口并没有设置该条件。
计算卷积窗口左上角的坐标 ( h 0 , w 0 ) (h_0,w_0) (h0,w0) 的公式为:
h 0 = h o ⋅ s h − p h w 0 = w o ⋅ s w − p w \begin{aligned} h_0 &= h_o \cdot s_h - p_h \\ w_0 &= w_o \cdot s_w - p_w \end{aligned} h0w0=ho⋅sh−ph=wo⋅sw−pw
与相对坐标相加得到卷积窗口所有点的坐标 p + p n + Δ p n p + p_n +\Delta p_n p+pn+Δpn。
topPosLocal_
和leftPosLocal_
为 ( y 1 , x 1 ) (y_1, x_1) (y1,x1)。
cpp
// global position
Adds(topPosLocal_, topPosLocal_, hOutIdx - kH_ / 2 + 0.0f, 2 * X_OFFSET_ALIGNED_SIZE);
Adds(leftPosLocal_, leftPosLocal_, wOutIdx - kW_ / 2 + 0.0f, 2 * X_OFFSET_ALIGNED_SIZE);
计算插值4组点在内存上的一维偏移:
o f f s e t 1 = ( y 1 ⋅ W i + x 1 ) C i o f f s e t 2 = ( y 1 ⋅ W i + x 2 ) C i o f f s e t 3 = ( y 2 ⋅ W i + x 1 ) C i o f f s e t 4 = ( y 2 ⋅ W i + x 2 ) C i \begin{aligned} \mathrm{offset}_1 &=(y_1\cdot W_i + x_1)C_i\\ \mathrm{offset}_2 &=(y_1\cdot W_i + x_2)C_i\\ \mathrm{offset}_3 &=(y_2\cdot W_i + x_1)C_i\\ \mathrm{offset}_4 &=(y_2\cdot W_i + x_2)C_i \end{aligned} offset1offset2offset3offset4=(y1⋅Wi+x1)Ci=(y1⋅Wi+x2)Ci=(y2⋅Wi+x1)Ci=(y2⋅Wi+x2)Ci
topLeftOffsetLocal_
、topRightOffsetLocal_
、bottomLeftOffsetLocal_
、bottomRightOffsetLocal_
4个变量在内存上是连续的所有可以使用一条指令处理。
cpp
// global Offset
Muls(topPosLocal_, topPosLocal_, wOut_ + 0.0f, 2 * X_OFFSET_ALIGNED_SIZE);
Add(topLeftOffsetLocal_, topPosLocal_, leftPosLocal_, X_OFFSET_ALIGNED_SIZE); // global (h * wOut + w)
Add(topRightOffsetLocal_, topPosLocal_, rightPosLocal_, X_OFFSET_ALIGNED_SIZE);
Add(bottomLeftOffsetLocal_, bottomPosLocal_, leftPosLocal_, X_OFFSET_ALIGNED_SIZE);
Add(bottomRightOffsetLocal_, bottomPosLocal_, rightPosLocal_, X_OFFSET_ALIGNED_SIZE);
Muls(topLeftOffsetLocal_, topLeftOffsetLocal_, cIn_ + 0.0f, 4 * X_OFFSET_ALIGNED_SIZE);
Adds(topLeftOffsetLocal_, topLeftOffsetLocal_, batchIdx * featureMapElementsSize_ + 0.0f,
4 * X_OFFSET_ALIGNED_SIZE); // global offset
CompareScalar 逐元素比较一个 tensor 中的元素和另一个 Scalar 的大小,结果在输出的对应比特位。
topPosLocal_
、bottomPosLocal_
、leftPosLocal_
、rightPosLocal_
四个变量的内存是连续的,每个变量的大小为 X_OFFSET_ALIGNED_SIZE。这里直接使用了64作为长度。
可以看出,由于地址对齐限制,36个有效元素对齐到64。
inGlobalLocal_
的大小为IN_GLOBAL_BUF_SIZE * sizeof(uint32_t)
,存储4组点在两个方向上是否在边界内。
inGlobalLocal_
为 uint32_t 类型,每条 CompareScalar 处理64个元素,保存到inGlobalLocal_
中每段的前2两个元素中。
比较 0 ≤ y 1 , 0 ≤ y 2 , 0 ≤ x 1 , 0 ≤ x 2 0 \le y_1,\enspace 0 \le y_2,\enspace 0 \le x_1,\enspace 0 \le x_2 0≤y1,0≤y2,0≤x1,0≤x2 以及 y 1 < H i , y 2 < H i , x 1 < W i , x 2 < W i y_1< H_i,\enspace y_2 < H_i\enspace, x_1 < W_i,\enspace x_2 < W_i y1<Hi,y2<Hi,x1<Wi,x2<Wi。
cpp
// in global flag
CompareScalar(inGlobalLocal_.ReinterpretCast<uint8_t>(), topPosLocal_, 0.0f, CMPMODE::GE, 64);
CompareScalar(inGlobalLocal_[8].ReinterpretCast<uint8_t>(), bottomPosLocal_, 0.0f, CMPMODE::GE, 64);
CompareScalar(inGlobalLocal_[16].ReinterpretCast<uint8_t>(), leftPosLocal_, 0.0f, CMPMODE::GE, 64);
CompareScalar(inGlobalLocal_[24].ReinterpretCast<uint8_t>(), rightPosLocal_, 0.0f, CMPMODE::GE, 64);
CompareScalar(inGlobalLocal_[32].ReinterpretCast<uint8_t>(), topPosLocal_, featureMapSize_ + 0.0f, CMPMODE::LT, 64);
CompareScalar(
inGlobalLocal_[40].ReinterpretCast<uint8_t>(), bottomPosLocal_, featureMapSize_ + 0.0f, CMPMODE::LT, 64);
CompareScalar(inGlobalLocal_[48].ReinterpretCast<uint8_t>(), leftPosLocal_, wOut_ + 0.0f, CMPMODE::LT, 64);
CompareScalar(inGlobalLocal_[56].ReinterpretCast<uint8_t>(), rightPosLocal_, wOut_ + 0.0f, CMPMODE::LT, 64);
合并两个方向的结果,即 0 ≤ y 1 < H i , 0 ≤ y 2 < H i , 0 ≤ x 1 < W i , 0 ≤ x 2 < W i 0 \le y_1 < H_i,\enspace 0 \le y_2 < H_i,\enspace 0 \le x_1 < W_i,\enspace 0 \le x_2 < W_i 0≤y1<Hi,0≤y2<Hi,0≤x1<Wi,0≤x2<Wi。
cpp
And(inGlobalLocal_[32].ReinterpretCast<uint16_t>(), inGlobalLocal_.ReinterpretCast<uint16_t>(),
inGlobalLocal_[32].ReinterpretCast<uint16_t>(), 64);
计算合法的 ( y 1 , x 1 ) (y_1, x_1) (y1,x1) 和 ( y 2 , x 2 ) (y_2, x_2) (y2,x2)。
cpp
And(inGlobalLocal_.ReinterpretCast<uint16_t>(), inGlobalLocal_[32].ReinterpretCast<uint16_t>(),
inGlobalLocal_[48].ReinterpretCast<uint16_t>(), 32); // TopLeft, BottomRight
计算合法的 ( y 1 , x 2 ) (y_1, x_2) (y1,x2) 和 ( y 2 , x 1 ) (y_2, x_1) (y2,x1)。
cpp
And(inGlobalLocal_[16].ReinterpretCast<uint16_t>(), inGlobalLocal_[32].ReinterpretCast<uint16_t>(),
inGlobalLocal_[56].ReinterpretCast<uint16_t>(), 16); // TopRight
And(inGlobalLocal_[24].ReinterpretCast<uint16_t>(), inGlobalLocal_[40].ReinterpretCast<uint16_t>(),
inGlobalLocal_[48].ReinterpretCast<uint16_t>(), 16); // BottomLeft
Select 根据selMask
(用于选择的 Mask 掩码)的比特位值选取元素。
将4组点的越界位置设置为-1.0f
,后续拷贝时可直接丢弃或处理为0。
cpp
Select(topLeftOffsetLocal_, inGlobalLocal_.ReinterpretCast<uint16_t>(), topLeftOffsetLocal_, -1.0f,
SELMODE::VSEL_TENSOR_SCALAR_MODE, 16);
Select(bottomRightOffsetLocal_, inGlobalLocal_[8].ReinterpretCast<uint16_t>(), bottomRightOffsetLocal_, -1.0f,
SELMODE::VSEL_TENSOR_SCALAR_MODE, 16);
Select(topRightOffsetLocal_, inGlobalLocal_[16].ReinterpretCast<uint16_t>(), topRightOffsetLocal_, -1.0f,
SELMODE::VSEL_TENSOR_SCALAR_MODE, 16);
Select(bottomLeftOffsetLocal_, inGlobalLocal_[24].ReinterpretCast<uint16_t>(), bottomLeftOffsetLocal_, -1.0f,
SELMODE::VSEL_TENSOR_SCALAR_MODE, 16);
需要插 scalar 等待 vector 的同步。
oneSubFracHLocal_
和oneSubFracWLocal_
的内存是连续的。
计算一维插值权重 y 2 − y y_2 - y y2−y 和 x 2 − x x_2 - x x2−x。
cpp
SetFlag<HardEvent::V_S>(V_SEventID);
WaitFlag<HardEvent::V_S>(V_SEventID);
Muls(oneSubFracHLocal_, fracHLocal_, -1.0f, 2 * X_OFFSET_ALIGNED_SIZE);
Adds(oneSubFracHLocal_, oneSubFracHLocal_, 1.0f, 2 * X_OFFSET_ALIGNED_SIZE); // 1-fracH, 1-fracW
调制权重乘到4个插值权重上: Δ m ( y − y 1 ) , Δ m ( x − x 1 ) , Δ m ( y 2 − y ) , Δ m ( x 2 − x ) \Delta m(y -y_1),\enspace \Delta m(x -x_1),\enspace \Delta m(y_2 -y),\enspace \Delta m(x_2 -x) Δm(y−y1),Δm(x−x1),Δm(y2−y),Δm(x2−x)。
cpp
if (modulated) {
WaitFlag<HardEvent::MTE2_V>(copyInMaskEventID);
Mul(fracHLocal_, fracHLocal_, maskLocal_, X_OFFSET_ALIGNED_SIZE);
Mul(oneSubFracHLocal_, oneSubFracHLocal_, maskLocal_, X_OFFSET_ALIGNED_SIZE);
}
Brcb 给定一个输入张量,每一次取输入张量中的8个数填充到结果张量的8个 datablock(32Bytes)中去,每个数对应一个 datablock。
插值系数与输入相乘时需要进行低维广播。下面的计算中,二者不等长,将每个系数广播为 C i 8 \frac{C_i}{8} 8Ci。
fracHBroadcastLocal_
空间大小为 9 × C i 8 × b l o c k 9\times \frac{C_i}{8}\times \mathrm{block} 9×8Ci×block。
brcbParams_
中设置元素间隔为 C i 64 \frac{C_i}{64} 64Ci 个 block,迭代间隔为 C i 8 \frac{C_i}{8} 8Ci 个 block。即将 C i C_i Ci 八等分,等分位上的 datablock 为有效值,其他位置无效。
横跨空间大小 16 × C i 64 × b l o c k = 2 C i 16\times \frac{C_i}{64}\times\mathrm{block} = 2C_i 16×64Ci×block=2Ci。
fracHLocal_
的每个元素填充到fracHBroadcastLocal_
中的一个 datablock,相邻元素间隔8个 datablock,即 C i 64 \frac{C_i}{64} 64Ci。
cpp
// Broadcast
Brcb(fracHBroadcastLocal_, fracHLocal_, 2, brcbParams_);
Brcb(fracWBroadcastLocal_, fracWLocal_, 2, brcbParams_);
Brcb(oneSubFracHBroadcastLocal_, oneSubFracHLocal_, 2, brcbParams_);
Brcb(oneSubFracWBroadcastLocal_, oneSubFracWLocal_, 2, brcbParams_);
DATA_BLOCK_SIZE 为8,FOUR_CORNERS 为4,X_OFFSET_ALIGNED_SIZE 为9。
maskForBroadcast_
等于dataBlockPerInputChannel_ - DATA_BLOCK_SIZE
。
通过一条 Copy 指令将第一个 datablock 的数据广播到 C i C_i Ci 中的其他块,形状为 4 × 9 × C i 4\times 9\times C_i 4×9×Ci。
每次迭代拷贝的 block 数量为:
N = ⌈ M a s k 8 ⌉ = ⌈ C i 8 − 8 8 ⌉ = ⌈ C i 64 ⌉ − 1 \begin{aligned} N &= \lceil\frac{\mathrm{Mask}}{8}\rceil \\ &= \lceil\frac{\frac{C_i}{8}-8}{8}\rceil \\ &= \lceil\frac{C_i}{64}\rceil-1 \end{aligned} N=⌈8Mask⌉=⌈88Ci−8⌉=⌈64Ci⌉−1
srcRepeatSize
和dstRepeatSize
参数设置为 C i 64 \frac{C_i}{64} 64Ci。
在第一步的广播中,相邻元素间隔 C i 64 \frac{C_i}{64} 64Ci,这使得每组插值权重有效值长度为 9 C i 8 \frac{9C_i}{8} 89Ci。
cpp
Copy(fracHBroadcastLocal_[DATA_BLOCK_SIZE], fracHBroadcastLocal_, maskForBroadcast_, FOUR_CORNERS * X_OFFSET_SIZE,
copyParams_);
DeformableConv2dV2Kernel::CopyInFeature 函数根据topLeftOffsetLocal_
和fracHBroadcastLocal_
加载输入并插值。
然后将outFeatureLocal_
中的结果拷贝到全局内存中。
cpp
CopyInFeature();
SetFlag<HardEvent::V_MTE3>(copyOutEventID);
WaitFlag<HardEvent::V_MTE3>(copyOutEventID);
DataCopyPad(img2colMatGm_[taskIdx * elementsCountPerTask_], outFeatureLocal_,
{1, static_cast<uint32_t>(elementsCountPerTask_ * FP32_BYTE_SIZE), 0, 0, 0});
}
DeformableConv2dV2Kernel::CopyInFeature
函数没有参数,导致看不出依赖的变量。
topLeft0
等值应该与整数进行比较。
代码直接展开,似乎可以像 V1中那样写成 for 循环。
加载9个输入点的通道后,与权重相乘。
topLeftWeightLocal_
为 Δ m ⋅ w 11 = Δ m ( y 2 − y ) ( x 2 − x ) \Delta m \cdot w_{11}=\Delta m(y_2 - y)(x_2 -x) Δm⋅w11=Δm(y2−y)(x2−x)。
topLeftWeightLocal_
中仅前面的 9 C i 8 \frac{9C_i}{8} 89Ci 个元素有效。
cpp
template<bool modulated>
__aicore__ inline void DeformableConv2dV2Kernel<modulated>::CopyInFeature()
{
int32_t topLeft0 = topLeftOffsetLocal_.GetValue(0);
int32_t topLeft1 = topLeftOffsetLocal_.GetValue(1);
int32_t topLeft2 = topLeftOffsetLocal_.GetValue(2);
int32_t topLeft3 = topLeftOffsetLocal_.GetValue(3);
int32_t topLeft4 = topLeftOffsetLocal_.GetValue(4);
int32_t topLeft5 = topLeftOffsetLocal_.GetValue(5);
int32_t topLeft6 = topLeftOffsetLocal_.GetValue(6);
int32_t topLeft7 = topLeftOffsetLocal_.GetValue(7);
int32_t topLeft8 = topLeftOffsetLocal_.GetValue(8);
(topLeft0 == -1.0f) ? Duplicate(topLeftFeatureLocal_[0 * cIn_], 0.0f, cIn_) :
DataCopy(topLeftFeatureLocal_[0 * cIn_], xGm_[topLeft0], cIn_);
(topLeft1 == -1.0f) ? Duplicate(topLeftFeatureLocal_[1 * cIn_], 0.0f, cIn_) :
DataCopy(topLeftFeatureLocal_[1 * cIn_], xGm_[topLeft1], cIn_);
(topLeft2 == -1.0f) ? Duplicate(topLeftFeatureLocal_[2 * cIn_], 0.0f, cIn_) :
DataCopy(topLeftFeatureLocal_[2 * cIn_], xGm_[topLeft2], cIn_);
(topLeft3 == -1.0f) ? Duplicate(topLeftFeatureLocal_[3 * cIn_], 0.0f, cIn_) :
DataCopy(topLeftFeatureLocal_[3 * cIn_], xGm_[topLeft3], cIn_);
(topLeft4 == -1.0f) ? Duplicate(topLeftFeatureLocal_[4 * cIn_], 0.0f, cIn_) :
DataCopy(topLeftFeatureLocal_[4 * cIn_], xGm_[topLeft4], cIn_);
(topLeft5 == -1.0f) ? Duplicate(topLeftFeatureLocal_[5 * cIn_], 0.0f, cIn_) :
DataCopy(topLeftFeatureLocal_[5 * cIn_], xGm_[topLeft5], cIn_);
(topLeft6 == -1.0f) ? Duplicate(topLeftFeatureLocal_[6 * cIn_], 0.0f, cIn_) :
DataCopy(topLeftFeatureLocal_[6 * cIn_], xGm_[topLeft6], cIn_);
(topLeft7 == -1.0f) ? Duplicate(topLeftFeatureLocal_[7 * cIn_], 0.0f, cIn_) :
DataCopy(topLeftFeatureLocal_[7 * cIn_], xGm_[topLeft7], cIn_);
(topLeft8 == -1.0f) ? Duplicate(topLeftFeatureLocal_[8 * cIn_], 0.0f, cIn_) :
DataCopy(topLeftFeatureLocal_[8 * cIn_], xGm_[topLeft8], cIn_);
Mul(topLeftWeightLocal_, oneSubFracHBroadcastLocal_, oneSubFracWBroadcastLocal_, 9 * dataBlockPerInputChannel_);
Mul 设置src1BlkStride
为0,实现了低维广播的乘法。topLeftWeightLocal_
的每个 datablock 与topLeftFeatureLocal_
的连续的8个 datablock 相乘。
src1RepStride
为1。
repeatTimes_
等于 9 C i 8 × 8 \frac{9C_i}{8\times 8} 8×89Ci,即总计处理 9 C i 9C_i 9Ci 个元素。
想要实现9个点的乘法,权重需要以ci/DATA_SIZE_PER_REPEAT
的长度分段放置。
cpp
SetFlag<HardEvent::MTE3_V>(MTE3_VEventID);
WaitFlag<HardEvent::MTE3_V>(MTE3_VEventID);
SetFlag<HardEvent::MTE2_V>(copyInFeatureEventID);
WaitFlag<HardEvent::MTE2_V>(copyInFeatureEventID);
Mul(outFeatureLocal_, topLeftFeatureLocal_, topLeftWeightLocal_, mask_, repeatTimes_, {1, 1, 0, 8, 8, 1});
cpp
int32_t topRight0 = topRightOffsetLocal_.GetValue(0);
int32_t topRight1 = topRightOffsetLocal_.GetValue(1);
int32_t topRight2 = topRightOffsetLocal_.GetValue(2);
int32_t topRight3 = topRightOffsetLocal_.GetValue(3);
int32_t topRight4 = topRightOffsetLocal_.GetValue(4);
int32_t topRight5 = topRightOffsetLocal_.GetValue(5);
int32_t topRight6 = topRightOffsetLocal_.GetValue(6);
int32_t topRight7 = topRightOffsetLocal_.GetValue(7);
int32_t topRight8 = topRightOffsetLocal_.GetValue(8);
(topRight0 == -1.0f) ? Duplicate(topRightFeatureLocal_[0 * cIn_], 0.0f, cIn_) :
DataCopy(topRightFeatureLocal_[0 * cIn_], xGm_[topRight0], cIn_);
(topRight1 == -1.0f) ? Duplicate(topRightFeatureLocal_[1 * cIn_], 0.0f, cIn_) :
DataCopy(topRightFeatureLocal_[1 * cIn_], xGm_[topRight1], cIn_);
(topRight2 == -1.0f) ? Duplicate(topRightFeatureLocal_[2 * cIn_], 0.0f, cIn_) :
DataCopy(topRightFeatureLocal_[2 * cIn_], xGm_[topRight2], cIn_);
(topRight3 == -1.0f) ? Duplicate(topRightFeatureLocal_[3 * cIn_], 0.0f, cIn_) :
DataCopy(topRightFeatureLocal_[3 * cIn_], xGm_[topRight3], cIn_);
(topRight4 == -1.0f) ? Duplicate(topRightFeatureLocal_[4 * cIn_], 0.0f, cIn_) :
DataCopy(topRightFeatureLocal_[4 * cIn_], xGm_[topRight4], cIn_);
(topRight5 == -1.0f) ? Duplicate(topRightFeatureLocal_[5 * cIn_], 0.0f, cIn_) :
DataCopy(topRightFeatureLocal_[5 * cIn_], xGm_[topRight5], cIn_);
(topRight6 == -1.0f) ? Duplicate(topRightFeatureLocal_[6 * cIn_], 0.0f, cIn_) :
DataCopy(topRightFeatureLocal_[6 * cIn_], xGm_[topRight6], cIn_);
(topRight7 == -1.0f) ? Duplicate(topRightFeatureLocal_[7 * cIn_], 0.0f, cIn_) :
DataCopy(topRightFeatureLocal_[7 * cIn_], xGm_[topRight7], cIn_);
(topRight8 == -1.0f) ? Duplicate(topRightFeatureLocal_[8 * cIn_], 0.0f, cIn_) :
DataCopy(topRightFeatureLocal_[8 * cIn_], xGm_[topRight8], cIn_);
Mul(topRightWeightLocal_, oneSubFracHBroadcastLocal_, fracWBroadcastLocal_, 9 * dataBlockPerInputChannel_);
SetFlag<HardEvent::MTE2_V>(copyInFeatureEventID);
WaitFlag<HardEvent::MTE2_V>(copyInFeatureEventID);
MulAddDst(outFeatureLocal_, topRightFeatureLocal_, topRightWeightLocal_, mask_, repeatTimes_, {1, 1, 0, 8, 8, 1});
int32_t bottomLeft0 = bottomLeftOffsetLocal_.GetValue(0);
int32_t bottomLeft1 = bottomLeftOffsetLocal_.GetValue(1);
int32_t bottomLeft2 = bottomLeftOffsetLocal_.GetValue(2);
int32_t bottomLeft3 = bottomLeftOffsetLocal_.GetValue(3);
int32_t bottomLeft4 = bottomLeftOffsetLocal_.GetValue(4);
int32_t bottomLeft5 = bottomLeftOffsetLocal_.GetValue(5);
int32_t bottomLeft6 = bottomLeftOffsetLocal_.GetValue(6);
int32_t bottomLeft7 = bottomLeftOffsetLocal_.GetValue(7);
int32_t bottomLeft8 = bottomLeftOffsetLocal_.GetValue(8);
(bottomLeft0 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[0 * cIn_], 0.0f, cIn_) :
DataCopy(bottomLeftFeatureLocal_[0 * cIn_], xGm_[bottomLeft0], cIn_);
(bottomLeft1 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[1 * cIn_], 0.0f, cIn_) :
DataCopy(bottomLeftFeatureLocal_[1 * cIn_], xGm_[bottomLeft1], cIn_);
(bottomLeft2 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[2 * cIn_], 0.0f, cIn_) :
DataCopy(bottomLeftFeatureLocal_[2 * cIn_], xGm_[bottomLeft2], cIn_);
(bottomLeft3 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[3 * cIn_], 0.0f, cIn_) :
DataCopy(bottomLeftFeatureLocal_[3 * cIn_], xGm_[bottomLeft3], cIn_);
(bottomLeft4 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[4 * cIn_], 0.0f, cIn_) :
DataCopy(bottomLeftFeatureLocal_[4 * cIn_], xGm_[bottomLeft4], cIn_);
(bottomLeft5 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[5 * cIn_], 0.0f, cIn_) :
DataCopy(bottomLeftFeatureLocal_[5 * cIn_], xGm_[bottomLeft5], cIn_);
(bottomLeft6 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[6 * cIn_], 0.0f, cIn_) :
DataCopy(bottomLeftFeatureLocal_[6 * cIn_], xGm_[bottomLeft6], cIn_);
(bottomLeft7 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[7 * cIn_], 0.0f, cIn_) :
DataCopy(bottomLeftFeatureLocal_[7 * cIn_], xGm_[bottomLeft7], cIn_);
(bottomLeft8 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[8 * cIn_], 0.0f, cIn_) :
DataCopy(bottomLeftFeatureLocal_[8 * cIn_], xGm_[bottomLeft8], cIn_);
Mul(bottomLeftWeightLocal_, oneSubFracWBroadcastLocal_, fracHBroadcastLocal_, 9 * dataBlockPerInputChannel_);
SetFlag<HardEvent::MTE2_V>(copyInFeatureEventID);
WaitFlag<HardEvent::MTE2_V>(copyInFeatureEventID);
MulAddDst(
outFeatureLocal_, bottomLeftFeatureLocal_, bottomLeftWeightLocal_, mask_, repeatTimes_, {1, 1, 0, 8, 8, 1});
int32_t bottomRight0 = bottomRightOffsetLocal_.GetValue(0);
int32_t bottomRight1 = bottomRightOffsetLocal_.GetValue(1);
int32_t bottomRight2 = bottomRightOffsetLocal_.GetValue(2);
int32_t bottomRight3 = bottomRightOffsetLocal_.GetValue(3);
int32_t bottomRight4 = bottomRightOffsetLocal_.GetValue(4);
int32_t bottomRight5 = bottomRightOffsetLocal_.GetValue(5);
int32_t bottomRight6 = bottomRightOffsetLocal_.GetValue(6);
int32_t bottomRight7 = bottomRightOffsetLocal_.GetValue(7);
int32_t bottomRight8 = bottomRightOffsetLocal_.GetValue(8);
(bottomRight0 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[0 * cIn_], 0.0f, cIn_) :
DataCopy(bottomRightFeatureLocal_[0 * cIn_], xGm_[bottomRight0], cIn_);
(bottomRight1 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[1 * cIn_], 0.0f, cIn_) :
DataCopy(bottomRightFeatureLocal_[1 * cIn_], xGm_[bottomRight1], cIn_);
(bottomRight2 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[2 * cIn_], 0.0f, cIn_) :
DataCopy(bottomRightFeatureLocal_[2 * cIn_], xGm_[bottomRight2], cIn_);
(bottomRight3 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[3 * cIn_], 0.0f, cIn_) :
DataCopy(bottomRightFeatureLocal_[3 * cIn_], xGm_[bottomRight3], cIn_);
(bottomRight4 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[4 * cIn_], 0.0f, cIn_) :
DataCopy(bottomRightFeatureLocal_[4 * cIn_], xGm_[bottomRight4], cIn_);
(bottomRight5 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[5 * cIn_], 0.0f, cIn_) :
DataCopy(bottomRightFeatureLocal_[5 * cIn_], xGm_[bottomRight5], cIn_);
(bottomRight6 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[6 * cIn_], 0.0f, cIn_) :
DataCopy(bottomRightFeatureLocal_[6 * cIn_], xGm_[bottomRight6], cIn_);
(bottomRight7 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[7 * cIn_], 0.0f, cIn_) :
DataCopy(bottomRightFeatureLocal_[7 * cIn_], xGm_[bottomRight7], cIn_);
(bottomRight8 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[8 * cIn_], 0.0f, cIn_) :
DataCopy(bottomRightFeatureLocal_[8 * cIn_], xGm_[bottomRight8], cIn_);
Mul(bottomRightWeightLocal_, fracHBroadcastLocal_, fracWBroadcastLocal_, 9 * dataBlockPerInputChannel_);
SetFlag<HardEvent::MTE2_V>(copyInFeatureEventID);
WaitFlag<HardEvent::MTE2_V>(copyInFeatureEventID);
MulAddDst(
outFeatureLocal_, bottomRightFeatureLocal_, bottomRightWeightLocal_, mask_, repeatTimes_, {1, 1, 0, 8, 8, 1});
}
DeformableConv2dV2Kernel::ProcessCube
innerCubeTaskIdx
为末尾元素索引。这里假定起始索引为0,因此可以得到 im2col 的行数cubeTaskCount
。
elementsCountPerTask_
为 k h k w C i k_h k_w C_i khkwCi。
aOffset
和cOffset
分别为当前核在 A 和 C 矩阵上的起始偏移。
cpp
template<bool modulated>
__aicore__ inline void DeformableConv2dV2Kernel<modulated>::ProcessCube(
uint32_t taskIdx, const int32_t& innerCubeTaskIdx)
{
int32_t cubeTaskCount = innerCubeTaskIdx + 1;
uint64_t aOffset = (taskIdx - innerCubeTaskIdx) * elementsCountPerTask_;
uint64_t cOffset = (taskIdx - innerCubeTaskIdx) * cOut_;
SetTensorA 设置矩阵乘的左矩阵 A。
SetTensorB 设置矩阵乘的右矩阵B。
SetSingleShape 设置 Matmul 单核计算的形状 singleMIn,singleNIn,singleKIn,单位为元素。
IterateAll 计算出 singleCoreM * singleCoreN 大小的 C 矩阵。迭代顺序可通过 tiling 参数 iterateOrder 调整。
img2col 的形状为 128 × k h k w C i 128\times k_h k_w C_i 128×khkwCi,weight 的形状为 C o × k h k w C i C_o \times k_h k_w C_i Co×khkwCi,输出形状为 128 × C o 128\times C_o 128×Co。
cpp
mm_.SetTensorA(img2colMatGm_[aOffset]);
mm_.SetTensorB(weightGm_, true);
mm_.SetSingleShape(cubeTaskCount, cOut_, elementsCountPerTask_);
mm_.template IterateAll<false>(yGm_[cOffset]);
}
参考资料:
- SetL2CacheHint
- 关于 Ascend C 的一些思考
- Pushing the Limits: Huawei's AI Chip Tests U.S. Export Controls
- FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs
- 昇腾310P使用记录
- PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
- 硬件架构抽象
- 非对齐场景
- 核函数
- 使用说明
- NPU的硬化 Task Scheduler 介绍
- Ascend-CC: Confidential Computing on Heterogeneous NPU for Emerging Generative AI Workloads
- Nvidia GPU与Huawei NPU
- 7.5. 计算调度与执行
- 面向昇腾处理器的高性能同步原语自动插入方法
- 同步控制简介
- 设置指定芯片的AI CPU、control CPU和data CPU数量
- Broadcast
- AIV 和 AIC 组合启动问题