Ascend DrivingSDK 中的 modulated_deform_conv2d（一）

Ascend DrivingSDK 是基于昇腾 NPU 平台开发的适用于自动驾驶场景的算子和模型加速库，提供了一系列高性能的算子和模型加速接口，支持 PyTorch 框架。

Ascend DrivingSDK 中的 modulated_deform_conv2d 是少有的融合算子，使用单个 kernel 完成 Deformable Convolution 的计算。然而由于910B 采用 vector core 和 cube core 分离架构，二者间的同步开销较大。910B 系列芯片拥有高达96MB 到192MB 的 L2缓存，并且在默认情况下开启。因此，modulated_deform_conv2d 算子的输入基本都在 L2缓存上。

modulated_deform_conv2d 在 Ascend C 层面有两个算子，v2针对3x3卷积进行了优化。奇怪的是这里没有选择新增 kernel 而是新增算子。不知出于何种原因，两个版本的算子参数列表顺序不同。

op	PreProcess	ComputeWeight	ComputeBilinearInterpolation	ProcessCube	Output
deformable_conv2d	缓存一行的卷积窗口索引，复用 H o H_o Ho 次	向量计算 W o k h k w W_o k_h k_w Wokhkw 个点的权重	单次加载 4 C i 4C_i 4Ci 数据插值	单次计算 W o W_o Wo 的结果	暂存 im2col 用于 ∂ L ∂ W \frac{\partial L}{\partial W} ∂W∂L
deformable_conv2d_v2	/	向量计算 k h k w k_h k_w khkw 个点的权重	k h k w k_h k_w khkw 条加载 C i C_i Ci 的指令然后插值	单次计算128个输出点	/

v2 性能比 v1好，原因应该是 v2中内存拷贝的同步间隔更长（9:4）。全局内存访问的延迟很高，v1单条指令拷贝 4 C i 4C_i 4Ci，v2单次加载 C i C_i Ci，但是9条指令后才同步。v1 缓存 im2col 节省计算，但是又没有省太多。因为计算 ∂ L ∂ Δ m n \frac{\partial L}{\partial \Delta \mathbf{m}_n} ∂Δmn∂L 时仍然需要双线性插值的结果。

两个算子功能不完备，例如不支持 half 类型、不支持 deform_group、不支持 bias 等。v2算子更为简陋。在此情况下无文档描述和防呆，使用时不免要费一些周折。不知出于何种原因，两个版本的算子参数列表顺序不同。难以想象这是工业级的代码，遑论车规。唯一的优点是像 AMD 一样开源，期待用户自己定位解决。

ModulatedDeformConv2dFunction

将输入转为 NHWC 格式。

python 复制代码

class ModulatedDeformConv2dFunction(Function):

    @staticmethod

    @custom_fwd(cast_inputs=torch.float32)

    # pylint: disable=huawei-too-many-arguments

    def forward(

        ctx,

        x: torch.Tensor,

        offset: torch.Tensor,

        mask: torch.Tensor,

        weight: torch.Tensor,

        bias: Optional[nn.Parameter] = None,

        stride: Union[int, Tuple[int, ...]] = 1,

        padding: Union[int, Tuple[int, ...]] = 0,

        dilation: Union[int, Tuple[int, ...]] = 1,

        groups: int = 1,

        deformable_groups: int = 1,

    ):

        ctx.kernel_size = [weight.size(2), weight.size(3)]

        ctx.stride = _pair(stride)

        ctx.padding = _pair(padding)

        ctx.dilation = _pair(dilation)

        ctx.groups = groups

        ctx.deformable_groups = deformable_groups

        nhwc_x = x.permute(0, 2, 3, 1).contiguous()

        nhwc_offset = offset.permute(0, 2, 3, 1).contiguous()

        nhwc_weight = weight.permute(0, 2, 3, 1).contiguous()

        nhwc_mask = mask.permute(0, 2, 3, 1).contiguous()

        out, offset_output = mx_driving._C.modulated_deformable_conv2d(

            nhwc_x,

            nhwc_offset,

            nhwc_mask,

            nhwc_weight,

            None,

            ctx.kernel_size,

            ctx.stride,

            ctx.padding,

            ctx.dilation,

            ctx.groups,

            ctx.deformable_groups,

            False,

        )

        ctx.save_for_backward(nhwc_x, nhwc_offset, nhwc_weight, nhwc_mask, offset_output)

        return out

ModulatedDeformConv2dFunction.backward

将 ∂ L ∂ Y \frac{\partial L}{\partial Y} ∂Y∂L 转置为 N × H o × C o × W o N\times H_o \times C_o \times W_o N×Ho×Co×Wo

python 复制代码

    @staticmethod

    @once_differentiable

    @custom_bwd

    # pylint: disable=huawei-too-many-arguments,too-many-return-values

    def backward(ctx, grad_out):

        nhwc_x, nhwc_offset, nhwc_weight, nhwc_mask, offset_output = ctx.saved_tensors

        nhwc_grad_out = grad_out.permute(0, 2, 1, 3).contiguous()

        grad_x, grad_weight, _, grad_offset, grad_mask = mx_driving._C.modulated_deformable_conv2d_backward(

            nhwc_x,

            nhwc_offset,

            nhwc_mask,

            nhwc_weight,

            None,

            offset_output,

            nhwc_grad_out,

            ctx.kernel_size,

            ctx.stride,

            ctx.padding,

            ctx.dilation,

            ctx.groups,

            ctx.deformable_groups,

            False,

        )

        return (

            grad_x,

            grad_offset,

            grad_mask,

            grad_weight,

            None,

            None,

            None,

            None,

            None,

            None,

        )

modulated_deformable_conv2d

group=1 else modulated_deformable_conv2d DeformableConv2dV2 DeformableConv2d

TORCH_CHECK_NPU 检查输入张量是否都存储在 NPU 设备上。

cpp 复制代码

std::tuple<at::Tensor, at::Tensor> modulated_deformable_conv2d(const at::Tensor& input, const at::Tensor& offset,

    const at::Tensor& mask, const at::Tensor& weight, const c10::optional<at::Tensor>& bias_opt,

    at::IntArrayRef kernel_size, at::IntArrayRef stride, at::IntArrayRef padding, at::IntArrayRef dilation,

    int64_t groups, int64_t deformable_groups, int64_t with_bias)

{

    TORCH_CHECK_NPU(input);

    TORCH_CHECK_NPU(offset);

    TORCH_CHECK_NPU(mask);

    TORCH_CHECK_NPU(weight);

对维度和参数进行检查。

cpp 复制代码

    TORCH_CHECK(input.dim() == INPUT_DIM, "input must to be a 4D Tensor, but got: ", input.dim());

    TORCH_CHECK(offset.dim() == INPUT_DIM, "offset has to be a 4D Tensor, but got: ", offset.dim());

    TORCH_CHECK(mask.dim() == INPUT_DIM, "mask has to be a 4D Tensor, but got: ", mask.dim());

    TORCH_CHECK(weight.dim() == INPUT_DIM, "weight has to be a 4D Tensor, but got: ", weight.dim());

    TORCH_CHECK(stride[0] > 0 && stride[1] > 0, "stride must be greater than 0");

    TORCH_CHECK(kernel_size[0] > 0 && kernel_size[1] > 0, "kernel_size must be greater than 0");

    TORCH_CHECK(dilation[0] > 0 && dilation[1] > 0, "dilation must be greater than 0");

c10::value_or_else 已经废弃了，推荐使用 std::optional::value_or。

安全地处理可选的bias_opt。

cpp 复制代码

    const at::Tensor& bias = c10::value_or_else(bias_opt, [] { return at::Tensor(); });

    uint32_t n = static_cast<uint32_t>(input.size(0));

    uint32_t c_in = static_cast<uint32_t>(input.size(3));

    uint32_t h_in = static_cast<uint32_t>(input.size(1));

    uint32_t w_in = static_cast<uint32_t>(input.size(2));

    uint32_t h_out = static_cast<uint32_t>(offset.size(1));

    uint32_t w_out = static_cast<uint32_t>(offset.size(2));

    uint32_t c_out = static_cast<uint32_t>(weight.size(0));

    uint32_t kh = static_cast<uint32_t>(weight.size(1));

    uint32_t kw = static_cast<uint32_t>(weight.size(2));

    TORCH_CHECK(kh == kernel_size[0] && kw == kernel_size[1], "kernel size mismatch");

    TORCH_CHECK(mask.size(-1) == kh * kw, "The shape of the mask is invalid");

    TORCH_CHECK(groups > 0, "groups must be greater than 0");

    TORCH_CHECK(c_out % groups == 0, "weight's out channel should be divided by groups");

    TORCH_CHECK(c_in % groups == 0, "input's channel should be divided by groups");

    bool modulated = true;

如果是无分组卷积并且输入通道数为256或512，调用 DeformableConv2dV2，否则调用 DeformableConv2d。两个算子的参数顺序不同。

DeformableConv2dV2 算子有两个输出：output的形状为 N × H o × W o × C o N \times H_o \times W_o \times C_o N×Ho×Wo×Co，offset_output的形状为 N × H o W o × k h k w × C i N\times H_o W_o \times k_h k_w \times C_i N×HoWo×khkw×Ci；
DeformableConv2d 算子有两个输出：output的形状为 N × H o × C o × W o N \times H_o \times C_o \times W_o N×Ho×Co×Wo，offset_output的形状为 N × H o × W o × G × k h k w C i G N\times H_o \times W_o \times G \times \frac{k_h k_w C_i}{G} N×Ho×Wo×G×GkhkwCi。

注意：DeformableConv2dV2 要求 k h k w = 9 k_h k_w =9 khkw=9，但是这里没有加判断条件。

cpp 复制代码

    if ((groups == 1) && ((c_in == CHANNEL_256) || (c_in == CHANNEL_512))) {

        at::Tensor output = at::empty({n, h_out, w_out, c_out}, input.options());

        at::Tensor offset_output = at::empty({n, h_out * w_out, kh * kw, c_in}, input.options());

        EXEC_NPU_CMD(aclnnDeformableConv2dV2, input, offset, mask, weight, bias, kernel_size, stride, padding, dilation,

            groups, deformable_groups, modulated, with_bias, output, offset_output);

        output = output.permute({0, 3, 1, 2});

        return std::tie(output, offset_output);

    } else {

        at::Tensor output = at::empty({n, h_out, c_out, w_out}, input.options());

        at::Tensor offset_output = at::empty({n, h_out, w_out, groups, kh * kw * c_in / groups}, input.options());

        EXEC_NPU_CMD(aclnnDeformableConv2d, input, weight, bias, offset, mask, kernel_size, stride, padding, dilation,

            groups, deformable_groups, modulated, with_bias, output, offset_output);

        output = output.permute({0, 2, 1, 3});

        return std::tie(output, offset_output);

    }

}

DeformableConv2dKernel::Process

DeformableConv2dKernel::Process PreProcess ProcessVector ProcessCube

cpp 复制代码

template<bool modulated>

__aicore__ inline void DeformableConv2dKernel<modulated>::Process()

{

    PreProcess();

    for (uint32_t taskIdx = start_; taskIdx < end_; ++taskIdx) {

        ProcessVector(taskIdx);

        ProcessCube(taskIdx);

    }

    mm_.End();

}

DeformableConv2dKernel::PreProcess

所有 VectorCore 协作完成一行输出所需的索引的计算，类似于 Allgather 模式。
TBuf:: Get 从 TBuf 上获取指定长度的 Tensor，或者获取全部长度的 Tensor。
auxH和auxW的大小为 W o k h k w W_o k_h k_w Wokhkw，存储一行输出的卷积窗口坐标 ( h i , w i ) (h_i, w_i) (hi,wi)。
w i _ s t a r t = w o ⋅ s − p h i _ s t a r t = h o ⋅ s − p \begin{aligned} w_{i\start}=w_o \cdot s−p\\ h{i\_start}=h_o \cdot s−p \end{aligned} wi_start=wo⋅s−phi_start=ho⋅s−p

由于auxH和auxW预先计算后用于多行输出的索引，因此auxH中是窗口内的相对偏移，没有加 h o ⋅ s h_o \cdot s ho⋅s。
auxStart_为当前核需要处理的起始索引。

cpp 复制代码

template<bool modulated>

__aicore__ inline void DeformableConv2dKernel<modulated>::PreProcess()

{

    LocalTensor<float> auxH = auxHBuf_.Get<float>();

    LocalTensor<float> auxW = auxWBuf_.Get<float>();

    uint32_t idx = 0;

    for (int32_t w = auxStart_; w < auxEnd_; ++w) {

        for (int32_t i = 0; i < kH_; ++i) {

            for (int32_t j = 0; j < kW_; ++j) {

                auxW.SetValue(idx, static_cast<float>(w * strideW_ - padW_ + j * dilationW_));

                auxH.SetValue(idx, static_cast<float>(-padH_ + i * dilationH_));

                ++idx;

            }

        }

    }

GlobalTensor::operator[] 根据输入的offset偏移返回新的 GlobalTensor。
valRptTimes_是 C i C_i Ci 拷贝次数。

每个核将本地计算的索引拷贝到auxWGm_和auxHGm_。

cpp 复制代码

    DataCopyPad(auxWGm_[auxStart_ * kernelSize_], auxW,

        {1, static_cast<uint16_t>(B32_BYTE_SIZE * (auxEnd_ - auxStart_) * kernelSize_), 0, 0});

    DataCopyPad(auxHGm_[auxStart_ * kernelSize_], auxH,

        {1, static_cast<uint16_t>(B32_BYTE_SIZE * (auxEnd_ - auxStart_) * kernelSize_), 0, 0});

    SyncAll();

同步后，从全局内存中拷贝得到完整的一行输出所需的索引。

注意：这里有个问题是两次全局内存访问的延迟比较高。卷积核通常为3x3，如果 W o W_o Wo 比较小的情况下，每个核自行计算比联合计算的开销可能更小。

复制代码

    DataCopy(auxW, auxWGm_, {1, rowOffsetBlk_, 0, 0});

    DataCopy(auxH, auxHGm_, {1, rowOffsetBlk_, 0, 0});

将feature清零。

cpp 复制代码

    LocalTensor<float> feature = featureBuf_.Get<float>();

    Duplicate<float, false>(feature, 0.f, MASK_PLACEHOLDER, 4 * valRptTimes_, 1, 8);

}

DeformableConv2dKernel::ProcessVector

DeformableConv2dKernel::ProcessVector CopyInOffset ComputeWeight ComputeBilinearInterpolation

cpp 复制代码

template<bool modulated>

__aicore__ inline void DeformableConv2dKernel<modulated>::ProcessVector(uint32_t taskIdx)

{

    uint32_t batch = taskIdx / hOut_;

    srcOffset_ = batch * hIn_ * wIn_ * cIn_;

    dstOffset_ = taskIdx * rowIn_;

    LocalTensor<float> offset = offsetBuf_.Get<float>();

    LocalTensor<float> auxW = auxWBuf_.Get<float>();

    LocalTensor<float> auxH = auxHBuf_.Get<float>();

    LocalTensor<int32_t> offsetInt = offsetIntBuf_.Get<int32_t>();

    LocalTensor<float> weight = weightBuf_.Get<float>();

    LocalTensor<float> feature = featureBuf_.Get<float>();

    LocalTensor<float> mask;

    if (modulated) {

        mask = maskBuf_.Get<float>();

    }

    LocalTensor<float> offsetOutput = offsetOutputBuf_.Get<float>();

DeformableConv2dKernel::CopyInOffset 拷贝一行的 Δ p n \Delta p_n Δpn 和 Δ m \Delta m Δm 并解交织 Δ p n \Delta p_n Δpn。
DeformableConv2dKernel::ComputeWeight 计算采样位置和插值权重。

cpp 复制代码

    CopyInOffset(taskIdx, offset, mask);

    ComputeWeight(taskIdx, auxW, auxH, offset, offsetInt, weight, mask);

    SetFlag<HardEvent::V_MTE2>(calEvt_);

    WaitFlag<HardEvent::V_MTE2>(calEvt_);

    SetFlag<HardEvent::MTE3_V>(0);

    SetFlag<HardEvent::MTE3_V>(1);

    uint8_t ping = 0;

DeformableConv2dKernel::ComputeBilinearInterpolation 加载计算和保存。

cpp 复制代码

    for (uint32_t w = 0; w < wOut_; ++w) {

        WaitFlag<HardEvent::MTE3_V>(ping);

        ComputeBilinearInterpolation(w, offset, offsetInt, feature, weight, offsetOutput[ping * kwIn_]);

        SetFlag<HardEvent::MTE3_V>(ping);

        ping = 1 - ping;

    }

    WaitFlag<HardEvent::MTE3_V>(0);

    WaitFlag<HardEvent::MTE3_V>(1);

}

DeformableConv2dKernel::CopyInOffset

cpp 复制代码

template<bool modulated>

__aicore__ inline void DeformableConv2dKernel<modulated>::CopyInOffset(

    uint32_t taskIdx, const LocalTensor<float>& offset, const LocalTensor<float>& mask)

{

    uint32_t offsetIdx = taskIdx * rowOffset_ * 2;

    DataCopy(offset, offsetGm_[offsetIdx], {1, doubleRowOffsetBlk_, 0, 0});

    if (modulated) {

        DataCopy(mask, maskGm_[taskIdx * rowOffset_], {1, rowOffsetBlk_, 0, 0});

    }

    SetFlag<HardEvent::MTE2_V>(copyEvt_);

    WaitFlag<HardEvent::MTE2_V>(copyEvt_);

    uint64_t cnt;

    GatherMask(offset[2 * alignedRowOffset_], offset, 2, false, MASK_PLACEHOLDER, gatherParams_, cnt);

    GatherMask(offset[3 * alignedRowOffset_], offset, 1, false, MASK_PLACEHOLDER, gatherParams_, cnt);

    SetVectorMask<float>(FULL_MASK, FULL_MASK);

}

DeformableConv2dKernel::ComputeWeight

offset是中间变量，offsetInt和weight是输出，但使用常量引用。

cpp 复制代码

template<bool modulated>

__aicore__ inline void DeformableConv2dKernel<modulated>::ComputeWeight(uint32_t taskIdx,

    const LocalTensor<float>& auxW, const LocalTensor<float>& auxH, const LocalTensor<float>& offset,

    const LocalTensor<int32_t>& offsetInt, const LocalTensor<float>& weight, const LocalTensor<float>& mask)

{

offset的大小为 4 × W o k h k w 4\times W_o k_h k_w 4×Wokhkw，用于临时变量。

使用 Copy 指令取 x i x_i xi。
h为 y o y_o yo，auxH 加 y o ⋅ s y_o \cdot s yo⋅s 后为实际坐标 y i y_i yi。
offset的前半部分为卷积窗口索引 p + p n p + p_n p+pn。

cpp 复制代码

    int32_t h = taskIdx % hOut_;

    Copy<float, false>(offset, auxW, MASK_PLACEHOLDER, rptTimes_, {1, 1, 8, 8});

    Adds<float, false>(offset[alignedRowOffset_], auxH, float(h * strideH_), MASK_PLACEHOLDER, rptTimes_, {1, 1, 8, 8});

由于内存连续，一条加法指令实现浮点坐标的计算： p + p n + Δ p n p + p_n+\Delta p_n p+pn+Δpn。

cpp 复制代码

    Add<float, false>(

        offset, offset, offset[2 * alignedRowOffset_], MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 1, 8, 8, 8});

offsetInt转为整型坐标 ( y 1 , x 1 ) (y_1, x_1) (y1,x1)，offset的后半部分存储浮点类型的左上角坐标。

cpp 复制代码

    Cast<int32_t, float, false>(

        offsetInt, offset, RoundMode::CAST_FLOOR, MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 8, 8});

    Cast<float, int32_t, false>(

        offset[2 * alignedRowOffset_], offsetInt, RoundMode::CAST_NONE, MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 8, 8});

前半部分为差值 y − y 1 y-y_1 y−y1 和 x − x 1 x-x_1 x−x1。
weight为1，因此后半部分为 y 2 − y y_2-y y2−y 和 x 2 − x x_2-x x2−x。

cpp 复制代码

    Sub<float, false>(

        offset, offset, offset[2 * alignedRowOffset_], MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 1, 8, 8, 8}); // lw, lh

    Duplicate<float, false>(weight, 1.f, MASK_PLACEHOLDER, 2 * rptTimes_, 1, 8);

    Sub<float, false>(

        offset[2 * alignedRowOffset_], weight, offset, MASK_PLACEHOLDER, 2 * rptTimes_, {1, 1, 1, 8, 8, 8}); // hw, hh

两个维度相乘得到4个点的权值：
w 11 = ( x 2 − x ) ( y 2 − y ) w 21 = ( x − x 1 ) ( y 2 − y ) w 12 = ( x 2 − x ) ( y − y 1 ) w 22 = ( x − x 1 ) ( y − y 1 ) \begin{aligned} w_{11} &= (x_2 -x)(y_2 -y) \\ w_{21} &=(x -x_1)(y_2 -y) \\ w_{12} &=(x_2 -x)(y -y_1)\\ w_{22} &=(x -x_1)(y -y_1) \end{aligned} w11w21w12w22=(x2−x)(y2−y)=(x−x1)(y2−y)=(x2−x)(y−y1)=(x−x1)(y−y1)

cpp 复制代码

    Mul<float, false>(weight, offset[2 * alignedRowOffset_], offset[3 * alignedRowOffset_], MASK_PLACEHOLDER, rptTimes_,

        {1, 1, 1, 8, 8, 8}); // hw * hh

    Mul<float, false>(weight[alignedRowOffset_], offset, offset[3 * alignedRowOffset_], MASK_PLACEHOLDER, rptTimes_,

        {1, 1, 1, 8, 8, 8}); // lw * hh

    Mul<float, false>(weight[2 * alignedRowOffset_], offset[alignedRowOffset_], offset[2 * alignedRowOffset_],

        MASK_PLACEHOLDER, rptTimes_, {1, 1, 1, 8, 8, 8}); // hw * lh

    Mul<float, false>(weight[3 * alignedRowOffset_], offset, offset[alignedRowOffset_], MASK_PLACEHOLDER, rptTimes_,

        {1, 1, 1, 8, 8, 8}); // lh * lw

将调制权重 Δ m \Delta_m Δm 乘到4个插值权重上。
weight 的形状为 4 × W o k h k w 4\times W_o k_h k_w 4×Wokhkw，mask的形状为 W o k h k w W_o k_h k_w Wokhkw，二者不等长导致需要调用4次。

代码注释没有删除。

cpp 复制代码

    if (modulated) {

        Mul<float, false>(weight, weight, mask, MASK_PLACEHOLDER, rptTimes_, {1, 1, 1, 8, 8, 8});

        Mul<float, false>(weight[alignedRowOffset_], weight[alignedRowOffset_], mask, MASK_PLACEHOLDER, rptTimes_,

            {1, 1, 1, 8, 8, 8}); // lw * hh

        Mul<float, false>(weight[2 * alignedRowOffset_], weight[2 * alignedRowOffset_], mask, MASK_PLACEHOLDER,

            rptTimes_, {1, 1, 1, 8, 8, 8}); // hw * lh

        Mul<float, false>(weight[3 * alignedRowOffset_], weight[3 * alignedRowOffset_], mask, MASK_PLACEHOLDER,

            rptTimes_, {1, 1, 1, 8, 8, 8}); // lh * lw

    }

}

DeformableConv2dKernel::ComputeBilinearInterpolation

offset没有用到。

cpp 复制代码

template<bool modulated>

__aicore__ inline void DeformableConv2dKernel<modulated>::ComputeBilinearInterpolation(uint32_t w,

    const LocalTensor<float>& offset, const LocalTensor<int32_t>& offsetInt, const LocalTensor<float>& feature,

    const LocalTensor<float>& weight, const LocalTensor<float>& offsetOutput)

{

首先将offsetOutput清零，其形状为 k h k w C i k_h k_w C_i khkwCi

cpp 复制代码

    Duplicate<float, false>(offsetOutput, 0.f, MASK_PLACEHOLDER, kernelSize_ * valRptTimes_, 1, 8);

    uint8_t ping = 0;

    uint32_t kernelOffset = w * kernelSize_;

    SetFlag<HardEvent::V_MTE2>(0);

    SetFlag<HardEvent::V_MTE2>(1);

传入 x o x_o xo。
pw和ph为数组中的索引。
gmOffset为输入点的一维偏移。
SetFlag 同一核内不同流水之间的同步指令。

Ascend C最佳实践中建议尽量一次搬运较大的数据块。

cpp 复制代码

#pragma bisheng auto_sync parallel

    for (uint32_t kIdx = 0; kIdx < kernelSize_; ++kIdx) {

        uint32_t pw = kIdx + kernelOffset;

        uint32_t ph = pw + alignedRowOffset_;

        int32_t w0 = offsetInt.GetValue(pw);

        int32_t h0 = offsetInt.GetValue(ph);

        int32_t w1 = w0 + 1;

        int32_t h1 = h0 + 1;

        uint32_t outOffset = kIdx * cIn_;

        uint32_t ftOffset = ping * featureOffset_;

        WaitFlag<HardEvent::V_MTE2>(ping);

对于每个输入点 ( y , x ) (y, x) (y,x)，如果 ( y 1 , x 1 ) , ( y 1 , x 2 ) , ( y 2 , x 1 ) , ( y 2 , x 2 ) (y1,x1), (y1,x2), (y2,x1), (y2,x2) (y1,x1),(y1,x2),(y2,x1),(y2,x2) 均在图像内，则一次加载4个点。
Axpy 将输入元素与标量求积后，累加到目的元素。

cpp 复制代码

        if (0 < h1 && h1 < hIn_) {

            if (0 < w1 && w1 < wIn_) {

                uint64_t gmOffset = srcOffset_ + (h0 * wIn_ + w0) * cIn_;

                DataCopy(feature[ftOffset], xGm_[gmOffset], cpQuadValParams_);

                SetFlag<HardEvent::MTE2_V>(copyEvt_);

                WaitFlag<HardEvent::MTE2_V>(copyEvt_);

                PipeBarrier<PIPE_V>();

                Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset], weight.GetValue(pw),

                    MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});

                PipeBarrier<PIPE_V>();

                Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + cIn_], weight.GetValue(ph),

                    MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});

                PipeBarrier<PIPE_V>();

                Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 2 * cIn_],

                    weight.GetValue(pw + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});

                PipeBarrier<PIPE_V>();

                Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 3 * cIn_],

                    weight.GetValue(ph + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});

            } else if (w1 == 0) {

                uint64_t gmOffset = srcOffset_ + (h0 * wIn_) * cIn_;

                DataCopy(feature[ftOffset + cIn_], xGm_[gmOffset], cpColDoubleValParams_);

                SetFlag<HardEvent::MTE2_V>(copyEvt_);

                WaitFlag<HardEvent::MTE2_V>(copyEvt_);

                PipeBarrier<PIPE_V>();

                Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + cIn_], weight.GetValue(ph),

                    MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});

                PipeBarrier<PIPE_V>();

                Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 3 * cIn_],

                    weight.GetValue(ph + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});

            } else if (w1 == wIn_) {

                uint64_t gmOffset = srcOffset_ + (h0 * wIn_ + w0) * cIn_;

                DataCopy(feature[ftOffset], xGm_[gmOffset], cpColDoubleValParams_);

                SetFlag<HardEvent::MTE2_V>(copyEvt_);

                WaitFlag<HardEvent::MTE2_V>(copyEvt_);

                PipeBarrier<PIPE_V>();

                Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset], weight.GetValue(pw),

                    MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});

                PipeBarrier<PIPE_V>();

                Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 2 * cIn_],

                    weight.GetValue(pw + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});

            }

        } else if (h1 == 0) {

            if (0 < w1 && w1 < wIn_) {

                uint64_t gmOffset = srcOffset_ + w0 * cIn_;

                DataCopy(feature[ftOffset + 2 * cIn_], xGm_[gmOffset], cpRowDoubleValParams_);

                SetFlag<HardEvent::MTE2_V>(copyEvt_);

                WaitFlag<HardEvent::MTE2_V>(copyEvt_);

                PipeBarrier<PIPE_V>();

                Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 2 * cIn_],

                    weight.GetValue(pw + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});

                PipeBarrier<PIPE_V>();

                Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 3 * cIn_],

                    weight.GetValue(ph + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});

            } else if (w1 == 0) {

                uint64_t gmOffset = srcOffset_;

                DataCopy(feature[ftOffset + 3 * cIn_], xGm_[gmOffset], cpOneValParams_);

                SetFlag<HardEvent::MTE2_V>(copyEvt_);

                WaitFlag<HardEvent::MTE2_V>(copyEvt_);

                PipeBarrier<PIPE_V>();

                Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 3 * cIn_],

                    weight.GetValue(ph + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});

            } else if (w1 == wIn_) {

                uint64_t gmOffset = srcOffset_ + w0 * cIn_;

                DataCopy(feature[ftOffset + 2 * cIn_], xGm_[gmOffset], cpOneValParams_);

                SetFlag<HardEvent::MTE2_V>(copyEvt_);

                WaitFlag<HardEvent::MTE2_V>(copyEvt_);

                PipeBarrier<PIPE_V>();

                Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + 2 * cIn_],

                    weight.GetValue(pw + 2 * alignedRowOffset_), MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});

            }

        } else if (h1 == hIn_) {

            if (0 < w1 && w1 < wIn_) {

                uint64_t gmOffset = srcOffset_ + (h0 * wIn_ + w0) * cIn_;

                DataCopy(feature[ftOffset], xGm_[gmOffset], cpRowDoubleValParams_);

                SetFlag<HardEvent::MTE2_V>(copyEvt_);

                WaitFlag<HardEvent::MTE2_V>(copyEvt_);

                PipeBarrier<PIPE_V>();

                Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset], weight.GetValue(pw),

                    MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});

                PipeBarrier<PIPE_V>();

                Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + cIn_], weight.GetValue(ph),

                    MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});

            } else if (w1 == 0) {

                uint64_t gmOffset = srcOffset_ + (h0 * wIn_) * cIn_;

                DataCopy(feature[ftOffset + cIn_], xGm_[gmOffset], cpOneValParams_);

                SetFlag<HardEvent::MTE2_V>(copyEvt_);

                WaitFlag<HardEvent::MTE2_V>(copyEvt_);

                PipeBarrier<PIPE_V>();

                Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset + cIn_], weight.GetValue(ph),

                    MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});

            } else if (w1 == wIn_) {

                uint64_t gmOffset = srcOffset_ + (h0 * wIn_ + w0) * cIn_;

                DataCopy(feature[ftOffset], xGm_[gmOffset], cpOneValParams_);

                SetFlag<HardEvent::MTE2_V>(copyEvt_);

                WaitFlag<HardEvent::MTE2_V>(copyEvt_);

                PipeBarrier<PIPE_V>();

                Axpy<float, float, false>(offsetOutput[outOffset], feature[ftOffset], weight.GetValue(pw),

                    MASK_PLACEHOLDER, valRptTimes_, {1, 1, 8, 8});

            }

        }

        SetFlag<HardEvent::V_MTE2>(ping);

        ping = 1 - ping;

    }

将插值得到 k h k w C i k_h k_w C_i khkwCi 分段拷出到形状为 G × W i × k h k w C i G G \times W_i\times \frac{k_h k_w C_i}{G} G×Wi×GkhkwCi 的全局内存offsetOutputGm_中。

cpp 复制代码

    SetFlag<HardEvent::V_MTE3>(calEvt_);

    WaitFlag<HardEvent::V_MTE3>(calEvt_);
    for (uint32_t i = 0; i < groups_; ++i) {

        DataCopy(offsetOutputGm_[dstOffset_ + rowInPerGroup_ * i], offsetOutput[i * cInPerGroup_], cpOffsetOutParams_);

    }

    dstOffset_ += kwInPerGroup_;

    WaitFlag<HardEvent::V_MTE2>(0);

    WaitFlag<HardEvent::V_MTE2>(1);

}

DeformableConv2dKernel::ProcessCube

SetTensorB 设置矩阵乘的右矩阵 B 需要转置。

采用循环方式实现 Batch Matmul：weight 的形状为 G × C o G × k h k w C i G G\times \frac{C_o}{G} \times \frac{k_h k_wC_i}{G} G×GCo×GkhkwCi，im2col 的形状为 G × W o × k h k w C i G G \times W_o\times \frac{k_h k_w C_i}{G} G×Wo×GkhkwCi，输出为 G × C o G × W o G\times \frac{C_o}{G}\times W_o G×GCo×Wo。

这使得多核输出形状为 N × H o × C o × W o N\times H_o \times C_o \times W_o N×Ho×Co×Wo。

cpp 复制代码

template<bool modulated>

__aicore__ inline void DeformableConv2dKernel<modulated>::ProcessCube(uint32_t taskIdx)

{

    uint64_t aOffset = 0;

    uint64_t bOffset = taskIdx * rowIn_;

    uint64_t cOffset = taskIdx * rowOut_;

    for (uint32_t i = 0; i < groups_; ++i) {

        mm_.SetTensorA(weightGm_[aOffset]);

        mm_.SetTensorB(offsetOutputGm_[bOffset], true);

        mm_.template IterateAll<false>(yGm_[cOffset]);

        aOffset += kernelPerGroup_;

        bOffset += rowInPerGroup_;

        cOffset += rowOutPerGroup_;

    }

}

DeformableConv2dV2Kernel::Process

DeformableConv2dV2Kernel::Process ProcessVector ProcessCube

DeformableConv2dV2Kernel::ProcessVector 每次生成 im2col 矩阵的一行。累积cubeTileTaskCount_行后，调用 DeformableConv2dV2Kernel::ProcessCube 进行卷积。

cpp 复制代码

template<bool modulated>

__aicore__ inline void DeformableConv2dV2Kernel<modulated>::Process()

{

    for (int32_t taskIdx = start_; taskIdx < end_; taskIdx++) {

        ProcessVector(taskIdx);

        int32_t innerCubeTaskIdx = (taskIdx - start_) % cubeTileTaskCount_;

        bool startCubeFlag = (innerCubeTaskIdx == cubeTileTaskCount_ - 1) || (taskIdx == end_ - 1);

        if (startCubeFlag) {

            ProcessCube(taskIdx, innerCubeTaskIdx);

        }

    }

    mm_.End();

}

DeformableConv2dV2Kernel::ProcessVector

DeformableConv2dV2Kernel::ProcessVector CopyInFeature

每次调用处理卷积输出特征图上一个点对应的输入，即 im2col 矩阵的一行。将一个 kernel window 中的值展开成一列，写入 img2colMatGm_ 中。

将taskIdx解码为对应的(n, h_out, w_out)。

cpp 复制代码

template<bool modulated>

__aicore__ inline void DeformableConv2dV2Kernel<modulated>::ProcessVector(uint32_t taskIdx)

{

    int16_t batchIdx = taskIdx / (featureMapSize_);

    int16_t hOutIdx = (taskIdx % (featureMapSize_)) / wOut_;

    int16_t wOutIdx = taskIdx % wOut_;

将当前taskIdx对应的 Δ p \Delta p Δp 和 Δ m \Delta m Δm 加载到本地内存。单次拷贝18或9个元素，过少。

cpp 复制代码

    // CopyIn Offset

    DataCopy(copyInOffsetLocal_, offsetGm_[taskIdx * OFFSET_SIZE], OFFSET_ALIGNED_SIZE);

    SetFlag<HardEvent::MTE2_V>(copyInOffsetEventID);

    if (modulated) {

        DataCopy(maskLocal_, maskGm_[taskIdx * X_OFFSET_SIZE], X_OFFSET_ALIGNED_SIZE);

        SetFlag<HardEvent::MTE2_V>(copyInMaskEventID);

    }

    WaitFlag<HardEvent::MTE2_V>(copyInOffsetEventID);

将交错存储的 ( Δ y , Δ x ) (\Delta y, \Delta x) (Δy,Δx) 分离开，存入独立的xOffsetLocal_和yOffsetLocal_缓冲区。

加上卷积窗口坐标得到 p n + Δ p n p_n +\Delta p_n pn+Δpn。

cpp 复制代码

    GatherMask(xOffsetLocal_, copyInOffsetLocal_, 1, true, maskForGatherMask_, {1, 1, 8, 0}, cnt_);

    GatherMask(yOffsetLocal_, copyInOffsetLocal_, 2, true, maskForGatherMask_, {1, 1, 8, 0}, cnt_);

    Add(xOffsetLocal_, xOffsetLocal_, constKHIdxLocal_, X_OFFSET_ALIGNED_SIZE);

    Add(yOffsetLocal_, yOffsetLocal_, constKWIdxLocal_, X_OFFSET_ALIGNED_SIZE);

对浮点坐标 ( i + Δ h i , j + Δ w i ) (i+\Delta h_i, j+\Delta w_i) (i+Δhi,j+Δwi) 取整得到双线性插值所需的四个方向的坐标。

计算小数偏移。

cpp 复制代码

    Floor(topPosLocal_, xOffsetLocal_, X_OFFSET_ALIGNED_SIZE);

    Floor(leftPosLocal_, yOffsetLocal_, X_OFFSET_ALIGNED_SIZE);

    Adds(bottomPosLocal_, topPosLocal_, 1.0f, X_OFFSET_ALIGNED_SIZE);

    Adds(rightPosLocal_, leftPosLocal_, 1.0f, X_OFFSET_ALIGNED_SIZE);

fracH和fracW为单个方向上的插值权重 y − y 1 y-y_1 y−y1 和 x − x 1 x-x_1 x−x1。

cpp 复制代码

    Sub(fracHLocal_, xOffsetLocal_, topPosLocal_, X_OFFSET_ALIGNED_SIZE);

    Sub(fracWLocal_, yOffsetLocal_, leftPosLocal_, X_OFFSET_ALIGNED_SIZE);

用输出点的坐标减去卷积核的半径，从而找到与之对应的输入区域的起始位置。这里假定 s h = 1 , s w = 1 s_h =1, s_w=1 sh=1,sw=1，然而算子入口并没有设置该条件。

计算卷积窗口左上角的坐标 ( h 0 , w 0 ) (h_0,w_0) (h0,w0) 的公式为：
h 0 = h o ⋅ s h − p h w 0 = w o ⋅ s w − p w \begin{aligned} h_0 &= h_o \cdot s_h - p_h \\ w_0 &= w_o \cdot s_w - p_w \end{aligned} h0w0=ho⋅sh−ph=wo⋅sw−pw

与相对坐标相加得到卷积窗口所有点的坐标 p + p n + Δ p n p + p_n +\Delta p_n p+pn+Δpn。
topPosLocal_和leftPosLocal_为 ( y 1 , x 1 ) (y_1, x_1) (y1,x1)。

cpp 复制代码

    // global position

    Adds(topPosLocal_, topPosLocal_, hOutIdx - kH_ / 2 + 0.0f, 2 * X_OFFSET_ALIGNED_SIZE);

    Adds(leftPosLocal_, leftPosLocal_, wOutIdx - kW_ / 2 + 0.0f, 2 * X_OFFSET_ALIGNED_SIZE);

计算插值4组点在内存上的一维偏移：
o f f s e t 1 = ( y 1 ⋅ W i + x 1 ) C i o f f s e t 2 = ( y 1 ⋅ W i + x 2 ) C i o f f s e t 3 = ( y 2 ⋅ W i + x 1 ) C i o f f s e t 4 = ( y 2 ⋅ W i + x 2 ) C i \begin{aligned} \mathrm{offset}_1 &=(y_1\cdot W_i + x_1)C_i\\ \mathrm{offset}_2 &=(y_1\cdot W_i + x_2)C_i\\ \mathrm{offset}_3 &=(y_2\cdot W_i + x_1)C_i\\ \mathrm{offset}_4 &=(y_2\cdot W_i + x_2)C_i \end{aligned} offset1offset2offset3offset4=(y1⋅Wi+x1)Ci=(y1⋅Wi+x2)Ci=(y2⋅Wi+x1)Ci=(y2⋅Wi+x2)Ci
topLeftOffsetLocal_、topRightOffsetLocal_、bottomLeftOffsetLocal_、bottomRightOffsetLocal_4个变量在内存上是连续的所有可以使用一条指令处理。

cpp 复制代码

    // global Offset

    Muls(topPosLocal_, topPosLocal_, wOut_ + 0.0f, 2 * X_OFFSET_ALIGNED_SIZE);

    Add(topLeftOffsetLocal_, topPosLocal_, leftPosLocal_, X_OFFSET_ALIGNED_SIZE); // global (h * wOut + w)

    Add(topRightOffsetLocal_, topPosLocal_, rightPosLocal_, X_OFFSET_ALIGNED_SIZE);

    Add(bottomLeftOffsetLocal_, bottomPosLocal_, leftPosLocal_, X_OFFSET_ALIGNED_SIZE);

    Add(bottomRightOffsetLocal_, bottomPosLocal_, rightPosLocal_, X_OFFSET_ALIGNED_SIZE);

    Muls(topLeftOffsetLocal_, topLeftOffsetLocal_, cIn_ + 0.0f, 4 * X_OFFSET_ALIGNED_SIZE);

    Adds(topLeftOffsetLocal_, topLeftOffsetLocal_, batchIdx * featureMapElementsSize_ + 0.0f,

        4 * X_OFFSET_ALIGNED_SIZE); // global offset

CompareScalar 逐元素比较一个 tensor 中的元素和另一个 Scalar 的大小，结果在输出的对应比特位。
topPosLocal_、bottomPosLocal_、leftPosLocal_、rightPosLocal_四个变量的内存是连续的，每个变量的大小为 X_OFFSET_ALIGNED_SIZE。这里直接使用了64作为长度。

可以看出，由于地址对齐限制，36个有效元素对齐到64。
inGlobalLocal_的大小为IN_GLOBAL_BUF_SIZE * sizeof(uint32_t)，存储4组点在两个方向上是否在边界内。
inGlobalLocal_为 uint32_t 类型，每条 CompareScalar 处理64个元素，保存到inGlobalLocal_中每段的前2两个元素中。

比较 0 ≤ y 1 , 0 ≤ y 2 , 0 ≤ x 1 , 0 ≤ x 2 0 \le y_1,\enspace 0 \le y_2,\enspace 0 \le x_1,\enspace 0 \le x_2 0≤y1,0≤y2,0≤x1,0≤x2 以及 y 1 < H i , y 2 < H i , x 1 < W i , x 2 < W i y_1< H_i,\enspace y_2 < H_i\enspace, x_1 < W_i,\enspace x_2 < W_i y1<Hi,y2<Hi,x1<Wi,x2<Wi。

cpp 复制代码

    // in global flag

    CompareScalar(inGlobalLocal_.ReinterpretCast<uint8_t>(), topPosLocal_, 0.0f, CMPMODE::GE, 64);

    CompareScalar(inGlobalLocal_[8].ReinterpretCast<uint8_t>(), bottomPosLocal_, 0.0f, CMPMODE::GE, 64);

    CompareScalar(inGlobalLocal_[16].ReinterpretCast<uint8_t>(), leftPosLocal_, 0.0f, CMPMODE::GE, 64);

    CompareScalar(inGlobalLocal_[24].ReinterpretCast<uint8_t>(), rightPosLocal_, 0.0f, CMPMODE::GE, 64);

    CompareScalar(inGlobalLocal_[32].ReinterpretCast<uint8_t>(), topPosLocal_, featureMapSize_ + 0.0f, CMPMODE::LT, 64);

    CompareScalar(

        inGlobalLocal_[40].ReinterpretCast<uint8_t>(), bottomPosLocal_, featureMapSize_ + 0.0f, CMPMODE::LT, 64);

    CompareScalar(inGlobalLocal_[48].ReinterpretCast<uint8_t>(), leftPosLocal_, wOut_ + 0.0f, CMPMODE::LT, 64);

    CompareScalar(inGlobalLocal_[56].ReinterpretCast<uint8_t>(), rightPosLocal_, wOut_ + 0.0f, CMPMODE::LT, 64);

合并两个方向的结果，即 0 ≤ y 1 < H i , 0 ≤ y 2 < H i , 0 ≤ x 1 < W i , 0 ≤ x 2 < W i 0 \le y_1 < H_i,\enspace 0 \le y_2 < H_i,\enspace 0 \le x_1 < W_i,\enspace 0 \le x_2 < W_i 0≤y1<Hi,0≤y2<Hi,0≤x1<Wi,0≤x2<Wi。

cpp 复制代码

    And(inGlobalLocal_[32].ReinterpretCast<uint16_t>(), inGlobalLocal_.ReinterpretCast<uint16_t>(),

        inGlobalLocal_[32].ReinterpretCast<uint16_t>(), 64);

计算合法的 ( y 1 , x 1 ) (y_1, x_1) (y1,x1) 和 ( y 2 , x 2 ) (y_2, x_2) (y2,x2)。

cpp 复制代码

    And(inGlobalLocal_.ReinterpretCast<uint16_t>(), inGlobalLocal_[32].ReinterpretCast<uint16_t>(),

        inGlobalLocal_[48].ReinterpretCast<uint16_t>(), 32); // TopLeft, BottomRight

计算合法的 ( y 1 , x 2 ) (y_1, x_2) (y1,x2) 和 ( y 2 , x 1 ) (y_2, x_1) (y2,x1)。

cpp 复制代码

    And(inGlobalLocal_[16].ReinterpretCast<uint16_t>(), inGlobalLocal_[32].ReinterpretCast<uint16_t>(),

        inGlobalLocal_[56].ReinterpretCast<uint16_t>(), 16); // TopRight

    And(inGlobalLocal_[24].ReinterpretCast<uint16_t>(), inGlobalLocal_[40].ReinterpretCast<uint16_t>(),

        inGlobalLocal_[48].ReinterpretCast<uint16_t>(), 16); // BottomLeft

Select 根据selMask（用于选择的 Mask 掩码）的比特位值选取元素。

将4组点的越界位置设置为-1.0f，后续拷贝时可直接丢弃或处理为0。

cpp 复制代码

    Select(topLeftOffsetLocal_, inGlobalLocal_.ReinterpretCast<uint16_t>(), topLeftOffsetLocal_, -1.0f,

        SELMODE::VSEL_TENSOR_SCALAR_MODE, 16);

    Select(bottomRightOffsetLocal_, inGlobalLocal_[8].ReinterpretCast<uint16_t>(), bottomRightOffsetLocal_, -1.0f,

        SELMODE::VSEL_TENSOR_SCALAR_MODE, 16);

    Select(topRightOffsetLocal_, inGlobalLocal_[16].ReinterpretCast<uint16_t>(), topRightOffsetLocal_, -1.0f,

        SELMODE::VSEL_TENSOR_SCALAR_MODE, 16);

    Select(bottomLeftOffsetLocal_, inGlobalLocal_[24].ReinterpretCast<uint16_t>(), bottomLeftOffsetLocal_, -1.0f,

        SELMODE::VSEL_TENSOR_SCALAR_MODE, 16);

需要插 scalar 等待 vector 的同步。
oneSubFracHLocal_和oneSubFracWLocal_的内存是连续的。

计算一维插值权重 y 2 − y y_2 - y y2−y 和 x 2 − x x_2 - x x2−x。

cpp 复制代码

    SetFlag<HardEvent::V_S>(V_SEventID);

    WaitFlag<HardEvent::V_S>(V_SEventID);

    Muls(oneSubFracHLocal_, fracHLocal_, -1.0f, 2 * X_OFFSET_ALIGNED_SIZE);

    Adds(oneSubFracHLocal_, oneSubFracHLocal_, 1.0f, 2 * X_OFFSET_ALIGNED_SIZE); // 1-fracH, 1-fracW

调制权重乘到4个插值权重上： Δ m ( y − y 1 ) , Δ m ( x − x 1 ) , Δ m ( y 2 − y ) , Δ m ( x 2 − x ) \Delta m(y -y_1),\enspace \Delta m(x -x_1),\enspace \Delta m(y_2 -y),\enspace \Delta m(x_2 -x) Δm(y−y1),Δm(x−x1),Δm(y2−y),Δm(x2−x)。

cpp 复制代码

    if (modulated) {

        WaitFlag<HardEvent::MTE2_V>(copyInMaskEventID);

        Mul(fracHLocal_, fracHLocal_, maskLocal_, X_OFFSET_ALIGNED_SIZE);

        Mul(oneSubFracHLocal_, oneSubFracHLocal_, maskLocal_, X_OFFSET_ALIGNED_SIZE);

    }

Brcb 给定一个输入张量，每一次取输入张量中的8个数填充到结果张量的8个 datablock（32Bytes）中去，每个数对应一个 datablock。

插值系数与输入相乘时需要进行低维广播。下面的计算中，二者不等长，将每个系数广播为 C i 8 \frac{C_i}{8} 8Ci。
fracHBroadcastLocal_空间大小为 9 × C i 8 × b l o c k 9\times \frac{C_i}{8}\times \mathrm{block} 9×8Ci×block。
brcbParams_中设置元素间隔为 C i 64 \frac{C_i}{64} 64Ci 个 block，迭代间隔为 C i 8 \frac{C_i}{8} 8Ci 个 block。即将 C i C_i Ci 八等分，等分位上的 datablock 为有效值，其他位置无效。

横跨空间大小 16 × C i 64 × b l o c k = 2 C i 16\times \frac{C_i}{64}\times\mathrm{block} = 2C_i 16×64Ci×block=2Ci。
fracHLocal_的每个元素填充到fracHBroadcastLocal_中的一个 datablock，相邻元素间隔8个 datablock，即 C i 64 \frac{C_i}{64} 64Ci。

cpp 复制代码

    // Broadcast

    Brcb(fracHBroadcastLocal_, fracHLocal_, 2, brcbParams_);

    Brcb(fracWBroadcastLocal_, fracWLocal_, 2, brcbParams_);

    Brcb(oneSubFracHBroadcastLocal_, oneSubFracHLocal_, 2, brcbParams_);

    Brcb(oneSubFracWBroadcastLocal_, oneSubFracWLocal_, 2, brcbParams_);

DATA_BLOCK_SIZE 为8，FOUR_CORNERS 为4，X_OFFSET_ALIGNED_SIZE 为9。
maskForBroadcast_等于dataBlockPerInputChannel_ - DATA_BLOCK_SIZE。

通过一条 Copy 指令将第一个 datablock 的数据广播到 C i C_i Ci 中的其他块，形状为 4 × 9 × C i 4\times 9\times C_i 4×9×Ci。

每次迭代拷贝的 block 数量为：
N = ⌈ M a s k 8 ⌉ = ⌈ C i 8 − 8 8 ⌉ = ⌈ C i 64 ⌉ − 1 \begin{aligned} N &= \lceil\frac{\mathrm{Mask}}{8}\rceil \\ &= \lceil\frac{\frac{C_i}{8}-8}{8}\rceil \\ &= \lceil\frac{C_i}{64}\rceil-1 \end{aligned} N=⌈8Mask⌉=⌈88Ci−8⌉=⌈64Ci⌉−1
srcRepeatSize和dstRepeatSize参数设置为 C i 64 \frac{C_i}{64} 64Ci。

在第一步的广播中，相邻元素间隔 C i 64 \frac{C_i}{64} 64Ci，这使得每组插值权重有效值长度为 9 C i 8 \frac{9C_i}{8} 89Ci。

cpp 复制代码

    Copy(fracHBroadcastLocal_[DATA_BLOCK_SIZE], fracHBroadcastLocal_, maskForBroadcast_, FOUR_CORNERS * X_OFFSET_SIZE,

        copyParams_);

DeformableConv2dV2Kernel::CopyInFeature 函数根据topLeftOffsetLocal_和fracHBroadcastLocal_加载输入并插值。

然后将outFeatureLocal_中的结果拷贝到全局内存中。

cpp 复制代码

    CopyInFeature();

    SetFlag<HardEvent::V_MTE3>(copyOutEventID);

    WaitFlag<HardEvent::V_MTE3>(copyOutEventID);

    DataCopyPad(img2colMatGm_[taskIdx * elementsCountPerTask_], outFeatureLocal_,

        {1, static_cast<uint32_t>(elementsCountPerTask_ * FP32_BYTE_SIZE), 0, 0, 0});

}

DeformableConv2dV2Kernel::CopyInFeature

函数没有参数，导致看不出依赖的变量。
topLeft0等值应该与整数进行比较。

代码直接展开，似乎可以像 V1中那样写成 for 循环。

加载9个输入点的通道后，与权重相乘。
topLeftWeightLocal_为 Δ m ⋅ w 11 = Δ m ( y 2 − y ) ( x 2 − x ) \Delta m \cdot w_{11}=\Delta m(y_2 - y)(x_2 -x) Δm⋅w11=Δm(y2−y)(x2−x)。
topLeftWeightLocal_中仅前面的 9 C i 8 \frac{9C_i}{8} 89Ci 个元素有效。

cpp 复制代码

template<bool modulated>

__aicore__ inline void DeformableConv2dV2Kernel<modulated>::CopyInFeature()

{

    int32_t topLeft0 = topLeftOffsetLocal_.GetValue(0);

    int32_t topLeft1 = topLeftOffsetLocal_.GetValue(1);

    int32_t topLeft2 = topLeftOffsetLocal_.GetValue(2);

    int32_t topLeft3 = topLeftOffsetLocal_.GetValue(3);

    int32_t topLeft4 = topLeftOffsetLocal_.GetValue(4);

    int32_t topLeft5 = topLeftOffsetLocal_.GetValue(5);

    int32_t topLeft6 = topLeftOffsetLocal_.GetValue(6);

    int32_t topLeft7 = topLeftOffsetLocal_.GetValue(7);

    int32_t topLeft8 = topLeftOffsetLocal_.GetValue(8);

    (topLeft0 == -1.0f) ? Duplicate(topLeftFeatureLocal_[0 * cIn_], 0.0f, cIn_) :

                          DataCopy(topLeftFeatureLocal_[0 * cIn_], xGm_[topLeft0], cIn_);

    (topLeft1 == -1.0f) ? Duplicate(topLeftFeatureLocal_[1 * cIn_], 0.0f, cIn_) :

                          DataCopy(topLeftFeatureLocal_[1 * cIn_], xGm_[topLeft1], cIn_);

    (topLeft2 == -1.0f) ? Duplicate(topLeftFeatureLocal_[2 * cIn_], 0.0f, cIn_) :

                          DataCopy(topLeftFeatureLocal_[2 * cIn_], xGm_[topLeft2], cIn_);

    (topLeft3 == -1.0f) ? Duplicate(topLeftFeatureLocal_[3 * cIn_], 0.0f, cIn_) :

                          DataCopy(topLeftFeatureLocal_[3 * cIn_], xGm_[topLeft3], cIn_);

    (topLeft4 == -1.0f) ? Duplicate(topLeftFeatureLocal_[4 * cIn_], 0.0f, cIn_) :

                          DataCopy(topLeftFeatureLocal_[4 * cIn_], xGm_[topLeft4], cIn_);

    (topLeft5 == -1.0f) ? Duplicate(topLeftFeatureLocal_[5 * cIn_], 0.0f, cIn_) :

                          DataCopy(topLeftFeatureLocal_[5 * cIn_], xGm_[topLeft5], cIn_);

    (topLeft6 == -1.0f) ? Duplicate(topLeftFeatureLocal_[6 * cIn_], 0.0f, cIn_) :

                          DataCopy(topLeftFeatureLocal_[6 * cIn_], xGm_[topLeft6], cIn_);

    (topLeft7 == -1.0f) ? Duplicate(topLeftFeatureLocal_[7 * cIn_], 0.0f, cIn_) :

                          DataCopy(topLeftFeatureLocal_[7 * cIn_], xGm_[topLeft7], cIn_);

    (topLeft8 == -1.0f) ? Duplicate(topLeftFeatureLocal_[8 * cIn_], 0.0f, cIn_) :

                          DataCopy(topLeftFeatureLocal_[8 * cIn_], xGm_[topLeft8], cIn_);

    Mul(topLeftWeightLocal_, oneSubFracHBroadcastLocal_, oneSubFracWBroadcastLocal_, 9 * dataBlockPerInputChannel_);

Mul 设置src1BlkStride为0，实现了低维广播的乘法。topLeftWeightLocal_的每个 datablock 与topLeftFeatureLocal_的连续的8个 datablock 相乘。
src1RepStride为1。
repeatTimes_等于 9 C i 8 × 8 \frac{9C_i}{8\times 8} 8×89Ci，即总计处理 9 C i 9C_i 9Ci 个元素。

想要实现9个点的乘法，权重需要以ci/DATA_SIZE_PER_REPEAT的长度分段放置。

cpp 复制代码

    SetFlag<HardEvent::MTE3_V>(MTE3_VEventID);

    WaitFlag<HardEvent::MTE3_V>(MTE3_VEventID);

    SetFlag<HardEvent::MTE2_V>(copyInFeatureEventID);

    WaitFlag<HardEvent::MTE2_V>(copyInFeatureEventID);

    Mul(outFeatureLocal_, topLeftFeatureLocal_, topLeftWeightLocal_, mask_, repeatTimes_, {1, 1, 0, 8, 8, 1});

cpp 复制代码

    int32_t topRight0 = topRightOffsetLocal_.GetValue(0);

    int32_t topRight1 = topRightOffsetLocal_.GetValue(1);

    int32_t topRight2 = topRightOffsetLocal_.GetValue(2);

    int32_t topRight3 = topRightOffsetLocal_.GetValue(3);

    int32_t topRight4 = topRightOffsetLocal_.GetValue(4);

    int32_t topRight5 = topRightOffsetLocal_.GetValue(5);

    int32_t topRight6 = topRightOffsetLocal_.GetValue(6);

    int32_t topRight7 = topRightOffsetLocal_.GetValue(7);

    int32_t topRight8 = topRightOffsetLocal_.GetValue(8);

    (topRight0 == -1.0f) ? Duplicate(topRightFeatureLocal_[0 * cIn_], 0.0f, cIn_) :

                           DataCopy(topRightFeatureLocal_[0 * cIn_], xGm_[topRight0], cIn_);

    (topRight1 == -1.0f) ? Duplicate(topRightFeatureLocal_[1 * cIn_], 0.0f, cIn_) :

                           DataCopy(topRightFeatureLocal_[1 * cIn_], xGm_[topRight1], cIn_);

    (topRight2 == -1.0f) ? Duplicate(topRightFeatureLocal_[2 * cIn_], 0.0f, cIn_) :

                           DataCopy(topRightFeatureLocal_[2 * cIn_], xGm_[topRight2], cIn_);

    (topRight3 == -1.0f) ? Duplicate(topRightFeatureLocal_[3 * cIn_], 0.0f, cIn_) :

                           DataCopy(topRightFeatureLocal_[3 * cIn_], xGm_[topRight3], cIn_);

    (topRight4 == -1.0f) ? Duplicate(topRightFeatureLocal_[4 * cIn_], 0.0f, cIn_) :

                           DataCopy(topRightFeatureLocal_[4 * cIn_], xGm_[topRight4], cIn_);

    (topRight5 == -1.0f) ? Duplicate(topRightFeatureLocal_[5 * cIn_], 0.0f, cIn_) :

                           DataCopy(topRightFeatureLocal_[5 * cIn_], xGm_[topRight5], cIn_);

    (topRight6 == -1.0f) ? Duplicate(topRightFeatureLocal_[6 * cIn_], 0.0f, cIn_) :

                           DataCopy(topRightFeatureLocal_[6 * cIn_], xGm_[topRight6], cIn_);

    (topRight7 == -1.0f) ? Duplicate(topRightFeatureLocal_[7 * cIn_], 0.0f, cIn_) :

                           DataCopy(topRightFeatureLocal_[7 * cIn_], xGm_[topRight7], cIn_);

    (topRight8 == -1.0f) ? Duplicate(topRightFeatureLocal_[8 * cIn_], 0.0f, cIn_) :

                           DataCopy(topRightFeatureLocal_[8 * cIn_], xGm_[topRight8], cIn_);

    Mul(topRightWeightLocal_, oneSubFracHBroadcastLocal_, fracWBroadcastLocal_, 9 * dataBlockPerInputChannel_);

    SetFlag<HardEvent::MTE2_V>(copyInFeatureEventID);

    WaitFlag<HardEvent::MTE2_V>(copyInFeatureEventID);

    MulAddDst(outFeatureLocal_, topRightFeatureLocal_, topRightWeightLocal_, mask_, repeatTimes_, {1, 1, 0, 8, 8, 1});

    int32_t bottomLeft0 = bottomLeftOffsetLocal_.GetValue(0);

    int32_t bottomLeft1 = bottomLeftOffsetLocal_.GetValue(1);

    int32_t bottomLeft2 = bottomLeftOffsetLocal_.GetValue(2);

    int32_t bottomLeft3 = bottomLeftOffsetLocal_.GetValue(3);

    int32_t bottomLeft4 = bottomLeftOffsetLocal_.GetValue(4);

    int32_t bottomLeft5 = bottomLeftOffsetLocal_.GetValue(5);

    int32_t bottomLeft6 = bottomLeftOffsetLocal_.GetValue(6);

    int32_t bottomLeft7 = bottomLeftOffsetLocal_.GetValue(7);

    int32_t bottomLeft8 = bottomLeftOffsetLocal_.GetValue(8);

    (bottomLeft0 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[0 * cIn_], 0.0f, cIn_) :

                             DataCopy(bottomLeftFeatureLocal_[0 * cIn_], xGm_[bottomLeft0], cIn_);

    (bottomLeft1 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[1 * cIn_], 0.0f, cIn_) :

                             DataCopy(bottomLeftFeatureLocal_[1 * cIn_], xGm_[bottomLeft1], cIn_);

    (bottomLeft2 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[2 * cIn_], 0.0f, cIn_) :

                             DataCopy(bottomLeftFeatureLocal_[2 * cIn_], xGm_[bottomLeft2], cIn_);

    (bottomLeft3 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[3 * cIn_], 0.0f, cIn_) :

                             DataCopy(bottomLeftFeatureLocal_[3 * cIn_], xGm_[bottomLeft3], cIn_);

    (bottomLeft4 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[4 * cIn_], 0.0f, cIn_) :

                             DataCopy(bottomLeftFeatureLocal_[4 * cIn_], xGm_[bottomLeft4], cIn_);

    (bottomLeft5 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[5 * cIn_], 0.0f, cIn_) :

                             DataCopy(bottomLeftFeatureLocal_[5 * cIn_], xGm_[bottomLeft5], cIn_);

    (bottomLeft6 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[6 * cIn_], 0.0f, cIn_) :

                             DataCopy(bottomLeftFeatureLocal_[6 * cIn_], xGm_[bottomLeft6], cIn_);

    (bottomLeft7 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[7 * cIn_], 0.0f, cIn_) :

                             DataCopy(bottomLeftFeatureLocal_[7 * cIn_], xGm_[bottomLeft7], cIn_);

    (bottomLeft8 == -1.0f) ? Duplicate(bottomLeftFeatureLocal_[8 * cIn_], 0.0f, cIn_) :

                             DataCopy(bottomLeftFeatureLocal_[8 * cIn_], xGm_[bottomLeft8], cIn_);

    Mul(bottomLeftWeightLocal_, oneSubFracWBroadcastLocal_, fracHBroadcastLocal_, 9 * dataBlockPerInputChannel_);

    SetFlag<HardEvent::MTE2_V>(copyInFeatureEventID);

    WaitFlag<HardEvent::MTE2_V>(copyInFeatureEventID);

    MulAddDst(

        outFeatureLocal_, bottomLeftFeatureLocal_, bottomLeftWeightLocal_, mask_, repeatTimes_, {1, 1, 0, 8, 8, 1});

    int32_t bottomRight0 = bottomRightOffsetLocal_.GetValue(0);

    int32_t bottomRight1 = bottomRightOffsetLocal_.GetValue(1);

    int32_t bottomRight2 = bottomRightOffsetLocal_.GetValue(2);

    int32_t bottomRight3 = bottomRightOffsetLocal_.GetValue(3);

    int32_t bottomRight4 = bottomRightOffsetLocal_.GetValue(4);

    int32_t bottomRight5 = bottomRightOffsetLocal_.GetValue(5);

    int32_t bottomRight6 = bottomRightOffsetLocal_.GetValue(6);

    int32_t bottomRight7 = bottomRightOffsetLocal_.GetValue(7);

    int32_t bottomRight8 = bottomRightOffsetLocal_.GetValue(8);

    (bottomRight0 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[0 * cIn_], 0.0f, cIn_) :

                              DataCopy(bottomRightFeatureLocal_[0 * cIn_], xGm_[bottomRight0], cIn_);

    (bottomRight1 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[1 * cIn_], 0.0f, cIn_) :

                              DataCopy(bottomRightFeatureLocal_[1 * cIn_], xGm_[bottomRight1], cIn_);

    (bottomRight2 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[2 * cIn_], 0.0f, cIn_) :

                              DataCopy(bottomRightFeatureLocal_[2 * cIn_], xGm_[bottomRight2], cIn_);

    (bottomRight3 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[3 * cIn_], 0.0f, cIn_) :

                              DataCopy(bottomRightFeatureLocal_[3 * cIn_], xGm_[bottomRight3], cIn_);

    (bottomRight4 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[4 * cIn_], 0.0f, cIn_) :

                              DataCopy(bottomRightFeatureLocal_[4 * cIn_], xGm_[bottomRight4], cIn_);

    (bottomRight5 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[5 * cIn_], 0.0f, cIn_) :

                              DataCopy(bottomRightFeatureLocal_[5 * cIn_], xGm_[bottomRight5], cIn_);

    (bottomRight6 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[6 * cIn_], 0.0f, cIn_) :

                              DataCopy(bottomRightFeatureLocal_[6 * cIn_], xGm_[bottomRight6], cIn_);

    (bottomRight7 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[7 * cIn_], 0.0f, cIn_) :

                              DataCopy(bottomRightFeatureLocal_[7 * cIn_], xGm_[bottomRight7], cIn_);

    (bottomRight8 == -1.0f) ? Duplicate(bottomRightFeatureLocal_[8 * cIn_], 0.0f, cIn_) :

                              DataCopy(bottomRightFeatureLocal_[8 * cIn_], xGm_[bottomRight8], cIn_);

    Mul(bottomRightWeightLocal_, fracHBroadcastLocal_, fracWBroadcastLocal_, 9 * dataBlockPerInputChannel_);

    SetFlag<HardEvent::MTE2_V>(copyInFeatureEventID);

    WaitFlag<HardEvent::MTE2_V>(copyInFeatureEventID);

    MulAddDst(

        outFeatureLocal_, bottomRightFeatureLocal_, bottomRightWeightLocal_, mask_, repeatTimes_, {1, 1, 0, 8, 8, 1});

}

DeformableConv2dV2Kernel::ProcessCube

innerCubeTaskIdx为末尾元素索引。这里假定起始索引为0，因此可以得到 im2col 的行数cubeTaskCount。
elementsCountPerTask_为 k h k w C i k_h k_w C_i khkwCi。
aOffset和cOffset分别为当前核在 A 和 C 矩阵上的起始偏移。

cpp 复制代码

template<bool modulated>

__aicore__ inline void DeformableConv2dV2Kernel<modulated>::ProcessCube(

    uint32_t taskIdx, const int32_t& innerCubeTaskIdx)

{

    int32_t cubeTaskCount = innerCubeTaskIdx + 1;

    uint64_t aOffset = (taskIdx - innerCubeTaskIdx) * elementsCountPerTask_;

    uint64_t cOffset = (taskIdx - innerCubeTaskIdx) * cOut_;

SetTensorA 设置矩阵乘的左矩阵 A。
SetTensorB 设置矩阵乘的右矩阵B。
SetSingleShape 设置 Matmul 单核计算的形状 singleMIn，singleNIn，singleKIn，单位为元素。

IterateAll 计算出 singleCoreM * singleCoreN 大小的 C 矩阵。迭代顺序可通过 tiling 参数 iterateOrder 调整。

img2col 的形状为 128 × k h k w C i 128\times k_h k_w C_i 128×khkwCi，weight 的形状为 C o × k h k w C i C_o \times k_h k_w C_i Co×khkwCi，输出形状为 128 × C o 128\times C_o 128×Co。

cpp 复制代码

    mm_.SetTensorA(img2colMatGm_[aOffset]);

    mm_.SetTensorB(weightGm_, true);

    mm_.SetSingleShape(cubeTaskCount, cOut_, elementsCountPerTask_);

    mm_.template IterateAll<false>(yGm_[cOffset]);

}

Ascend DrivingSDK 中的 modulated_deform_conv2d（一）

参考资料：