CANN ops-nn 卷积池化算子实现

cann组织链接 ：https://atomgit.com/cann
ops-nn仓库链接：https://atomgit.com/cann/ops-nn

引言

卷积（Convolution）和池化（Pooling）是计算机视觉模型的核心操作。从经典的AlexNet、VGG到现代的ResNet、EfficientNet，卷积网络在图像分类、目标检测、语义分割等任务中取得了巨大成功。高效的卷积和池化算子实现直接决定了视觉模型的训练和推理性能。

ops-nn项目的conv和pooling目录提供了全面的卷积和池化算子实现，支持2D/3D卷积、各类池化操作、以及它们的反向传播。本文将深入解析这些算子的实现原理与优化技术。

卷积算子解析

卷积的数学定义

二维卷积的数学表达：

复制代码

Output[n][c_out][h][w] = Σ Σ Σ Input[n][c_in][h+i][w+j] * Weight[c_out][c_in][i][j]

其中：

n：batch维度
c_in/c_out：输入/输出通道数
h, w：输出特征图的空间位置
i, j：卷积核的空间位置

关键参数：

kernel_size：卷积核大小（如3×3、5×5）
stride：步长，控制输出尺寸
padding：填充，保持输出尺寸
dilation：空洞率，扩大感受野
groups：分组卷积，减少参数量

ops-nn卷积算子

Conv2dV2：

标准2D卷积，支持：

任意kernel_size和stride
各种padding模式（same、valid、自定义）
分组卷积（groups参数）
空洞卷积（dilation参数）

cpp 复制代码

Conv2dV2(input,      // [N, C_in, H_in, W_in]
         weight,     // [C_out, C_in/groups, K_h, K_w]
         bias,       // [C_out]
         output,     // [N, C_out, H_out, W_out]
         stride,
         padding,
         dilation,
         groups);

Conv3dV2：

三维卷积，用于视频分析、3D医学图像等：

cpp 复制代码

Conv3dV2(input,      // [N, C_in, D_in, H_in, W_in]
         weight,     // [C_out, C_in/groups, K_d, K_h, K_w]
         output);    // [N, C_out, D_out, H_out, W_out]

ConvolutionForward：

通用卷积接口，支持1D/2D/3D卷积和转置卷积：

cpp 复制代码

ConvolutionForward(input,
                   weight,
                   bias,
                   stride,
                   padding,
                   dilation,
                   transposed,  // 是否转置卷积
                   output_padding,
                   groups,
                   output);

DeformableConv2d：

可变形卷积，支持动态调整采样位置：

cpp 复制代码

DeformableConv2d(input,
                 weight,
                 offset,     // 偏移量，控制采样位置
                 mask,       // 调制标量，控制采样权重
                 output);

可变形卷积在目标检测（DETR、Deformable DETR）中广泛应用。

卷积实现方法

1. 直接卷积（Direct Convolution）

最直观的实现方式，直接按照数学定义计算：

cpp 复制代码

for (int n = 0; n < batch; n++) {
    for (int c_out = 0; c_out < out_channels; c_out++) {
        for (int h = 0; h < H_out; h++) {
            for (int w = 0; w < W_out; w++) {
                float sum = 0;
                for (int c_in = 0; c_in < in_channels; c_in++) {
                    for (int kh = 0; kh < K_h; kh++) {
                        for (int kw = 0; kw < K_w; kw++) {
                            int h_in = h * stride + kh;
                            int w_in = w * stride + kw;
                            sum += input[n][c_in][h_in][w_in] * 
                                   weight[c_out][c_in][kh][kw];
                        }
                    }
                }
                output[n][c_out][h][w] = sum + bias[c_out];
            }
        }
    }
}

特点：

实现简单，易于理解
内存占用小
但性能较差，不适合大规模卷积

2. Im2col + GEMM

将卷积转换为矩阵乘法，是最常用的卷积实现方式。

Im2col变换：

将输入特征图展开为列矩阵：

复制代码

输入：[N, C_in, H_in, W_in]
卷积核：[C_out, C_in, K_h, K_w]

Im2col后：
  输入矩阵：[N * H_out * W_out, C_in * K_h * K_w]
  权重矩阵：[C_out, C_in * K_h * K_w]
  
矩阵乘：Output = Input_col @ Weight^T
结果：[N * H_out * W_out, C_out]

Im2col实现：

cpp 复制代码

void Im2col(const float* input,
            float* col,
            int N, int C, int H, int W,
            int K_h, int K_w,
            int stride, int padding) {
    int H_out = (H + 2*padding - K_h) / stride + 1;
    int W_out = (W + 2*padding - K_w) / stride + 1;
    
    for (int n = 0; n < N; n++) {
        for (int h_out = 0; h_out < H_out; h_out++) {
            for (int w_out = 0; w_out < W_out; w_out++) {
                int out_idx = (n * H_out * W_out + h_out * W_out + w_out);
                
                for (int c = 0; c < C; c++) {
                    for (int kh = 0; kh < K_h; kh++) {
                        for (int kw = 0; kw < K_w; kw++) {
                            int h_in = h_out * stride + kh - padding;
                            int w_in = w_out * stride + kw - padding;
                            
                            int col_idx = out_idx * (C * K_h * K_w) +
                                         c * K_h * K_w + kh * K_w + kw;
                            
                            if (h_in >= 0 && h_in < H && w_in >= 0 && w_in < W) {
                                col[col_idx] = input[n][c][h_in][w_in];
                            } else {
                                col[col_idx] = 0;  // padding
                            }
                        }
                    }
                }
            }
        }
    }
}

GEMM计算：

cpp 复制代码

// 调用高效的矩阵乘法
MatMul(output_col, input_col, weight_T,
       N * H_out * W_out,  // M
       C_out,              // N
       C_in * K_h * K_w);  // K

优缺点：

优点：充分利用矩阵乘法的优化，性能高
缺点：Im2col需要额外内存，空间换时间

3. Winograd快速卷积

Winograd算法通过减少乘法次数加速小卷积（如3×3）。

原理：

对于F(2×2, 3×3)Winograd（输出2×2，核3×3）：

普通卷积：36次乘法
Winograd：16次乘法

计算流程：

复制代码

1. 输入变换：B^T d B
2. 权重变换：G g G^T
3. 逐元素乘法：M = (B^T d B) ⊙ (G g G^T)
4. 输出变换：A^T M A

其中B、G、A是固定的变换矩阵。

适用场景：

小卷积核（3×3、5×5）
stride=1
计算密集型场景

ops-nn在某些情况下会自动选择Winograd算法。

4. FFT卷积

对于大卷积核，可以使用FFT加速：

复制代码

卷积定理：f * g = IFFT(FFT(f) ⊙ FFT(g))

流程：

cpp 复制代码

// 1. FFT变换
auto input_fft = FFT2D(input);
auto weight_fft = FFT2D(weight);

// 2. 频域相乘
auto output_fft = input_fft * weight_fft;

// 3. IFFT逆变换
auto output = IFFT2D(output_fft);

适用场景：

大卷积核（11×11及以上）
特征图尺寸较大

卷积优化技巧

1. Tiling优化：

将输入、输出、权重分块处理：

cpp 复制代码

// 分块参数
const int TILE_H = 16;
const int TILE_W = 16;
const int TILE_C_OUT = 32;

for (int h = 0; h < H_out; h += TILE_H) {
    for (int w = 0; w < W_out; w += TILE_W) {
        for (int c = 0; c < C_out; c += TILE_C_OUT) {
            // 处理一个tile
            ConvTile(input, weight, output,
                    h, w, c,
                    TILE_H, TILE_W, TILE_C_OUT);
        }
    }
}

2. 数据重排（NCHW vs NHWC）：

复制代码

NCHW：[batch, channel, height, width]
  - 适合卷积计算
  - 内存访问连续性好

NHWC：[batch, height, width, channel]
  - 适合某些硬件
  - 利于向量化

根据硬件特性选择合适的布局。

3. 多核并行：

cpp 复制代码

// 按batch或输出通道并行
#pragma omp parallel for
for (int c_out = 0; c_out < C_out; c_out++) {
    ComputeOutputChannel(input, weight, output, c_out);
}

4. 低精度计算：

使用INT8或FP16进行卷积：

cpp 复制代码

// 量化卷积
QuantizedConv2d(input_int8,
                weight_int8,
                scale_input,
                scale_weight,
                output_int8);

池化算子解析

池化的数学定义

池化是一种下采样操作，减少特征图尺寸。

MaxPooling：

复制代码

Output[n][c][h][w] = max(Input[n][c][h*s:h*s+k, w*s:w*s+k])

选择窗口内的最大值。

AvgPooling：

复制代码

Output[n][c][h][w] = mean(Input[n][c][h*s:h*s+k, w*s:w*s+k])

计算窗口内的平均值。

ops-nn池化算子

MaxPoolV3：

2D最大池化：

cpp 复制代码

MaxPoolV3(input,         // [N, C, H_in, W_in]
          output,        // [N, C, H_out, W_out]
          kernel_size,   // [k_h, k_w]
          stride,
          padding,
          dilation);

MaxPool3d：

3D最大池化：

cpp 复制代码

MaxPool3d(input,         // [N, C, D_in, H_in, W_in]
          output,        // [N, C, D_out, H_out, W_out]
          kernel_size,   // [k_d, k_h, k_w]
          stride,
          padding,
          dilation);

AvgPool3D：

3D平均池化：

cpp 复制代码

AvgPool3D(input,
          output,
          kernel_size,
          stride,
          padding,
          count_include_pad);  // 是否包含padding在平均值计算中

AdaptiveMaxPool3d：

自适应池化，指定输出尺寸：

cpp 复制代码

AdaptiveMaxPool3d(input,
                  output,
                  output_size);  // 目标输出尺寸

自适应池化会自动计算合适的kernel_size和stride。

池化实现

MaxPooling实现：

cpp 复制代码

__aicore__ void MaxPoolKernel::Compute() {
    for (int n = 0; n < batch; n++) {
        for (int c = 0; c < channels; c++) {
            for (int h = 0; h < H_out; h++) {
                for (int w = 0; w < W_out; w++) {
                    float max_val = -INFINITY;
                    
                    // 遍历池化窗口
                    for (int kh = 0; kh < K_h; kh++) {
                        for (int kw = 0; kw < K_w; kw++) {
                            int h_in = h * stride + kh;
                            int w_in = w * stride + kw;
                            
                            if (h_in < H_in && w_in < W_in) {
                                float val = input[n][c][h_in][w_in];
                                max_val = max(max_val, val);
                            }
                        }
                    }
                    
                    output[n][c][h][w] = max_val;
                }
            }
        }
    }
}

AvgPooling实现：

cpp 复制代码

__aicore__ void AvgPoolKernel::Compute() {
    for (int n = 0; n < batch; n++) {
        for (int c = 0; c < channels; c++) {
            for (int h = 0; h < H_out; h++) {
                for (int w = 0; w < W_out; w++) {
                    float sum = 0;
                    int count = 0;
                    
                    // 遍历池化窗口
                    for (int kh = 0; kh < K_h; kh++) {
                        for (int kw = 0; kw < K_w; kw++) {
                            int h_in = h * stride + kh;
                            int w_in = w * stride + kw;
                            
                            if (h_in < H_in && w_in < W_in) {
                                sum += input[n][c][h_in][w_in];
                                count++;
                            }
                        }
                    }
                    
                    output[n][c][h][w] = sum / count;
                }
            }
        }
    }
}

MaxPooling with Indices：

保存最大值的位置，用于反向传播：

cpp 复制代码

MaxPoolWithArgmax(input,
                  output,       // 最大值
                  indices,      // 最大值的索引
                  kernel_size,
                  stride);

池化优化

1. 向量化：

cpp 复制代码

// 使用向量化指令批量处理
for (int w = 0; w < W_out; w += VEC_SIZE) {
    // 批量加载
    LoadVector(input_vec, input_ptr + w * stride, VEC_SIZE);
    
    // 向量化Max/Mean
    MaxVector(output_vec, input_vec, VEC_SIZE);
    
    // 批量存储
    StoreVector(output_ptr + w, output_vec, VEC_SIZE);
}

2. 并行化：

cpp 复制代码

// 按输出位置并行
#pragma omp parallel for collapse(3)
for (int n = 0; n < batch; n++) {
    for (int c = 0; c < channels; c++) {
        for (int h = 0; h < H_out; h++) {
            // 处理一行
            PoolRow(input, output, n, c, h);
        }
    }
}

3. 硬件加速：

利用AI Core的特殊指令：

cpp 复制代码

// 使用硬件ReduceMax指令
AscendC::ReduceMax(output, input, window_size);

反向传播

卷积反向传播

卷积有两个反向传播方向：

对输入的梯度（Conv2dBackwardInput）：

cpp 复制代码

Conv2dBackpropInput(grad_output,  // ∂L/∂Y
                    weight,        // W
                    grad_input);   // ∂L/∂X

实际上是一个转置卷积（Transposed Convolution）。

对权重的梯度（Conv2dBackwardFilter）：

cpp 复制代码

Conv2dBackpropFilter(input,        // X
                     grad_output,   // ∂L/∂Y
                     grad_weight);  // ∂L/∂W

计算：∂L/∂W = X^T * ∂L/∂Y

ops-nn实现：

cpp 复制代码

ConvolutionBackward(grad_output,
                    input,
                    weight,
                    grad_input,     // 输出
                    grad_weight,    // 输出
                    grad_bias,      // 输出
                    output_mask);   // 控制计算哪些梯度

池化反向传播

MaxPooling反向传播：

cpp 复制代码

// 只有最大值位置的梯度非零
MaxPoolGradWithArgmax(grad_output,
                      indices,      // 前向保存的最大值位置
                      grad_input);

实现：

cpp 复制代码

for (int i = 0; i < grad_output.size(); i++) {
    int max_idx = indices[i];
    grad_input[max_idx] += grad_output[i];
}

AvgPooling反向传播：

cpp 复制代码

// 梯度均分到池化窗口的所有位置
AvgPool3dGrad(grad_output,
              input_size,
              kernel_size,
              stride,
              grad_input);

实现：

cpp 复制代码

for 每个grad_output位置 (h_out, w_out):
    grad_value = grad_output[h_out][w_out] / (K_h * K_w)
    for kh in range(K_h):
        for kw in range(K_w):
            h_in = h_out * stride + kh
            w_in = w_out * stride + kw
            grad_input[h_in][w_in] += grad_value

应用场景

分类网络

ResNet、VGG等分类网络的典型结构：

python 复制代码

# ResNet Block
x = conv2d(x, 64, 3x3, stride=1)
x = batch_norm(x)
x = relu(x)
x = conv2d(x, 64, 3x3, stride=1)
x = batch_norm(x)
x = x + residual  # 残差连接
x = relu(x)

目标检测

YOLO、Faster R-CNN等检测网络使用多尺度卷积：

python 复制代码

# FPN结构
p5 = conv2d(c5, 256, 1x1)
p4 = upsample(p5) + conv2d(c4, 256, 1x1)
p3 = upsample(p4) + conv2d(c3, 256, 1x1)

# 检测头
detections = [
    detect_head(p3),  # 检测小目标
    detect_head(p4),  # 检测中目标
    detect_head(p5),  # 检测大目标
]

语义分割

UNet、DeepLab等分割网络使用空洞卷积：

python 复制代码

# 空洞卷积扩大感受野
x1 = conv2d(x, 256, 3x3, dilation=1)
x2 = conv2d(x, 256, 3x3, dilation=2)
x3 = conv2d(x, 256, 3x3, dilation=4)
x4 = conv2d(x, 256, 3x3, dilation=8)

# ASPP模块
x = concat([x1, x2, x3, x4])

总结

卷积和池化是计算机视觉的基石操作。ops-nn提供了全面的卷积和池化算子实现，支持各种变体和优化技术。

通过本文，我们深入了解了：

卷积和池化的数学原理
多种卷积实现方法（直接卷积、Im2col、Winograd、FFT）
池化操作的实现与优化
反向传播的计算方法
在视觉任务中的应用

卷积算子的优化是一个系统工程，需要综合考虑算法选择、数据布局、硬件特性等多个因素。建议开发者：

理解不同实现方法的适用场景
根据卷积核大小选择合适的算法
充分利用硬件加速能力
在准确性和性能之间取得平衡

掌握卷积和池化算子的实现与优化，是开发高性能视觉模型的关键技能。ops-nn项目提供的丰富实现为开发者提供了宝贵的学习资源和参考实现。