iree 编译流程(1)

IREE 目前支持将 MHLO 或 XLA、Torch Tensor 和 TOSA 作为输入,经过一系列 passes 编译生成 IREE 定义的 VM bytecode 中间产物,其中硬件相关代码会编译成相应的 Executable,保存在 VM bytecode 中供 host 进行调用。例如 CUDA 相关的计算代码会被lower 成 PTX 代码,在 IREE 的 runtime 中再被 CUDA 的 runtime 以 JIT 的方式编译成可执行的 cubin kernel。

IREE 编译的入口是IREEVMTransformPassPipelineIREEVMTransformPassPipeline又被分成InputConversionPassPipelineCommonInputConversionPassPipelineABI::TransformPassPipelineFlow::FlowTransformPassPipelineStream::StreamTransformPassPipeline(仅 CUDA 后端)、HAL::HALTransformPassPipelineVM::VMTransformPassPipeline等几个阶段。

1 InputConversionPassPipeline

主要作用是将不同的输入(MHLOXLATorch TensorTOSA)统一 lower 成 linalg dialectbuiltinarith dialectscf dialecttensor dialect。以 MHLO 输入为例,列举了 InputConversionPassPipeline 中各个 pass 以及它们的主要作用。

  • mhlo::createLegalizeControlFlowPass

    将TF1.0中的控制流原语(http://download.tensorflow.org/paper/white_paper_tf_control_flow_implementation_2017_11_1.pdf)规范化成HLO中的控制流算子。

  • createTopLevelSCFToCFGPass

    将顶层的structured control flow表示的控制流图转换成更底层基础块的控制流图(CFG)。

  • createMHLOToMHLOPreprocessingPass

  • mlir::createCanonicalizerPass

  • mlir::createShapeToShapeLowering

    shape.num_elements转换成shape.reduce

  • mlir::createConvertShapeToStandardPass

    shape dialect lower成arith dialectscf dialecttensor dialect。比如

    python 复制代码
    func.func @test(%arg0: tensor<1x?xf32>, %arg1: tensor<?xf32>) -> index {
      %c1 = arith.constant 1 : index
      %c0 = arith.constant 0 : index
      %0 = shape.dim %arg0, %c1 : tensor<1x?xf32>, index -> index
      %1 = shape.dim %arg1, %c0 : tensor<?xf32>, index -> index
      %2 = shape.add %0, %1 : index, index -> index
      return %2 : index
    }

    转换成

    python 复制代码
    func.func @test(%arg0: tensor<1x?xf32>, %arg1: tensor<?xf32>) -> index {
        %c1 = arith.constant 1 : index
        %c0 = arith.constant 0 : index
        %c1_0 = arith.constant 1 : index
        %c1_1 = arith.constant 1 : index
        %0 = tensor.dim %arg0, %c1_1 : tensor<1x?xf32>
        %1 = tensor.from_elements %c1_0, %0 : tensor<2xindex>
        %2 = tensor.cast %1 : tensor<2xindex> to tensor<2xindex>
        %3 = tensor.dim %arg0, %c1 : tensor<1x?xf32>
        %c0_2 = arith.constant 0 : index
        %4 = tensor.dim %arg1, %c0_2 : tensor<?xf32>
        %5 = tensor.from_elements %4 : tensor<1xindex>
        %6 = tensor.cast %5 : tensor<1xindex> to tensor<1xindex>
        %7 = tensor.dim %arg1, %c0 : tensor<?xf32>
        %8 = arith.addi %3, %7 : index
        return %8 : index
      }
  • mlir::createCanonicalizerPass

  • mlir::createInlinerPass

    内联callscallable operations,并删除dead callables。比如:

    python 复制代码
    func.func @test(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> {
      %0 = call @add(%arg0, %arg1) : (tensor<1xf32>, tensor<1xf32>) -> tensor<1xf32>
      return %0 : tensor<1xf32>
    }
    func.func private @add(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> {
      %0 = mhlo.add %arg0, %arg1 : tensor<1xf32>
      return %0 : tensor<1xf32>
    }

    私有的add函数被内联之后删除,

    python 复制代码
    func.func @test(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> {
      %0 = mhlo.add %arg0, %arg1 : tensor<1xf32>
      return %0 : tensor<1xf32>
    }
  • IREE::Util::createDemoteI64ToI32Pass

  • IREE::Util::createDemoteF64ToF32Pass

  • mlir::createCanonicalizerPass

  • mlir::createCSEPass

  • mhlo::createLegalizeShapeComputationsPass

    scalar tensor op转换成scalar op + fromElements op。比如

    python 复制代码
    func.func @test(%arg0: f32, %arg1: f32) -> tensor<1xf32> {
      %0 = tensor.from_elements %arg0 : tensor<1xf32>
      %1 = tensor.from_elements %arg1 : tensor<1xf32>
      %2 = mhlo.add %0, %1 : tensor<1xf32>
      return %2 : tensor<1xf32>
    }

    转换成:

    python 复制代码
    func.func @test(%arg0: f32, %arg1: f32) -> tensor<1xf32> {
      %0 = arith.addf %arg0, %arg1 : f32
      %1 = tensor.from_elements %0 : tensor<1xf32>
      return %1 : tensor<1xf32>
    }
  • createConvertMHLOToLinalgExtPass

    mhlo::sortmhlo.scattermhlo.fftmhlo.reversemhlo.topk转换到IREE::LinalgExt dialect,同时将在IREE::LinalgExt dialect区域内部的mhlo op转换成linalg dialectmhlo.return则转换成iree_linalg_ext.yield。比如,

    python 复制代码
    func.func @test(%arg0: tensor<10xf32>) -> tensor<10xf32> {
      %0 = "mhlo.sort"(%arg0) ({
      ^bb0(%arg1: tensor<f32>, %arg2: tensor<f32>):
        %1 = mhlo.compare  GT, %arg1, %arg2 : (tensor<f32>, tensor<f32>) -> tensor<i1>
        mhlo.return %1 : tensor<i1>
      }) {dimension = 0 : i64} : (tensor<10xf32>) -> tensor<10xf32>
      return %0 : tensor<10xf32>
    }

    转换成,

    python 复制代码
    func.func @test(%arg0: tensor<10xf32>) -> tensor<10xf32> {
      %0 = iree_linalg_ext.sort dimension(0) outs(%arg0 : tensor<10xf32>) {
      ^bb0(%arg1: f32, %arg2: f32):
        %1 = arith.cmpf ogt, %arg1, %arg2 : f32
        iree_linalg_ext.yield %1 : i1
      } -> tensor<10xf32>
      return %0 : tensor<10xf32>
    }
  • createMHLOToLinalgOnTensorsPass

    将外层剩余的mhlo op转换到linalg dialect。比如

    python 复制代码
    func.func @test(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> {
      %0 = mhlo.add %arg0, %arg1 : tensor<1xf32>
      return %0 : tensor<1xf32>
    }

    转换成,

    python 复制代码
    func.func @test(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> {
      %0 = linalg.init_tensor [1] : tensor<1xf32>
      %1 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} ins(%arg0, %arg1 : tensor<1xf32>, tensor<1xf32>) outs(%0 : tensor<1xf32>) {
      ^bb0(%arg2: f32, %arg3: f32, %arg4: f32):
        %2 = arith.addf %arg2, %arg3 : f32
        linalg.yield %2 : f32
      } -> tensor<1xf32>
      return %1 : tensor<1xf32>
    }
  • mlir::createReconcileUnrealizedCastsPass

    消除unrealized conversion cast操作。

    算法过程描述:

    • 如果unrealized conversion castdead节点(没有user或所有users也都是unrealized conversion cast),则直接删除该dead节点;
    • 如果是live节点(至少有一个非unrealized conversion castuser),则遍历其所有子节点,如果其子节点中所有unrealized conversion castresult type与该 op 的input type相同(即不存在真实意义的type cast操作),则将所有遍历到的unrealized conversion cast都折叠成该 op 的输入,否则报错live unrealized conversion cast
  • mlir::createCanonicalizerPass

  • createVerifyCompilerMHLOInputLegality

    验证program是否合法。

2 CommonInputConversionPassPipeline

主要作用是将IREE::Input dialect lowerIREE::UtilIREE::FlowIREE::HAL dialect,包括以下几个passes:

  • createIREEImportPublicPass

    IREE::Input dialect转换成IREE::UtilIREE::FlowIREE::HAL dialect,并转换func的属性和signature中输入输出类型。比如,

    python 复制代码
    iree_input.global private mutable @param  : tensor<1x2xf32>
    func.func @run(%arg0: tensor<1x2xf32>) {
      %0 = iree_input.global.load @param : tensor<1x2xf32>
      %1 = iree_input.tensor.clone %0 : tensor<1x2xf32>
      iree_input.global.store %1, @param : tensor<1x2xf32>
      return
    }

    转换成(iree_input.global.load --> util.global.loadiree_input.global.store --> util.global.storeiree_input.tensor.clone --> flow.tensor.clone):

    python 复制代码
    util.global private mutable @param : tensor<1x2xf32>
    func.func @run(%arg0: tensor<1x2xf32>) {
      %param = util.global.load @param : tensor<1x2xf32>
      %0 = flow.tensor.clone %param : tensor<1x2xf32>
      util.global.store %0, @param : tensor<1x2xf32>
      return
    }
  • createImportMLProgramPass

    ml_program dialect转换到IREE::Util dialect

  • createSanitizeModuleNamesPass

    module name中的.替换为_,以符合mlir identifiers的命名规范。

    python 复制代码
    module @iree.module {
      func.func @test(%arg0: f32, %arg1: f32) -> f32 {
        %0 = arith.addf %arg0, %arg1 : f32
        return %0 : f32
      }
    }

    转换成,

    python 复制代码
    module @iree_module {
      func.func @test(%arg0: f32, %arg1: f32) -> f32 {
        %0 = arith.addf %arg0, %arg1 : f32
        return %0 : f32
      }
    }

3 ABI::TransformPassPipeline

主要作用是将外部导入的接口和本 module 导出到外部的接口参数统一成标准标量类型或hal.buffer_view类型(hal.buffer_view对应tensor)。

  • createWrapEntryPointsPass

    external func生成一个内部函数,函数中调用原始的external func,同时将public func的函数体包装成一个新的函数,原public func中调用该函数。该 pass 最终的目的是将外部导入的接口和本 module 导出到外部的接口参数统一成标准标量类型或hal.buffer_viewhal.buffer_view对应 tensor 类型)。

    python 复制代码
    // external/imported func
    func.func private @add(tensor<f32>, tensor<f32>) -> tensor<f32>
    
    // public/exported func
    func.func @test(%arg0: tensor<f32>, %arg1: tensor<f32>) -> tensor<f32> {
      %0 = call @add(%arg0, %arg1) : (tensor<f32>, tensor<f32>) -> tensor<f32>
      return %0 : tensor<f32>
    }

    转换成,

    python 复制代码
    func.func private @add(!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub}
    func.func private @_add(%arg0: tensor<f32>, %arg1: tensor<f32>) -> tensor<f32> {
      %0 = hal.tensor.export %arg0 : tensor<f32> -> !hal.buffer_view
      %1 = hal.tensor.export %arg1 : tensor<f32> -> !hal.buffer_view
      %2 = call @add(%0, %1) : (!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view
      %3 = hal.tensor.import %2 : !hal.buffer_view -> tensor<f32>
      return %3 : tensor<f32>
    }
    func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<f32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<f32>
      %2 = call @_test(%0, %1) : (tensor<f32>, tensor<f32>) -> tensor<f32>
      %3 = hal.tensor.export %2 : tensor<f32> -> !hal.buffer_view
      return %3 : !hal.buffer_view
    }
    func.func private @_test(%arg0: tensor<f32>, %arg1: tensor<f32>) -> tensor<f32> {
      %0 = call @_add(%arg0, %arg1) : (tensor<f32>, tensor<f32>) -> tensor<f32>
      return %0 : tensor<f32>
    }
  • mlir::createInlinerPass

    WrapEntryPointsPass中生成的 wrap 函数内联起来。最终转换成,

    python 复制代码
    func.func private @add(!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub}
    func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %0 = call @add(%arg0, %arg1) : (!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view
      return %0 : !hal.buffer_view
    }
  • mlir::createCanonicalizerPass

  • mlir::createCSEPass

  • mlir::createSymbolDCEPass

4 Flow::FlowTransformPassPipeline

主要作用是执行一系列窥孔优化,比如 1x1 的 conv2d 转换成 matmultilingop fusion 等,最终将 workload 拆分成 flow.executable

  • IREE::Util::createDemoteF64ToF32Pass

    将F64类型窄化为F32。

  • IREE::Flow::createConvertConv2D1x1ToMatmulPass

    将 1x1 的linalg.conv_2d_nhwc_hwcf转换成linalg.matmul

    python 复制代码
    // func.func @conv(%input : tensor<1x2x2x3xf32>, %filter: tensor<1x1x3x4xf32>) -> tensor<1x2x2x4xf32> {
    //   %0 = mhlo.convolution(%input, %filter)
    //             dim_numbers = [b, 0, 1, f]x[0, 1, i, o]->[b, 0, 1, f],
    //             window = {stride = [1, 1], pad = [[0, 0], [0, 0]], rhs_dilate = [1, 1]}
    //             {batch_group_count = 1 : i64, feature_group_count = 1 : i64}
    //           : (tensor<1x2x2x3xf32>, tensor<1x1x3x4xf32>) -> tensor<1x2x2x4xf32>
    //   return %0 : tensor<1x2x2x4xf32>
    // }
    func.func @conv(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x2x2x3xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<1x1x3x4xf32>
      %2 = linalg.init_tensor [1, 2, 2, 4] : tensor<1x2x2x4xf32>
      %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x2x2x4xf32>) -> tensor<1x2x2x4xf32>
      %4 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%0, %1 : tensor<1x2x2x3xf32>, tensor<1x1x3x4xf32>) outs(%3 : tensor<1x2x2x4xf32>) -> tensor<1x2x2x4xf32>
      %5 = hal.tensor.export %4 : tensor<1x2x2x4xf32> -> !hal.buffer_view
      return %5 : !hal.buffer_view
    }

    转换成,

    python 复制代码
    func.func @conv(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x2x2x3xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<1x1x3x4xf32>
      %2 = linalg.init_tensor [1, 2, 2, 4] : tensor<1x2x2x4xf32>
      %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x2x2x4xf32>) -> tensor<1x2x2x4xf32>
      %4 = tensor.collapse_shape %0 [[0, 1, 2], [3]] : tensor<1x2x2x3xf32> into tensor<4x3xf32>
      %5 = tensor.collapse_shape %1 [[0, 1, 2], [3]] : tensor<1x1x3x4xf32> into tensor<3x4xf32>
      %6 = tensor.collapse_shape %3 [[0, 1, 2], [3]] : tensor<1x2x2x4xf32> into tensor<4x4xf32>
      %7 = linalg.matmul ins(%4, %5 : tensor<4x3xf32>, tensor<3x4xf32>) outs(%6 : tensor<4x4xf32>) -> tensor<4x4xf32>
      %8 = tensor.expand_shape %7 [[0, 1, 2], [3]] : tensor<4x4xf32> into tensor<1x2x2x4xf32>
      %9 = hal.tensor.export %8 : tensor<1x2x2x4xf32> -> !hal.buffer_view
      return %9 : !hal.buffer_view
    }
  • IREE::Flow::createConvertConv2DToImg2ColPass

    conv2d转换成img2col。默认不开启。

    python 复制代码
    // %0 = mhlo.convolution(%input, %filter)
    //               dim_numbers = [b, 0, 1, f]x[0, 1, i, o]->[b, 0, 1, f],
    //               window = {stride = [1, 1], pad = [[0, 0], [0, 0]], rhs_dilate = [1, 1]}
    //               {batch_group_count = 1 : i64, feature_group_count = 1 : i64}
    //             : (tensor<1x4x4x3xf32>, tensor<2x2x3x4xf32>) -> tensor<1x3x3x4xf32>
    func.func @conv(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x4x4x3xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2x2x3x4xf32>
      %2 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32>
      %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
      %4 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%0, %1 : tensor<1x4x4x3xf32>, tensor<2x2x3x4xf32>) outs(%3 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
      %5 = hal.tensor.export %4 : tensor<1x3x3x4xf32> -> !hal.buffer_view
      return %5 : !hal.buffer_view
    }

    转换成,

    python 复制代码
    func.func @conv(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x4x4x3xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2x2x3x4xf32>
      %2 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32>
      %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
      %4 = linalg.init_tensor [1, 3, 3, 2, 2, 3] : tensor<1x3x3x2x2x3xf32>
      %5 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1 + d3, d2 + d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%0 : tensor<1x4x4x3xf32>) outs(%4 : tensor<1x3x3x2x2x3xf32>) {
      ^bb0(%arg2: f32, %arg3: f32):
        linalg.yield %arg2 : f32
      } -> tensor<1x3x3x2x2x3xf32>
      %6 = tensor.collapse_shape %5 [[0, 1, 2], [3, 4, 5]] : tensor<1x3x3x2x2x3xf32> into tensor<9x12xf32>
      %7 = tensor.collapse_shape %1 [[0, 1, 2], [3]] : tensor<2x2x3x4xf32> into tensor<12x4xf32>
      %8 = tensor.collapse_shape %3 [[0, 1, 2], [3]] : tensor<1x3x3x4xf32> into tensor<9x4xf32>
      %9 = linalg.matmul ins(%6, %7 : tensor<9x12xf32>, tensor<12x4xf32>) outs(%8 : tensor<9x4xf32>) -> tensor<9x4xf32>
      %10 = tensor.expand_shape %9 [[0, 1, 2], [3]] : tensor<9x4xf32> into tensor<1x3x3x4xf32>
      %11 = hal.tensor.export %10 : tensor<1x3x3x4xf32> -> !hal.buffer_view
      return %11 : !hal.buffer_view
    }
  • IREE::Flow::createDetachElementwiseFromNamedOpsPass

    buffer = linalg.generic_op + linalg.named_payload_op转换成tmp_buffer = linalg.named_payload_op; buffer = linalg.generic_op + tmp_buffer,主要目的是将上游的generic opnamed_payload_op分隔开,使得named_payload_op的结果写到一块新的 buffer。

    python 复制代码
    func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x4x4x3xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2x2x3x4xf32>
      %2 = hal.tensor.import %arg2 : !hal.buffer_view -> tensor<1x3x3x4xf32>
      
      %3 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32>
      %4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
      %5 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%2 : tensor<1x3x3x4xf32>) outs(%4 : tensor<1x3x3x4xf32>) {
      ^bb0(%arg3: f32, %arg4: f32):
        %8 = arith.addf %arg3, %arg3 : f32
        linalg.yield %8 : f32
      } -> tensor<1x3x3x4xf32>
      
      %6 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%0, %1 : tensor<1x4x4x3xf32>, tensor<2x2x3x4xf32>) outs(%5 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
      %7 = hal.tensor.export %6 : tensor<1x3x3x4xf32> -> !hal.buffer_view
      return %7 : !hal.buffer_view
    }

    转换成,

    python 复制代码
    func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x4x4x3xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2x2x3x4xf32>
      %2 = hal.tensor.import %arg2 : !hal.buffer_view -> tensor<1x3x3x4xf32>
      
      %3 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32>
      %4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
      %5 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%2 : tensor<1x3x3x4xf32>) outs(%4 : tensor<1x3x3x4xf32>) {
      ^bb0(%arg3: f32, %arg4: f32):
        %11 = arith.addf %arg3, %arg3 : f32
        linalg.yield %11 : f32
      } -> tensor<1x3x3x4xf32>
      
      %6 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32>
      %7 = linalg.fill ins(%cst : f32) outs(%6 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
      %8 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%0, %1 : tensor<1x4x4x3xf32>, tensor<2x2x3x4xf32>) outs(%7 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
    
      %9 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%8, %5 : tensor<1x3x3x4xf32>, tensor<1x3x3x4xf32>) outs(%7 : tensor<1x3x3x4xf32>) {
      ^bb0(%arg3: f32, %arg4: f32, %arg5: f32):
        %11 = arith.addf %arg3, %arg4 : f32
        linalg.yield %11 : f32
      } -> tensor<1x3x3x4xf32>
      %10 = hal.tensor.export %9 : tensor<1x3x3x4xf32> -> !hal.buffer_view
      return %10 : !hal.buffer_view
    }
  • IREE::Flow::createVerifyInputLegalityPass

    验证program是否合法

  • IREE::Flow::createConvertLinalgMatmulToMmt4DPass

    将 2d 的linalg.matmul tiling成linalg.mmt4d。默认不开启,可通过--iree-flow-mmt4d-target-options="enable_generic_slow arch=cuda选项开启。

    python 复制代码
    func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<128x256xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<256x256xf32>
      %2 = linalg.init_tensor [128, 256] : tensor<128x256xf32>
      %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<128x256xf32>) -> tensor<128x256xf32>
      %4 = linalg.matmul ins(%0, %1 : tensor<128x256xf32>, tensor<256x256xf32>) outs(%3 : tensor<128x256xf32>) -> tensor<128x256xf32>
      %5 = hal.tensor.export %4 : tensor<128x256xf32> -> !hal.buffer_view
      return %5 : !hal.buffer_view
    }

    转换成,

    python 复制代码
    func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<128x256xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<256x256xf32>
      %2 = linalg.init_tensor [128, 256] : tensor<128x256xf32>
      %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<128x256xf32>) -> tensor<128x256xf32>
      %4 = tensor.expand_shape %0 [[0, 1], [2, 3]] : tensor<128x256xf32> into tensor<16x8x128x2xf32>
      %5 = tensor.expand_shape %1 [[0, 1], [2, 3]] : tensor<256x256xf32> into tensor<128x2x64x4xf32>
      %6 = tensor.expand_shape %3 [[0, 1], [2, 3]] : tensor<128x256xf32> into tensor<16x8x64x4xf32>
      %7 = linalg.init_tensor [16, 128, 8, 2] : tensor<16x128x8x2xf32>
      %8 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%4 : tensor<16x8x128x2xf32>) outs(%7 : tensor<16x128x8x2xf32>) {
      ^bb0(%arg2: f32, %arg3: f32):
        linalg.yield %arg2 : f32
      } -> tensor<16x128x8x2xf32>
      %9 = linalg.init_tensor [64, 128, 4, 2] : tensor<64x128x4x2xf32>
      %10 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d1, d3, d0, d2)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%5 : tensor<128x2x64x4xf32>) outs(%9 : tensor<64x128x4x2xf32>) {
      ^bb0(%arg2: f32, %arg3: f32):
        linalg.yield %arg2 : f32
      } -> tensor<64x128x4x2xf32>
      %11 = linalg.init_tensor [16, 64, 8, 4] : tensor<16x64x8x4xf32>
      %12 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%6 : tensor<16x8x64x4xf32>) outs(%11 : tensor<16x64x8x4xf32>) {
      ^bb0(%arg2: f32, %arg3: f32):
        linalg.yield %arg2 : f32
      } -> tensor<16x64x8x4xf32>
      // 16 x (128x8x2) @ 64 x (128x4x2) => 16 x 64 x sum_{128}(8x2 * (4x2)^T)
      %13 = linalg.mmt4d {comment = "generic tiling parameters, as no known kernel was matched for this matmul and target"} ins(%8, %10 : tensor<16x128x8x2xf32>, tensor<64x128x4x2xf32>) outs(%12 : tensor<16x64x8x4xf32>) -> tensor<16x64x8x4xf32>
      %14 = linalg.init_tensor [16, 8, 64, 4] : tensor<16x8x64x4xf32>
      %15 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%13 : tensor<16x64x8x4xf32>) outs(%14 : tensor<16x8x64x4xf32>) {
      ^bb0(%arg2: f32, %arg3: f32):
        linalg.yield %arg2 : f32
      } -> tensor<16x8x64x4xf32>
      %16 = tensor.collapse_shape %15 [[0, 1], [2, 3]] : tensor<16x8x64x4xf32> into tensor<128x256xf32>
      %17 = hal.tensor.export %16 : tensor<128x256xf32> -> !hal.buffer_view
      return %17 : !hal.buffer_view
    }
  • IREE::Flow::createPadLinalgOpsToIntegerMultiplePass

    将 matmul 的 M、N 和 K 扩充到paddingSize的整数倍,paddingSize默认为 4。

  • mlir::createLinalgNamedOpConversionPass

    depth_multiplier=1linalg.depthwise_conv_2d_nhwc_hwcm转换成linalg.depthwise_conv_2d_nhwc_hwc,将depth_multiplier=1linalg.depthwise_conv_2d_nhwc_hwcm_q转换成linalg.depthwise_conv_2d_nhwc_hwc_q

depth_multiplier的作用见 https://www.tensorflow.org/api_docs/python/tf/keras/layers/DepthwiseConv2D

  • IREE::Flow::createExpandTensorShapesPass
    dynamic tensor扩充为tensor + dynamic dim的对偶形式,这么做的一个好处是动态维度可以直接参与计算和推导。比如

    python 复制代码
    // func.func private @add(%arg0 : tensor<?x2xf32>, %arg1 : tensor<?x2xf32>) -> tensor<?x2xf32>
    // iree_input.global private mutable @param : tensor<?x2xf32>
    // func.func @run(%arg0 : tensor<?x2xf32>) -> tensor<?x2xf32> {
    //   %0 = iree_input.global.load @param : tensor<?x2xf32>
    //   %1 = call @add(%0, %arg0) : (tensor<?x2xf32>, tensor<?x2xf32>) -> tensor<?x2xf32>
    //   iree_input.global.store %1, @param : tensor<?x2xf32>
    //   return %1 : tensor<?x2xf32>
    // }
    func.func private @add(!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub}
    util.global private mutable @param : tensor<?x2xf32>
    func.func @run(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %c0 = arith.constant 0 : index
      %param = util.global.load @param : tensor<?x2xf32>
      %dim = tensor.dim %param, %c0 : tensor<?x2xf32>
      %0 = hal.tensor.export %param : tensor<?x2xf32>{%dim} -> !hal.buffer_view
      %1 = call @add(%0, %arg0) : (!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view
      %2 = hal.buffer_view.dim<%1 : !hal.buffer_view>[0] : index
      %3 = hal.tensor.import %1 : !hal.buffer_view -> tensor<?x2xf32>{%2}
      util.global.store %3, @param : tensor<?x2xf32>
      return %1 : !hal.buffer_view
    }

    转换成,

    python 复制代码
    func.func private @add(!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub}
    util.global private mutable @param : tensor<?x2xf32>
    util.global private mutable @param__d0 : index
    func.func @run(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %c0 = arith.constant 0 : index
      %param = util.global.load @param : tensor<?x2xf32>
      %param__d0 = util.global.load @param__d0 : index
      %0 = flow.tensor.tie_shape %param : tensor<?x2xf32>{%param__d0}
      %dim = tensor.dim %0, %c0 : tensor<?x2xf32>
      %1 = hal.tensor.export %0 : tensor<?x2xf32>{%dim} -> !hal.buffer_view
      %2 = call @add(%1, %arg0) : (!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view
      %3 = hal.buffer_view.dim<%2 : !hal.buffer_view>[0] : index
      %4 = hal.tensor.import %2 : !hal.buffer_view -> tensor<?x2xf32>{%3}
      util.global.store %4, @param : tensor<?x2xf32>
      util.global.store %3, @param__d0 : index
      return %2 : !hal.buffer_view
    }

未完待续...

相关推荐
虾球xz1 小时前
游戏引擎学习第55天
学习·游戏引擎
oneouto1 小时前
selenium学习笔记(二)
笔记·学习·selenium
sealaugh321 小时前
aws(学习笔记第十九课) 使用ECS和Fargate进行容器开发
笔记·学习·aws
炭烤玛卡巴卡2 小时前
学习postman工具使用
学习·测试工具·postman
thesky1234562 小时前
活着就好20241224
学习·算法
蜗牛hb2 小时前
VMware Workstation虚拟机网络模式
开发语言·学习·php
汤姆和杰瑞在瑞士吃糯米粑粑2 小时前
【C++学习篇】AVL树
开发语言·c++·学习
虾球xz3 小时前
游戏引擎学习第58天
学习·游戏引擎
奶香臭豆腐3 小时前
C++ —— 模板类具体化
开发语言·c++·学习
波音彬要多做4 小时前
41 stack类与queue类
开发语言·数据结构·c++·学习·算法