iree 编译流程(1)

IREE 目前支持将 MHLO 或 XLA、Torch Tensor 和 TOSA 作为输入,经过一系列 passes 编译生成 IREE 定义的 VM bytecode 中间产物,其中硬件相关代码会编译成相应的 Executable,保存在 VM bytecode 中供 host 进行调用。例如 CUDA 相关的计算代码会被lower 成 PTX 代码,在 IREE 的 runtime 中再被 CUDA 的 runtime 以 JIT 的方式编译成可执行的 cubin kernel。

IREE 编译的入口是IREEVMTransformPassPipelineIREEVMTransformPassPipeline又被分成InputConversionPassPipelineCommonInputConversionPassPipelineABI::TransformPassPipelineFlow::FlowTransformPassPipelineStream::StreamTransformPassPipeline(仅 CUDA 后端)、HAL::HALTransformPassPipelineVM::VMTransformPassPipeline等几个阶段。

1 InputConversionPassPipeline

主要作用是将不同的输入(MHLOXLATorch TensorTOSA)统一 lower 成 linalg dialectbuiltinarith dialectscf dialecttensor dialect。以 MHLO 输入为例,列举了 InputConversionPassPipeline 中各个 pass 以及它们的主要作用。

  • mhlo::createLegalizeControlFlowPass

    将TF1.0中的控制流原语(http://download.tensorflow.org/paper/white_paper_tf_control_flow_implementation_2017_11_1.pdf)规范化成HLO中的控制流算子。

  • createTopLevelSCFToCFGPass

    将顶层的structured control flow表示的控制流图转换成更底层基础块的控制流图(CFG)。

  • createMHLOToMHLOPreprocessingPass

  • mlir::createCanonicalizerPass

  • mlir::createShapeToShapeLowering

    shape.num_elements转换成shape.reduce

  • mlir::createConvertShapeToStandardPass

    shape dialect lower成arith dialectscf dialecttensor dialect。比如

    python 复制代码
    func.func @test(%arg0: tensor<1x?xf32>, %arg1: tensor<?xf32>) -> index {
      %c1 = arith.constant 1 : index
      %c0 = arith.constant 0 : index
      %0 = shape.dim %arg0, %c1 : tensor<1x?xf32>, index -> index
      %1 = shape.dim %arg1, %c0 : tensor<?xf32>, index -> index
      %2 = shape.add %0, %1 : index, index -> index
      return %2 : index
    }

    转换成

    python 复制代码
    func.func @test(%arg0: tensor<1x?xf32>, %arg1: tensor<?xf32>) -> index {
        %c1 = arith.constant 1 : index
        %c0 = arith.constant 0 : index
        %c1_0 = arith.constant 1 : index
        %c1_1 = arith.constant 1 : index
        %0 = tensor.dim %arg0, %c1_1 : tensor<1x?xf32>
        %1 = tensor.from_elements %c1_0, %0 : tensor<2xindex>
        %2 = tensor.cast %1 : tensor<2xindex> to tensor<2xindex>
        %3 = tensor.dim %arg0, %c1 : tensor<1x?xf32>
        %c0_2 = arith.constant 0 : index
        %4 = tensor.dim %arg1, %c0_2 : tensor<?xf32>
        %5 = tensor.from_elements %4 : tensor<1xindex>
        %6 = tensor.cast %5 : tensor<1xindex> to tensor<1xindex>
        %7 = tensor.dim %arg1, %c0 : tensor<?xf32>
        %8 = arith.addi %3, %7 : index
        return %8 : index
      }
  • mlir::createCanonicalizerPass

  • mlir::createInlinerPass

    内联callscallable operations,并删除dead callables。比如:

    python 复制代码
    func.func @test(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> {
      %0 = call @add(%arg0, %arg1) : (tensor<1xf32>, tensor<1xf32>) -> tensor<1xf32>
      return %0 : tensor<1xf32>
    }
    func.func private @add(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> {
      %0 = mhlo.add %arg0, %arg1 : tensor<1xf32>
      return %0 : tensor<1xf32>
    }

    私有的add函数被内联之后删除,

    python 复制代码
    func.func @test(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> {
      %0 = mhlo.add %arg0, %arg1 : tensor<1xf32>
      return %0 : tensor<1xf32>
    }
  • IREE::Util::createDemoteI64ToI32Pass

  • IREE::Util::createDemoteF64ToF32Pass

  • mlir::createCanonicalizerPass

  • mlir::createCSEPass

  • mhlo::createLegalizeShapeComputationsPass

    scalar tensor op转换成scalar op + fromElements op。比如

    python 复制代码
    func.func @test(%arg0: f32, %arg1: f32) -> tensor<1xf32> {
      %0 = tensor.from_elements %arg0 : tensor<1xf32>
      %1 = tensor.from_elements %arg1 : tensor<1xf32>
      %2 = mhlo.add %0, %1 : tensor<1xf32>
      return %2 : tensor<1xf32>
    }

    转换成:

    python 复制代码
    func.func @test(%arg0: f32, %arg1: f32) -> tensor<1xf32> {
      %0 = arith.addf %arg0, %arg1 : f32
      %1 = tensor.from_elements %0 : tensor<1xf32>
      return %1 : tensor<1xf32>
    }
  • createConvertMHLOToLinalgExtPass

    mhlo::sortmhlo.scattermhlo.fftmhlo.reversemhlo.topk转换到IREE::LinalgExt dialect,同时将在IREE::LinalgExt dialect区域内部的mhlo op转换成linalg dialectmhlo.return则转换成iree_linalg_ext.yield。比如,

    python 复制代码
    func.func @test(%arg0: tensor<10xf32>) -> tensor<10xf32> {
      %0 = "mhlo.sort"(%arg0) ({
      ^bb0(%arg1: tensor<f32>, %arg2: tensor<f32>):
        %1 = mhlo.compare  GT, %arg1, %arg2 : (tensor<f32>, tensor<f32>) -> tensor<i1>
        mhlo.return %1 : tensor<i1>
      }) {dimension = 0 : i64} : (tensor<10xf32>) -> tensor<10xf32>
      return %0 : tensor<10xf32>
    }

    转换成,

    python 复制代码
    func.func @test(%arg0: tensor<10xf32>) -> tensor<10xf32> {
      %0 = iree_linalg_ext.sort dimension(0) outs(%arg0 : tensor<10xf32>) {
      ^bb0(%arg1: f32, %arg2: f32):
        %1 = arith.cmpf ogt, %arg1, %arg2 : f32
        iree_linalg_ext.yield %1 : i1
      } -> tensor<10xf32>
      return %0 : tensor<10xf32>
    }
  • createMHLOToLinalgOnTensorsPass

    将外层剩余的mhlo op转换到linalg dialect。比如

    python 复制代码
    func.func @test(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> {
      %0 = mhlo.add %arg0, %arg1 : tensor<1xf32>
      return %0 : tensor<1xf32>
    }

    转换成,

    python 复制代码
    func.func @test(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> {
      %0 = linalg.init_tensor [1] : tensor<1xf32>
      %1 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} ins(%arg0, %arg1 : tensor<1xf32>, tensor<1xf32>) outs(%0 : tensor<1xf32>) {
      ^bb0(%arg2: f32, %arg3: f32, %arg4: f32):
        %2 = arith.addf %arg2, %arg3 : f32
        linalg.yield %2 : f32
      } -> tensor<1xf32>
      return %1 : tensor<1xf32>
    }
  • mlir::createReconcileUnrealizedCastsPass

    消除unrealized conversion cast操作。

    算法过程描述:

    • 如果unrealized conversion castdead节点(没有user或所有users也都是unrealized conversion cast),则直接删除该dead节点;
    • 如果是live节点(至少有一个非unrealized conversion castuser),则遍历其所有子节点,如果其子节点中所有unrealized conversion castresult type与该 op 的input type相同(即不存在真实意义的type cast操作),则将所有遍历到的unrealized conversion cast都折叠成该 op 的输入,否则报错live unrealized conversion cast
  • mlir::createCanonicalizerPass

  • createVerifyCompilerMHLOInputLegality

    验证program是否合法。

2 CommonInputConversionPassPipeline

主要作用是将IREE::Input dialect lowerIREE::UtilIREE::FlowIREE::HAL dialect,包括以下几个passes:

  • createIREEImportPublicPass

    IREE::Input dialect转换成IREE::UtilIREE::FlowIREE::HAL dialect,并转换func的属性和signature中输入输出类型。比如,

    python 复制代码
    iree_input.global private mutable @param  : tensor<1x2xf32>
    func.func @run(%arg0: tensor<1x2xf32>) {
      %0 = iree_input.global.load @param : tensor<1x2xf32>
      %1 = iree_input.tensor.clone %0 : tensor<1x2xf32>
      iree_input.global.store %1, @param : tensor<1x2xf32>
      return
    }

    转换成(iree_input.global.load --> util.global.loadiree_input.global.store --> util.global.storeiree_input.tensor.clone --> flow.tensor.clone):

    python 复制代码
    util.global private mutable @param : tensor<1x2xf32>
    func.func @run(%arg0: tensor<1x2xf32>) {
      %param = util.global.load @param : tensor<1x2xf32>
      %0 = flow.tensor.clone %param : tensor<1x2xf32>
      util.global.store %0, @param : tensor<1x2xf32>
      return
    }
  • createImportMLProgramPass

    ml_program dialect转换到IREE::Util dialect

  • createSanitizeModuleNamesPass

    module name中的.替换为_,以符合mlir identifiers的命名规范。

    python 复制代码
    module @iree.module {
      func.func @test(%arg0: f32, %arg1: f32) -> f32 {
        %0 = arith.addf %arg0, %arg1 : f32
        return %0 : f32
      }
    }

    转换成,

    python 复制代码
    module @iree_module {
      func.func @test(%arg0: f32, %arg1: f32) -> f32 {
        %0 = arith.addf %arg0, %arg1 : f32
        return %0 : f32
      }
    }

3 ABI::TransformPassPipeline

主要作用是将外部导入的接口和本 module 导出到外部的接口参数统一成标准标量类型或hal.buffer_view类型(hal.buffer_view对应tensor)。

  • createWrapEntryPointsPass

    external func生成一个内部函数,函数中调用原始的external func,同时将public func的函数体包装成一个新的函数,原public func中调用该函数。该 pass 最终的目的是将外部导入的接口和本 module 导出到外部的接口参数统一成标准标量类型或hal.buffer_viewhal.buffer_view对应 tensor 类型)。

    python 复制代码
    // external/imported func
    func.func private @add(tensor<f32>, tensor<f32>) -> tensor<f32>
    
    // public/exported func
    func.func @test(%arg0: tensor<f32>, %arg1: tensor<f32>) -> tensor<f32> {
      %0 = call @add(%arg0, %arg1) : (tensor<f32>, tensor<f32>) -> tensor<f32>
      return %0 : tensor<f32>
    }

    转换成,

    python 复制代码
    func.func private @add(!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub}
    func.func private @_add(%arg0: tensor<f32>, %arg1: tensor<f32>) -> tensor<f32> {
      %0 = hal.tensor.export %arg0 : tensor<f32> -> !hal.buffer_view
      %1 = hal.tensor.export %arg1 : tensor<f32> -> !hal.buffer_view
      %2 = call @add(%0, %1) : (!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view
      %3 = hal.tensor.import %2 : !hal.buffer_view -> tensor<f32>
      return %3 : tensor<f32>
    }
    func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<f32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<f32>
      %2 = call @_test(%0, %1) : (tensor<f32>, tensor<f32>) -> tensor<f32>
      %3 = hal.tensor.export %2 : tensor<f32> -> !hal.buffer_view
      return %3 : !hal.buffer_view
    }
    func.func private @_test(%arg0: tensor<f32>, %arg1: tensor<f32>) -> tensor<f32> {
      %0 = call @_add(%arg0, %arg1) : (tensor<f32>, tensor<f32>) -> tensor<f32>
      return %0 : tensor<f32>
    }
  • mlir::createInlinerPass

    WrapEntryPointsPass中生成的 wrap 函数内联起来。最终转换成,

    python 复制代码
    func.func private @add(!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub}
    func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %0 = call @add(%arg0, %arg1) : (!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view
      return %0 : !hal.buffer_view
    }
  • mlir::createCanonicalizerPass

  • mlir::createCSEPass

  • mlir::createSymbolDCEPass

4 Flow::FlowTransformPassPipeline

主要作用是执行一系列窥孔优化,比如 1x1 的 conv2d 转换成 matmultilingop fusion 等,最终将 workload 拆分成 flow.executable

  • IREE::Util::createDemoteF64ToF32Pass

    将F64类型窄化为F32。

  • IREE::Flow::createConvertConv2D1x1ToMatmulPass

    将 1x1 的linalg.conv_2d_nhwc_hwcf转换成linalg.matmul

    python 复制代码
    // func.func @conv(%input : tensor<1x2x2x3xf32>, %filter: tensor<1x1x3x4xf32>) -> tensor<1x2x2x4xf32> {
    //   %0 = mhlo.convolution(%input, %filter)
    //             dim_numbers = [b, 0, 1, f]x[0, 1, i, o]->[b, 0, 1, f],
    //             window = {stride = [1, 1], pad = [[0, 0], [0, 0]], rhs_dilate = [1, 1]}
    //             {batch_group_count = 1 : i64, feature_group_count = 1 : i64}
    //           : (tensor<1x2x2x3xf32>, tensor<1x1x3x4xf32>) -> tensor<1x2x2x4xf32>
    //   return %0 : tensor<1x2x2x4xf32>
    // }
    func.func @conv(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x2x2x3xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<1x1x3x4xf32>
      %2 = linalg.init_tensor [1, 2, 2, 4] : tensor<1x2x2x4xf32>
      %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x2x2x4xf32>) -> tensor<1x2x2x4xf32>
      %4 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%0, %1 : tensor<1x2x2x3xf32>, tensor<1x1x3x4xf32>) outs(%3 : tensor<1x2x2x4xf32>) -> tensor<1x2x2x4xf32>
      %5 = hal.tensor.export %4 : tensor<1x2x2x4xf32> -> !hal.buffer_view
      return %5 : !hal.buffer_view
    }

    转换成,

    python 复制代码
    func.func @conv(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x2x2x3xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<1x1x3x4xf32>
      %2 = linalg.init_tensor [1, 2, 2, 4] : tensor<1x2x2x4xf32>
      %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x2x2x4xf32>) -> tensor<1x2x2x4xf32>
      %4 = tensor.collapse_shape %0 [[0, 1, 2], [3]] : tensor<1x2x2x3xf32> into tensor<4x3xf32>
      %5 = tensor.collapse_shape %1 [[0, 1, 2], [3]] : tensor<1x1x3x4xf32> into tensor<3x4xf32>
      %6 = tensor.collapse_shape %3 [[0, 1, 2], [3]] : tensor<1x2x2x4xf32> into tensor<4x4xf32>
      %7 = linalg.matmul ins(%4, %5 : tensor<4x3xf32>, tensor<3x4xf32>) outs(%6 : tensor<4x4xf32>) -> tensor<4x4xf32>
      %8 = tensor.expand_shape %7 [[0, 1, 2], [3]] : tensor<4x4xf32> into tensor<1x2x2x4xf32>
      %9 = hal.tensor.export %8 : tensor<1x2x2x4xf32> -> !hal.buffer_view
      return %9 : !hal.buffer_view
    }
  • IREE::Flow::createConvertConv2DToImg2ColPass

    conv2d转换成img2col。默认不开启。

    python 复制代码
    // %0 = mhlo.convolution(%input, %filter)
    //               dim_numbers = [b, 0, 1, f]x[0, 1, i, o]->[b, 0, 1, f],
    //               window = {stride = [1, 1], pad = [[0, 0], [0, 0]], rhs_dilate = [1, 1]}
    //               {batch_group_count = 1 : i64, feature_group_count = 1 : i64}
    //             : (tensor<1x4x4x3xf32>, tensor<2x2x3x4xf32>) -> tensor<1x3x3x4xf32>
    func.func @conv(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x4x4x3xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2x2x3x4xf32>
      %2 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32>
      %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
      %4 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%0, %1 : tensor<1x4x4x3xf32>, tensor<2x2x3x4xf32>) outs(%3 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
      %5 = hal.tensor.export %4 : tensor<1x3x3x4xf32> -> !hal.buffer_view
      return %5 : !hal.buffer_view
    }

    转换成,

    python 复制代码
    func.func @conv(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x4x4x3xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2x2x3x4xf32>
      %2 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32>
      %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
      %4 = linalg.init_tensor [1, 3, 3, 2, 2, 3] : tensor<1x3x3x2x2x3xf32>
      %5 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1 + d3, d2 + d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%0 : tensor<1x4x4x3xf32>) outs(%4 : tensor<1x3x3x2x2x3xf32>) {
      ^bb0(%arg2: f32, %arg3: f32):
        linalg.yield %arg2 : f32
      } -> tensor<1x3x3x2x2x3xf32>
      %6 = tensor.collapse_shape %5 [[0, 1, 2], [3, 4, 5]] : tensor<1x3x3x2x2x3xf32> into tensor<9x12xf32>
      %7 = tensor.collapse_shape %1 [[0, 1, 2], [3]] : tensor<2x2x3x4xf32> into tensor<12x4xf32>
      %8 = tensor.collapse_shape %3 [[0, 1, 2], [3]] : tensor<1x3x3x4xf32> into tensor<9x4xf32>
      %9 = linalg.matmul ins(%6, %7 : tensor<9x12xf32>, tensor<12x4xf32>) outs(%8 : tensor<9x4xf32>) -> tensor<9x4xf32>
      %10 = tensor.expand_shape %9 [[0, 1, 2], [3]] : tensor<9x4xf32> into tensor<1x3x3x4xf32>
      %11 = hal.tensor.export %10 : tensor<1x3x3x4xf32> -> !hal.buffer_view
      return %11 : !hal.buffer_view
    }
  • IREE::Flow::createDetachElementwiseFromNamedOpsPass

    buffer = linalg.generic_op + linalg.named_payload_op转换成tmp_buffer = linalg.named_payload_op; buffer = linalg.generic_op + tmp_buffer,主要目的是将上游的generic opnamed_payload_op分隔开,使得named_payload_op的结果写到一块新的 buffer。

    python 复制代码
    func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x4x4x3xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2x2x3x4xf32>
      %2 = hal.tensor.import %arg2 : !hal.buffer_view -> tensor<1x3x3x4xf32>
      
      %3 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32>
      %4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
      %5 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%2 : tensor<1x3x3x4xf32>) outs(%4 : tensor<1x3x3x4xf32>) {
      ^bb0(%arg3: f32, %arg4: f32):
        %8 = arith.addf %arg3, %arg3 : f32
        linalg.yield %8 : f32
      } -> tensor<1x3x3x4xf32>
      
      %6 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%0, %1 : tensor<1x4x4x3xf32>, tensor<2x2x3x4xf32>) outs(%5 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
      %7 = hal.tensor.export %6 : tensor<1x3x3x4xf32> -> !hal.buffer_view
      return %7 : !hal.buffer_view
    }

    转换成,

    python 复制代码
    func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x4x4x3xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2x2x3x4xf32>
      %2 = hal.tensor.import %arg2 : !hal.buffer_view -> tensor<1x3x3x4xf32>
      
      %3 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32>
      %4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
      %5 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%2 : tensor<1x3x3x4xf32>) outs(%4 : tensor<1x3x3x4xf32>) {
      ^bb0(%arg3: f32, %arg4: f32):
        %11 = arith.addf %arg3, %arg3 : f32
        linalg.yield %11 : f32
      } -> tensor<1x3x3x4xf32>
      
      %6 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32>
      %7 = linalg.fill ins(%cst : f32) outs(%6 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
      %8 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%0, %1 : tensor<1x4x4x3xf32>, tensor<2x2x3x4xf32>) outs(%7 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
    
      %9 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%8, %5 : tensor<1x3x3x4xf32>, tensor<1x3x3x4xf32>) outs(%7 : tensor<1x3x3x4xf32>) {
      ^bb0(%arg3: f32, %arg4: f32, %arg5: f32):
        %11 = arith.addf %arg3, %arg4 : f32
        linalg.yield %11 : f32
      } -> tensor<1x3x3x4xf32>
      %10 = hal.tensor.export %9 : tensor<1x3x3x4xf32> -> !hal.buffer_view
      return %10 : !hal.buffer_view
    }
  • IREE::Flow::createVerifyInputLegalityPass

    验证program是否合法

  • IREE::Flow::createConvertLinalgMatmulToMmt4DPass

    将 2d 的linalg.matmul tiling成linalg.mmt4d。默认不开启,可通过--iree-flow-mmt4d-target-options="enable_generic_slow arch=cuda选项开启。

    python 复制代码
    func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<128x256xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<256x256xf32>
      %2 = linalg.init_tensor [128, 256] : tensor<128x256xf32>
      %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<128x256xf32>) -> tensor<128x256xf32>
      %4 = linalg.matmul ins(%0, %1 : tensor<128x256xf32>, tensor<256x256xf32>) outs(%3 : tensor<128x256xf32>) -> tensor<128x256xf32>
      %5 = hal.tensor.export %4 : tensor<128x256xf32> -> !hal.buffer_view
      return %5 : !hal.buffer_view
    }

    转换成,

    python 复制代码
    func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<128x256xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<256x256xf32>
      %2 = linalg.init_tensor [128, 256] : tensor<128x256xf32>
      %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<128x256xf32>) -> tensor<128x256xf32>
      %4 = tensor.expand_shape %0 [[0, 1], [2, 3]] : tensor<128x256xf32> into tensor<16x8x128x2xf32>
      %5 = tensor.expand_shape %1 [[0, 1], [2, 3]] : tensor<256x256xf32> into tensor<128x2x64x4xf32>
      %6 = tensor.expand_shape %3 [[0, 1], [2, 3]] : tensor<128x256xf32> into tensor<16x8x64x4xf32>
      %7 = linalg.init_tensor [16, 128, 8, 2] : tensor<16x128x8x2xf32>
      %8 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%4 : tensor<16x8x128x2xf32>) outs(%7 : tensor<16x128x8x2xf32>) {
      ^bb0(%arg2: f32, %arg3: f32):
        linalg.yield %arg2 : f32
      } -> tensor<16x128x8x2xf32>
      %9 = linalg.init_tensor [64, 128, 4, 2] : tensor<64x128x4x2xf32>
      %10 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d1, d3, d0, d2)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%5 : tensor<128x2x64x4xf32>) outs(%9 : tensor<64x128x4x2xf32>) {
      ^bb0(%arg2: f32, %arg3: f32):
        linalg.yield %arg2 : f32
      } -> tensor<64x128x4x2xf32>
      %11 = linalg.init_tensor [16, 64, 8, 4] : tensor<16x64x8x4xf32>
      %12 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%6 : tensor<16x8x64x4xf32>) outs(%11 : tensor<16x64x8x4xf32>) {
      ^bb0(%arg2: f32, %arg3: f32):
        linalg.yield %arg2 : f32
      } -> tensor<16x64x8x4xf32>
      // 16 x (128x8x2) @ 64 x (128x4x2) => 16 x 64 x sum_{128}(8x2 * (4x2)^T)
      %13 = linalg.mmt4d {comment = "generic tiling parameters, as no known kernel was matched for this matmul and target"} ins(%8, %10 : tensor<16x128x8x2xf32>, tensor<64x128x4x2xf32>) outs(%12 : tensor<16x64x8x4xf32>) -> tensor<16x64x8x4xf32>
      %14 = linalg.init_tensor [16, 8, 64, 4] : tensor<16x8x64x4xf32>
      %15 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%13 : tensor<16x64x8x4xf32>) outs(%14 : tensor<16x8x64x4xf32>) {
      ^bb0(%arg2: f32, %arg3: f32):
        linalg.yield %arg2 : f32
      } -> tensor<16x8x64x4xf32>
      %16 = tensor.collapse_shape %15 [[0, 1], [2, 3]] : tensor<16x8x64x4xf32> into tensor<128x256xf32>
      %17 = hal.tensor.export %16 : tensor<128x256xf32> -> !hal.buffer_view
      return %17 : !hal.buffer_view
    }
  • IREE::Flow::createPadLinalgOpsToIntegerMultiplePass

    将 matmul 的 M、N 和 K 扩充到paddingSize的整数倍,paddingSize默认为 4。

  • mlir::createLinalgNamedOpConversionPass

    depth_multiplier=1linalg.depthwise_conv_2d_nhwc_hwcm转换成linalg.depthwise_conv_2d_nhwc_hwc,将depth_multiplier=1linalg.depthwise_conv_2d_nhwc_hwcm_q转换成linalg.depthwise_conv_2d_nhwc_hwc_q

depth_multiplier的作用见 https://www.tensorflow.org/api_docs/python/tf/keras/layers/DepthwiseConv2D

  • IREE::Flow::createExpandTensorShapesPass
    dynamic tensor扩充为tensor + dynamic dim的对偶形式,这么做的一个好处是动态维度可以直接参与计算和推导。比如

    python 复制代码
    // func.func private @add(%arg0 : tensor<?x2xf32>, %arg1 : tensor<?x2xf32>) -> tensor<?x2xf32>
    // iree_input.global private mutable @param : tensor<?x2xf32>
    // func.func @run(%arg0 : tensor<?x2xf32>) -> tensor<?x2xf32> {
    //   %0 = iree_input.global.load @param : tensor<?x2xf32>
    //   %1 = call @add(%0, %arg0) : (tensor<?x2xf32>, tensor<?x2xf32>) -> tensor<?x2xf32>
    //   iree_input.global.store %1, @param : tensor<?x2xf32>
    //   return %1 : tensor<?x2xf32>
    // }
    func.func private @add(!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub}
    util.global private mutable @param : tensor<?x2xf32>
    func.func @run(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %c0 = arith.constant 0 : index
      %param = util.global.load @param : tensor<?x2xf32>
      %dim = tensor.dim %param, %c0 : tensor<?x2xf32>
      %0 = hal.tensor.export %param : tensor<?x2xf32>{%dim} -> !hal.buffer_view
      %1 = call @add(%0, %arg0) : (!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view
      %2 = hal.buffer_view.dim<%1 : !hal.buffer_view>[0] : index
      %3 = hal.tensor.import %1 : !hal.buffer_view -> tensor<?x2xf32>{%2}
      util.global.store %3, @param : tensor<?x2xf32>
      return %1 : !hal.buffer_view
    }

    转换成,

    python 复制代码
    func.func private @add(!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub}
    util.global private mutable @param : tensor<?x2xf32>
    util.global private mutable @param__d0 : index
    func.func @run(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %c0 = arith.constant 0 : index
      %param = util.global.load @param : tensor<?x2xf32>
      %param__d0 = util.global.load @param__d0 : index
      %0 = flow.tensor.tie_shape %param : tensor<?x2xf32>{%param__d0}
      %dim = tensor.dim %0, %c0 : tensor<?x2xf32>
      %1 = hal.tensor.export %0 : tensor<?x2xf32>{%dim} -> !hal.buffer_view
      %2 = call @add(%1, %arg0) : (!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view
      %3 = hal.buffer_view.dim<%2 : !hal.buffer_view>[0] : index
      %4 = hal.tensor.import %2 : !hal.buffer_view -> tensor<?x2xf32>{%3}
      util.global.store %4, @param : tensor<?x2xf32>
      util.global.store %3, @param__d0 : index
      return %2 : !hal.buffer_view
    }

未完待续...

相关推荐
潮汐退涨月冷风霜42 分钟前
机器学习之非监督学习(四)K-means 聚类算法
学习·算法·机器学习
GoppViper1 小时前
golang学习笔记29——golang 中如何将 GitHub 最新提交的版本设置为 v1.0.0
笔记·git·后端·学习·golang·github·源代码管理
羊小猪~~1 小时前
深度学习基础案例5--VGG16人脸识别(体验学习的痛苦与乐趣)
人工智能·python·深度学习·学习·算法·机器学习·cnn
Charles Ray2 小时前
C++学习笔记 —— 内存分配 new
c++·笔记·学习
我要吐泡泡了哦3 小时前
GAMES104:15 游戏引擎的玩法系统基础-学习笔记
笔记·学习·游戏引擎
骑鱼过海的猫1233 小时前
【tomcat】tomcat学习笔记
笔记·学习·tomcat
贾saisai5 小时前
Xilinx系FPGA学习笔记(九)DDR3学习
笔记·学习·fpga开发
北岛寒沫5 小时前
JavaScript(JS)学习笔记 1(简单介绍 注释和输入输出语句 变量 数据类型 运算符 流程控制 数组)
javascript·笔记·学习
铁匠匠匠6 小时前
从零开始学数据结构系列之第六章《排序简介》
c语言·数据结构·经验分享·笔记·学习·开源·课程设计
架构文摘JGWZ8 小时前
Java 23 的12 个新特性!!
java·开发语言·学习