iree 编译流程(1)

IREE 目前支持将 MHLO 或 XLA、Torch Tensor 和 TOSA 作为输入，经过一系列 passes 编译生成 IREE 定义的 VM bytecode 中间产物，其中硬件相关代码会编译成相应的 Executable，保存在 VM bytecode 中供 host 进行调用。例如 CUDA 相关的计算代码会被lower 成 PTX 代码，在 IREE 的 runtime 中再被 CUDA 的 runtime 以 JIT 的方式编译成可执行的 cubin kernel。

IREE 编译的入口是IREEVMTransformPassPipeline，IREEVMTransformPassPipeline又被分成InputConversionPassPipeline、CommonInputConversionPassPipeline、ABI::TransformPassPipeline、Flow::FlowTransformPassPipeline、Stream::StreamTransformPassPipeline（仅 CUDA 后端）、HAL::HALTransformPassPipeline、VM::VMTransformPassPipeline等几个阶段。

1 InputConversionPassPipeline

主要作用是将不同的输入（MHLO 或 XLA、Torch Tensor 和 TOSA）统一 lower 成 linalg dialect 和 builtin 的 arith dialect、scf dialect 和 tensor dialect。以 MHLO 输入为例，列举了 InputConversionPassPipeline 中各个 pass 以及它们的主要作用。

mhlo::createLegalizeControlFlowPass

将TF1.0中的控制流原语（http://download.tensorflow.org/paper/white_paper_tf_control_flow_implementation_2017_11_1.pdf）规范化成HLO中的控制流算子。
createTopLevelSCFToCFGPass

将顶层的structured control flow表示的控制流图转换成更底层基础块的控制流图（CFG）。
createMHLOToMHLOPreprocessingPass
mlir::createCanonicalizerPass
mlir::createShapeToShapeLowering

将shape.num_elements转换成shape.reduce。

mlir::createConvertShapeToStandardPass

将shape dialect lower成arith dialect、scf dialect和tensor dialect。比如

python 复制代码

func.func @test(%arg0: tensor<1x?xf32>, %arg1: tensor<?xf32>) -> index {
  %c1 = arith.constant 1 : index
  %c0 = arith.constant 0 : index
  %0 = shape.dim %arg0, %c1 : tensor<1x?xf32>, index -> index
  %1 = shape.dim %arg1, %c0 : tensor<?xf32>, index -> index
  %2 = shape.add %0, %1 : index, index -> index
  return %2 : index
}

转换成

python 复制代码

func.func @test(%arg0: tensor<1x?xf32>, %arg1: tensor<?xf32>) -> index {
    %c1 = arith.constant 1 : index
    %c0 = arith.constant 0 : index
    %c1_0 = arith.constant 1 : index
    %c1_1 = arith.constant 1 : index
    %0 = tensor.dim %arg0, %c1_1 : tensor<1x?xf32>
    %1 = tensor.from_elements %c1_0, %0 : tensor<2xindex>
    %2 = tensor.cast %1 : tensor<2xindex> to tensor<2xindex>
    %3 = tensor.dim %arg0, %c1 : tensor<1x?xf32>
    %c0_2 = arith.constant 0 : index
    %4 = tensor.dim %arg1, %c0_2 : tensor<?xf32>
    %5 = tensor.from_elements %4 : tensor<1xindex>
    %6 = tensor.cast %5 : tensor<1xindex> to tensor<1xindex>
    %7 = tensor.dim %arg1, %c0 : tensor<?xf32>
    %8 = arith.addi %3, %7 : index
    return %8 : index
  }

mlir::createCanonicalizerPass

mlir::createInlinerPass

内联calls和callable operations，并删除dead callables。比如：

python 复制代码

func.func @test(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> {
  %0 = call @add(%arg0, %arg1) : (tensor<1xf32>, tensor<1xf32>) -> tensor<1xf32>
  return %0 : tensor<1xf32>
}
func.func private @add(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> {
  %0 = mhlo.add %arg0, %arg1 : tensor<1xf32>
  return %0 : tensor<1xf32>
}

私有的add函数被内联之后删除，

python 复制代码

func.func @test(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> {
  %0 = mhlo.add %arg0, %arg1 : tensor<1xf32>
  return %0 : tensor<1xf32>
}

IREE::Util::createDemoteI64ToI32Pass
IREE::Util::createDemoteF64ToF32Pass
mlir::createCanonicalizerPass
mlir::createCSEPass

mhlo::createLegalizeShapeComputationsPass

把scalar tensor op转换成scalar op + fromElements op。比如

python 复制代码

func.func @test(%arg0: f32, %arg1: f32) -> tensor<1xf32> {
  %0 = tensor.from_elements %arg0 : tensor<1xf32>
  %1 = tensor.from_elements %arg1 : tensor<1xf32>
  %2 = mhlo.add %0, %1 : tensor<1xf32>
  return %2 : tensor<1xf32>
}

转换成：

python 复制代码

func.func @test(%arg0: f32, %arg1: f32) -> tensor<1xf32> {
  %0 = arith.addf %arg0, %arg1 : f32
  %1 = tensor.from_elements %0 : tensor<1xf32>
  return %1 : tensor<1xf32>
}

createConvertMHLOToLinalgExtPass

将mhlo::sort、mhlo.scatter、mhlo.fft、mhlo.reverse、mhlo.topk转换到IREE::LinalgExt dialect，同时将在IREE::LinalgExt dialect区域内部的mhlo op转换成linalg dialect，mhlo.return则转换成iree_linalg_ext.yield。比如，

python 复制代码

func.func @test(%arg0: tensor<10xf32>) -> tensor<10xf32> {
  %0 = "mhlo.sort"(%arg0) ({
  ^bb0(%arg1: tensor<f32>, %arg2: tensor<f32>):
    %1 = mhlo.compare  GT, %arg1, %arg2 : (tensor<f32>, tensor<f32>) -> tensor<i1>
    mhlo.return %1 : tensor<i1>
  }) {dimension = 0 : i64} : (tensor<10xf32>) -> tensor<10xf32>
  return %0 : tensor<10xf32>
}

转换成，

python 复制代码

func.func @test(%arg0: tensor<10xf32>) -> tensor<10xf32> {
  %0 = iree_linalg_ext.sort dimension(0) outs(%arg0 : tensor<10xf32>) {
  ^bb0(%arg1: f32, %arg2: f32):
    %1 = arith.cmpf ogt, %arg1, %arg2 : f32
    iree_linalg_ext.yield %1 : i1
  } -> tensor<10xf32>
  return %0 : tensor<10xf32>
}

createMHLOToLinalgOnTensorsPass

将外层剩余的mhlo op转换到linalg dialect。比如

python 复制代码

func.func @test(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> {
  %0 = mhlo.add %arg0, %arg1 : tensor<1xf32>
  return %0 : tensor<1xf32>
}

转换成，

python 复制代码

func.func @test(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> {
  %0 = linalg.init_tensor [1] : tensor<1xf32>
  %1 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} ins(%arg0, %arg1 : tensor<1xf32>, tensor<1xf32>) outs(%0 : tensor<1xf32>) {
  ^bb0(%arg2: f32, %arg3: f32, %arg4: f32):
    %2 = arith.addf %arg2, %arg3 : f32
    linalg.yield %2 : f32
  } -> tensor<1xf32>
  return %1 : tensor<1xf32>
}

mlir::createReconcileUnrealizedCastsPass

消除unrealized conversion cast操作。

算法过程描述：
- 如果unrealized conversion cast是dead节点（没有user或所有users也都是unrealized conversion cast），则直接删除该dead节点；
- 如果是live节点（至少有一个非unrealized conversion cast的user），则遍历其所有子节点，如果其子节点中所有unrealized conversion cast的result type与该 op 的input type相同（即不存在真实意义的type cast操作），则将所有遍历到的unrealized conversion cast都折叠成该 op 的输入，否则报错live unrealized conversion cast。
mlir::createCanonicalizerPass
createVerifyCompilerMHLOInputLegality

验证program是否合法。

2 CommonInputConversionPassPipeline

主要作用是将IREE::Input dialect lower成IREE::Util、IREE::Flow和IREE::HAL dialect，包括以下几个passes：

createIREEImportPublicPass

将IREE::Input dialect转换成IREE::Util、IREE::Flow和IREE::HAL dialect，并转换func的属性和signature中输入输出类型。比如，

python 复制代码

iree_input.global private mutable @param  : tensor<1x2xf32>
func.func @run(%arg0: tensor<1x2xf32>) {
  %0 = iree_input.global.load @param : tensor<1x2xf32>
  %1 = iree_input.tensor.clone %0 : tensor<1x2xf32>
  iree_input.global.store %1, @param : tensor<1x2xf32>
  return
}

转换成（iree_input.global.load --> util.global.load，iree_input.global.store --> util.global.store，iree_input.tensor.clone --> flow.tensor.clone）：

python 复制代码

util.global private mutable @param : tensor<1x2xf32>
func.func @run(%arg0: tensor<1x2xf32>) {
  %param = util.global.load @param : tensor<1x2xf32>
  %0 = flow.tensor.clone %param : tensor<1x2xf32>
  util.global.store %0, @param : tensor<1x2xf32>
  return
}

createImportMLProgramPass

将ml_program dialect转换到IREE::Util dialect。

createSanitizeModuleNamesPass

将module name中的.替换为_，以符合mlir identifiers的命名规范。

python 复制代码

module @iree.module {
  func.func @test(%arg0: f32, %arg1: f32) -> f32 {
    %0 = arith.addf %arg0, %arg1 : f32
    return %0 : f32
  }
}

转换成,

python 复制代码

module @iree_module {
  func.func @test(%arg0: f32, %arg1: f32) -> f32 {
    %0 = arith.addf %arg0, %arg1 : f32
    return %0 : f32
  }
}

3 ABI::TransformPassPipeline

主要作用是将外部导入的接口和本 module 导出到外部的接口参数统一成标准标量类型或hal.buffer_view类型（hal.buffer_view对应tensor）。

createWrapEntryPointsPass

给external func生成一个内部函数，函数中调用原始的external func，同时将public func的函数体包装成一个新的函数，原public func中调用该函数。该 pass 最终的目的是将外部导入的接口和本 module 导出到外部的接口参数统一成标准标量类型或hal.buffer_view（hal.buffer_view对应 tensor 类型）。

python 复制代码

// external/imported func
func.func private @add(tensor<f32>, tensor<f32>) -> tensor<f32>

// public/exported func
func.func @test(%arg0: tensor<f32>, %arg1: tensor<f32>) -> tensor<f32> {
  %0 = call @add(%arg0, %arg1) : (tensor<f32>, tensor<f32>) -> tensor<f32>
  return %0 : tensor<f32>
}

转换成，

python 复制代码

func.func private @add(!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub}
func.func private @_add(%arg0: tensor<f32>, %arg1: tensor<f32>) -> tensor<f32> {
  %0 = hal.tensor.export %arg0 : tensor<f32> -> !hal.buffer_view
  %1 = hal.tensor.export %arg1 : tensor<f32> -> !hal.buffer_view
  %2 = call @add(%0, %1) : (!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view
  %3 = hal.tensor.import %2 : !hal.buffer_view -> tensor<f32>
  return %3 : tensor<f32>
}
func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<f32>
  %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<f32>
  %2 = call @_test(%0, %1) : (tensor<f32>, tensor<f32>) -> tensor<f32>
  %3 = hal.tensor.export %2 : tensor<f32> -> !hal.buffer_view
  return %3 : !hal.buffer_view
}
func.func private @_test(%arg0: tensor<f32>, %arg1: tensor<f32>) -> tensor<f32> {
  %0 = call @_add(%arg0, %arg1) : (tensor<f32>, tensor<f32>) -> tensor<f32>
  return %0 : tensor<f32>
}

mlir::createInlinerPass

将WrapEntryPointsPass中生成的 wrap 函数内联起来。最终转换成，

python 复制代码

func.func private @add(!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub}
func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %0 = call @add(%arg0, %arg1) : (!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view
  return %0 : !hal.buffer_view
}

mlir::createCanonicalizerPass
mlir::createCSEPass
mlir::createSymbolDCEPass

4 Flow::FlowTransformPassPipeline

主要作用是执行一系列窥孔优化，比如 1x1 的 conv2d 转换成 matmul、tiling、op fusion 等，最终将 workload 拆分成 flow.executable。

IREE::Util::createDemoteF64ToF32Pass

将F64类型窄化为F32。

IREE::Flow::createConvertConv2D1x1ToMatmulPass

将 1x1 的linalg.conv_2d_nhwc_hwcf转换成linalg.matmul。

python 复制代码

// func.func @conv(%input : tensor<1x2x2x3xf32>, %filter: tensor<1x1x3x4xf32>) -> tensor<1x2x2x4xf32> {
//   %0 = mhlo.convolution(%input, %filter)
//             dim_numbers = [b, 0, 1, f]x[0, 1, i, o]->[b, 0, 1, f],
//             window = {stride = [1, 1], pad = [[0, 0], [0, 0]], rhs_dilate = [1, 1]}
//             {batch_group_count = 1 : i64, feature_group_count = 1 : i64}
//           : (tensor<1x2x2x3xf32>, tensor<1x1x3x4xf32>) -> tensor<1x2x2x4xf32>
//   return %0 : tensor<1x2x2x4xf32>
// }
func.func @conv(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x2x2x3xf32>
  %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<1x1x3x4xf32>
  %2 = linalg.init_tensor [1, 2, 2, 4] : tensor<1x2x2x4xf32>
  %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x2x2x4xf32>) -> tensor<1x2x2x4xf32>
  %4 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%0, %1 : tensor<1x2x2x3xf32>, tensor<1x1x3x4xf32>) outs(%3 : tensor<1x2x2x4xf32>) -> tensor<1x2x2x4xf32>
  %5 = hal.tensor.export %4 : tensor<1x2x2x4xf32> -> !hal.buffer_view
  return %5 : !hal.buffer_view
}

转换成，

python 复制代码

func.func @conv(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x2x2x3xf32>
  %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<1x1x3x4xf32>
  %2 = linalg.init_tensor [1, 2, 2, 4] : tensor<1x2x2x4xf32>
  %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x2x2x4xf32>) -> tensor<1x2x2x4xf32>
  %4 = tensor.collapse_shape %0 [[0, 1, 2], [3]] : tensor<1x2x2x3xf32> into tensor<4x3xf32>
  %5 = tensor.collapse_shape %1 [[0, 1, 2], [3]] : tensor<1x1x3x4xf32> into tensor<3x4xf32>
  %6 = tensor.collapse_shape %3 [[0, 1, 2], [3]] : tensor<1x2x2x4xf32> into tensor<4x4xf32>
  %7 = linalg.matmul ins(%4, %5 : tensor<4x3xf32>, tensor<3x4xf32>) outs(%6 : tensor<4x4xf32>) -> tensor<4x4xf32>
  %8 = tensor.expand_shape %7 [[0, 1, 2], [3]] : tensor<4x4xf32> into tensor<1x2x2x4xf32>
  %9 = hal.tensor.export %8 : tensor<1x2x2x4xf32> -> !hal.buffer_view
  return %9 : !hal.buffer_view
}

IREE::Flow::createConvertConv2DToImg2ColPass

将conv2d转换成img2col。默认不开启。

python 复制代码

// %0 = mhlo.convolution(%input, %filter)
//               dim_numbers = [b, 0, 1, f]x[0, 1, i, o]->[b, 0, 1, f],
//               window = {stride = [1, 1], pad = [[0, 0], [0, 0]], rhs_dilate = [1, 1]}
//               {batch_group_count = 1 : i64, feature_group_count = 1 : i64}
//             : (tensor<1x4x4x3xf32>, tensor<2x2x3x4xf32>) -> tensor<1x3x3x4xf32>
func.func @conv(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x4x4x3xf32>
  %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2x2x3x4xf32>
  %2 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32>
  %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
  %4 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%0, %1 : tensor<1x4x4x3xf32>, tensor<2x2x3x4xf32>) outs(%3 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
  %5 = hal.tensor.export %4 : tensor<1x3x3x4xf32> -> !hal.buffer_view
  return %5 : !hal.buffer_view
}

转换成，

python 复制代码

func.func @conv(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x4x4x3xf32>
  %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2x2x3x4xf32>
  %2 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32>
  %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
  %4 = linalg.init_tensor [1, 3, 3, 2, 2, 3] : tensor<1x3x3x2x2x3xf32>
  %5 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1 + d3, d2 + d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%0 : tensor<1x4x4x3xf32>) outs(%4 : tensor<1x3x3x2x2x3xf32>) {
  ^bb0(%arg2: f32, %arg3: f32):
    linalg.yield %arg2 : f32
  } -> tensor<1x3x3x2x2x3xf32>
  %6 = tensor.collapse_shape %5 [[0, 1, 2], [3, 4, 5]] : tensor<1x3x3x2x2x3xf32> into tensor<9x12xf32>
  %7 = tensor.collapse_shape %1 [[0, 1, 2], [3]] : tensor<2x2x3x4xf32> into tensor<12x4xf32>
  %8 = tensor.collapse_shape %3 [[0, 1, 2], [3]] : tensor<1x3x3x4xf32> into tensor<9x4xf32>
  %9 = linalg.matmul ins(%6, %7 : tensor<9x12xf32>, tensor<12x4xf32>) outs(%8 : tensor<9x4xf32>) -> tensor<9x4xf32>
  %10 = tensor.expand_shape %9 [[0, 1, 2], [3]] : tensor<9x4xf32> into tensor<1x3x3x4xf32>
  %11 = hal.tensor.export %10 : tensor<1x3x3x4xf32> -> !hal.buffer_view
  return %11 : !hal.buffer_view
}

IREE::Flow::createDetachElementwiseFromNamedOpsPass

将buffer = linalg.generic_op + linalg.named_payload_op转换成tmp_buffer = linalg.named_payload_op; buffer = linalg.generic_op + tmp_buffer，主要目的是将上游的generic op和named_payload_op分隔开，使得named_payload_op的结果写到一块新的 buffer。

python 复制代码

func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x4x4x3xf32>
  %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2x2x3x4xf32>
  %2 = hal.tensor.import %arg2 : !hal.buffer_view -> tensor<1x3x3x4xf32>
  
  %3 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32>
  %4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
  %5 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%2 : tensor<1x3x3x4xf32>) outs(%4 : tensor<1x3x3x4xf32>) {
  ^bb0(%arg3: f32, %arg4: f32):
    %8 = arith.addf %arg3, %arg3 : f32
    linalg.yield %8 : f32
  } -> tensor<1x3x3x4xf32>
  
  %6 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%0, %1 : tensor<1x4x4x3xf32>, tensor<2x2x3x4xf32>) outs(%5 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
  %7 = hal.tensor.export %6 : tensor<1x3x3x4xf32> -> !hal.buffer_view
  return %7 : !hal.buffer_view
}

转换成，

python 复制代码

func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x4x4x3xf32>
  %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2x2x3x4xf32>
  %2 = hal.tensor.import %arg2 : !hal.buffer_view -> tensor<1x3x3x4xf32>
  
  %3 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32>
  %4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
  %5 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%2 : tensor<1x3x3x4xf32>) outs(%4 : tensor<1x3x3x4xf32>) {
  ^bb0(%arg3: f32, %arg4: f32):
    %11 = arith.addf %arg3, %arg3 : f32
    linalg.yield %11 : f32
  } -> tensor<1x3x3x4xf32>
  
  %6 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32>
  %7 = linalg.fill ins(%cst : f32) outs(%6 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>
  %8 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%0, %1 : tensor<1x4x4x3xf32>, tensor<2x2x3x4xf32>) outs(%7 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32>

  %9 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%8, %5 : tensor<1x3x3x4xf32>, tensor<1x3x3x4xf32>) outs(%7 : tensor<1x3x3x4xf32>) {
  ^bb0(%arg3: f32, %arg4: f32, %arg5: f32):
    %11 = arith.addf %arg3, %arg4 : f32
    linalg.yield %11 : f32
  } -> tensor<1x3x3x4xf32>
  %10 = hal.tensor.export %9 : tensor<1x3x3x4xf32> -> !hal.buffer_view
  return %10 : !hal.buffer_view
}

IREE::Flow::createVerifyInputLegalityPass

验证program是否合法

IREE::Flow::createConvertLinalgMatmulToMmt4DPass

将 2d 的linalg.matmul tiling成linalg.mmt4d。默认不开启，可通过--iree-flow-mmt4d-target-options="enable_generic_slow arch=cuda选项开启。

python 复制代码

func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<128x256xf32>
  %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<256x256xf32>
  %2 = linalg.init_tensor [128, 256] : tensor<128x256xf32>
  %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<128x256xf32>) -> tensor<128x256xf32>
  %4 = linalg.matmul ins(%0, %1 : tensor<128x256xf32>, tensor<256x256xf32>) outs(%3 : tensor<128x256xf32>) -> tensor<128x256xf32>
  %5 = hal.tensor.export %4 : tensor<128x256xf32> -> !hal.buffer_view
  return %5 : !hal.buffer_view
}

转换成，

python 复制代码

func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<128x256xf32>
  %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<256x256xf32>
  %2 = linalg.init_tensor [128, 256] : tensor<128x256xf32>
  %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<128x256xf32>) -> tensor<128x256xf32>
  %4 = tensor.expand_shape %0 [[0, 1], [2, 3]] : tensor<128x256xf32> into tensor<16x8x128x2xf32>
  %5 = tensor.expand_shape %1 [[0, 1], [2, 3]] : tensor<256x256xf32> into tensor<128x2x64x4xf32>
  %6 = tensor.expand_shape %3 [[0, 1], [2, 3]] : tensor<128x256xf32> into tensor<16x8x64x4xf32>
  %7 = linalg.init_tensor [16, 128, 8, 2] : tensor<16x128x8x2xf32>
  %8 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%4 : tensor<16x8x128x2xf32>) outs(%7 : tensor<16x128x8x2xf32>) {
  ^bb0(%arg2: f32, %arg3: f32):
    linalg.yield %arg2 : f32
  } -> tensor<16x128x8x2xf32>
  %9 = linalg.init_tensor [64, 128, 4, 2] : tensor<64x128x4x2xf32>
  %10 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d1, d3, d0, d2)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%5 : tensor<128x2x64x4xf32>) outs(%9 : tensor<64x128x4x2xf32>) {
  ^bb0(%arg2: f32, %arg3: f32):
    linalg.yield %arg2 : f32
  } -> tensor<64x128x4x2xf32>
  %11 = linalg.init_tensor [16, 64, 8, 4] : tensor<16x64x8x4xf32>
  %12 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%6 : tensor<16x8x64x4xf32>) outs(%11 : tensor<16x64x8x4xf32>) {
  ^bb0(%arg2: f32, %arg3: f32):
    linalg.yield %arg2 : f32
  } -> tensor<16x64x8x4xf32>
  // 16 x (128x8x2) @ 64 x (128x4x2) => 16 x 64 x sum_{128}(8x2 * (4x2)^T)
  %13 = linalg.mmt4d {comment = "generic tiling parameters, as no known kernel was matched for this matmul and target"} ins(%8, %10 : tensor<16x128x8x2xf32>, tensor<64x128x4x2xf32>) outs(%12 : tensor<16x64x8x4xf32>) -> tensor<16x64x8x4xf32>
  %14 = linalg.init_tensor [16, 8, 64, 4] : tensor<16x8x64x4xf32>
  %15 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%13 : tensor<16x64x8x4xf32>) outs(%14 : tensor<16x8x64x4xf32>) {
  ^bb0(%arg2: f32, %arg3: f32):
    linalg.yield %arg2 : f32
  } -> tensor<16x8x64x4xf32>
  %16 = tensor.collapse_shape %15 [[0, 1], [2, 3]] : tensor<16x8x64x4xf32> into tensor<128x256xf32>
  %17 = hal.tensor.export %16 : tensor<128x256xf32> -> !hal.buffer_view
  return %17 : !hal.buffer_view
}

IREE::Flow::createPadLinalgOpsToIntegerMultiplePass

将 matmul 的 M、N 和 K 扩充到paddingSize的整数倍，paddingSize默认为 4。
mlir::createLinalgNamedOpConversionPass

将depth_multiplier=1的linalg.depthwise_conv_2d_nhwc_hwcm转换成linalg.depthwise_conv_2d_nhwc_hwc，将depth_multiplier=1的linalg.depthwise_conv_2d_nhwc_hwcm_q转换成linalg.depthwise_conv_2d_nhwc_hwc_q。

depth_multiplier的作用见 https://www.tensorflow.org/api_docs/python/tf/keras/layers/DepthwiseConv2D 。

IREE::Flow::createExpandTensorShapesPass
将dynamic tensor扩充为tensor + dynamic dim的对偶形式，这么做的一个好处是动态维度可以直接参与计算和推导。比如

python 复制代码

// func.func private @add(%arg0 : tensor<?x2xf32>, %arg1 : tensor<?x2xf32>) -> tensor<?x2xf32>
// iree_input.global private mutable @param : tensor<?x2xf32>
// func.func @run(%arg0 : tensor<?x2xf32>) -> tensor<?x2xf32> {
//   %0 = iree_input.global.load @param : tensor<?x2xf32>
//   %1 = call @add(%0, %arg0) : (tensor<?x2xf32>, tensor<?x2xf32>) -> tensor<?x2xf32>
//   iree_input.global.store %1, @param : tensor<?x2xf32>
//   return %1 : tensor<?x2xf32>
// }
func.func private @add(!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub}
util.global private mutable @param : tensor<?x2xf32>
func.func @run(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %c0 = arith.constant 0 : index
  %param = util.global.load @param : tensor<?x2xf32>
  %dim = tensor.dim %param, %c0 : tensor<?x2xf32>
  %0 = hal.tensor.export %param : tensor<?x2xf32>{%dim} -> !hal.buffer_view
  %1 = call @add(%0, %arg0) : (!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view
  %2 = hal.buffer_view.dim<%1 : !hal.buffer_view>[0] : index
  %3 = hal.tensor.import %1 : !hal.buffer_view -> tensor<?x2xf32>{%2}
  util.global.store %3, @param : tensor<?x2xf32>
  return %1 : !hal.buffer_view
}

转换成，

python 复制代码

func.func private @add(!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub}
util.global private mutable @param : tensor<?x2xf32>
util.global private mutable @param__d0 : index
func.func @run(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %c0 = arith.constant 0 : index
  %param = util.global.load @param : tensor<?x2xf32>
  %param__d0 = util.global.load @param__d0 : index
  %0 = flow.tensor.tie_shape %param : tensor<?x2xf32>{%param__d0}
  %dim = tensor.dim %0, %c0 : tensor<?x2xf32>
  %1 = hal.tensor.export %0 : tensor<?x2xf32>{%dim} -> !hal.buffer_view
  %2 = call @add(%1, %arg0) : (!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view
  %3 = hal.buffer_view.dim<%2 : !hal.buffer_view>[0] : index
  %4 = hal.tensor.import %2 : !hal.buffer_view -> tensor<?x2xf32>{%3}
  util.global.store %4, @param : tensor<?x2xf32>
  util.global.store %3, @param__d0 : index
  return %2 : !hal.buffer_view
}

未完待续...