IREE 目前支持将 MHLO 或 XLA、Torch Tensor 和 TOSA 作为输入,经过一系列 passes 编译生成 IREE 定义的 VM bytecode 中间产物,其中硬件相关代码会编译成相应的 Executable,保存在 VM bytecode 中供 host 进行调用。例如 CUDA 相关的计算代码会被lower 成 PTX 代码,在 IREE 的 runtime 中再被 CUDA 的 runtime 以 JIT 的方式编译成可执行的 cubin kernel。
IREE 编译的入口是IREEVMTransformPassPipeline,IREEVMTransformPassPipeline又被分成InputConversionPassPipeline、CommonInputConversionPassPipeline、ABI::TransformPassPipeline、Flow::FlowTransformPassPipeline、Stream::StreamTransformPassPipeline(仅 CUDA 后端)、HAL::HALTransformPassPipeline、VM::VMTransformPassPipeline等几个阶段。
1 InputConversionPassPipeline
主要作用是将不同的输入(MHLO 或 XLA、Torch Tensor 和 TOSA)统一 lower 成 linalg dialect 和 builtin 的 arith dialect、scf dialect 和 tensor dialect。以 MHLO 输入为例,列举了 InputConversionPassPipeline 中各个 pass 以及它们的主要作用。
-
mhlo::createLegalizeControlFlowPass将TF1.0中的控制流原语(http://download.tensorflow.org/paper/white_paper_tf_control_flow_implementation_2017_11_1.pdf)规范化成HLO中的控制流算子。
-
createTopLevelSCFToCFGPass将顶层的structured control flow表示的控制流图转换成更底层基础块的控制流图(CFG)。
-
createMHLOToMHLOPreprocessingPass -
mlir::createCanonicalizerPass -
mlir::createShapeToShapeLowering将
shape.num_elements转换成shape.reduce。 -
mlir::createConvertShapeToStandardPass将
shape dialectlower成arith dialect、scf dialect和tensor dialect。比如pythonfunc.func @test(%arg0: tensor<1x?xf32>, %arg1: tensor<?xf32>) -> index { %c1 = arith.constant 1 : index %c0 = arith.constant 0 : index %0 = shape.dim %arg0, %c1 : tensor<1x?xf32>, index -> index %1 = shape.dim %arg1, %c0 : tensor<?xf32>, index -> index %2 = shape.add %0, %1 : index, index -> index return %2 : index }转换成
pythonfunc.func @test(%arg0: tensor<1x?xf32>, %arg1: tensor<?xf32>) -> index { %c1 = arith.constant 1 : index %c0 = arith.constant 0 : index %c1_0 = arith.constant 1 : index %c1_1 = arith.constant 1 : index %0 = tensor.dim %arg0, %c1_1 : tensor<1x?xf32> %1 = tensor.from_elements %c1_0, %0 : tensor<2xindex> %2 = tensor.cast %1 : tensor<2xindex> to tensor<2xindex> %3 = tensor.dim %arg0, %c1 : tensor<1x?xf32> %c0_2 = arith.constant 0 : index %4 = tensor.dim %arg1, %c0_2 : tensor<?xf32> %5 = tensor.from_elements %4 : tensor<1xindex> %6 = tensor.cast %5 : tensor<1xindex> to tensor<1xindex> %7 = tensor.dim %arg1, %c0 : tensor<?xf32> %8 = arith.addi %3, %7 : index return %8 : index } -
mlir::createCanonicalizerPass -
mlir::createInlinerPass内联
calls和callable operations,并删除dead callables。比如:pythonfunc.func @test(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> { %0 = call @add(%arg0, %arg1) : (tensor<1xf32>, tensor<1xf32>) -> tensor<1xf32> return %0 : tensor<1xf32> } func.func private @add(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> { %0 = mhlo.add %arg0, %arg1 : tensor<1xf32> return %0 : tensor<1xf32> }私有的add函数被内联之后删除,
pythonfunc.func @test(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> { %0 = mhlo.add %arg0, %arg1 : tensor<1xf32> return %0 : tensor<1xf32> } -
IREE::Util::createDemoteI64ToI32Pass -
IREE::Util::createDemoteF64ToF32Pass -
mlir::createCanonicalizerPass -
mlir::createCSEPass -
mhlo::createLegalizeShapeComputationsPass把
scalar tensor op转换成scalar op + fromElements op。比如pythonfunc.func @test(%arg0: f32, %arg1: f32) -> tensor<1xf32> { %0 = tensor.from_elements %arg0 : tensor<1xf32> %1 = tensor.from_elements %arg1 : tensor<1xf32> %2 = mhlo.add %0, %1 : tensor<1xf32> return %2 : tensor<1xf32> }转换成:
pythonfunc.func @test(%arg0: f32, %arg1: f32) -> tensor<1xf32> { %0 = arith.addf %arg0, %arg1 : f32 %1 = tensor.from_elements %0 : tensor<1xf32> return %1 : tensor<1xf32> } -
createConvertMHLOToLinalgExtPass将
mhlo::sort、mhlo.scatter、mhlo.fft、mhlo.reverse、mhlo.topk转换到IREE::LinalgExt dialect,同时将在IREE::LinalgExt dialect区域内部的mhlo op转换成linalg dialect,mhlo.return则转换成iree_linalg_ext.yield。比如,pythonfunc.func @test(%arg0: tensor<10xf32>) -> tensor<10xf32> { %0 = "mhlo.sort"(%arg0) ({ ^bb0(%arg1: tensor<f32>, %arg2: tensor<f32>): %1 = mhlo.compare GT, %arg1, %arg2 : (tensor<f32>, tensor<f32>) -> tensor<i1> mhlo.return %1 : tensor<i1> }) {dimension = 0 : i64} : (tensor<10xf32>) -> tensor<10xf32> return %0 : tensor<10xf32> }转换成,
pythonfunc.func @test(%arg0: tensor<10xf32>) -> tensor<10xf32> { %0 = iree_linalg_ext.sort dimension(0) outs(%arg0 : tensor<10xf32>) { ^bb0(%arg1: f32, %arg2: f32): %1 = arith.cmpf ogt, %arg1, %arg2 : f32 iree_linalg_ext.yield %1 : i1 } -> tensor<10xf32> return %0 : tensor<10xf32> } -
createMHLOToLinalgOnTensorsPass将外层剩余的mhlo op转换到linalg dialect。比如
pythonfunc.func @test(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> { %0 = mhlo.add %arg0, %arg1 : tensor<1xf32> return %0 : tensor<1xf32> }转换成,
pythonfunc.func @test(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> { %0 = linalg.init_tensor [1] : tensor<1xf32> %1 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} ins(%arg0, %arg1 : tensor<1xf32>, tensor<1xf32>) outs(%0 : tensor<1xf32>) { ^bb0(%arg2: f32, %arg3: f32, %arg4: f32): %2 = arith.addf %arg2, %arg3 : f32 linalg.yield %2 : f32 } -> tensor<1xf32> return %1 : tensor<1xf32> } -
mlir::createReconcileUnrealizedCastsPass消除
unrealized conversion cast操作。算法过程描述:
- 如果
unrealized conversion cast是dead节点(没有user或所有users也都是unrealized conversion cast),则直接删除该dead节点; - 如果是
live节点(至少有一个非unrealized conversion cast的user),则遍历其所有子节点,如果其子节点中所有unrealized conversion cast的result type与该 op 的input type相同(即不存在真实意义的type cast操作),则将所有遍历到的unrealized conversion cast都折叠成该 op 的输入,否则报错live unrealized conversion cast。
- 如果
-
mlir::createCanonicalizerPass -
createVerifyCompilerMHLOInputLegality验证program是否合法。
2 CommonInputConversionPassPipeline
主要作用是将IREE::Input dialect lower成IREE::Util、IREE::Flow和IREE::HAL dialect,包括以下几个passes:
-
createIREEImportPublicPass将
IREE::Input dialect转换成IREE::Util、IREE::Flow和IREE::HAL dialect,并转换func的属性和signature中输入输出类型。比如,pythoniree_input.global private mutable @param : tensor<1x2xf32> func.func @run(%arg0: tensor<1x2xf32>) { %0 = iree_input.global.load @param : tensor<1x2xf32> %1 = iree_input.tensor.clone %0 : tensor<1x2xf32> iree_input.global.store %1, @param : tensor<1x2xf32> return }转换成(
iree_input.global.load-->util.global.load,iree_input.global.store-->util.global.store,iree_input.tensor.clone-->flow.tensor.clone):pythonutil.global private mutable @param : tensor<1x2xf32> func.func @run(%arg0: tensor<1x2xf32>) { %param = util.global.load @param : tensor<1x2xf32> %0 = flow.tensor.clone %param : tensor<1x2xf32> util.global.store %0, @param : tensor<1x2xf32> return } -
createImportMLProgramPass将
ml_program dialect转换到IREE::Util dialect。 -
createSanitizeModuleNamesPass将
module name中的.替换为_,以符合mlir identifiers的命名规范。pythonmodule @iree.module { func.func @test(%arg0: f32, %arg1: f32) -> f32 { %0 = arith.addf %arg0, %arg1 : f32 return %0 : f32 } }转换成,
pythonmodule @iree_module { func.func @test(%arg0: f32, %arg1: f32) -> f32 { %0 = arith.addf %arg0, %arg1 : f32 return %0 : f32 } }
3 ABI::TransformPassPipeline
主要作用是将外部导入的接口和本 module 导出到外部的接口参数统一成标准标量类型或hal.buffer_view类型(hal.buffer_view对应tensor)。
-
createWrapEntryPointsPass给
external func生成一个内部函数,函数中调用原始的external func,同时将public func的函数体包装成一个新的函数,原public func中调用该函数。该 pass 最终的目的是将外部导入的接口和本 module 导出到外部的接口参数统一成标准标量类型或hal.buffer_view(hal.buffer_view对应 tensor 类型)。python// external/imported func func.func private @add(tensor<f32>, tensor<f32>) -> tensor<f32> // public/exported func func.func @test(%arg0: tensor<f32>, %arg1: tensor<f32>) -> tensor<f32> { %0 = call @add(%arg0, %arg1) : (tensor<f32>, tensor<f32>) -> tensor<f32> return %0 : tensor<f32> }转换成,
pythonfunc.func private @add(!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} func.func private @_add(%arg0: tensor<f32>, %arg1: tensor<f32>) -> tensor<f32> { %0 = hal.tensor.export %arg0 : tensor<f32> -> !hal.buffer_view %1 = hal.tensor.export %arg1 : tensor<f32> -> !hal.buffer_view %2 = call @add(%0, %1) : (!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view %3 = hal.tensor.import %2 : !hal.buffer_view -> tensor<f32> return %3 : tensor<f32> } func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<f32> %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<f32> %2 = call @_test(%0, %1) : (tensor<f32>, tensor<f32>) -> tensor<f32> %3 = hal.tensor.export %2 : tensor<f32> -> !hal.buffer_view return %3 : !hal.buffer_view } func.func private @_test(%arg0: tensor<f32>, %arg1: tensor<f32>) -> tensor<f32> { %0 = call @_add(%arg0, %arg1) : (tensor<f32>, tensor<f32>) -> tensor<f32> return %0 : tensor<f32> } -
mlir::createInlinerPass将
WrapEntryPointsPass中生成的 wrap 函数内联起来。最终转换成,pythonfunc.func private @add(!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %0 = call @add(%arg0, %arg1) : (!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view return %0 : !hal.buffer_view } -
mlir::createCanonicalizerPass -
mlir::createCSEPass -
mlir::createSymbolDCEPass
4 Flow::FlowTransformPassPipeline
主要作用是执行一系列窥孔优化,比如 1x1 的 conv2d 转换成 matmul、tiling、op fusion 等,最终将 workload 拆分成 flow.executable。
-
IREE::Util::createDemoteF64ToF32Pass将F64类型窄化为F32。
-
IREE::Flow::createConvertConv2D1x1ToMatmulPass将 1x1 的
linalg.conv_2d_nhwc_hwcf转换成linalg.matmul。python// func.func @conv(%input : tensor<1x2x2x3xf32>, %filter: tensor<1x1x3x4xf32>) -> tensor<1x2x2x4xf32> { // %0 = mhlo.convolution(%input, %filter) // dim_numbers = [b, 0, 1, f]x[0, 1, i, o]->[b, 0, 1, f], // window = {stride = [1, 1], pad = [[0, 0], [0, 0]], rhs_dilate = [1, 1]} // {batch_group_count = 1 : i64, feature_group_count = 1 : i64} // : (tensor<1x2x2x3xf32>, tensor<1x1x3x4xf32>) -> tensor<1x2x2x4xf32> // return %0 : tensor<1x2x2x4xf32> // } func.func @conv(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %cst = arith.constant 0.000000e+00 : f32 %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x2x2x3xf32> %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<1x1x3x4xf32> %2 = linalg.init_tensor [1, 2, 2, 4] : tensor<1x2x2x4xf32> %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x2x2x4xf32>) -> tensor<1x2x2x4xf32> %4 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%0, %1 : tensor<1x2x2x3xf32>, tensor<1x1x3x4xf32>) outs(%3 : tensor<1x2x2x4xf32>) -> tensor<1x2x2x4xf32> %5 = hal.tensor.export %4 : tensor<1x2x2x4xf32> -> !hal.buffer_view return %5 : !hal.buffer_view }转换成,
pythonfunc.func @conv(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %cst = arith.constant 0.000000e+00 : f32 %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x2x2x3xf32> %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<1x1x3x4xf32> %2 = linalg.init_tensor [1, 2, 2, 4] : tensor<1x2x2x4xf32> %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x2x2x4xf32>) -> tensor<1x2x2x4xf32> %4 = tensor.collapse_shape %0 [[0, 1, 2], [3]] : tensor<1x2x2x3xf32> into tensor<4x3xf32> %5 = tensor.collapse_shape %1 [[0, 1, 2], [3]] : tensor<1x1x3x4xf32> into tensor<3x4xf32> %6 = tensor.collapse_shape %3 [[0, 1, 2], [3]] : tensor<1x2x2x4xf32> into tensor<4x4xf32> %7 = linalg.matmul ins(%4, %5 : tensor<4x3xf32>, tensor<3x4xf32>) outs(%6 : tensor<4x4xf32>) -> tensor<4x4xf32> %8 = tensor.expand_shape %7 [[0, 1, 2], [3]] : tensor<4x4xf32> into tensor<1x2x2x4xf32> %9 = hal.tensor.export %8 : tensor<1x2x2x4xf32> -> !hal.buffer_view return %9 : !hal.buffer_view } -
IREE::Flow::createConvertConv2DToImg2ColPass将
conv2d转换成img2col。默认不开启。python// %0 = mhlo.convolution(%input, %filter) // dim_numbers = [b, 0, 1, f]x[0, 1, i, o]->[b, 0, 1, f], // window = {stride = [1, 1], pad = [[0, 0], [0, 0]], rhs_dilate = [1, 1]} // {batch_group_count = 1 : i64, feature_group_count = 1 : i64} // : (tensor<1x4x4x3xf32>, tensor<2x2x3x4xf32>) -> tensor<1x3x3x4xf32> func.func @conv(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %cst = arith.constant 0.000000e+00 : f32 %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x4x4x3xf32> %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2x2x3x4xf32> %2 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32> %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32> %4 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%0, %1 : tensor<1x4x4x3xf32>, tensor<2x2x3x4xf32>) outs(%3 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32> %5 = hal.tensor.export %4 : tensor<1x3x3x4xf32> -> !hal.buffer_view return %5 : !hal.buffer_view }转换成,
pythonfunc.func @conv(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %cst = arith.constant 0.000000e+00 : f32 %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x4x4x3xf32> %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2x2x3x4xf32> %2 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32> %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32> %4 = linalg.init_tensor [1, 3, 3, 2, 2, 3] : tensor<1x3x3x2x2x3xf32> %5 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1 + d3, d2 + d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%0 : tensor<1x4x4x3xf32>) outs(%4 : tensor<1x3x3x2x2x3xf32>) { ^bb0(%arg2: f32, %arg3: f32): linalg.yield %arg2 : f32 } -> tensor<1x3x3x2x2x3xf32> %6 = tensor.collapse_shape %5 [[0, 1, 2], [3, 4, 5]] : tensor<1x3x3x2x2x3xf32> into tensor<9x12xf32> %7 = tensor.collapse_shape %1 [[0, 1, 2], [3]] : tensor<2x2x3x4xf32> into tensor<12x4xf32> %8 = tensor.collapse_shape %3 [[0, 1, 2], [3]] : tensor<1x3x3x4xf32> into tensor<9x4xf32> %9 = linalg.matmul ins(%6, %7 : tensor<9x12xf32>, tensor<12x4xf32>) outs(%8 : tensor<9x4xf32>) -> tensor<9x4xf32> %10 = tensor.expand_shape %9 [[0, 1, 2], [3]] : tensor<9x4xf32> into tensor<1x3x3x4xf32> %11 = hal.tensor.export %10 : tensor<1x3x3x4xf32> -> !hal.buffer_view return %11 : !hal.buffer_view } -
IREE::Flow::createDetachElementwiseFromNamedOpsPass将
buffer = linalg.generic_op + linalg.named_payload_op转换成tmp_buffer = linalg.named_payload_op; buffer = linalg.generic_op + tmp_buffer,主要目的是将上游的generic op和named_payload_op分隔开,使得named_payload_op的结果写到一块新的 buffer。pythonfunc.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %cst = arith.constant 0.000000e+00 : f32 %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x4x4x3xf32> %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2x2x3x4xf32> %2 = hal.tensor.import %arg2 : !hal.buffer_view -> tensor<1x3x3x4xf32> %3 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32> %4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32> %5 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%2 : tensor<1x3x3x4xf32>) outs(%4 : tensor<1x3x3x4xf32>) { ^bb0(%arg3: f32, %arg4: f32): %8 = arith.addf %arg3, %arg3 : f32 linalg.yield %8 : f32 } -> tensor<1x3x3x4xf32> %6 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%0, %1 : tensor<1x4x4x3xf32>, tensor<2x2x3x4xf32>) outs(%5 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32> %7 = hal.tensor.export %6 : tensor<1x3x3x4xf32> -> !hal.buffer_view return %7 : !hal.buffer_view }转换成,
pythonfunc.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %cst = arith.constant 0.000000e+00 : f32 %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x4x4x3xf32> %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2x2x3x4xf32> %2 = hal.tensor.import %arg2 : !hal.buffer_view -> tensor<1x3x3x4xf32> %3 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32> %4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32> %5 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%2 : tensor<1x3x3x4xf32>) outs(%4 : tensor<1x3x3x4xf32>) { ^bb0(%arg3: f32, %arg4: f32): %11 = arith.addf %arg3, %arg3 : f32 linalg.yield %11 : f32 } -> tensor<1x3x3x4xf32> %6 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32> %7 = linalg.fill ins(%cst : f32) outs(%6 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32> %8 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%0, %1 : tensor<1x4x4x3xf32>, tensor<2x2x3x4xf32>) outs(%7 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32> %9 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%8, %5 : tensor<1x3x3x4xf32>, tensor<1x3x3x4xf32>) outs(%7 : tensor<1x3x3x4xf32>) { ^bb0(%arg3: f32, %arg4: f32, %arg5: f32): %11 = arith.addf %arg3, %arg4 : f32 linalg.yield %11 : f32 } -> tensor<1x3x3x4xf32> %10 = hal.tensor.export %9 : tensor<1x3x3x4xf32> -> !hal.buffer_view return %10 : !hal.buffer_view } -
IREE::Flow::createVerifyInputLegalityPass验证program是否合法
-
IREE::Flow::createConvertLinalgMatmulToMmt4DPass将 2d 的
linalg.matmul tiling成linalg.mmt4d。默认不开启,可通过--iree-flow-mmt4d-target-options="enable_generic_slow arch=cuda选项开启。pythonfunc.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %cst = arith.constant 0.000000e+00 : f32 %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<128x256xf32> %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<256x256xf32> %2 = linalg.init_tensor [128, 256] : tensor<128x256xf32> %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<128x256xf32>) -> tensor<128x256xf32> %4 = linalg.matmul ins(%0, %1 : tensor<128x256xf32>, tensor<256x256xf32>) outs(%3 : tensor<128x256xf32>) -> tensor<128x256xf32> %5 = hal.tensor.export %4 : tensor<128x256xf32> -> !hal.buffer_view return %5 : !hal.buffer_view }转换成,
pythonfunc.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %cst = arith.constant 0.000000e+00 : f32 %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<128x256xf32> %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<256x256xf32> %2 = linalg.init_tensor [128, 256] : tensor<128x256xf32> %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<128x256xf32>) -> tensor<128x256xf32> %4 = tensor.expand_shape %0 [[0, 1], [2, 3]] : tensor<128x256xf32> into tensor<16x8x128x2xf32> %5 = tensor.expand_shape %1 [[0, 1], [2, 3]] : tensor<256x256xf32> into tensor<128x2x64x4xf32> %6 = tensor.expand_shape %3 [[0, 1], [2, 3]] : tensor<128x256xf32> into tensor<16x8x64x4xf32> %7 = linalg.init_tensor [16, 128, 8, 2] : tensor<16x128x8x2xf32> %8 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%4 : tensor<16x8x128x2xf32>) outs(%7 : tensor<16x128x8x2xf32>) { ^bb0(%arg2: f32, %arg3: f32): linalg.yield %arg2 : f32 } -> tensor<16x128x8x2xf32> %9 = linalg.init_tensor [64, 128, 4, 2] : tensor<64x128x4x2xf32> %10 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d1, d3, d0, d2)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%5 : tensor<128x2x64x4xf32>) outs(%9 : tensor<64x128x4x2xf32>) { ^bb0(%arg2: f32, %arg3: f32): linalg.yield %arg2 : f32 } -> tensor<64x128x4x2xf32> %11 = linalg.init_tensor [16, 64, 8, 4] : tensor<16x64x8x4xf32> %12 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%6 : tensor<16x8x64x4xf32>) outs(%11 : tensor<16x64x8x4xf32>) { ^bb0(%arg2: f32, %arg3: f32): linalg.yield %arg2 : f32 } -> tensor<16x64x8x4xf32> // 16 x (128x8x2) @ 64 x (128x4x2) => 16 x 64 x sum_{128}(8x2 * (4x2)^T) %13 = linalg.mmt4d {comment = "generic tiling parameters, as no known kernel was matched for this matmul and target"} ins(%8, %10 : tensor<16x128x8x2xf32>, tensor<64x128x4x2xf32>) outs(%12 : tensor<16x64x8x4xf32>) -> tensor<16x64x8x4xf32> %14 = linalg.init_tensor [16, 8, 64, 4] : tensor<16x8x64x4xf32> %15 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%13 : tensor<16x64x8x4xf32>) outs(%14 : tensor<16x8x64x4xf32>) { ^bb0(%arg2: f32, %arg3: f32): linalg.yield %arg2 : f32 } -> tensor<16x8x64x4xf32> %16 = tensor.collapse_shape %15 [[0, 1], [2, 3]] : tensor<16x8x64x4xf32> into tensor<128x256xf32> %17 = hal.tensor.export %16 : tensor<128x256xf32> -> !hal.buffer_view return %17 : !hal.buffer_view } -
IREE::Flow::createPadLinalgOpsToIntegerMultiplePass将 matmul 的 M、N 和 K 扩充到
paddingSize的整数倍,paddingSize默认为 4。 -
mlir::createLinalgNamedOpConversionPass将
depth_multiplier=1的linalg.depthwise_conv_2d_nhwc_hwcm转换成linalg.depthwise_conv_2d_nhwc_hwc,将depth_multiplier=1的linalg.depthwise_conv_2d_nhwc_hwcm_q转换成linalg.depthwise_conv_2d_nhwc_hwc_q。
depth_multiplier的作用见 https://www.tensorflow.org/api_docs/python/tf/keras/layers/DepthwiseConv2D 。
-
IREE::Flow::createExpandTensorShapesPass
将dynamic tensor扩充为tensor + dynamic dim的对偶形式,这么做的一个好处是动态维度可以直接参与计算和推导。比如python// func.func private @add(%arg0 : tensor<?x2xf32>, %arg1 : tensor<?x2xf32>) -> tensor<?x2xf32> // iree_input.global private mutable @param : tensor<?x2xf32> // func.func @run(%arg0 : tensor<?x2xf32>) -> tensor<?x2xf32> { // %0 = iree_input.global.load @param : tensor<?x2xf32> // %1 = call @add(%0, %arg0) : (tensor<?x2xf32>, tensor<?x2xf32>) -> tensor<?x2xf32> // iree_input.global.store %1, @param : tensor<?x2xf32> // return %1 : tensor<?x2xf32> // } func.func private @add(!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} util.global private mutable @param : tensor<?x2xf32> func.func @run(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %c0 = arith.constant 0 : index %param = util.global.load @param : tensor<?x2xf32> %dim = tensor.dim %param, %c0 : tensor<?x2xf32> %0 = hal.tensor.export %param : tensor<?x2xf32>{%dim} -> !hal.buffer_view %1 = call @add(%0, %arg0) : (!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view %2 = hal.buffer_view.dim<%1 : !hal.buffer_view>[0] : index %3 = hal.tensor.import %1 : !hal.buffer_view -> tensor<?x2xf32>{%2} util.global.store %3, @param : tensor<?x2xf32> return %1 : !hal.buffer_view }转换成,
pythonfunc.func private @add(!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} util.global private mutable @param : tensor<?x2xf32> util.global private mutable @param__d0 : index func.func @run(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %c0 = arith.constant 0 : index %param = util.global.load @param : tensor<?x2xf32> %param__d0 = util.global.load @param__d0 : index %0 = flow.tensor.tie_shape %param : tensor<?x2xf32>{%param__d0} %dim = tensor.dim %0, %c0 : tensor<?x2xf32> %1 = hal.tensor.export %0 : tensor<?x2xf32>{%dim} -> !hal.buffer_view %2 = call @add(%1, %arg0) : (!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view %3 = hal.buffer_view.dim<%2 : !hal.buffer_view>[0] : index %4 = hal.tensor.import %2 : !hal.buffer_view -> tensor<?x2xf32>{%3} util.global.store %4, @param : tensor<?x2xf32> util.global.store %3, @param__d0 : index return %2 : !hal.buffer_view }
未完待续...