IREE 目前支持将 MHLO 或 XLA、Torch Tensor 和 TOSA 作为输入,经过一系列 passes 编译生成 IREE 定义的 VM bytecode 中间产物,其中硬件相关代码会编译成相应的 Executable,保存在 VM bytecode 中供 host 进行调用。例如 CUDA 相关的计算代码会被lower 成 PTX 代码,在 IREE 的 runtime 中再被 CUDA 的 runtime 以 JIT 的方式编译成可执行的 cubin kernel。
IREE 编译的入口是IREEVMTransformPassPipeline
,IREEVMTransformPassPipeline
又被分成InputConversionPassPipeline
、CommonInputConversionPassPipeline
、ABI::TransformPassPipeline
、Flow::FlowTransformPassPipeline
、Stream::StreamTransformPassPipeline
(仅 CUDA 后端)、HAL::HALTransformPassPipeline
、VM::VMTransformPassPipeline
等几个阶段。
1 InputConversionPassPipeline
主要作用是将不同的输入(MHLO
或 XLA
、Torch Tensor
和 TOSA
)统一 lower 成 linalg dialect
和 builtin
的 arith dialect
、scf dialect
和 tensor dialect
。以 MHLO 输入为例,列举了 InputConversionPassPipeline 中各个 pass 以及它们的主要作用。
-
mhlo::createLegalizeControlFlowPass
将TF1.0中的控制流原语(http://download.tensorflow.org/paper/white_paper_tf_control_flow_implementation_2017_11_1.pdf)规范化成HLO中的控制流算子。
-
createTopLevelSCFToCFGPass
将顶层的structured control flow表示的控制流图转换成更底层基础块的控制流图(CFG)。
-
createMHLOToMHLOPreprocessingPass
-
mlir::createCanonicalizerPass
-
mlir::createShapeToShapeLowering
将
shape.num_elements
转换成shape.reduce
。 -
mlir::createConvertShapeToStandardPass
将
shape dialect
lower成arith dialect
、scf dialect
和tensor dialect
。比如pythonfunc.func @test(%arg0: tensor<1x?xf32>, %arg1: tensor<?xf32>) -> index { %c1 = arith.constant 1 : index %c0 = arith.constant 0 : index %0 = shape.dim %arg0, %c1 : tensor<1x?xf32>, index -> index %1 = shape.dim %arg1, %c0 : tensor<?xf32>, index -> index %2 = shape.add %0, %1 : index, index -> index return %2 : index }
转换成
pythonfunc.func @test(%arg0: tensor<1x?xf32>, %arg1: tensor<?xf32>) -> index { %c1 = arith.constant 1 : index %c0 = arith.constant 0 : index %c1_0 = arith.constant 1 : index %c1_1 = arith.constant 1 : index %0 = tensor.dim %arg0, %c1_1 : tensor<1x?xf32> %1 = tensor.from_elements %c1_0, %0 : tensor<2xindex> %2 = tensor.cast %1 : tensor<2xindex> to tensor<2xindex> %3 = tensor.dim %arg0, %c1 : tensor<1x?xf32> %c0_2 = arith.constant 0 : index %4 = tensor.dim %arg1, %c0_2 : tensor<?xf32> %5 = tensor.from_elements %4 : tensor<1xindex> %6 = tensor.cast %5 : tensor<1xindex> to tensor<1xindex> %7 = tensor.dim %arg1, %c0 : tensor<?xf32> %8 = arith.addi %3, %7 : index return %8 : index }
-
mlir::createCanonicalizerPass
-
mlir::createInlinerPass
内联
calls
和callable operations
,并删除dead callables
。比如:pythonfunc.func @test(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> { %0 = call @add(%arg0, %arg1) : (tensor<1xf32>, tensor<1xf32>) -> tensor<1xf32> return %0 : tensor<1xf32> } func.func private @add(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> { %0 = mhlo.add %arg0, %arg1 : tensor<1xf32> return %0 : tensor<1xf32> }
私有的add函数被内联之后删除,
pythonfunc.func @test(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> { %0 = mhlo.add %arg0, %arg1 : tensor<1xf32> return %0 : tensor<1xf32> }
-
IREE::Util::createDemoteI64ToI32Pass
-
IREE::Util::createDemoteF64ToF32Pass
-
mlir::createCanonicalizerPass
-
mlir::createCSEPass
-
mhlo::createLegalizeShapeComputationsPass
把
scalar tensor op
转换成scalar op + fromElements op
。比如pythonfunc.func @test(%arg0: f32, %arg1: f32) -> tensor<1xf32> { %0 = tensor.from_elements %arg0 : tensor<1xf32> %1 = tensor.from_elements %arg1 : tensor<1xf32> %2 = mhlo.add %0, %1 : tensor<1xf32> return %2 : tensor<1xf32> }
转换成:
pythonfunc.func @test(%arg0: f32, %arg1: f32) -> tensor<1xf32> { %0 = arith.addf %arg0, %arg1 : f32 %1 = tensor.from_elements %0 : tensor<1xf32> return %1 : tensor<1xf32> }
-
createConvertMHLOToLinalgExtPass
将
mhlo::sort
、mhlo.scatter
、mhlo.fft
、mhlo.reverse
、mhlo.topk
转换到IREE::LinalgExt dialect
,同时将在IREE::LinalgExt dialect
区域内部的mhlo op
转换成linalg dialect
,mhlo.return
则转换成iree_linalg_ext.yield
。比如,pythonfunc.func @test(%arg0: tensor<10xf32>) -> tensor<10xf32> { %0 = "mhlo.sort"(%arg0) ({ ^bb0(%arg1: tensor<f32>, %arg2: tensor<f32>): %1 = mhlo.compare GT, %arg1, %arg2 : (tensor<f32>, tensor<f32>) -> tensor<i1> mhlo.return %1 : tensor<i1> }) {dimension = 0 : i64} : (tensor<10xf32>) -> tensor<10xf32> return %0 : tensor<10xf32> }
转换成,
pythonfunc.func @test(%arg0: tensor<10xf32>) -> tensor<10xf32> { %0 = iree_linalg_ext.sort dimension(0) outs(%arg0 : tensor<10xf32>) { ^bb0(%arg1: f32, %arg2: f32): %1 = arith.cmpf ogt, %arg1, %arg2 : f32 iree_linalg_ext.yield %1 : i1 } -> tensor<10xf32> return %0 : tensor<10xf32> }
-
createMHLOToLinalgOnTensorsPass
将外层剩余的mhlo op转换到linalg dialect。比如
pythonfunc.func @test(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> { %0 = mhlo.add %arg0, %arg1 : tensor<1xf32> return %0 : tensor<1xf32> }
转换成,
pythonfunc.func @test(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> { %0 = linalg.init_tensor [1] : tensor<1xf32> %1 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} ins(%arg0, %arg1 : tensor<1xf32>, tensor<1xf32>) outs(%0 : tensor<1xf32>) { ^bb0(%arg2: f32, %arg3: f32, %arg4: f32): %2 = arith.addf %arg2, %arg3 : f32 linalg.yield %2 : f32 } -> tensor<1xf32> return %1 : tensor<1xf32> }
-
mlir::createReconcileUnrealizedCastsPass
消除
unrealized conversion cast
操作。算法过程描述:
- 如果
unrealized conversion cast
是dead
节点(没有user
或所有users
也都是unrealized conversion cast
),则直接删除该dead
节点; - 如果是
live
节点(至少有一个非unrealized conversion cast
的user
),则遍历其所有子节点,如果其子节点中所有unrealized conversion cast
的result type
与该 op 的input type
相同(即不存在真实意义的type cast
操作),则将所有遍历到的unrealized conversion cast
都折叠成该 op 的输入,否则报错live unrealized conversion cast
。
- 如果
-
mlir::createCanonicalizerPass
-
createVerifyCompilerMHLOInputLegality
验证program是否合法。
2 CommonInputConversionPassPipeline
主要作用是将IREE::Input dialect lower
成IREE::Util
、IREE::Flow
和IREE::HAL dialect
,包括以下几个passes:
-
createIREEImportPublicPass
将
IREE::Input dialect
转换成IREE::Util
、IREE::Flow
和IREE::HAL dialect
,并转换func
的属性和signature
中输入输出类型。比如,pythoniree_input.global private mutable @param : tensor<1x2xf32> func.func @run(%arg0: tensor<1x2xf32>) { %0 = iree_input.global.load @param : tensor<1x2xf32> %1 = iree_input.tensor.clone %0 : tensor<1x2xf32> iree_input.global.store %1, @param : tensor<1x2xf32> return }
转换成(
iree_input.global.load
-->util.global.load
,iree_input.global.store
-->util.global.store
,iree_input.tensor.clone
-->flow.tensor.clone
):pythonutil.global private mutable @param : tensor<1x2xf32> func.func @run(%arg0: tensor<1x2xf32>) { %param = util.global.load @param : tensor<1x2xf32> %0 = flow.tensor.clone %param : tensor<1x2xf32> util.global.store %0, @param : tensor<1x2xf32> return }
-
createImportMLProgramPass
将
ml_program dialect
转换到IREE::Util dialect
。 -
createSanitizeModuleNamesPass
将
module name
中的.
替换为_
,以符合mlir identifiers
的命名规范。pythonmodule @iree.module { func.func @test(%arg0: f32, %arg1: f32) -> f32 { %0 = arith.addf %arg0, %arg1 : f32 return %0 : f32 } }
转换成,
pythonmodule @iree_module { func.func @test(%arg0: f32, %arg1: f32) -> f32 { %0 = arith.addf %arg0, %arg1 : f32 return %0 : f32 } }
3 ABI::TransformPassPipeline
主要作用是将外部导入的接口和本 module 导出到外部的接口参数统一成标准标量类型或hal.buffer_view
类型(hal.buffer_view
对应tensor
)。
-
createWrapEntryPointsPass
给
external func
生成一个内部函数,函数中调用原始的external func
,同时将public func
的函数体包装成一个新的函数,原public func
中调用该函数。该 pass 最终的目的是将外部导入的接口和本 module 导出到外部的接口参数统一成标准标量类型或hal.buffer_view
(hal.buffer_view
对应 tensor 类型)。python// external/imported func func.func private @add(tensor<f32>, tensor<f32>) -> tensor<f32> // public/exported func func.func @test(%arg0: tensor<f32>, %arg1: tensor<f32>) -> tensor<f32> { %0 = call @add(%arg0, %arg1) : (tensor<f32>, tensor<f32>) -> tensor<f32> return %0 : tensor<f32> }
转换成,
pythonfunc.func private @add(!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} func.func private @_add(%arg0: tensor<f32>, %arg1: tensor<f32>) -> tensor<f32> { %0 = hal.tensor.export %arg0 : tensor<f32> -> !hal.buffer_view %1 = hal.tensor.export %arg1 : tensor<f32> -> !hal.buffer_view %2 = call @add(%0, %1) : (!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view %3 = hal.tensor.import %2 : !hal.buffer_view -> tensor<f32> return %3 : tensor<f32> } func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<f32> %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<f32> %2 = call @_test(%0, %1) : (tensor<f32>, tensor<f32>) -> tensor<f32> %3 = hal.tensor.export %2 : tensor<f32> -> !hal.buffer_view return %3 : !hal.buffer_view } func.func private @_test(%arg0: tensor<f32>, %arg1: tensor<f32>) -> tensor<f32> { %0 = call @_add(%arg0, %arg1) : (tensor<f32>, tensor<f32>) -> tensor<f32> return %0 : tensor<f32> }
-
mlir::createInlinerPass
将
WrapEntryPointsPass
中生成的 wrap 函数内联起来。最终转换成,pythonfunc.func private @add(!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %0 = call @add(%arg0, %arg1) : (!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view return %0 : !hal.buffer_view }
-
mlir::createCanonicalizerPass
-
mlir::createCSEPass
-
mlir::createSymbolDCEPass
4 Flow::FlowTransformPassPipeline
主要作用是执行一系列窥孔优化,比如 1x1 的 conv2d
转换成 matmul
、tiling
、op fusion
等,最终将 workload
拆分成 flow.executable
。
-
IREE::Util::createDemoteF64ToF32Pass
将F64类型窄化为F32。
-
IREE::Flow::createConvertConv2D1x1ToMatmulPass
将 1x1 的
linalg.conv_2d_nhwc_hwcf
转换成linalg.matmul
。python// func.func @conv(%input : tensor<1x2x2x3xf32>, %filter: tensor<1x1x3x4xf32>) -> tensor<1x2x2x4xf32> { // %0 = mhlo.convolution(%input, %filter) // dim_numbers = [b, 0, 1, f]x[0, 1, i, o]->[b, 0, 1, f], // window = {stride = [1, 1], pad = [[0, 0], [0, 0]], rhs_dilate = [1, 1]} // {batch_group_count = 1 : i64, feature_group_count = 1 : i64} // : (tensor<1x2x2x3xf32>, tensor<1x1x3x4xf32>) -> tensor<1x2x2x4xf32> // return %0 : tensor<1x2x2x4xf32> // } func.func @conv(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %cst = arith.constant 0.000000e+00 : f32 %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x2x2x3xf32> %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<1x1x3x4xf32> %2 = linalg.init_tensor [1, 2, 2, 4] : tensor<1x2x2x4xf32> %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x2x2x4xf32>) -> tensor<1x2x2x4xf32> %4 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%0, %1 : tensor<1x2x2x3xf32>, tensor<1x1x3x4xf32>) outs(%3 : tensor<1x2x2x4xf32>) -> tensor<1x2x2x4xf32> %5 = hal.tensor.export %4 : tensor<1x2x2x4xf32> -> !hal.buffer_view return %5 : !hal.buffer_view }
转换成,
pythonfunc.func @conv(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %cst = arith.constant 0.000000e+00 : f32 %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x2x2x3xf32> %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<1x1x3x4xf32> %2 = linalg.init_tensor [1, 2, 2, 4] : tensor<1x2x2x4xf32> %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x2x2x4xf32>) -> tensor<1x2x2x4xf32> %4 = tensor.collapse_shape %0 [[0, 1, 2], [3]] : tensor<1x2x2x3xf32> into tensor<4x3xf32> %5 = tensor.collapse_shape %1 [[0, 1, 2], [3]] : tensor<1x1x3x4xf32> into tensor<3x4xf32> %6 = tensor.collapse_shape %3 [[0, 1, 2], [3]] : tensor<1x2x2x4xf32> into tensor<4x4xf32> %7 = linalg.matmul ins(%4, %5 : tensor<4x3xf32>, tensor<3x4xf32>) outs(%6 : tensor<4x4xf32>) -> tensor<4x4xf32> %8 = tensor.expand_shape %7 [[0, 1, 2], [3]] : tensor<4x4xf32> into tensor<1x2x2x4xf32> %9 = hal.tensor.export %8 : tensor<1x2x2x4xf32> -> !hal.buffer_view return %9 : !hal.buffer_view }
-
IREE::Flow::createConvertConv2DToImg2ColPass
将
conv2d
转换成img2col
。默认不开启。python// %0 = mhlo.convolution(%input, %filter) // dim_numbers = [b, 0, 1, f]x[0, 1, i, o]->[b, 0, 1, f], // window = {stride = [1, 1], pad = [[0, 0], [0, 0]], rhs_dilate = [1, 1]} // {batch_group_count = 1 : i64, feature_group_count = 1 : i64} // : (tensor<1x4x4x3xf32>, tensor<2x2x3x4xf32>) -> tensor<1x3x3x4xf32> func.func @conv(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %cst = arith.constant 0.000000e+00 : f32 %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x4x4x3xf32> %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2x2x3x4xf32> %2 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32> %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32> %4 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%0, %1 : tensor<1x4x4x3xf32>, tensor<2x2x3x4xf32>) outs(%3 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32> %5 = hal.tensor.export %4 : tensor<1x3x3x4xf32> -> !hal.buffer_view return %5 : !hal.buffer_view }
转换成,
pythonfunc.func @conv(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %cst = arith.constant 0.000000e+00 : f32 %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x4x4x3xf32> %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2x2x3x4xf32> %2 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32> %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32> %4 = linalg.init_tensor [1, 3, 3, 2, 2, 3] : tensor<1x3x3x2x2x3xf32> %5 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1 + d3, d2 + d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%0 : tensor<1x4x4x3xf32>) outs(%4 : tensor<1x3x3x2x2x3xf32>) { ^bb0(%arg2: f32, %arg3: f32): linalg.yield %arg2 : f32 } -> tensor<1x3x3x2x2x3xf32> %6 = tensor.collapse_shape %5 [[0, 1, 2], [3, 4, 5]] : tensor<1x3x3x2x2x3xf32> into tensor<9x12xf32> %7 = tensor.collapse_shape %1 [[0, 1, 2], [3]] : tensor<2x2x3x4xf32> into tensor<12x4xf32> %8 = tensor.collapse_shape %3 [[0, 1, 2], [3]] : tensor<1x3x3x4xf32> into tensor<9x4xf32> %9 = linalg.matmul ins(%6, %7 : tensor<9x12xf32>, tensor<12x4xf32>) outs(%8 : tensor<9x4xf32>) -> tensor<9x4xf32> %10 = tensor.expand_shape %9 [[0, 1, 2], [3]] : tensor<9x4xf32> into tensor<1x3x3x4xf32> %11 = hal.tensor.export %10 : tensor<1x3x3x4xf32> -> !hal.buffer_view return %11 : !hal.buffer_view }
-
IREE::Flow::createDetachElementwiseFromNamedOpsPass
将
buffer = linalg.generic_op + linalg.named_payload_op
转换成tmp_buffer = linalg.named_payload_op; buffer = linalg.generic_op + tmp_buffer
,主要目的是将上游的generic op
和named_payload_op
分隔开,使得named_payload_op
的结果写到一块新的 buffer。pythonfunc.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %cst = arith.constant 0.000000e+00 : f32 %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x4x4x3xf32> %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2x2x3x4xf32> %2 = hal.tensor.import %arg2 : !hal.buffer_view -> tensor<1x3x3x4xf32> %3 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32> %4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32> %5 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%2 : tensor<1x3x3x4xf32>) outs(%4 : tensor<1x3x3x4xf32>) { ^bb0(%arg3: f32, %arg4: f32): %8 = arith.addf %arg3, %arg3 : f32 linalg.yield %8 : f32 } -> tensor<1x3x3x4xf32> %6 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%0, %1 : tensor<1x4x4x3xf32>, tensor<2x2x3x4xf32>) outs(%5 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32> %7 = hal.tensor.export %6 : tensor<1x3x3x4xf32> -> !hal.buffer_view return %7 : !hal.buffer_view }
转换成,
pythonfunc.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %cst = arith.constant 0.000000e+00 : f32 %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x4x4x3xf32> %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2x2x3x4xf32> %2 = hal.tensor.import %arg2 : !hal.buffer_view -> tensor<1x3x3x4xf32> %3 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32> %4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32> %5 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%2 : tensor<1x3x3x4xf32>) outs(%4 : tensor<1x3x3x4xf32>) { ^bb0(%arg3: f32, %arg4: f32): %11 = arith.addf %arg3, %arg3 : f32 linalg.yield %11 : f32 } -> tensor<1x3x3x4xf32> %6 = linalg.init_tensor [1, 3, 3, 4] : tensor<1x3x3x4xf32> %7 = linalg.fill ins(%cst : f32) outs(%6 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32> %8 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%0, %1 : tensor<1x4x4x3xf32>, tensor<2x2x3x4xf32>) outs(%7 : tensor<1x3x3x4xf32>) -> tensor<1x3x3x4xf32> %9 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%8, %5 : tensor<1x3x3x4xf32>, tensor<1x3x3x4xf32>) outs(%7 : tensor<1x3x3x4xf32>) { ^bb0(%arg3: f32, %arg4: f32, %arg5: f32): %11 = arith.addf %arg3, %arg4 : f32 linalg.yield %11 : f32 } -> tensor<1x3x3x4xf32> %10 = hal.tensor.export %9 : tensor<1x3x3x4xf32> -> !hal.buffer_view return %10 : !hal.buffer_view }
-
IREE::Flow::createVerifyInputLegalityPass
验证program是否合法
-
IREE::Flow::createConvertLinalgMatmulToMmt4DPass
将 2d 的
linalg.matmul tiling成linalg.mmt4d
。默认不开启,可通过--iree-flow-mmt4d-target-options="enable_generic_slow arch=cuda
选项开启。pythonfunc.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %cst = arith.constant 0.000000e+00 : f32 %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<128x256xf32> %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<256x256xf32> %2 = linalg.init_tensor [128, 256] : tensor<128x256xf32> %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<128x256xf32>) -> tensor<128x256xf32> %4 = linalg.matmul ins(%0, %1 : tensor<128x256xf32>, tensor<256x256xf32>) outs(%3 : tensor<128x256xf32>) -> tensor<128x256xf32> %5 = hal.tensor.export %4 : tensor<128x256xf32> -> !hal.buffer_view return %5 : !hal.buffer_view }
转换成,
pythonfunc.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %cst = arith.constant 0.000000e+00 : f32 %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<128x256xf32> %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<256x256xf32> %2 = linalg.init_tensor [128, 256] : tensor<128x256xf32> %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<128x256xf32>) -> tensor<128x256xf32> %4 = tensor.expand_shape %0 [[0, 1], [2, 3]] : tensor<128x256xf32> into tensor<16x8x128x2xf32> %5 = tensor.expand_shape %1 [[0, 1], [2, 3]] : tensor<256x256xf32> into tensor<128x2x64x4xf32> %6 = tensor.expand_shape %3 [[0, 1], [2, 3]] : tensor<128x256xf32> into tensor<16x8x64x4xf32> %7 = linalg.init_tensor [16, 128, 8, 2] : tensor<16x128x8x2xf32> %8 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%4 : tensor<16x8x128x2xf32>) outs(%7 : tensor<16x128x8x2xf32>) { ^bb0(%arg2: f32, %arg3: f32): linalg.yield %arg2 : f32 } -> tensor<16x128x8x2xf32> %9 = linalg.init_tensor [64, 128, 4, 2] : tensor<64x128x4x2xf32> %10 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d1, d3, d0, d2)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%5 : tensor<128x2x64x4xf32>) outs(%9 : tensor<64x128x4x2xf32>) { ^bb0(%arg2: f32, %arg3: f32): linalg.yield %arg2 : f32 } -> tensor<64x128x4x2xf32> %11 = linalg.init_tensor [16, 64, 8, 4] : tensor<16x64x8x4xf32> %12 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%6 : tensor<16x8x64x4xf32>) outs(%11 : tensor<16x64x8x4xf32>) { ^bb0(%arg2: f32, %arg3: f32): linalg.yield %arg2 : f32 } -> tensor<16x64x8x4xf32> // 16 x (128x8x2) @ 64 x (128x4x2) => 16 x 64 x sum_{128}(8x2 * (4x2)^T) %13 = linalg.mmt4d {comment = "generic tiling parameters, as no known kernel was matched for this matmul and target"} ins(%8, %10 : tensor<16x128x8x2xf32>, tensor<64x128x4x2xf32>) outs(%12 : tensor<16x64x8x4xf32>) -> tensor<16x64x8x4xf32> %14 = linalg.init_tensor [16, 8, 64, 4] : tensor<16x8x64x4xf32> %15 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%13 : tensor<16x64x8x4xf32>) outs(%14 : tensor<16x8x64x4xf32>) { ^bb0(%arg2: f32, %arg3: f32): linalg.yield %arg2 : f32 } -> tensor<16x8x64x4xf32> %16 = tensor.collapse_shape %15 [[0, 1], [2, 3]] : tensor<16x8x64x4xf32> into tensor<128x256xf32> %17 = hal.tensor.export %16 : tensor<128x256xf32> -> !hal.buffer_view return %17 : !hal.buffer_view }
-
IREE::Flow::createPadLinalgOpsToIntegerMultiplePass
将 matmul 的 M、N 和 K 扩充到
paddingSize
的整数倍,paddingSize
默认为 4。 -
mlir::createLinalgNamedOpConversionPass
将
depth_multiplier=1
的linalg.depthwise_conv_2d_nhwc_hwcm
转换成linalg.depthwise_conv_2d_nhwc_hwc
,将depth_multiplier=1
的linalg.depthwise_conv_2d_nhwc_hwcm_q
转换成linalg.depthwise_conv_2d_nhwc_hwc_q
。
depth_multiplier
的作用见 https://www.tensorflow.org/api_docs/python/tf/keras/layers/DepthwiseConv2D 。
-
IREE::Flow::createExpandTensorShapesPass
将dynamic tensor
扩充为tensor + dynamic dim
的对偶形式,这么做的一个好处是动态维度可以直接参与计算和推导。比如python// func.func private @add(%arg0 : tensor<?x2xf32>, %arg1 : tensor<?x2xf32>) -> tensor<?x2xf32> // iree_input.global private mutable @param : tensor<?x2xf32> // func.func @run(%arg0 : tensor<?x2xf32>) -> tensor<?x2xf32> { // %0 = iree_input.global.load @param : tensor<?x2xf32> // %1 = call @add(%0, %arg0) : (tensor<?x2xf32>, tensor<?x2xf32>) -> tensor<?x2xf32> // iree_input.global.store %1, @param : tensor<?x2xf32> // return %1 : tensor<?x2xf32> // } func.func private @add(!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} util.global private mutable @param : tensor<?x2xf32> func.func @run(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %c0 = arith.constant 0 : index %param = util.global.load @param : tensor<?x2xf32> %dim = tensor.dim %param, %c0 : tensor<?x2xf32> %0 = hal.tensor.export %param : tensor<?x2xf32>{%dim} -> !hal.buffer_view %1 = call @add(%0, %arg0) : (!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view %2 = hal.buffer_view.dim<%1 : !hal.buffer_view>[0] : index %3 = hal.tensor.import %1 : !hal.buffer_view -> tensor<?x2xf32>{%2} util.global.store %3, @param : tensor<?x2xf32> return %1 : !hal.buffer_view }
转换成,
pythonfunc.func private @add(!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} util.global private mutable @param : tensor<?x2xf32> util.global private mutable @param__d0 : index func.func @run(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} { %c0 = arith.constant 0 : index %param = util.global.load @param : tensor<?x2xf32> %param__d0 = util.global.load @param__d0 : index %0 = flow.tensor.tie_shape %param : tensor<?x2xf32>{%param__d0} %dim = tensor.dim %0, %c0 : tensor<?x2xf32> %1 = hal.tensor.export %0 : tensor<?x2xf32>{%dim} -> !hal.buffer_view %2 = call @add(%1, %arg0) : (!hal.buffer_view, !hal.buffer_view) -> !hal.buffer_view %3 = hal.buffer_view.dim<%2 : !hal.buffer_view>[0] : index %4 = hal.tensor.import %2 : !hal.buffer_view -> tensor<?x2xf32>{%3} util.global.store %4, @param : tensor<?x2xf32> util.global.store %3, @param__d0 : index return %2 : !hal.buffer_view }
未完待续...