MIGraphX Lowering 模块深度分析

本文档详细分析 AMD MIGraphX 中 Lowering（降层/后端适配）Pass 的工作原理，覆盖 GPU、CPU、Ref 三个后端的实现差异与核心机制。

1. 概述

Lowering 是 MIGraphX 编译管线中承上启下的关键 Pass，负责将平台无关的中间表示（IR）算子 转换为后端特定的算子表示。它是图优化与代码生成之间的分水岭：

输入：经过多轮图优化后的通用算子（如 convolution、dot、softmax）
输出：后端专属算子（如 gpu::miopen_op、dnnl::convolution、ref::op）
副作用 ：插入内存分配（allocate）、数据搬运（hip::copy_to_gpu）、延迟编译标记（gpu::precompile_op）

后端	Lowering 核心任务
GPU	映射到 MIOpen/rocBLAS/hipBLASLt，或标记为 JIT 编译；插入显存分配与 H2D/D2H 拷贝
CPU	映射到 oneDNN (DNNL) 算子，或自定义 CPU kernel；部分算子做预融合匹配
Ref	直接包装为 `ref_op`，或替换为参考实现（如 `ref_gemm`、`ref_softmax`）

2. Lowering 在编译管线中的位置

2.1 GPU Target Pipeline

通用图优化

optimize_module
融合 Pass

fuse_attention / fuse_mlir / fuse_ck
layout_convolution
prefuse_ops
lowering
eliminate_contiguous
compile_miopen
fuse_ops
compile_hipblaslt
compile_ops
内存管理

replace_allocate / memory_coloring

Lowering 位于融合优化之后、编译执行之前。它的输出直接决定后续哪些 Pass 会介入：

生成 gpu::miopen_op → 由 compile_miopen 编译
生成 gpu::precompile_op → 由 compile_ops 做 JIT 编译
生成 gpu::hipblaslt_op → 由 compile_hipblaslt 编译

2.2 CPU / Ref Target Pipeline

通用图优化
auto_contiguous
lowering
eliminate_contiguous
fuse_ops
write_literals / memory_coloring

CPU/Ref 的 Lowering 相对简单，后续通常直接接 fuse_ops 和内存管理 Pass。

3. GPU Lowering 深入分析

3.1 总体架构

输出：GPU 专属 IR
gpu::lowering (miopen_apply)
输入：通用 IR 模块
convolution
dot
pooling
softmax
pointwise

(融合生成)
neg
if
@param
init(): 构建 apply_map
遍历所有 instruction
apply_map 命中
has_compiler_for() 命中
target=cpu fallback
未匹配，保留原样
处理 prefill 属性
copy_params()

Host↔Device 拷贝
gpu::miopen_op
gpu::hipblaslt_op

rocblas_gemm
gpu::pooling

或 gpu::precompile_op
gpu::precompile_op
gpu::sub

(0 - x)
gpu::loop / gpu::if

hip::copy_from_gpu
hip::copy_to_gpu
hip::copy_from_gpu

3.2 `miopen_apply` ------ GPU Lowering 的核心执行器

cpp 复制代码

// src/targets/gpu/lowering.cpp
struct miopen_apply
{
    module* mod              = nullptr;  // 当前处理的模块
    module_pass_manager* mpm = nullptr;  // 模块管理器（用于创建子模块）
    const lowering* pass     = nullptr;  // lowering pass 配置
    std::unordered_map<std::string, std::function<instruction_ref(instruction_ref)>> apply_map{};
    bool offload_copy = false;           // 是否自动处理 H2D/D2H
    bool compute_fp32 = false;           // rocBLAS FP32 计算标志

    void init();     // 构建 apply_map
    void apply();    // 主循环：遍历并替换 instruction
    void copy_params() const;  // 处理参数和返回值的拷贝
};

3.3 四大替换路径

apply() 方法对每个 instruction 按以下优先级决策：

cpp 复制代码

void apply()
{
    init();
    for(auto it = mod->begin(); it != mod->end(); it++)
    {
        auto s     = it->get_shape();
        auto attrs = it->get_operator().attributes();

        // 路径一：apply_map 命中（MIOpen / rocBLAS / GPU 专属）
        if(apply_map.count(it->name()) > 0)
        {
            check_shape(s, apply_map.at(it->name())(it));
        }
        // 路径二：有 JIT 编译器支持
        else if(has_compiler_for(it->name()))
        {
            check_shape(s, insert_precompile_op(it));
        }
        // 路径三：自定义目标算子（如 target=cpu 的 fallback）
        else if(attrs.contains("target"))
        {
            check_shape(s, insert_custom_op(it, attrs));
        }
        // 路径四：未匹配，保留原样

        // 附加：处理 prefill 属性（填充初始化）
        if(attrs.contains("prefill"))
        {
            insert_fill(it, attrs.at("prefill"));
        }
    }
    copy_params();
}

3.4 路径一：apply_map ------ 预编译库映射

init() 构建的 apply_map 定义了算子到替换函数的映射：

注册方法	处理的算子	替换目标	关键逻辑
`add_convolution_op()`	`convolution`, `quant_convolution`, `convolution_backwards`	`gpu::miopen_op`	包装 MIOpen 卷积描述符
`add_gemm_op<op::dot>()`	`dot`, `quant_dot`	`rocblas_gemm` 或 `gpu::hipblaslt_op`	FP8 强制 hipBLASLt，gfx90 强制 rocBLAS
`add_pooling_op()`	`pooling`	`gpu::pooling` 或 `gpu::precompile_op`	条件判断：数据类型、模式、padding 对称性
`add_generic_op()`	`contiguous`	`gpu::contiguous`	直接映射，不复制属性
`add_extend_op()`	`argmax`, `argmin`, `logsoftmax`, `multinomial`, `nonzero`, `reverse`, `prefix_scan_sum`, `rnn_var_sl_*`	`gpu::argmax` 等	复制原始算子属性
`add_neg_op()`	`neg`	`gpu::sub`	`0 - input`，插入全零 literal
`add_if_op()`	`if`	原算子 + `hip::copy_from_gpu`	条件从 GPU 拷贝到 CPU
`add_loop_op()`	`loop`	`gpu::loop`	迭代变量拷贝到 CPU，分配输出内存
`add_nms_op()`	`nonmaxsuppression`	原算子 + `hip::copy_from_gpu` + `hip::copy_to_gpu`	CPU 执行 NMS
`add_lrn_op()`	`lrn`	同上	CPU 执行 LRN（MIOpen 未覆盖时）
`add_convolution_backwards_op()`	`convolution_backwards`	同上	CPU fallback
`add_select_module_op()`	`select_module`	原算子 + `allocate`	分配子模块输出
`add_reshape_lazy_op()`	`reshape`	`gpu::contiguous` + `reshape_lazy` + `gpu::contiguous`	惰性 reshape，可消除
`add_concat_past_present_op()`	`concat_past_present`	`gpu::precompile_op`	LLM KV-Cache 拼接
`add_scan_slice_op()`	`scan_slice`	原算子 + `hip::copy_from_gpu`	CPU 执行 slice

GEMM 路径选择的精细逻辑：

cpp 复制代码

void add_gemm_op(const std::string& name)
{
    apply_map.emplace(name, [=](instruction_ref ins) {
        bool has_fp8_inputs = std::any_of(ins->inputs().begin(), ins->inputs().end(),
            [](auto i) { return contains(fp8_types{}.get(), i->get_shape().type()); });

        // 选择 GEMM Provider：
        // 1. 用户显式设置 rocblas
        // 2. 硬件不支持 hipblaslt
        // 3. gfx90 等默认使用 rocblas 的架构
        if(not has_fp8_inputs and
           ((string_value_of(MIGRAPHX_SET_GEMM_PROVIDER{}) == "rocblas") or
            not hipblaslt_supported() or gpu::gfx_default_rocblas()))
        {
            return mod->replace_instruction(ins, rocblas_gemm<Op>{Op{}, 1, 0, compute_fp32}, refs);
        }
        // 否则使用 hipBLASLt
        return mod->replace_instruction(ins,
            make_op("gpu::hipblaslt_op", {{"op", to_value(gemm_op)}}), ...);
    });
}

Pooling 条件判断：

cpp 复制代码

static bool use_miopen_pooling(instruction_ref ins)
{
    // 1. 环境变量禁用
    if(enabled(MIGRAPHX_DISABLE_MIOPEN_POOLING{})) return false;
    // 2. 非 float/half 类型
    if(not contains({shape::float_type, shape::half_type}, ins->get_shape().type())) return false;
    // 3. count_include_pad + average 模式不支持
    auto mode = op_val.at("mode").to<op::pooling_mode>();
    if(op_val.at("count_include_pad").to<bool>() and mode == op::pooling_mode::average)
        return false;
    // 4. lpnorm 模式不支持
    if(mode == op::pooling_mode::lpnorm) return false;
    // 5. padding 必须对称
    auto op_padding = op_val.at("padding").to_vector<size_t>();
    return std::equal(op_padding.begin(), op_padding.begin() + kdims,
                      op_padding.begin() + kdims, op_padding.end());
}

3.5 路径二：JIT 延迟编译标记

对于 apply_map 未覆盖、但存在 JIT 编译器的算子：

cpp 复制代码

instruction_ref insert_precompile_op(instruction_ref ins) const
{
    auto output = insert_allocation(ins, ins->get_shape());
    std::vector<instruction_ref> refs = ins->inputs();
    refs.push_back(output);

    return mod->replace_instruction(
        ins,
        make_op("gpu::precompile_op", {{"op", to_value(ins->get_operator())}}),
        refs,
        ins->module_inputs());  // 保留子模块引用（融合后的 kernel 逻辑）
}

设计意图 ：Lowering 阶段不做实际编译，只将算子信息和子模块打包到 precompile_op 中，后续由 compile_ops Pass 统一做并行 JIT 编译和调优。

3.6 路径三：CPU Fallback

对于标记了 "target": "cpu" 的算子，Lowering 会生成 GPU↔CPU 数据传输 + CPU 执行的链路：

cpp 复制代码

instruction_ref insert_custom_op(instruction_ref ins, const value& attrs) const
{
    if(attrs.at("target") == "cpu")
    {
        auto inputs = ins->inputs();
        auto output = inputs.back();
        std::vector<instruction_ref> cpu_inputs;

        // 1. GPU → CPU 拷贝所有输入
        std::transform(inputs.begin(), inputs.end(), std::back_inserter(cpu_inputs),
            [&](auto in) { return mod->insert_instruction(ins, make_op("hip::copy_from_gpu"), in); });

        // 2. 同步 stream
        cpu_inputs.front() = mod->insert_instruction(ins, make_op("hip::sync_stream"), cpu_inputs);

        // 3. CPU 执行原算子
        auto cpu_out = mod->insert_instruction(ins, custom_op, cpu_inputs);

        // 4. CPU → GPU 拷贝结果
        auto gpu_out = mod->insert_instruction(ins, make_op("hip::copy_to_gpu"), cpu_out, output);
        return mod->replace_instruction(ins, gpu_out);
    }
}

3.7 内存分配注入

每个 GPU 算子都需要输出缓冲区。Lowering 统一通过 insert_allocation() 注入 allocate 指令：

cpp 复制代码

instruction_ref insert_allocation(instruction_ref ins, const shape& s) const
{
    return mod->insert_instruction(ins, make_op("allocate", {{"shape", to_value(s)}}));
}

典型替换后的指令结构：

复制代码

原始：  %out = convolution(%input, %weight)
替换后： %alloc = allocate(shape={...})
        %out = gpu::miopen_op(%input, %weight, %alloc)

3.8 offload_copy ------ 自动 Host-Device 数据传输

当 lowering.offload_copy = true 时（根模块），Lowering 自动处理参数和返回值的拷贝：

cpp 复制代码

void copy_params() const
{
    if(not offload_copy) return;

    // 参数：Host → Device
    for(auto ins : iterator_for(*mod))
    {
        if(ins->name() != "@param") continue;
        if(ins->outputs().empty()) continue;  // 无输出的参数无需拷贝

        auto pos = std::next(ins);
        auto a   = insert_allocation(pos, ins->get_shape());
        auto c   = mod->insert_instruction(pos, make_op("hip::copy_to_gpu"), ins, a);
        mod->replace_instruction(ins, c);  // 替换 param 为 GPU 上的拷贝
    }

    // 返回值：Device → Host
    auto ret = std::prev(mod->end());
    if(ret->name() == "@return")
    {
        for(const auto& in : ret->inputs())
        {
            auto p_output = mod->insert_instruction(ret, make_op("hip::copy_from_gpu"), in);
            instruction::replace_argument(ret, in, p_output);
        }
    }
}

这使得用户可以从 CPU 内存传入参数、从 CPU 内存接收结果，无需手动管理 GPU 内存。

3.9 `get_operator().attributes()` 的来源与机制

在 miopen_apply::apply() 的主循环中，每处理一个 instruction 都会读取其算子的 attributes：

cpp 复制代码

auto attrs = it->get_operator().attributes();
if(apply_map.count(it->name()) > 0) { ... }
else if(has_compiler_for(it->name())) { ... }
else if(attrs.contains("target")) { insert_custom_op(it, attrs); }
if(attrs.contains("prefill")) { insert_fill(it, attrs.at("prefill")); }

这里使用的 "target" 和 "prefill" 两个属性并非凭空出现，而是由具体算子类在定义时通过重写 attributes() 方法注入的。

3.9.1 `attributes()` 接口机制

attributes() 是 operation 类型擦除容器暴露的虚方法（定义于 src/include/migraphx/operation.hpp:552）：

cpp 复制代码

struct operation {
    // ... 其他接口 ...
    value attributes() const;  // 返回算子元信息的键值对字典
};

每个具体算子类（如 convolution、split_fused_reduce、relu）可以选择重写 attributes() const 来返回自定义的 value 字典。这些属性被 Pass 系统用于：

图可视化 （fillcolor 控制 GraphViz 节点颜色）
属性规范化 （normalize_padding、normalize_axes 指示哪些属性需要维度对齐）
Pass 行为控制 （target、prefill、pointwise、reduce 等）

3.9.2 `"target"` 属性 ------ 自定义算子的后端标记

设置位置 ：src/api/api.cpp:360-363

cpp 复制代码

template <class CustomOp>
struct custom_operation
{
    value attributes() const
    {
        return {
            {"custom_op", true},
            {"target", op.runs_on_offload_target() ? "gpu" : "cpu"}
        };
    }
    // ...
};

触发场景 ：当用户通过 C API (migraphx_experimental_custom_op) 或 Python API 注册自定义算子时，MIGraphX 将其包装为 custom_operation<CustomOp>。runs_on_offload_target() 由用户实现，返回 true 表示该算子可以在 GPU 上直接执行（如自定义 HIP kernel），返回 false 表示必须在 CPU 上执行。

Lowering 中的处理：

cpp 复制代码

else if(attrs.contains("target"))
{
    check_shape(s, insert_custom_op(it, attrs));
}

target == "cpu"：走 insert_custom_op() 的 CPU fallback 路径（3.6 节所述的 H2D→CPU 执行→D2H 链路）
target == "gpu"：保留原算子，后续由 GPU 执行路径处理

3.9.3 `"prefill"` 属性 ------ 输出缓冲区初始化标记

设置位置 ：src/split_reduce.cpp:56

cpp 复制代码

struct split_fused_reduce
{
    value attributes() const { return {{"prefill", 0}}; }
    // ...
};

触发场景 ：fuse_pointwise_reduce Pass 在融合 "逐点算子 + 归约算子" 时，若归约规模过大，会调用 split_reduce 将其拆分为多个 partial reduce，最终生成 split_fused_reduce 算子。由于该算子采用 "reduce-then-atomic-add" 策略，需要先将输出缓冲区清零，否则原子加会累加未初始化的内存垃圾值。

Lowering 中的处理：

cpp 复制代码

void insert_fill(instruction_ref ins, value v) const
{
    instruction_ref alloc = instruction::get_output_alias(ins, true);
    if(alloc == ins) return;
    auto fill = mod->insert_instruction(ins, make_op("hip::fill", {{"value", v}}), alloc);
    instruction::replace_argument(ins, alloc, fill);
}

找到算子的输出别名（即 allocate 指令）
在该算子之前插入 hip::fill(value=0)，将输出缓冲区预填充为 0
将算子的输出参数引用从 alloc 改为 fill（确保 fill 在算子之前执行）

3.9.4 其他常见 attributes 汇总

属性键	设置位置（示例）	用途
`"pointwise"`	`src/include/migraphx/op/unary.hpp:62` `src/include/migraphx/op/binary.hpp:60`	标记逐点算子，`fuse_pointwise` Pass 依赖此属性识别可融合链
`"reduce"`	`src/include/migraphx/op/reduce_op.hpp:95`	标记归约算子，用于图分析和调度
`"normalize_padding"`	`src/include/migraphx/op/convolution.hpp:77` `src/include/migraphx/op/pooling.hpp:135`	指示 `normalize_ops` Pass 对 `padding` 属性做维度对齐
`"normalize_axes"`	`src/include/migraphx/op/reduce_op.hpp:93` `src/include/migraphx/op/slice.hpp:96`	指示 `normalize_ops` Pass 对 `axes` 属性做负轴转正
`"fillcolor"`	几乎所有算子	GraphViz 可视化时的节点背景色

3.9.5 设计启示

attributes() 机制是 MIGraphX 中算子与 Pass 之间松耦合通信的关键通道：

算子不直接调用 Pass，而是通过 attributes "声明"自己的特性
Pass 不硬编码算子名单，而是读取 attributes 做条件判断
新增算子无需修改 Lowering ：只要在 attributes() 中返回 "target" 或 "prefill"，Lowering 会自动识别并处理

这使得 GPU Lowering 的四级路径判断（apply_map → has_compiler_for → target → prefill）具有可扩展性：任何新算子只需正确声明 attributes，即可无缝融入 lowering 流程。

4. CPU Lowering 深入分析

4.1 总体架构

CPU Lowering 由 cpu_apply 执行，核心任务是将通用算子映射到 oneDNN (DNNL) 算子：

cpp 复制代码

// src/targets/cpu/lowering.cpp
struct cpu_apply
{
    module* modl;
    std::unordered_map<std::string, std::function<instruction_ref(instruction_ref)>> apply_map{};

    void init();   // 注册 DNNL 映射
    void apply();  // 先执行融合匹配，再遍历替换
};

4.2 oneDNN 算子映射

cpp 复制代码

void init()
{
    // 二元算子
    extend_dnnl_algos("dnnl::binary", {
        {"add", "binary_add"}, {"div", "binary_div"},
        {"max", "binary_max"}, {"min", "binary_min"}, {"mul", "binary_mul"}
    });

    // 逐元素激活
    extend_dnnl_algos("dnnl::eltwise", {
        {"abs", "eltwise_abs"}, {"elu", "eltwise_elu"}, {"exp", "eltwise_exp"},
        {"log", "eltwise_log"}, {"relu", "eltwise_relu"},
        {"sqrt", "eltwise_sqrt"}, {"tanh", "eltwise_tanh"}
    });

    // 归约
    extend_dnnl_algos("dnnl::reduction", {
        {"reduce_max", "reduction_max"}, {"reduce_mean", "reduction_mean"},
        {"reduce_min", "reduction_min"}, {"reduce_sum", "reduction_sum"}
    });

    // 完整算子映射
    extend_op("concat", "dnnl::concat");
    extend_op("contiguous", "dnnl::reorder");
    extend_op("convolution", "dnnl::convolution");
    extend_op("dot", "dnnl::dot");  // ZenDNN 启用时可能替换
    extend_op("gather", "cpu::gather");
    extend_op("logsoftmax", "dnnl::logsoftmax");
    extend_op("lrn", "dnnl::lrn");
    extend_op("softmax", "dnnl::softmax");

    // 自定义 CPU 实现（无 DNNL 对应）
    extend_op("im2col", "cpu::im2col", false);
    extend_op("leaky_relu", "cpu::leaky_relu", false);
    extend_op("pad", "cpu::pad", false);
    extend_op("rnn_var_sl_last_output", "cpu::rnn_var_sl_last_output", false);
}

4.3 预融合匹配（Pre-fusion）

与 GPU Lowering 不同，CPU Lowering 在替换前先执行图模式匹配融合：

cpp 复制代码

void apply()
{
    init();

    // 第一步：融合匹配
    match::find_matches(*modl,
        fuse_match(match::gelu_erf(),
                   make_op("dnnl::eltwise", {{"algo", "eltwise_gelu_erf"}}), {"x"}),
        fuse_match(match::gelu_tanh(),
                   make_op("dnnl::eltwise", {{"algo", "eltwise_gelu_tanh"}}), {"x"}),
        fuse_match(match::layernorm(), make_op("dnnl::layernorm"), {"x"}));

    // 第二步：常规 lowering
    for(auto it : iterator_for(*modl))
    {
        // 跳过 FP8（oneDNN 暂不支持）
        if(std::any_of(it->inputs().begin(), it->inputs().end(),
            [](const auto& i) { return contains(fp8_types{}.get(), i->get_shape().type()); }))
            continue;

        if(it->name() == "pooling") { apply_pooling(it); }
        else if(it->name() == "reshape") { apply_reshape(it); }
        else if(apply_map.count(it->name()) > 0) { apply_map.at(it->name())(it); }
    }
}

匹配到的融合模式：

gelu_erf() → dnnl::eltwise (eltwise_gelu_erf)
gelu_tanh() → dnnl::eltwise (eltwise_gelu_tanh)
layernorm() → dnnl::layernorm

4.4 条件 Lowering

CPU Lowering 对某些算子做了条件判断：

Pooling：仅当满足 oneDNN 约束时才映射

cpp 复制代码

instruction_ref apply_pooling(instruction_ref ins) const
{
    if(has_op("dnnl::pooling") and
       ins->get_shape().type() == shape::type_t::float_type and
       not v["ceil_mode"].to<bool>() and
       v["mode"].to<op::pooling_mode>() != op::pooling_mode::lpnorm)
    {
        return replace(ins, make_op("dnnl::pooling", op.to_value()));
    }
    return ins;  // 不满足条件，保持原样
}

Reshape ：转为 reshape_lazy + dnnl::reorder（可消除）

cpp 复制代码

instruction_ref apply_reshape(instruction_ref ins) const
{
    // reshape → contiguous + reshape_lazy + contiguous
    auto before_contig = modl->insert_instruction(ins, make_op("dnnl::reorder"), ...);
    auto new_reshape   = modl->insert_instruction(ins, make_op("reshape_lazy"), before_contig);
    return modl->replace_instruction(ins, make_op("dnnl::reorder"), ...);
}

5. Ref Lowering 深入分析

Ref（Reference）后端是最简单的 Lowering 实现，用于正确性验证。

5.1 核心逻辑

cpp 复制代码

// src/targets/ref/lowering.cpp
struct ref_apply
{
    module* mod;
    std::unordered_map<std::string, std::function<void(instruction_ref)>> apply_map{};

    void init()
    {
        // 少数算子需要特化实现
        apply_map["dot"]        = extend_op<ref_gemm, op::dot>();
        apply_map["quant_dot"]  = extend_op<ref_quant_gemm, op::quant_dot>();
        apply_map["im2col"]     = extend_op<ref_im2col, op::im2col>();
        apply_map["logsoftmax"] = extend_op<ref_softmax<op::logsoftmax>, op::logsoftmax>();
        apply_map["lrn"]        = extend_op<ref_lrn, op::lrn>();
        apply_map["pad"]        = extend_op<ref_pad, op::pad>();
        apply_map["softmax"]    = extend_op<ref_softmax<op::softmax>, op::softmax>();
        apply_map["rnn_var_sl_last_output"] = extend_op<ref_rnn_var_sl_last_output, ...>();
    }

    void apply()
    {
        init();
        for(auto it : iterator_for(*mod))
        {
            if(apply_map.count(it->name()) > 0)
            {
                apply_map.at(it->name())(it);
            }
            else if(is_context_free(it->get_operator()))
            {
                // 大部分算子直接包装为 ref_op
                mod->replace_instruction(it, ref_op{it->get_operator()}, it->inputs());
            }
        }
    }
};

5.2 设计特点

特点	说明
极简映射	大部分算子直接包装为 `ref_op`，调用原始算子的 `compute()`
无外部依赖	不依赖 MIOpen、oneDNN、rocBLAS 等库
纯 CPU 执行	所有 kernel 都是 C++ 循环实现（`par_dfor`、`par_for`）
用于验证	与 GPU/CPU 后端结果对比，验证正确性

6. 三后端 Lowering 对比

Ref Lowering
ref_op 通用包装
少量特化实现

gemm/softmax/lrn
无外部依赖
CPU Lowering
oneDNN 映射为主
预融合匹配

gelu/layernorm
自定义 CPU kernel

im2col/pad
reshape → reorder
FP8 跳过
GPU Lowering
apply_map 丰富

MIOpen/rocBLAS/hipBLASLt
JIT 延迟编译标记

precompile_op
内存分配注入

allocate
H2D/D2H 拷贝注入

offload_copy
CPU fallback 支持
通用 IR 算子

conv / dot / softmax / pooling ...

维度	GPU	CPU	Ref
核心映射目标	MIOpen / rocBLAS / hipBLASLt / JIT	oneDNN (DNNL)	`ref_op` 包装器
融合支持	融合在 lowering 之前完成（fuse_* passes）	lowering 内做预融合匹配（gelu/layernorm）	无融合
内存管理	注入 `allocate` 指令	注入 `allocate` 指令	无显式分配注入
数据传输	支持 `offload_copy` 自动 H2D/D2H	无（纯 CPU）	无（纯 CPU）
Fallback	CPU fallback（`target=cpu`）	无	无
延迟编译	`precompile_op` 标记	无	无
条件判断	pooling 类型/模式判断、GEMM provider 选择	pooling 条件、FP8 跳过、reshape 处理	`is_context_free` 判断

7. 关键设计模式总结

7.1 Apply Map 模式

三后端均采用 unordered_map<name, lambda> 注册算子替换策略：

cpp 复制代码

std::unordered_map<std::string, std::function<instruction_ref(instruction_ref)>> apply_map;

apply_map.emplace("convolution", [=](instruction_ref ins) {
    // 替换逻辑...
});

// 执行时查找
if(apply_map.count(it->name()) > 0) {
    apply_map.at(it->name())(it);
}

优势：新增算子支持只需在 init() 中添加条目，无需修改主循环。

7.2 延迟编译模式（GPU 特有）

cpp 复制代码

// Lowering 阶段：只标记，不编译
insert_precompile_op(ins) → "gpu::precompile_op"

// compile_ops 阶段：统一编译
for(auto ins : iterator_for(m)) {
    if(ins->name() == "gpu::precompile_op") {
        cm.add_plan(ctx, preop, ins, &m);  // 创建编译计划
    }
}
cm.compile(m);  // 并行编译 + 调优

7.3 形状校验模式

每次替换后都断言输出形状不变：

cpp 复制代码

void check_shape(shape x, instruction_ref i) {
    assert(x == i->get_shape());
}

确保 lowering 不改变计算的语义输出。

7.4 算子属性继承模式

add_extend_op() 将原始算子的 to_value() 完整传递给后端算子：

cpp 复制代码

void add_extend_op(const std::string& op_name, const std::string& gpu_name) {
    apply_map.emplace(op_name, [=](instruction_ref ins) {
        auto&& op = ins->get_operator();
        return mod->replace_instruction(ins, make_op(gpu_name, op.to_value()), refs);
    });
}

这保证了卷积的 padding、stride，池化的 mode、窗口大小等属性不丢失。

8. 环境变量与配置

环境变量	作用	影响后端
`MIGRAPHX_SET_GEMM_PROVIDER=rocblas`	强制使用 rocBLAS 而非 hipBLASLt	GPU
`MIGRAPHX_DISABLE_MIOPEN_POOLING`	禁用 MIOpen pooling，走 JIT	GPU
`MIGRAPHX_DISABLE_MIOPEN_FUSION`	禁用 MIOpen 融合	GPU
`offload_copy` (编译选项)	自动 H2D/D2H 拷贝	GPU
`fast_math` (编译选项)	启用快速数学近似	GPU
`MIGRAPHX_ENABLE_ZENDNN`	启用 ZenDNN 替代 oneDNN	CPU

9. 总结

Lowering 是 MIGraphX 后端适配的核心枢纽，其设计体现了以下关键思想：

分层决策：GPU Lowering 的四级路径（预编译库 → JIT → CPU fallback → 保留）实现了灵活的后端选择
延迟编译 ：precompile_op 将编译时延与图变换解耦，支持并行编译和自动调优
统一内存管理 ：通过 insert_allocation() 自动注入显存分配，简化算子实现
属性继承 ：add_extend_op() 确保算子参数在 lowering 过程中完整传递
条件降级：Pooling、GEMM 等算子根据硬件能力和形状特征动态选择最优路径

理解 Lowering 的关键在于：它不是简单的"名称替换"，而是包含路径选择、内存规划、数据传输、延迟编译标记的综合性后端适配过程。

MIGraphX Lowering 模块深度分析