hipcc 编译 amd gpu kernel 和 打包与解包的流程实验

1, hip cuda kernel 编译概观

编译的文件流:

.hip kernel --(clang++)--> .o

.o --(lld)--> .out

.out --(clang-offload-bundler)--> .hipfb

2,示例 hipcc -###

代码:

cpp 复制代码
__global__ void WWWWW()
{
	((int*)0x8888888)[3] = 0x77777;
}

操作过程:

bash 复制代码
$ hipcc -### param_00.hip --cuda-device-only --offload-arch=gfx906 
AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.0.2 24012 af27734ed982b52a9f1be0f035ac91726fc697e4)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-6.0.2/llvm/bin
Configuration file: /opt/rocm-6.0.2/lib/llvm/bin/clang++.cfg
clang: warning: argument unused during compilation: '--rtlib=compiler-rt' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-unwindlib=libgcc' [-Wunused-command-line-argument]
 "/opt/rocm-6.0.2/lib/llvm/bin/clang-17" \
 "-cc1" "-triple" "amdgcn-amd-amdhsa" "-aux-triple" \
 "x86_64-unknown-linux-gnu" "-emit-obj" "-disable-free" \
 "-clear-ast-before-backend" "-disable-llvm-verifier" \
 "-discard-value-names" "-main-file-name" "param_00.hip" \
 "-mrelocation-model" "pic" "-pic-level" "2" "-fhalf-no-semantic-interposition" \
 "-mframe-pointer=none" "-fno-rounding-math" "-mconstructor-aliases" "-aux-target-cpu" \
 "x86-64" "-fcuda-is-device" "-mllvm" "-amdgpu-internalize-symbols" "-fcuda-allow-variadic-functions" \
 "-fvisibility=hidden" "-fapply-global-visibility-to-externs" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/hip.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/ocml.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/ockl.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_daz_opt_off.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_unsafe_math_off.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_finite_only_off.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_wavefrontsize64_on.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_isa_version_906.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_abi_version_500.bc" \
 "-target-cpu" "gfx906" \
 "-debugger-tuning=gdb" "-resource-dir" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0" \
 "-internal-isystem" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/include/cuda_wrappers" \
 "-idirafter" "/opt/rocm-6.0.2/lib/llvm/bin/../../../include" "-include" \
 "__clang_hip_runtime_wrapper.h" "-c-isystem" "/opt/rocm-6.0.2/llvm/include/gpu-none-llvm" \
 "-isystem" "/opt/rocm-6.0.2/include" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/x86_64-linux-gnu/c++/12" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/backward" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/x86_64-linux-gnu/c++/12" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/backward" \
 "-internal-isystem" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/include" \
 "-internal-isystem" "/usr/local/include" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include" \
 "-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" \
 "-internal-externc-isystem" "/usr/include" "-internal-isystem" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/include" \
 "-internal-isystem" "/usr/local/include" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include" \
 "-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" \
 "-internal-externc-isystem" "/usr/include" "-O3" "-fdeprecated-macro" "-fno-autolink" \
 "-fdebug-compilation-dir=/home/hipper/ex_amd_gpu_compiler/ex/parameters_hip/param_00" \
 "-ferror-limit" "19" "-fhip-new-launch-api" "-fgnuc-version=4.2.1" "-fcxx-exceptions" "-fexceptions" \
 "-fcolor-diagnostics" "-vectorize-loops" "-vectorize-slp" "-mllvm" "-amdgpu-early-inline-all=true" "-mllvm" \
 "-amdgpu-function-calls=false" "-cuid=3e1885b9958b336f" "-fcuda-allow-variadic-functions" "-faddrsig" \
 "-D__GCC_HAVE_DWARF2_CFI_ASM=1" \
 "-o" "/tmp/param_00-gfx906-a7e858.o" \
 "-x" "hip" "param_00.hip"




 "/opt/rocm-6.0.2/llvm/bin/lld" \
 "-flavor" "gnu" "-m" "elf64_amdgpu" \
 "--no-undefined" "-shared" \
 "-plugin-opt=-amdgpu-internalize-symbols" \
 "-plugin-opt=mcpu=gfx906" \
 "-plugin-opt=O3" "--lto-CGO3" \
 "-plugin-opt=-amdgpu-early-inline-all=true" \
 "-plugin-opt=-amdgpu-function-calls=false" \
 "--whole-archive" \
 "-o" "/tmp/param_00-gfx906-65b179.out" \
 "/tmp/param_00-gfx906-a7e858.o" \
 "--no-whole-archive"



 "/opt/rocm-6.0.2/llvm/bin/clang-offload-bundler" \
 "-type=o" "-bundle-align=4096" \
 "-targets=host-x86_64-unknown-linux,hipv4-amdgcn-amd-amdhsa--gfx906" \
 "-input=/dev/null" \
 "-input=/tmp/param_00-gfx906-65b179.out" \
 "-output=param_00.hip-hip-amdgcn-amd-amdhsa.hipfb"

涉及到了三个命令:

clang++ -o xxx.o

lld -o xxx.out

clang-offload-bundler -output=xxx.hipfb

3,分析中间文件

3.1 clang++ 编译生成的 .o

.o 是一个elf文件

这个.o 是使用自己编译出来的clang++ 编译的,

其中,将 cp /opt/rocm/bin/clang++.cfg local_amdgpu/llvm/bin/

稍作路径调整,编译生成 .o :

cpp 复制代码
$ "../../../../local_amdgpu/bin/clang-19" \
 "-cc1" "-triple" "amdgcn-amd-amdhsa" "-aux-triple" \
 "x86_64-unknown-linux-gnu" "-emit-obj" "-disable-free" \
 "-clear-ast-before-backend" "-disable-llvm-verifier" \
 "-discard-value-names" "-main-file-name" "param_00.hip" \
 "-mrelocation-model" "pic" "-pic-level" "2" "-fhalf-no-semantic-interposition" \
 "-mframe-pointer=none" "-fno-rounding-math" "-mconstructor-aliases" "-aux-target-cpu" \
 "x86-64" "-fcuda-is-device" "-mllvm" "-amdgpu-internalize-symbols" "-fcuda-allow-variadic-functions" \
 "-fvisibility=hidden" "-fapply-global-visibility-to-externs" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/hip.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/ocml.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/ockl.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_daz_opt_off.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_unsafe_math_off.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_finite_only_off.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_wavefrontsize64_on.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_isa_version_906.bc" \
 "-mlink-builtin-bitcode" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/lib/amdgcn/bitcode/oclc_abi_version_500.bc" \
 "-target-cpu" "gfx906" \
 "-debugger-tuning=gdb" "-resource-dir" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0" \
 "-internal-isystem" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/include/cuda_wrappers" \
 "-idirafter" "/opt/rocm-6.0.2/lib/llvm/bin/../../../include" "-include" \
 "__clang_hip_runtime_wrapper.h" "-c-isystem" "/opt/rocm-6.0.2/llvm/include/gpu-none-llvm" \
 "-isystem" "/opt/rocm-6.0.2/include" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/x86_64-linux-gnu/c++/12" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/backward" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/x86_64-linux-gnu/c++/12" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/backward" \
 "-internal-isystem" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/include" \
 "-internal-isystem" "/usr/local/include" \
 "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include" \
 "-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" \
 "-internal-externc-isystem" "/usr/include" "-internal-isystem" "/opt/rocm-6.0.2/lib/llvm/lib/clang/17.0.0/include" \
 "-internal-isystem" "/usr/local/include" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include" \
 "-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" \
 "-internal-externc-isystem" "/usr/include" "-O3" "-fdeprecated-macro" "-fno-autolink" \
 "-fdebug-compilation-dir=/home/hipper/ex_amd_gpu_compiler/ex/parameters_hip/param_00" \
 "-ferror-limit" "19" "-fhip-new-launch-api" "-fgnuc-version=4.2.1" "-fcxx-exceptions" "-fexceptions" \
 "-fcolor-diagnostics" "-vectorize-loops" "-vectorize-slp" "-mllvm" "-amdgpu-early-inline-all=true" "-mllvm" \
 "-amdgpu-function-calls=false" "-cuid=3e1885b9958b336f" "-fcuda-allow-variadic-functions" "-faddrsig" \
 "-D__GCC_HAVE_DWARF2_CFI_ASM=1" \
 "-o" "./param_00-gfx906-a7e858.o" \
 "-x" "hip" "param_00.hip"

3.2 lld 链接得到的.out 文件

也属于 elf 文件,但是是DYN (Shared object file) 类型,不再是 relocationable 类型。

其生成命令也是稍作了路径改动:

bash 复制代码
"../../../../local_amdgpu/bin/lld" \
 "-flavor" "gnu" "-m" "elf64_amdgpu" \
 "--no-undefined" "-shared" \
 "-plugin-opt=-amdgpu-internalize-symbols" \
 "-plugin-opt=mcpu=gfx906" \
 "-plugin-opt=O3" "--lto-CGO3" \
 "-plugin-opt=-amdgpu-early-inline-all=true" \
 "-plugin-opt=-amdgpu-function-calls=false" \
 "--whole-archive" \
 "-o" "./param_00-gfx906-65b179.out" \
 "./param_00-gfx906-a7e858.o" \
 "--no-whole-archive"

3.3 hipfb 文件

这个类型的文件 是由 clang-offload-bundler 打包而成,这里没有什么新意,对 clang-offload-bundler做一个介绍:

clang-offload-bundler 是一个工具,它是 Clang/LLVM 编译器工具链的一部分,用于支持在多种设备上进行异构计算。这个工具主要用于处理和打包不同目标设备(如 CPU、GPU、FPGA 等)的代码,以便在一个单一的程序中支持多种计算设备。这种技术通常用于加速应用程序,特别是在需要大量并行处理的场景中。

3.3.1 生成 hipfb 文件的方法

bash 复制代码
 "../../../../local_amdgpu/bin/clang-offload-bundler" \
 "-type=o" "-bundle-align=4096" \
 "-targets=host-x86_64-unknown-linux,hipv4-amdgcn-amd-amdhsa--gfx906" \
 "-input=/dev/null" \
 "-input=./param_00-gfx906-65b179.out" \
 "-output=param_00.hip-hip-amdgcn-amd-amdhsa.hipfb"

3.3.2 主要功能和作用

  1. 打包和解包代码对象:
  • clang-offload-bundler 能够将针对不同设备的代码对象(如 CPU 和 GPU 的代码)打包到一个单一的文件中。这使得管理和分发针对异构计算平台的应用程序变得更加简单。
  • 同样,它也可以从这种打包的文件中提取特定目标设备的代码对象,以便在适当的设备上执行。
  1. 支持异构编程:
  • 在异构编程中,开发者可能需要将程序的不同部分优化并编译到不同的硬件平台上。clang-offload-bundler 通过管理这些不同的代码段,简化了构建和部署过程。
  1. 简化编译和链接流程:
  • 在使用 OpenMP 或 CUDA 等并行编程模型时,clang-offload-bundler 能够处理主机代码和加速器代码之间的交互,包括数据传输和执行控制。这样,开发者可以更专注于代码的并行部分,而不是底层的数据管理和设备控制。
  1. 提高性能和可移植性:
  • 通过允许代码针对特定硬件进行优化,clang-offload-bundler 帮助提高应用程序的性能。同时,它也支持代码的可移植性,因为同一个应用程序可以针对多种硬件平台进行编译和打包。

3.3.3 使用场景

  • 并行计算应用:在需要大量计算资源的应用中,如科学计算、图像处理、机器学习等,clang-offload-bundler 可以帮助开发者有效地利用多种计算资源。
  • 开发跨平台应用:对于需要在多种硬件设备上运行的软件,如桌面和移动设备,或者 CPU 和 GPU,clang-offload-bundler 提供了一种统一的方式来处理不同平台的代码。

意义:clang-offload-bundler 还是挺强大的,用于支持和简化异构计算环境中的编程和部署过程。它通过管理针对不同硬件平台的代码,使得开发高性能并行应用程序变得更加高效和可行。

3.3.4 解析 hipfb的方法

bash 复制代码
clang-offload-bundler -type=o -targets=hip-amdgcn-amd-amdhsa--gfx906 -input=param_00.hip-hip-amdgcn-amd-amdhsa.hipfb -output=device_output.o -unbundle

于是又得到了 bundle前的out 文件:

甚至连文件大小都一样:

相关推荐
编程重生之路14 天前
今年2024的1024文章
ai·ai编程·编译器·cursor·1024程序员节
Thanks_ks15 天前
【第五章·选择控制结构】第一节:生活中与计算机中的问题求解方法
算法·编译器·机器语言·分治策略·c 语言程序设计·计算机程序·程序设计语言
Eloudy22 天前
函数地址对齐 __attribute__((aligned(64))) 编译器选项 -falign-functions=4
算法·编译器
Trouvaille ~2 个月前
【C++篇】C++类与对象深度解析(六):全面剖析拷贝省略、RVO、NRVO优化策略
c++·c++20·编译原理·编译器·类和对象·rvo·nrvo
EleganceJiaBao3 个月前
【Story】编译器的基础概念与类型分类
java·c语言·c++·python·gnu·编译器·gcc
怜渠客3 个月前
VS2022快速搭建OLLVM
c++·ide·编译器
张一西3 个月前
ARM学习(31)编译器对overlay方式的支持
arm开发·编译器·overlay·动态加载·bank·armcc·armclang
harykali4 个月前
静态分析学习笔记02:程序中间表示(IR)
编译器
剑海风云4 个月前
GraalVM简介及使用
java·jvm·graalvm·编译器·本地镜像
黑不溜秋的4 个月前
LLVM-编译器结构
llvm