从"能跑"到"跑得快":异构数学库与AI算子的底层优化全复盘
写这篇文章的起因很简单------我们在国产加速卡上跑AI模型,一开始能跑通,但一看性能曲线,离"好用"还差得远。于是花了大半年时间,从GEMM到卷积,从TensorCore到算子生成器,一层一层地把数学库和算子优化捋了一遍。这篇文章就是整个过程的复盘笔记。
一、先搞懂GEMM:矩阵乘法的分块策略
矩阵乘法是AI计算的"呼吸"------模型训练和推理每时每刻都在做这件事。但大矩阵直接扔给硬件算是不现实的,必须拆。
我们的做法是把大矩阵切成规整的小块,将计算任务分配到线程块和线程。每个线程块负责一块子矩阵(记为 MTO × MT1),线程块再按二维网格组织,线程负责更细的粒度(TTO × TT1):
MTO = WGO × TTO
MT1 = WG1 × TT1
线程块网格的总数由问题规模决定:
totalWorkGroups0 = ceil(SizeM ÷ MTO)
totalWorkGroups1 = ceil(SizeN ÷ MT1)
通过调整 MT 和 TT 的大小来控制每个线程块和线程的负载,把加速卡的并行计算能力吃满。
#mermaid-svg-SrP0ekVTOW9hn7HI{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-SrP0ekVTOW9hn7HI .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-SrP0ekVTOW9hn7HI .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-SrP0ekVTOW9hn7HI .error-icon{fill:#552222;}#mermaid-svg-SrP0ekVTOW9hn7HI .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-SrP0ekVTOW9hn7HI .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-SrP0ekVTOW9hn7HI .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-SrP0ekVTOW9hn7HI .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-SrP0ekVTOW9hn7HI .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-SrP0ekVTOW9hn7HI .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-SrP0ekVTOW9hn7HI .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-SrP0ekVTOW9hn7HI .marker{fill:#333333;stroke:#333333;}#mermaid-svg-SrP0ekVTOW9hn7HI .marker.cross{stroke:#333333;}#mermaid-svg-SrP0ekVTOW9hn7HI svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-SrP0ekVTOW9hn7HI p{margin:0;}#mermaid-svg-SrP0ekVTOW9hn7HI .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-SrP0ekVTOW9hn7HI .cluster-label text{fill:#333;}#mermaid-svg-SrP0ekVTOW9hn7HI .cluster-label span{color:#333;}#mermaid-svg-SrP0ekVTOW9hn7HI .cluster-label span p{background-color:transparent;}#mermaid-svg-SrP0ekVTOW9hn7HI .label text,#mermaid-svg-SrP0ekVTOW9hn7HI span{fill:#333;color:#333;}#mermaid-svg-SrP0ekVTOW9hn7HI .node rect,#mermaid-svg-SrP0ekVTOW9hn7HI .node circle,#mermaid-svg-SrP0ekVTOW9hn7HI .node ellipse,#mermaid-svg-SrP0ekVTOW9hn7HI .node polygon,#mermaid-svg-SrP0ekVTOW9hn7HI .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-SrP0ekVTOW9hn7HI .rough-node .label text,#mermaid-svg-SrP0ekVTOW9hn7HI .node .label text,#mermaid-svg-SrP0ekVTOW9hn7HI .image-shape .label,#mermaid-svg-SrP0ekVTOW9hn7HI .icon-shape .label{text-anchor:middle;}#mermaid-svg-SrP0ekVTOW9hn7HI .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-SrP0ekVTOW9hn7HI .rough-node .label,#mermaid-svg-SrP0ekVTOW9hn7HI .node .label,#mermaid-svg-SrP0ekVTOW9hn7HI .image-shape .label,#mermaid-svg-SrP0ekVTOW9hn7HI .icon-shape .label{text-align:center;}#mermaid-svg-SrP0ekVTOW9hn7HI .node.clickable{cursor:pointer;}#mermaid-svg-SrP0ekVTOW9hn7HI .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-SrP0ekVTOW9hn7HI .arrowheadPath{fill:#333333;}#mermaid-svg-SrP0ekVTOW9hn7HI .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-SrP0ekVTOW9hn7HI .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-SrP0ekVTOW9hn7HI .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-SrP0ekVTOW9hn7HI .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-SrP0ekVTOW9hn7HI .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-SrP0ekVTOW9hn7HI .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-SrP0ekVTOW9hn7HI .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-SrP0ekVTOW9hn7HI .cluster text{fill:#333;}#mermaid-svg-SrP0ekVTOW9hn7HI .cluster span{color:#333;}#mermaid-svg-SrP0ekVTOW9hn7HI div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-SrP0ekVTOW9hn7HI .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-SrP0ekVTOW9hn7HI rect.text{fill:none;stroke-width:0;}#mermaid-svg-SrP0ekVTOW9hn7HI .icon-shape,#mermaid-svg-SrP0ekVTOW9hn7HI .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-SrP0ekVTOW9hn7HI .icon-shape p,#mermaid-svg-SrP0ekVTOW9hn7HI .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-SrP0ekVTOW9hn7HI .icon-shape .label rect,#mermaid-svg-SrP0ekVTOW9hn7HI .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-SrP0ekVTOW9hn7HI .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-SrP0ekVTOW9hn7HI .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-SrP0ekVTOW9hn7HI :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 按MT0×MT1分块
线程块网格
(0,0)
(0,1)
(0,2)
(0,3)
(1,0)
(1,1)
(1,2)
(1,3)
(2,0)
(2,1)
(2,2)
(2,3)
(3,0)
(3,1)
(3,2)
(3,3)
M×N
一句话总结:分块大小不是拍脑袋定的,需要根据硬件缓存大小、寄存器数量、计算单元数量联合调参。太小浪费并行度,太大寄存器溢出,是一个精细的平衡活。
二、向量外积:换个循环顺序,访存效率翻倍
这块是我踩过的坑。一开始用常规的"行乘列"做内积,代码写起来最直观:
c
// 向量内积(行 × 列)
for (int m = 0; m < M; m++)
for (int n = 0; n < N; n++)
for (int k = 0; k < K; k++)
C[m][n] += A[m][k] * B[k][n];
但问题出在计算访存比上。内积模式下,浮点操作总量是 2MNK,内存访问量是 4MNK,计算访存比只有 0.5------意味着每做 1 次浮点计算就要访问 2 次内存,完全是访存瓶颈。
切换到向量外积思路就不一样了:
c
// 向量外积(列 × 行)
for (int k = 0; k < K; k++)
for (int m = 0; m < M; m++)
for (int n = 0; n < N; n++)
C[m][n] += A[m][k] * B[k][n];
每次 k 迭代加载 (M + N) 个数据,完成 M × N 次乘加运算,计算访存比约为 M×N / (M+N)。当 M 和 N 都比较大时,这个比值远大于 0.5,计算单元终于能吃饱了。
#mermaid-svg-Ut458VSdZRJBLLbz{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-Ut458VSdZRJBLLbz .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-Ut458VSdZRJBLLbz .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-Ut458VSdZRJBLLbz .error-icon{fill:#552222;}#mermaid-svg-Ut458VSdZRJBLLbz .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-Ut458VSdZRJBLLbz .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-Ut458VSdZRJBLLbz .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-Ut458VSdZRJBLLbz .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-Ut458VSdZRJBLLbz .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-Ut458VSdZRJBLLbz .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-Ut458VSdZRJBLLbz .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-Ut458VSdZRJBLLbz .marker{fill:#333333;stroke:#333333;}#mermaid-svg-Ut458VSdZRJBLLbz .marker.cross{stroke:#333333;}#mermaid-svg-Ut458VSdZRJBLLbz svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-Ut458VSdZRJBLLbz p{margin:0;}#mermaid-svg-Ut458VSdZRJBLLbz .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-Ut458VSdZRJBLLbz .cluster-label text{fill:#333;}#mermaid-svg-Ut458VSdZRJBLLbz .cluster-label span{color:#333;}#mermaid-svg-Ut458VSdZRJBLLbz .cluster-label span p{background-color:transparent;}#mermaid-svg-Ut458VSdZRJBLLbz .label text,#mermaid-svg-Ut458VSdZRJBLLbz span{fill:#333;color:#333;}#mermaid-svg-Ut458VSdZRJBLLbz .node rect,#mermaid-svg-Ut458VSdZRJBLLbz .node circle,#mermaid-svg-Ut458VSdZRJBLLbz .node ellipse,#mermaid-svg-Ut458VSdZRJBLLbz .node polygon,#mermaid-svg-Ut458VSdZRJBLLbz .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Ut458VSdZRJBLLbz .rough-node .label text,#mermaid-svg-Ut458VSdZRJBLLbz .node .label text,#mermaid-svg-Ut458VSdZRJBLLbz .image-shape .label,#mermaid-svg-Ut458VSdZRJBLLbz .icon-shape .label{text-anchor:middle;}#mermaid-svg-Ut458VSdZRJBLLbz .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-Ut458VSdZRJBLLbz .rough-node .label,#mermaid-svg-Ut458VSdZRJBLLbz .node .label,#mermaid-svg-Ut458VSdZRJBLLbz .image-shape .label,#mermaid-svg-Ut458VSdZRJBLLbz .icon-shape .label{text-align:center;}#mermaid-svg-Ut458VSdZRJBLLbz .node.clickable{cursor:pointer;}#mermaid-svg-Ut458VSdZRJBLLbz .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-Ut458VSdZRJBLLbz .arrowheadPath{fill:#333333;}#mermaid-svg-Ut458VSdZRJBLLbz .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-Ut458VSdZRJBLLbz .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-Ut458VSdZRJBLLbz .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Ut458VSdZRJBLLbz .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-Ut458VSdZRJBLLbz .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Ut458VSdZRJBLLbz .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-Ut458VSdZRJBLLbz .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-Ut458VSdZRJBLLbz .cluster text{fill:#333;}#mermaid-svg-Ut458VSdZRJBLLbz .cluster span{color:#333;}#mermaid-svg-Ut458VSdZRJBLLbz div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-Ut458VSdZRJBLLbz .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-Ut458VSdZRJBLLbz rect.text{fill:none;stroke-width:0;}#mermaid-svg-Ut458VSdZRJBLLbz .icon-shape,#mermaid-svg-Ut458VSdZRJBLLbz .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Ut458VSdZRJBLLbz .icon-shape p,#mermaid-svg-Ut458VSdZRJBLLbz .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-Ut458VSdZRJBLLbz .icon-shape .label rect,#mermaid-svg-Ut458VSdZRJBLLbz .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Ut458VSdZRJBLLbz .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-Ut458VSdZRJBLLbz .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-Ut458VSdZRJBLLbz :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 外积方式
K次累加
A的一列
C的一个扇面
列×行=秩1矩阵
B的一行
完整C矩阵
内积方式
A的一行
C的一个元素
B的一列
关键认知:GPU/加速卡上,计算能力往往不是瓶颈,访存才是。外积把"读一次数据做一次计算"变成了"读一次数据做一整个扇面",大幅降低了访存压力。
三、数据预取:让访存和计算同时跑
光换循环顺序还不够。加速卡上访存延迟动辄几百个周期,如果计算单元干等着数据回来,那跟堵车一样------发动机再好也跑不起来。
数据预取的核心思路很简单:在循环展开的基础上,提前发射访存指令。具体做法:
- 软件预取提前发射后面几次循环的访存指令,把数据提前拉到 L1 Cache
- 等当前迭代需要这些数据时,它们已经在 Cache 里了,命中率大幅提升
- 访存和计算在时间上重叠,隐藏掉访存延迟
#mermaid-svg-oazAqbN0ITx5t5Bk{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-oazAqbN0ITx5t5Bk .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-oazAqbN0ITx5t5Bk .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-oazAqbN0ITx5t5Bk .error-icon{fill:#552222;}#mermaid-svg-oazAqbN0ITx5t5Bk .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-oazAqbN0ITx5t5Bk .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-oazAqbN0ITx5t5Bk .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-oazAqbN0ITx5t5Bk .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-oazAqbN0ITx5t5Bk .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-oazAqbN0ITx5t5Bk .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-oazAqbN0ITx5t5Bk .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-oazAqbN0ITx5t5Bk .marker{fill:#333333;stroke:#333333;}#mermaid-svg-oazAqbN0ITx5t5Bk .marker.cross{stroke:#333333;}#mermaid-svg-oazAqbN0ITx5t5Bk svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-oazAqbN0ITx5t5Bk p{margin:0;}#mermaid-svg-oazAqbN0ITx5t5Bk .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-oazAqbN0ITx5t5Bk .cluster-label text{fill:#333;}#mermaid-svg-oazAqbN0ITx5t5Bk .cluster-label span{color:#333;}#mermaid-svg-oazAqbN0ITx5t5Bk .cluster-label span p{background-color:transparent;}#mermaid-svg-oazAqbN0ITx5t5Bk .label text,#mermaid-svg-oazAqbN0ITx5t5Bk span{fill:#333;color:#333;}#mermaid-svg-oazAqbN0ITx5t5Bk .node rect,#mermaid-svg-oazAqbN0ITx5t5Bk .node circle,#mermaid-svg-oazAqbN0ITx5t5Bk .node ellipse,#mermaid-svg-oazAqbN0ITx5t5Bk .node polygon,#mermaid-svg-oazAqbN0ITx5t5Bk .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-oazAqbN0ITx5t5Bk .rough-node .label text,#mermaid-svg-oazAqbN0ITx5t5Bk .node .label text,#mermaid-svg-oazAqbN0ITx5t5Bk .image-shape .label,#mermaid-svg-oazAqbN0ITx5t5Bk .icon-shape .label{text-anchor:middle;}#mermaid-svg-oazAqbN0ITx5t5Bk .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-oazAqbN0ITx5t5Bk .rough-node .label,#mermaid-svg-oazAqbN0ITx5t5Bk .node .label,#mermaid-svg-oazAqbN0ITx5t5Bk .image-shape .label,#mermaid-svg-oazAqbN0ITx5t5Bk .icon-shape .label{text-align:center;}#mermaid-svg-oazAqbN0ITx5t5Bk .node.clickable{cursor:pointer;}#mermaid-svg-oazAqbN0ITx5t5Bk .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-oazAqbN0ITx5t5Bk .arrowheadPath{fill:#333333;}#mermaid-svg-oazAqbN0ITx5t5Bk .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-oazAqbN0ITx5t5Bk .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-oazAqbN0ITx5t5Bk .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-oazAqbN0ITx5t5Bk .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-oazAqbN0ITx5t5Bk .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-oazAqbN0ITx5t5Bk .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-oazAqbN0ITx5t5Bk .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-oazAqbN0ITx5t5Bk .cluster text{fill:#333;}#mermaid-svg-oazAqbN0ITx5t5Bk .cluster span{color:#333;}#mermaid-svg-oazAqbN0ITx5t5Bk div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-oazAqbN0ITx5t5Bk .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-oazAqbN0ITx5t5Bk rect.text{fill:none;stroke-width:0;}#mermaid-svg-oazAqbN0ITx5t5Bk .icon-shape,#mermaid-svg-oazAqbN0ITx5t5Bk .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-oazAqbN0ITx5t5Bk .icon-shape p,#mermaid-svg-oazAqbN0ITx5t5Bk .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-oazAqbN0ITx5t5Bk .icon-shape .label rect,#mermaid-svg-oazAqbN0ITx5t5Bk .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-oazAqbN0ITx5t5Bk .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-oazAqbN0ITx5t5Bk .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-oazAqbN0ITx5t5Bk :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 有预取
同时预取
加载第1批
计算第1批
预取第2、3批到L1
计算第2批
(数据已在L1)
无预取
加载第1批
计算第1批
加载第2批
计算第2批
代价是预取会多占一些寄存器资源。这也是一个 trade-off:用寄存器换时间。通常来说是划算的,因为加速卡的寄存器资源相对充裕,而访存延迟是真正的性能杀手。
四、共享内存:线程块内的"高速通路"
4.1 LDS 重构数据流
加速卡的存储层次是一个金字塔:Global Memory 容量最大但最慢,L1/L2 Cache 中间层,寄存器最快但最少。在它们之间还有一层关键角色------LDS(Local Data Share,共享内存),相当于线程块内的一块软件可控的高速 Scratchpad。
#mermaid-svg-CZxzEisiuf3lP2i1{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-CZxzEisiuf3lP2i1 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-CZxzEisiuf3lP2i1 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-CZxzEisiuf3lP2i1 .error-icon{fill:#552222;}#mermaid-svg-CZxzEisiuf3lP2i1 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-CZxzEisiuf3lP2i1 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-CZxzEisiuf3lP2i1 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-CZxzEisiuf3lP2i1 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-CZxzEisiuf3lP2i1 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-CZxzEisiuf3lP2i1 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-CZxzEisiuf3lP2i1 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-CZxzEisiuf3lP2i1 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-CZxzEisiuf3lP2i1 .marker.cross{stroke:#333333;}#mermaid-svg-CZxzEisiuf3lP2i1 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-CZxzEisiuf3lP2i1 p{margin:0;}#mermaid-svg-CZxzEisiuf3lP2i1 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-CZxzEisiuf3lP2i1 .cluster-label text{fill:#333;}#mermaid-svg-CZxzEisiuf3lP2i1 .cluster-label span{color:#333;}#mermaid-svg-CZxzEisiuf3lP2i1 .cluster-label span p{background-color:transparent;}#mermaid-svg-CZxzEisiuf3lP2i1 .label text,#mermaid-svg-CZxzEisiuf3lP2i1 span{fill:#333;color:#333;}#mermaid-svg-CZxzEisiuf3lP2i1 .node rect,#mermaid-svg-CZxzEisiuf3lP2i1 .node circle,#mermaid-svg-CZxzEisiuf3lP2i1 .node ellipse,#mermaid-svg-CZxzEisiuf3lP2i1 .node polygon,#mermaid-svg-CZxzEisiuf3lP2i1 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-CZxzEisiuf3lP2i1 .rough-node .label text,#mermaid-svg-CZxzEisiuf3lP2i1 .node .label text,#mermaid-svg-CZxzEisiuf3lP2i1 .image-shape .label,#mermaid-svg-CZxzEisiuf3lP2i1 .icon-shape .label{text-anchor:middle;}#mermaid-svg-CZxzEisiuf3lP2i1 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-CZxzEisiuf3lP2i1 .rough-node .label,#mermaid-svg-CZxzEisiuf3lP2i1 .node .label,#mermaid-svg-CZxzEisiuf3lP2i1 .image-shape .label,#mermaid-svg-CZxzEisiuf3lP2i1 .icon-shape .label{text-align:center;}#mermaid-svg-CZxzEisiuf3lP2i1 .node.clickable{cursor:pointer;}#mermaid-svg-CZxzEisiuf3lP2i1 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-CZxzEisiuf3lP2i1 .arrowheadPath{fill:#333333;}#mermaid-svg-CZxzEisiuf3lP2i1 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-CZxzEisiuf3lP2i1 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-CZxzEisiuf3lP2i1 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-CZxzEisiuf3lP2i1 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-CZxzEisiuf3lP2i1 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-CZxzEisiuf3lP2i1 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-CZxzEisiuf3lP2i1 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-CZxzEisiuf3lP2i1 .cluster text{fill:#333;}#mermaid-svg-CZxzEisiuf3lP2i1 .cluster span{color:#333;}#mermaid-svg-CZxzEisiuf3lP2i1 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-CZxzEisiuf3lP2i1 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-CZxzEisiuf3lP2i1 rect.text{fill:none;stroke-width:0;}#mermaid-svg-CZxzEisiuf3lP2i1 .icon-shape,#mermaid-svg-CZxzEisiuf3lP2i1 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-CZxzEisiuf3lP2i1 .icon-shape p,#mermaid-svg-CZxzEisiuf3lP2i1 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-CZxzEisiuf3lP2i1 .icon-shape .label rect,#mermaid-svg-CZxzEisiuf3lP2i1 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-CZxzEisiuf3lP2i1 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-CZxzEisiuf3lP2i1 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-CZxzEisiuf3lP2i1 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Global Memory
(容量大,速度慢)
L2 Cache
L1 Cache
LDS 共享内存
(软件可控)
寄存器
(最快)
原本每个线程都去 Global Memory 读数据,多个线程可能重复读同一块数据,白白浪费带宽。LDS 的思路是让线程块内的线程共享数据:先由几个线程从显存加载到 LDS,然后块内所有线程直接从 LDS 读。
GEMM 中典型的数据流:
#mermaid-svg-8imFNkB7Dsc2SXHw{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-8imFNkB7Dsc2SXHw .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-8imFNkB7Dsc2SXHw .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-8imFNkB7Dsc2SXHw .error-icon{fill:#552222;}#mermaid-svg-8imFNkB7Dsc2SXHw .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-8imFNkB7Dsc2SXHw .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-8imFNkB7Dsc2SXHw .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-8imFNkB7Dsc2SXHw .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-8imFNkB7Dsc2SXHw .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-8imFNkB7Dsc2SXHw .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-8imFNkB7Dsc2SXHw .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-8imFNkB7Dsc2SXHw .marker{fill:#333333;stroke:#333333;}#mermaid-svg-8imFNkB7Dsc2SXHw .marker.cross{stroke:#333333;}#mermaid-svg-8imFNkB7Dsc2SXHw svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-8imFNkB7Dsc2SXHw p{margin:0;}#mermaid-svg-8imFNkB7Dsc2SXHw .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-8imFNkB7Dsc2SXHw .cluster-label text{fill:#333;}#mermaid-svg-8imFNkB7Dsc2SXHw .cluster-label span{color:#333;}#mermaid-svg-8imFNkB7Dsc2SXHw .cluster-label span p{background-color:transparent;}#mermaid-svg-8imFNkB7Dsc2SXHw .label text,#mermaid-svg-8imFNkB7Dsc2SXHw span{fill:#333;color:#333;}#mermaid-svg-8imFNkB7Dsc2SXHw .node rect,#mermaid-svg-8imFNkB7Dsc2SXHw .node circle,#mermaid-svg-8imFNkB7Dsc2SXHw .node ellipse,#mermaid-svg-8imFNkB7Dsc2SXHw .node polygon,#mermaid-svg-8imFNkB7Dsc2SXHw .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-8imFNkB7Dsc2SXHw .rough-node .label text,#mermaid-svg-8imFNkB7Dsc2SXHw .node .label text,#mermaid-svg-8imFNkB7Dsc2SXHw .image-shape .label,#mermaid-svg-8imFNkB7Dsc2SXHw .icon-shape .label{text-anchor:middle;}#mermaid-svg-8imFNkB7Dsc2SXHw .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-8imFNkB7Dsc2SXHw .rough-node .label,#mermaid-svg-8imFNkB7Dsc2SXHw .node .label,#mermaid-svg-8imFNkB7Dsc2SXHw .image-shape .label,#mermaid-svg-8imFNkB7Dsc2SXHw .icon-shape .label{text-align:center;}#mermaid-svg-8imFNkB7Dsc2SXHw .node.clickable{cursor:pointer;}#mermaid-svg-8imFNkB7Dsc2SXHw .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-8imFNkB7Dsc2SXHw .arrowheadPath{fill:#333333;}#mermaid-svg-8imFNkB7Dsc2SXHw .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-8imFNkB7Dsc2SXHw .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-8imFNkB7Dsc2SXHw .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-8imFNkB7Dsc2SXHw .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-8imFNkB7Dsc2SXHw .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-8imFNkB7Dsc2SXHw .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-8imFNkB7Dsc2SXHw .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-8imFNkB7Dsc2SXHw .cluster text{fill:#333;}#mermaid-svg-8imFNkB7Dsc2SXHw .cluster span{color:#333;}#mermaid-svg-8imFNkB7Dsc2SXHw div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-8imFNkB7Dsc2SXHw .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-8imFNkB7Dsc2SXHw rect.text{fill:none;stroke-width:0;}#mermaid-svg-8imFNkB7Dsc2SXHw .icon-shape,#mermaid-svg-8imFNkB7Dsc2SXHw .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-8imFNkB7Dsc2SXHw .icon-shape p,#mermaid-svg-8imFNkB7Dsc2SXHw .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-8imFNkB7Dsc2SXHw .icon-shape .label rect,#mermaid-svg-8imFNkB7Dsc2SXHw .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-8imFNkB7Dsc2SXHw .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-8imFNkB7Dsc2SXHw .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-8imFNkB7Dsc2SXHw :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} preload
preload
A矩阵
(Global Mem)
A分块
(LDS)
B矩阵
(Global Mem)
B分块
(LDS)
乘加计算
4.2 双缓冲:干掉同步等待
用了 LDS 之后,不同线程束之间存在数据依赖------必须等所有线程把数据写到 LDS 之后,其他线程才能安全地读。这就需要在每次循环前后插入同步指令 __syncthreads(),而这些同步指令会阻塞流水线。
双缓冲的解法很巧妙:用两份 LDS 空间,交错读写。
- 当线程束在读 Buffer A 的数据时,另一批线程在向 Buffer B 写下一轮数据
- 读写操作完全错开,消除"写之前"的同步指令
- 只有"读之后"还需要同步
代价是多用一份共享内存空间,可能降低并发度。但在大多数场景下,消除同步等待带来的收益远大于并发度损失。
4.3 PAD 填充:避免 Bank 冲突
LDS 把内存按 Bank 划分,同一线程束内的线程如果同时访问同一个 Bank 的不同地址,就会产生 Bank 冲突,导致串行访问。
PAD 策略是:在数据排布中插入额外的填充空间,让有冲突的线程错开位置。
#mermaid-svg-7zzg3hEHljkPXTPW{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-7zzg3hEHljkPXTPW .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-7zzg3hEHljkPXTPW .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-7zzg3hEHljkPXTPW .error-icon{fill:#552222;}#mermaid-svg-7zzg3hEHljkPXTPW .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-7zzg3hEHljkPXTPW .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-7zzg3hEHljkPXTPW .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-7zzg3hEHljkPXTPW .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-7zzg3hEHljkPXTPW .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-7zzg3hEHljkPXTPW .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-7zzg3hEHljkPXTPW .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-7zzg3hEHljkPXTPW .marker{fill:#333333;stroke:#333333;}#mermaid-svg-7zzg3hEHljkPXTPW .marker.cross{stroke:#333333;}#mermaid-svg-7zzg3hEHljkPXTPW svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-7zzg3hEHljkPXTPW p{margin:0;}#mermaid-svg-7zzg3hEHljkPXTPW .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-7zzg3hEHljkPXTPW .cluster-label text{fill:#333;}#mermaid-svg-7zzg3hEHljkPXTPW .cluster-label span{color:#333;}#mermaid-svg-7zzg3hEHljkPXTPW .cluster-label span p{background-color:transparent;}#mermaid-svg-7zzg3hEHljkPXTPW .label text,#mermaid-svg-7zzg3hEHljkPXTPW span{fill:#333;color:#333;}#mermaid-svg-7zzg3hEHljkPXTPW .node rect,#mermaid-svg-7zzg3hEHljkPXTPW .node circle,#mermaid-svg-7zzg3hEHljkPXTPW .node ellipse,#mermaid-svg-7zzg3hEHljkPXTPW .node polygon,#mermaid-svg-7zzg3hEHljkPXTPW .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-7zzg3hEHljkPXTPW .rough-node .label text,#mermaid-svg-7zzg3hEHljkPXTPW .node .label text,#mermaid-svg-7zzg3hEHljkPXTPW .image-shape .label,#mermaid-svg-7zzg3hEHljkPXTPW .icon-shape .label{text-anchor:middle;}#mermaid-svg-7zzg3hEHljkPXTPW .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-7zzg3hEHljkPXTPW .rough-node .label,#mermaid-svg-7zzg3hEHljkPXTPW .node .label,#mermaid-svg-7zzg3hEHljkPXTPW .image-shape .label,#mermaid-svg-7zzg3hEHljkPXTPW .icon-shape .label{text-align:center;}#mermaid-svg-7zzg3hEHljkPXTPW .node.clickable{cursor:pointer;}#mermaid-svg-7zzg3hEHljkPXTPW .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-7zzg3hEHljkPXTPW .arrowheadPath{fill:#333333;}#mermaid-svg-7zzg3hEHljkPXTPW .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-7zzg3hEHljkPXTPW .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-7zzg3hEHljkPXTPW .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-7zzg3hEHljkPXTPW .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-7zzg3hEHljkPXTPW .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-7zzg3hEHljkPXTPW .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-7zzg3hEHljkPXTPW .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-7zzg3hEHljkPXTPW .cluster text{fill:#333;}#mermaid-svg-7zzg3hEHljkPXTPW .cluster span{color:#333;}#mermaid-svg-7zzg3hEHljkPXTPW div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-7zzg3hEHljkPXTPW .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-7zzg3hEHljkPXTPW rect.text{fill:none;stroke-width:0;}#mermaid-svg-7zzg3hEHljkPXTPW .icon-shape,#mermaid-svg-7zzg3hEHljkPXTPW .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-7zzg3hEHljkPXTPW .icon-shape p,#mermaid-svg-7zzg3hEHljkPXTPW .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-7zzg3hEHljkPXTPW .icon-shape .label rect,#mermaid-svg-7zzg3hEHljkPXTPW .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-7zzg3hEHljkPXTPW .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-7zzg3hEHljkPXTPW .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-7zzg3hEHljkPXTPW :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 有PAD
线程0
访问Bank0
线程1
访问Bank0+PAD
✅无冲突
无PAD
线程0
访问Bank0
线程1
访问Bank0
❌冲突
简单说就是"用空间换并行度"------多用一点 LDS 空间,换来无冲突的并行访问。
五、TensorCore 指令:为矩阵乘专门设计的硬件单元
加速卡上设计了专门的张量计算单元(Tensor Core),把大规模矩阵运算拆成小块并行执行。硬件层面通过 MMAC 指令系列来驱动。
以 FP64 为例,v_mmac_16x16x4_f64 指令以线程束为单位,执行 M×N×K = 16×16×4 的矩阵运算,一个指令周期就能完成 1024 次乘加操作。
#mermaid-svg-kgFnbFuDySKTpVhP{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-kgFnbFuDySKTpVhP .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-kgFnbFuDySKTpVhP .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-kgFnbFuDySKTpVhP .error-icon{fill:#552222;}#mermaid-svg-kgFnbFuDySKTpVhP .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-kgFnbFuDySKTpVhP .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-kgFnbFuDySKTpVhP .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-kgFnbFuDySKTpVhP .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-kgFnbFuDySKTpVhP .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-kgFnbFuDySKTpVhP .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-kgFnbFuDySKTpVhP .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-kgFnbFuDySKTpVhP .marker{fill:#333333;stroke:#333333;}#mermaid-svg-kgFnbFuDySKTpVhP .marker.cross{stroke:#333333;}#mermaid-svg-kgFnbFuDySKTpVhP svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-kgFnbFuDySKTpVhP p{margin:0;}#mermaid-svg-kgFnbFuDySKTpVhP .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-kgFnbFuDySKTpVhP .cluster-label text{fill:#333;}#mermaid-svg-kgFnbFuDySKTpVhP .cluster-label span{color:#333;}#mermaid-svg-kgFnbFuDySKTpVhP .cluster-label span p{background-color:transparent;}#mermaid-svg-kgFnbFuDySKTpVhP .label text,#mermaid-svg-kgFnbFuDySKTpVhP span{fill:#333;color:#333;}#mermaid-svg-kgFnbFuDySKTpVhP .node rect,#mermaid-svg-kgFnbFuDySKTpVhP .node circle,#mermaid-svg-kgFnbFuDySKTpVhP .node ellipse,#mermaid-svg-kgFnbFuDySKTpVhP .node polygon,#mermaid-svg-kgFnbFuDySKTpVhP .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-kgFnbFuDySKTpVhP .rough-node .label text,#mermaid-svg-kgFnbFuDySKTpVhP .node .label text,#mermaid-svg-kgFnbFuDySKTpVhP .image-shape .label,#mermaid-svg-kgFnbFuDySKTpVhP .icon-shape .label{text-anchor:middle;}#mermaid-svg-kgFnbFuDySKTpVhP .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-kgFnbFuDySKTpVhP .rough-node .label,#mermaid-svg-kgFnbFuDySKTpVhP .node .label,#mermaid-svg-kgFnbFuDySKTpVhP .image-shape .label,#mermaid-svg-kgFnbFuDySKTpVhP .icon-shape .label{text-align:center;}#mermaid-svg-kgFnbFuDySKTpVhP .node.clickable{cursor:pointer;}#mermaid-svg-kgFnbFuDySKTpVhP .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-kgFnbFuDySKTpVhP .arrowheadPath{fill:#333333;}#mermaid-svg-kgFnbFuDySKTpVhP .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-kgFnbFuDySKTpVhP .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-kgFnbFuDySKTpVhP .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-kgFnbFuDySKTpVhP .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-kgFnbFuDySKTpVhP .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-kgFnbFuDySKTpVhP .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-kgFnbFuDySKTpVhP .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-kgFnbFuDySKTpVhP .cluster text{fill:#333;}#mermaid-svg-kgFnbFuDySKTpVhP .cluster span{color:#333;}#mermaid-svg-kgFnbFuDySKTpVhP div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-kgFnbFuDySKTpVhP .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-kgFnbFuDySKTpVhP rect.text{fill:none;stroke-width:0;}#mermaid-svg-kgFnbFuDySKTpVhP .icon-shape,#mermaid-svg-kgFnbFuDySKTpVhP .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-kgFnbFuDySKTpVhP .icon-shape p,#mermaid-svg-kgFnbFuDySKTpVhP .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-kgFnbFuDySKTpVhP .icon-shape .label rect,#mermaid-svg-kgFnbFuDySKTpVhP .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-kgFnbFuDySKTpVhP .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-kgFnbFuDySKTpVhP .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-kgFnbFuDySKTpVhP :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Tensor Core 计算
A矩阵
16×4
v_mmac_16x16x4_f64
B矩阵
4×16
C矩阵
16×16
(一次指令完成)
关键理解:TensorCore 不是替代 CU 的,而是和 CU 配合使用的。普通的标量/向量计算走 CU,矩阵乘密集的计算走 TensorCore,各司其职。
六、卷积算子优化:最吃功夫的部分
卷积是 CNN 的核心操作,也是性能优化的主战场。我们做了四层优化:
6.1 BlockSwizzle:重排线程块调度
常规的 GEMM 映射会把卷积输入展开成二维矩阵,然后用前面的矩阵乘方案去算。但展开过程中,相邻的线程块访问的数据在 Global Memory 上并不相邻,导致 L2 Cache 命中率低、显存访问量大。
BlockSwizzle 的思路是:重新编排线程块的执行顺序,让相邻线程块处理的数据在显存上也相邻。
#mermaid-svg-YAk4W5vS4Zibq3W5{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-YAk4W5vS4Zibq3W5 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-YAk4W5vS4Zibq3W5 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-YAk4W5vS4Zibq3W5 .error-icon{fill:#552222;}#mermaid-svg-YAk4W5vS4Zibq3W5 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-YAk4W5vS4Zibq3W5 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-YAk4W5vS4Zibq3W5 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-YAk4W5vS4Zibq3W5 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-YAk4W5vS4Zibq3W5 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-YAk4W5vS4Zibq3W5 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-YAk4W5vS4Zibq3W5 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-YAk4W5vS4Zibq3W5 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-YAk4W5vS4Zibq3W5 .marker.cross{stroke:#333333;}#mermaid-svg-YAk4W5vS4Zibq3W5 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-YAk4W5vS4Zibq3W5 p{margin:0;}#mermaid-svg-YAk4W5vS4Zibq3W5 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-YAk4W5vS4Zibq3W5 .cluster-label text{fill:#333;}#mermaid-svg-YAk4W5vS4Zibq3W5 .cluster-label span{color:#333;}#mermaid-svg-YAk4W5vS4Zibq3W5 .cluster-label span p{background-color:transparent;}#mermaid-svg-YAk4W5vS4Zibq3W5 .label text,#mermaid-svg-YAk4W5vS4Zibq3W5 span{fill:#333;color:#333;}#mermaid-svg-YAk4W5vS4Zibq3W5 .node rect,#mermaid-svg-YAk4W5vS4Zibq3W5 .node circle,#mermaid-svg-YAk4W5vS4Zibq3W5 .node ellipse,#mermaid-svg-YAk4W5vS4Zibq3W5 .node polygon,#mermaid-svg-YAk4W5vS4Zibq3W5 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-YAk4W5vS4Zibq3W5 .rough-node .label text,#mermaid-svg-YAk4W5vS4Zibq3W5 .node .label text,#mermaid-svg-YAk4W5vS4Zibq3W5 .image-shape .label,#mermaid-svg-YAk4W5vS4Zibq3W5 .icon-shape .label{text-anchor:middle;}#mermaid-svg-YAk4W5vS4Zibq3W5 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-YAk4W5vS4Zibq3W5 .rough-node .label,#mermaid-svg-YAk4W5vS4Zibq3W5 .node .label,#mermaid-svg-YAk4W5vS4Zibq3W5 .image-shape .label,#mermaid-svg-YAk4W5vS4Zibq3W5 .icon-shape .label{text-align:center;}#mermaid-svg-YAk4W5vS4Zibq3W5 .node.clickable{cursor:pointer;}#mermaid-svg-YAk4W5vS4Zibq3W5 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-YAk4W5vS4Zibq3W5 .arrowheadPath{fill:#333333;}#mermaid-svg-YAk4W5vS4Zibq3W5 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-YAk4W5vS4Zibq3W5 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-YAk4W5vS4Zibq3W5 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-YAk4W5vS4Zibq3W5 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-YAk4W5vS4Zibq3W5 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-YAk4W5vS4Zibq3W5 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-YAk4W5vS4Zibq3W5 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-YAk4W5vS4Zibq3W5 .cluster text{fill:#333;}#mermaid-svg-YAk4W5vS4Zibq3W5 .cluster span{color:#333;}#mermaid-svg-YAk4W5vS4Zibq3W5 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-YAk4W5vS4Zibq3W5 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-YAk4W5vS4Zibq3W5 rect.text{fill:none;stroke-width:0;}#mermaid-svg-YAk4W5vS4Zibq3W5 .icon-shape,#mermaid-svg-YAk4W5vS4Zibq3W5 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-YAk4W5vS4Zibq3W5 .icon-shape p,#mermaid-svg-YAk4W5vS4Zibq3W5 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-YAk4W5vS4Zibq3W5 .icon-shape .label rect,#mermaid-svg-YAk4W5vS4Zibq3W5 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-YAk4W5vS4Zibq3W5 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-YAk4W5vS4Zibq3W5 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-YAk4W5vS4Zibq3W5 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} BlockSwizzle
数据复用
数据复用
Block1
Block2
Block3
Block4
常规映射
Block1
Block2
Block3
Block4
Block5
Block6
实际效果:
| 指标 | 变化 |
|---|---|
| L2 Cache 命中率 | 提升约 9% |
| Kernel 执行耗时 | 降至原来的约 78% |
| 计算单元负载 | 降低约 8% |
经过大规模卷积参数验证,整体性能提升约 19%。而且当 M 方向的 Block 数越多时,数据复用越多,效果越明显。
#mermaid-svg-C49gRmEXswStkz9U{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-C49gRmEXswStkz9U .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-C49gRmEXswStkz9U .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-C49gRmEXswStkz9U .error-icon{fill:#552222;}#mermaid-svg-C49gRmEXswStkz9U .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-C49gRmEXswStkz9U .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-C49gRmEXswStkz9U .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-C49gRmEXswStkz9U .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-C49gRmEXswStkz9U .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-C49gRmEXswStkz9U .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-C49gRmEXswStkz9U .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-C49gRmEXswStkz9U .marker{fill:#333333;stroke:#333333;}#mermaid-svg-C49gRmEXswStkz9U .marker.cross{stroke:#333333;}#mermaid-svg-C49gRmEXswStkz9U svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-C49gRmEXswStkz9U p{margin:0;}#mermaid-svg-C49gRmEXswStkz9U :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} BlockSwizzle 性能提升对比 常规映射BlockSwizzle 1251201151101051009590 相对性能(%)
6.2 预计算:把边界判断变成查表
卷积计算中有一个很容易被忽略的细节:输入张量沿着 RSC 维度遍历卷积核时,地址偏移的增量对于固定的问题规模是不变的。每个线程块负责的 NPQ 和 RSC 索引范围也是固定的。
预计算优化的做法是:在初始化阶段把偏移增量和边界掩码预计算好,存成表。主循环里不再做复杂的边界计算,直接查表+加法即可。
好处两个:
- 偏移表减少算术指令,主循环只剩 Load 和加法,编译器好优化、指令流水更顺畅
- 掩码表减少分支和流水线停顿,控制流更稳定
6.3 BankConflict 处理:Swizzle 重排
在共享内存中读数据时,同一线程束的不同线程如果访问了同一个 Bank 的不同地址,就会串行化。
我们的做法是把数据块切成 8×8 的 Tile,对每个 Tile 做 Swizzle 重排,让同一事务中的线程能无冲突地并行访问 32 个 Bank。
#mermaid-svg-SkhX7rBOXvc6b9c5{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-SkhX7rBOXvc6b9c5 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-SkhX7rBOXvc6b9c5 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-SkhX7rBOXvc6b9c5 .error-icon{fill:#552222;}#mermaid-svg-SkhX7rBOXvc6b9c5 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-SkhX7rBOXvc6b9c5 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-SkhX7rBOXvc6b9c5 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-SkhX7rBOXvc6b9c5 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-SkhX7rBOXvc6b9c5 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-SkhX7rBOXvc6b9c5 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-SkhX7rBOXvc6b9c5 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-SkhX7rBOXvc6b9c5 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-SkhX7rBOXvc6b9c5 .marker.cross{stroke:#333333;}#mermaid-svg-SkhX7rBOXvc6b9c5 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-SkhX7rBOXvc6b9c5 p{margin:0;}#mermaid-svg-SkhX7rBOXvc6b9c5 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-SkhX7rBOXvc6b9c5 .cluster-label text{fill:#333;}#mermaid-svg-SkhX7rBOXvc6b9c5 .cluster-label span{color:#333;}#mermaid-svg-SkhX7rBOXvc6b9c5 .cluster-label span p{background-color:transparent;}#mermaid-svg-SkhX7rBOXvc6b9c5 .label text,#mermaid-svg-SkhX7rBOXvc6b9c5 span{fill:#333;color:#333;}#mermaid-svg-SkhX7rBOXvc6b9c5 .node rect,#mermaid-svg-SkhX7rBOXvc6b9c5 .node circle,#mermaid-svg-SkhX7rBOXvc6b9c5 .node ellipse,#mermaid-svg-SkhX7rBOXvc6b9c5 .node polygon,#mermaid-svg-SkhX7rBOXvc6b9c5 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-SkhX7rBOXvc6b9c5 .rough-node .label text,#mermaid-svg-SkhX7rBOXvc6b9c5 .node .label text,#mermaid-svg-SkhX7rBOXvc6b9c5 .image-shape .label,#mermaid-svg-SkhX7rBOXvc6b9c5 .icon-shape .label{text-anchor:middle;}#mermaid-svg-SkhX7rBOXvc6b9c5 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-SkhX7rBOXvc6b9c5 .rough-node .label,#mermaid-svg-SkhX7rBOXvc6b9c5 .node .label,#mermaid-svg-SkhX7rBOXvc6b9c5 .image-shape .label,#mermaid-svg-SkhX7rBOXvc6b9c5 .icon-shape .label{text-align:center;}#mermaid-svg-SkhX7rBOXvc6b9c5 .node.clickable{cursor:pointer;}#mermaid-svg-SkhX7rBOXvc6b9c5 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-SkhX7rBOXvc6b9c5 .arrowheadPath{fill:#333333;}#mermaid-svg-SkhX7rBOXvc6b9c5 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-SkhX7rBOXvc6b9c5 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-SkhX7rBOXvc6b9c5 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-SkhX7rBOXvc6b9c5 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-SkhX7rBOXvc6b9c5 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-SkhX7rBOXvc6b9c5 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-SkhX7rBOXvc6b9c5 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-SkhX7rBOXvc6b9c5 .cluster text{fill:#333;}#mermaid-svg-SkhX7rBOXvc6b9c5 .cluster span{color:#333;}#mermaid-svg-SkhX7rBOXvc6b9c5 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-SkhX7rBOXvc6b9c5 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-SkhX7rBOXvc6b9c5 rect.text{fill:none;stroke-width:0;}#mermaid-svg-SkhX7rBOXvc6b9c5 .icon-shape,#mermaid-svg-SkhX7rBOXvc6b9c5 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-SkhX7rBOXvc6b9c5 .icon-shape p,#mermaid-svg-SkhX7rBOXvc6b9c5 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-SkhX7rBOXvc6b9c5 .icon-shape .label rect,#mermaid-svg-SkhX7rBOXvc6b9c5 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-SkhX7rBOXvc6b9c5 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-SkhX7rBOXvc6b9c5 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-SkhX7rBOXvc6b9c5 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 重排后Swizzle
Tile0
线程0→Bank0
Tile0
线程1→Bank1
✅无冲突
重排前
线程0
Bank0
线程1
Bank0
❌冲突
6.4 特殊卷积 3×3 R1S1:专案专办
前面提到 IGEMM(Implicit GEMM)方案通用性强,但对某些参数组合------尤其是 3×3 卷积核、步长为 1 的场景------存在严重的冗余访存。
举个例子:5×5 的输入用 3×3 卷积核步长 1 计算,理论上只需要 5×5=25 个数据就能出 3×3 的结果。但在 IGEMM 方案里,由于要把卷积展开成矩阵乘,实际会产生 9×3×3=81 次 全局访存------和理论值差了 3 倍多。
卷积核越大、步长不同时差距更明显。这种冗余访存在显存带宽受限的场景下就是性能瓶颈。
我们的核心思路就两条:
- 一次性搬运:把输出块所需的全部输入数据,一次性、无重复地从 Global Memory 加载到 LDS
- 两级数据复用 :
- LDS 级:把卷积核滑动的所有位置数据都从 LDS 读,避免重复的全局访存
- VGPR 级:计算卷积核不同位置时,中间重叠的输入数据保留在向量寄存器中,进一步减少 LDS 访问
#mermaid-svg-C1UpjtS098v9nGYc{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-C1UpjtS098v9nGYc .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-C1UpjtS098v9nGYc .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-C1UpjtS098v9nGYc .error-icon{fill:#552222;}#mermaid-svg-C1UpjtS098v9nGYc .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-C1UpjtS098v9nGYc .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-C1UpjtS098v9nGYc .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-C1UpjtS098v9nGYc .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-C1UpjtS098v9nGYc .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-C1UpjtS098v9nGYc .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-C1UpjtS098v9nGYc .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-C1UpjtS098v9nGYc .marker{fill:#333333;stroke:#333333;}#mermaid-svg-C1UpjtS098v9nGYc .marker.cross{stroke:#333333;}#mermaid-svg-C1UpjtS098v9nGYc svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-C1UpjtS098v9nGYc p{margin:0;}#mermaid-svg-C1UpjtS098v9nGYc .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-C1UpjtS098v9nGYc .cluster-label text{fill:#333;}#mermaid-svg-C1UpjtS098v9nGYc .cluster-label span{color:#333;}#mermaid-svg-C1UpjtS098v9nGYc .cluster-label span p{background-color:transparent;}#mermaid-svg-C1UpjtS098v9nGYc .label text,#mermaid-svg-C1UpjtS098v9nGYc span{fill:#333;color:#333;}#mermaid-svg-C1UpjtS098v9nGYc .node rect,#mermaid-svg-C1UpjtS098v9nGYc .node circle,#mermaid-svg-C1UpjtS098v9nGYc .node ellipse,#mermaid-svg-C1UpjtS098v9nGYc .node polygon,#mermaid-svg-C1UpjtS098v9nGYc .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-C1UpjtS098v9nGYc .rough-node .label text,#mermaid-svg-C1UpjtS098v9nGYc .node .label text,#mermaid-svg-C1UpjtS098v9nGYc .image-shape .label,#mermaid-svg-C1UpjtS098v9nGYc .icon-shape .label{text-anchor:middle;}#mermaid-svg-C1UpjtS098v9nGYc .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-C1UpjtS098v9nGYc .rough-node .label,#mermaid-svg-C1UpjtS098v9nGYc .node .label,#mermaid-svg-C1UpjtS098v9nGYc .image-shape .label,#mermaid-svg-C1UpjtS098v9nGYc .icon-shape .label{text-align:center;}#mermaid-svg-C1UpjtS098v9nGYc .node.clickable{cursor:pointer;}#mermaid-svg-C1UpjtS098v9nGYc .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-C1UpjtS098v9nGYc .arrowheadPath{fill:#333333;}#mermaid-svg-C1UpjtS098v9nGYc .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-C1UpjtS098v9nGYc .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-C1UpjtS098v9nGYc .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-C1UpjtS098v9nGYc .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-C1UpjtS098v9nGYc .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-C1UpjtS098v9nGYc .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-C1UpjtS098v9nGYc .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-C1UpjtS098v9nGYc .cluster text{fill:#333;}#mermaid-svg-C1UpjtS098v9nGYc .cluster span{color:#333;}#mermaid-svg-C1UpjtS098v9nGYc div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-C1UpjtS098v9nGYc .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-C1UpjtS098v9nGYc rect.text{fill:none;stroke-width:0;}#mermaid-svg-C1UpjtS098v9nGYc .icon-shape,#mermaid-svg-C1UpjtS098v9nGYc .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-C1UpjtS098v9nGYc .icon-shape p,#mermaid-svg-C1UpjtS098v9nGYc .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-C1UpjtS098v9nGYc .icon-shape .label rect,#mermaid-svg-C1UpjtS098v9nGYc .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-C1UpjtS098v9nGYc .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-C1UpjtS098v9nGYc .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-C1UpjtS098v9nGYc :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 一次性搬运
无重复
RS循环展开
9次滑动均读LDS
直接复用
Global Memory
输入特征图
LDS
VGPR
保留重叠数据
计算输出
七、算子生成工具 DOT:告别手写 Kernel
当算子种类越来越多------不同数据类型(FP16/FP32/FP64)、不同卷积核大小、不同步长、不同通道数------人工手写每个 Kernel 就完全不可持续了。
DOT 是一个算子代码生成器,采用分层架构:
#mermaid-svg-H6BhyIrcqsju9CDZ{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-H6BhyIrcqsju9CDZ .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-H6BhyIrcqsju9CDZ .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-H6BhyIrcqsju9CDZ .error-icon{fill:#552222;}#mermaid-svg-H6BhyIrcqsju9CDZ .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-H6BhyIrcqsju9CDZ .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-H6BhyIrcqsju9CDZ .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-H6BhyIrcqsju9CDZ .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-H6BhyIrcqsju9CDZ .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-H6BhyIrcqsju9CDZ .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-H6BhyIrcqsju9CDZ .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-H6BhyIrcqsju9CDZ .marker{fill:#333333;stroke:#333333;}#mermaid-svg-H6BhyIrcqsju9CDZ .marker.cross{stroke:#333333;}#mermaid-svg-H6BhyIrcqsju9CDZ svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-H6BhyIrcqsju9CDZ p{margin:0;}#mermaid-svg-H6BhyIrcqsju9CDZ .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-H6BhyIrcqsju9CDZ .cluster-label text{fill:#333;}#mermaid-svg-H6BhyIrcqsju9CDZ .cluster-label span{color:#333;}#mermaid-svg-H6BhyIrcqsju9CDZ .cluster-label span p{background-color:transparent;}#mermaid-svg-H6BhyIrcqsju9CDZ .label text,#mermaid-svg-H6BhyIrcqsju9CDZ span{fill:#333;color:#333;}#mermaid-svg-H6BhyIrcqsju9CDZ .node rect,#mermaid-svg-H6BhyIrcqsju9CDZ .node circle,#mermaid-svg-H6BhyIrcqsju9CDZ .node ellipse,#mermaid-svg-H6BhyIrcqsju9CDZ .node polygon,#mermaid-svg-H6BhyIrcqsju9CDZ .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-H6BhyIrcqsju9CDZ .rough-node .label text,#mermaid-svg-H6BhyIrcqsju9CDZ .node .label text,#mermaid-svg-H6BhyIrcqsju9CDZ .image-shape .label,#mermaid-svg-H6BhyIrcqsju9CDZ .icon-shape .label{text-anchor:middle;}#mermaid-svg-H6BhyIrcqsju9CDZ .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-H6BhyIrcqsju9CDZ .rough-node .label,#mermaid-svg-H6BhyIrcqsju9CDZ .node .label,#mermaid-svg-H6BhyIrcqsju9CDZ .image-shape .label,#mermaid-svg-H6BhyIrcqsju9CDZ .icon-shape .label{text-align:center;}#mermaid-svg-H6BhyIrcqsju9CDZ .node.clickable{cursor:pointer;}#mermaid-svg-H6BhyIrcqsju9CDZ .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-H6BhyIrcqsju9CDZ .arrowheadPath{fill:#333333;}#mermaid-svg-H6BhyIrcqsju9CDZ .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-H6BhyIrcqsju9CDZ .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-H6BhyIrcqsju9CDZ .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-H6BhyIrcqsju9CDZ .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-H6BhyIrcqsju9CDZ .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-H6BhyIrcqsju9CDZ .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-H6BhyIrcqsju9CDZ .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-H6BhyIrcqsju9CDZ .cluster text{fill:#333;}#mermaid-svg-H6BhyIrcqsju9CDZ .cluster span{color:#333;}#mermaid-svg-H6BhyIrcqsju9CDZ div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-H6BhyIrcqsju9CDZ .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-H6BhyIrcqsju9CDZ rect.text{fill:none;stroke-width:0;}#mermaid-svg-H6BhyIrcqsju9CDZ .icon-shape,#mermaid-svg-H6BhyIrcqsju9CDZ .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-H6BhyIrcqsju9CDZ .icon-shape p,#mermaid-svg-H6BhyIrcqsju9CDZ .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-H6BhyIrcqsju9CDZ .icon-shape .label rect,#mermaid-svg-H6BhyIrcqsju9CDZ .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-H6BhyIrcqsju9CDZ .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-H6BhyIrcqsju9CDZ .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-H6BhyIrcqsju9CDZ :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 组件层
Swizzle
ThreadLayout
Coord
MacComputer
LdsReorder
Fragment
MhaBody
Builder层
MainLoopBuilder
PostOpBuilder
DirectConvBuilder
Device层
ConvBuilder
MhaBuilder
UniversalAdapter
调用层
ConvDriver
MhaDriver
工作流程也很直接:用户定义问题规格(数据类型、卷积参数、可用设备等),DOT 在搜索空间中自动枚举候选方案,逐一评估 Kernel 性能,选出最优解并生成代码。
#mermaid-svg-WPBKjTLkLSVjZ3tA{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-WPBKjTLkLSVjZ3tA .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-WPBKjTLkLSVjZ3tA .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-WPBKjTLkLSVjZ3tA .error-icon{fill:#552222;}#mermaid-svg-WPBKjTLkLSVjZ3tA .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-WPBKjTLkLSVjZ3tA .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-WPBKjTLkLSVjZ3tA .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-WPBKjTLkLSVjZ3tA .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-WPBKjTLkLSVjZ3tA .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-WPBKjTLkLSVjZ3tA .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-WPBKjTLkLSVjZ3tA .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-WPBKjTLkLSVjZ3tA .marker{fill:#333333;stroke:#333333;}#mermaid-svg-WPBKjTLkLSVjZ3tA .marker.cross{stroke:#333333;}#mermaid-svg-WPBKjTLkLSVjZ3tA svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-WPBKjTLkLSVjZ3tA p{margin:0;}#mermaid-svg-WPBKjTLkLSVjZ3tA .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-WPBKjTLkLSVjZ3tA .cluster-label text{fill:#333;}#mermaid-svg-WPBKjTLkLSVjZ3tA .cluster-label span{color:#333;}#mermaid-svg-WPBKjTLkLSVjZ3tA .cluster-label span p{background-color:transparent;}#mermaid-svg-WPBKjTLkLSVjZ3tA .label text,#mermaid-svg-WPBKjTLkLSVjZ3tA span{fill:#333;color:#333;}#mermaid-svg-WPBKjTLkLSVjZ3tA .node rect,#mermaid-svg-WPBKjTLkLSVjZ3tA .node circle,#mermaid-svg-WPBKjTLkLSVjZ3tA .node ellipse,#mermaid-svg-WPBKjTLkLSVjZ3tA .node polygon,#mermaid-svg-WPBKjTLkLSVjZ3tA .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-WPBKjTLkLSVjZ3tA .rough-node .label text,#mermaid-svg-WPBKjTLkLSVjZ3tA .node .label text,#mermaid-svg-WPBKjTLkLSVjZ3tA .image-shape .label,#mermaid-svg-WPBKjTLkLSVjZ3tA .icon-shape .label{text-anchor:middle;}#mermaid-svg-WPBKjTLkLSVjZ3tA .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-WPBKjTLkLSVjZ3tA .rough-node .label,#mermaid-svg-WPBKjTLkLSVjZ3tA .node .label,#mermaid-svg-WPBKjTLkLSVjZ3tA .image-shape .label,#mermaid-svg-WPBKjTLkLSVjZ3tA .icon-shape .label{text-align:center;}#mermaid-svg-WPBKjTLkLSVjZ3tA .node.clickable{cursor:pointer;}#mermaid-svg-WPBKjTLkLSVjZ3tA .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-WPBKjTLkLSVjZ3tA .arrowheadPath{fill:#333333;}#mermaid-svg-WPBKjTLkLSVjZ3tA .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-WPBKjTLkLSVjZ3tA .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-WPBKjTLkLSVjZ3tA .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-WPBKjTLkLSVjZ3tA .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-WPBKjTLkLSVjZ3tA .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-WPBKjTLkLSVjZ3tA .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-WPBKjTLkLSVjZ3tA .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-WPBKjTLkLSVjZ3tA .cluster text{fill:#333;}#mermaid-svg-WPBKjTLkLSVjZ3tA .cluster span{color:#333;}#mermaid-svg-WPBKjTLkLSVjZ3tA div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-WPBKjTLkLSVjZ3tA .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-WPBKjTLkLSVjZ3tA rect.text{fill:none;stroke-width:0;}#mermaid-svg-WPBKjTLkLSVjZ3tA .icon-shape,#mermaid-svg-WPBKjTLkLSVjZ3tA .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-WPBKjTLkLSVjZ3tA .icon-shape p,#mermaid-svg-WPBKjTLkLSVjZ3tA .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-WPBKjTLkLSVjZ3tA .icon-shape .label rect,#mermaid-svg-WPBKjTLkLSVjZ3tA .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-WPBKjTLkLSVjZ3tA .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-WPBKjTLkLSVjZ3tA .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-WPBKjTLkLSVjZ3tA :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 用户定义问题
搜索空间枚举
评估候选 Kernel
选出最优方案
生成代码
所有生成的算子统一注册在 generator 命名空间下,作为 Builder 的模板参数。模板参数涵盖 Block/Warp Shape、CodeGen 驱动配置等。
一句话:DOT 本质上是一个"算子搜索引擎"------把调参的工作交给工具,人只需要定义约束条件和目标。
八、总结
整个异构数学库和 AI 算子开发的优化路径可以归纳为三层:
| 层次 | 优化手段 | 核心思路 |
|---|---|---|
| 算法层 | 向量外积、分块策略 | 提升计算访存比,减少冗余内存访问 |
| 访存层 | 数据预取、LDS 重构、双缓冲、PAD 填充 | 隐藏访存延迟、消除 Bank 冲突、减少同步等待 |
| 硬件层 | TensorCore 指令、BlockSwizzle | 充分利用专用计算单元、提升 Cache 命中率 |
| 工程层 | 算子生成器 DOT | 自动化搜索最优 Kernel 配置,告别手写调参 |
每层都遵循同一个原则:别让计算单元闲着。有了数据就赶紧算,算完了下一批数据已经在路上了。GPU/加速卡编程的本质就是------管理好从显存到计算单元这条"数据流水线"。
这条路走下来,最大的体会是:性能优化没有银弹,但有一把组合钥匙。单个手段可能只提升几个百分点,几个手段叠加,量变就成质变了。
本文基于内部技术培训材料整理,涉及的具体性能数字、产品代号已做脱敏处理。