国产GPGPU踩坑实录:从卡间互联拓扑到软件栈,聊聊新一代加速卡的真实表现
搞大模型训练的都知道,卡间互联带宽不够,多卡并行就是灾难。最近上手了一款国产GPGPU加速卡的新一代产品,把互联拓扑、训练推理性能、软件生态完整跑了一遍,这里做个复盘。
一、先看硬件底子
新一代加速卡有几个关键升级:
| 维度 | 上一代产品 | 新一代产品 |
|---|---|---|
| FP64 精度 | 不支持 | 支持(国产唯一) |
| 显存容量 | 64GB | 64GB |
| 显存带宽 | ~900GB/s | ~1.8TB/s |
| CPU-GPU互联 | PCIe 4.0 x16 | PCIe 5.0 x16 |
| GPU-GPU互联 | PCIe 4.0 | 自研高速互联链路 |
| 功耗 | 350W | 800W |
| 形态 | PCIe全高全长 | PCIe & OAM |
两代之间跨度很大。上一代只支持到 FP32,做科学计算基本没戏。新一代补齐了 FP64,显存带宽直接翻倍。
8 卡服务器节点规格(基于新一代加速卡):
- CPU:2 × 自研 x86 处理器
- GPU:8 × 新一代加速卡
- 显存总量:512GB
- 显存总带宽:~14.4 TB/s
- 卡间互联:自研交换芯片全互联拓扑
- GPU-GPU 单链路带宽:~448 GB/s
- 节点内总带宽:~3584 GB/s
- 对分带宽:~1792 GB/s
- 机间互联:8 × 200G IB / RoCE
---个
二、互联拓扑:Fullmesh vs Switch Clos,差距有多大?
2.1 8 卡 Fullmesh 的尴尬
业界很多方案用的是 8 卡 Fullmesh(全互联)。这方案的好处是简单,坏处是------每张卡只有 1 条 P2P 链路连到其他卡。
#mermaid-svg-jmSKFYlj4Fi9TIuh{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-jmSKFYlj4Fi9TIuh .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-jmSKFYlj4Fi9TIuh .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-jmSKFYlj4Fi9TIuh .error-icon{fill:#552222;}#mermaid-svg-jmSKFYlj4Fi9TIuh .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-jmSKFYlj4Fi9TIuh .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-jmSKFYlj4Fi9TIuh .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-jmSKFYlj4Fi9TIuh .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-jmSKFYlj4Fi9TIuh .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-jmSKFYlj4Fi9TIuh .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-jmSKFYlj4Fi9TIuh .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-jmSKFYlj4Fi9TIuh .marker{fill:#333333;stroke:#333333;}#mermaid-svg-jmSKFYlj4Fi9TIuh .marker.cross{stroke:#333333;}#mermaid-svg-jmSKFYlj4Fi9TIuh svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-jmSKFYlj4Fi9TIuh p{margin:0;}#mermaid-svg-jmSKFYlj4Fi9TIuh .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-jmSKFYlj4Fi9TIuh .cluster-label text{fill:#333;}#mermaid-svg-jmSKFYlj4Fi9TIuh .cluster-label span{color:#333;}#mermaid-svg-jmSKFYlj4Fi9TIuh .cluster-label span p{background-color:transparent;}#mermaid-svg-jmSKFYlj4Fi9TIuh .label text,#mermaid-svg-jmSKFYlj4Fi9TIuh span{fill:#333;color:#333;}#mermaid-svg-jmSKFYlj4Fi9TIuh .node rect,#mermaid-svg-jmSKFYlj4Fi9TIuh .node circle,#mermaid-svg-jmSKFYlj4Fi9TIuh .node ellipse,#mermaid-svg-jmSKFYlj4Fi9TIuh .node polygon,#mermaid-svg-jmSKFYlj4Fi9TIuh .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-jmSKFYlj4Fi9TIuh .rough-node .label text,#mermaid-svg-jmSKFYlj4Fi9TIuh .node .label text,#mermaid-svg-jmSKFYlj4Fi9TIuh .image-shape .label,#mermaid-svg-jmSKFYlj4Fi9TIuh .icon-shape .label{text-anchor:middle;}#mermaid-svg-jmSKFYlj4Fi9TIuh .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-jmSKFYlj4Fi9TIuh .rough-node .label,#mermaid-svg-jmSKFYlj4Fi9TIuh .node .label,#mermaid-svg-jmSKFYlj4Fi9TIuh .image-shape .label,#mermaid-svg-jmSKFYlj4Fi9TIuh .icon-shape .label{text-align:center;}#mermaid-svg-jmSKFYlj4Fi9TIuh .node.clickable{cursor:pointer;}#mermaid-svg-jmSKFYlj4Fi9TIuh .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-jmSKFYlj4Fi9TIuh .arrowheadPath{fill:#333333;}#mermaid-svg-jmSKFYlj4Fi9TIuh .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-jmSKFYlj4Fi9TIuh .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-jmSKFYlj4Fi9TIuh .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-jmSKFYlj4Fi9TIuh .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-jmSKFYlj4Fi9TIuh .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-jmSKFYlj4Fi9TIuh .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-jmSKFYlj4Fi9TIuh .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-jmSKFYlj4Fi9TIuh .cluster text{fill:#333;}#mermaid-svg-jmSKFYlj4Fi9TIuh .cluster span{color:#333;}#mermaid-svg-jmSKFYlj4Fi9TIuh div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-jmSKFYlj4Fi9TIuh .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-jmSKFYlj4Fi9TIuh rect.text{fill:none;stroke-width:0;}#mermaid-svg-jmSKFYlj4Fi9TIuh .icon-shape,#mermaid-svg-jmSKFYlj4Fi9TIuh .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-jmSKFYlj4Fi9TIuh .icon-shape p,#mermaid-svg-jmSKFYlj4Fi9TIuh .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-jmSKFYlj4Fi9TIuh .icon-shape .label rect,#mermaid-svg-jmSKFYlj4Fi9TIuh .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-jmSKFYlj4Fi9TIuh .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-jmSKFYlj4Fi9TIuh .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-jmSKFYlj4Fi9TIuh :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Fullmesh
GPU0
GPU1
GPU2
GPU3
GPU4
GPU5
GPU6
GPU7
单链路带宽一般 50-60GB/s,P2P 双向带宽就是这么多。节点内总链路数 28 条,链路总带宽大约 1500-1600GB/s,对分带宽也就 800-900GB/s。
这个数字放在大模型 AllReduce 场景里,Ring 算法跑起来,busBW 直接被对分带宽卡脖子。
2.2 Switch Clos 全互联怎么破局
新一代方案不走 Fullmesh,而是在 8 卡之间插入自研交换芯片,走 Clos 全互联:
#mermaid-svg-d0juYvEsGe7mJS2L{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-d0juYvEsGe7mJS2L .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-d0juYvEsGe7mJS2L .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-d0juYvEsGe7mJS2L .error-icon{fill:#552222;}#mermaid-svg-d0juYvEsGe7mJS2L .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-d0juYvEsGe7mJS2L .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-d0juYvEsGe7mJS2L .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-d0juYvEsGe7mJS2L .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-d0juYvEsGe7mJS2L .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-d0juYvEsGe7mJS2L .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-d0juYvEsGe7mJS2L .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-d0juYvEsGe7mJS2L .marker{fill:#333333;stroke:#333333;}#mermaid-svg-d0juYvEsGe7mJS2L .marker.cross{stroke:#333333;}#mermaid-svg-d0juYvEsGe7mJS2L svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-d0juYvEsGe7mJS2L p{margin:0;}#mermaid-svg-d0juYvEsGe7mJS2L .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-d0juYvEsGe7mJS2L .cluster-label text{fill:#333;}#mermaid-svg-d0juYvEsGe7mJS2L .cluster-label span{color:#333;}#mermaid-svg-d0juYvEsGe7mJS2L .cluster-label span p{background-color:transparent;}#mermaid-svg-d0juYvEsGe7mJS2L .label text,#mermaid-svg-d0juYvEsGe7mJS2L span{fill:#333;color:#333;}#mermaid-svg-d0juYvEsGe7mJS2L .node rect,#mermaid-svg-d0juYvEsGe7mJS2L .node circle,#mermaid-svg-d0juYvEsGe7mJS2L .node ellipse,#mermaid-svg-d0juYvEsGe7mJS2L .node polygon,#mermaid-svg-d0juYvEsGe7mJS2L .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-d0juYvEsGe7mJS2L .rough-node .label text,#mermaid-svg-d0juYvEsGe7mJS2L .node .label text,#mermaid-svg-d0juYvEsGe7mJS2L .image-shape .label,#mermaid-svg-d0juYvEsGe7mJS2L .icon-shape .label{text-anchor:middle;}#mermaid-svg-d0juYvEsGe7mJS2L .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-d0juYvEsGe7mJS2L .rough-node .label,#mermaid-svg-d0juYvEsGe7mJS2L .node .label,#mermaid-svg-d0juYvEsGe7mJS2L .image-shape .label,#mermaid-svg-d0juYvEsGe7mJS2L .icon-shape .label{text-align:center;}#mermaid-svg-d0juYvEsGe7mJS2L .node.clickable{cursor:pointer;}#mermaid-svg-d0juYvEsGe7mJS2L .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-d0juYvEsGe7mJS2L .arrowheadPath{fill:#333333;}#mermaid-svg-d0juYvEsGe7mJS2L .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-d0juYvEsGe7mJS2L .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-d0juYvEsGe7mJS2L .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-d0juYvEsGe7mJS2L .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-d0juYvEsGe7mJS2L .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-d0juYvEsGe7mJS2L .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-d0juYvEsGe7mJS2L .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-d0juYvEsGe7mJS2L .cluster text{fill:#333;}#mermaid-svg-d0juYvEsGe7mJS2L .cluster span{color:#333;}#mermaid-svg-d0juYvEsGe7mJS2L div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-d0juYvEsGe7mJS2L .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-d0juYvEsGe7mJS2L rect.text{fill:none;stroke-width:0;}#mermaid-svg-d0juYvEsGe7mJS2L .icon-shape,#mermaid-svg-d0juYvEsGe7mJS2L .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-d0juYvEsGe7mJS2L .icon-shape p,#mermaid-svg-d0juYvEsGe7mJS2L .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-d0juYvEsGe7mJS2L .icon-shape .label rect,#mermaid-svg-d0juYvEsGe7mJS2L .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-d0juYvEsGe7mJS2L .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-d0juYvEsGe7mJS2L .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-d0juYvEsGe7mJS2L :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Switch Clos 全互联拓扑
交换芯片1
交换芯片2
交换芯片3
交换芯片4
交换芯片5
交换芯片6
交换芯片7
GPU0
GPU1
GPU2
GPU3
GPU4
GPU5
GPU6
GPU7
每张加速卡挂 7 条链路到交换芯片,单链路双向带宽 64GB/s,单卡 P2P 总带宽 = 7 × 64 = 448GB/s。7 个交换芯片聚合后:
| 维度 | Fullmesh | Switch Clos | 提升 |
|---|---|---|---|
| P2P 链路数 | 1 | 7 | 7× |
| P2P 双向带宽 | ~56GB/s | 448GB/s | 8× |
| 链路总带宽 | ~1568GB/s | 3584GB/s | 2.3× |
| 对分带宽 | ~896GB/s | 1792GB/s | 2× |
| AllReduce busBW | ~112GB/s | 224GB/s | 2× |
这意味着什么?AllReduce 通信不再是对分带宽瓶颈,Ring 算法在 8 卡内能把 7 条 Ring 全跑起来。
2.3 大模型 TP 并行场景实测对比
Tensor Parallelism(TP)是大模型训练里最吃通信的并行策略。TP=8 时每组 8 卡做 AllReduce。
| TP 规模 | 指标 | Fullmesh 类方案 | Switch Clos |
|---|---|---|---|
| TP=2 | Ring 数 | 1 | 7 |
| TP=2 | AR BusBW | ~28-56 GB/s | 224 GB/s |
| TP=4 | Ring 数 | 1-3 | 7 |
| TP=4 | AR BusBW | ~28-56 GB/s | 224 GB/s |
| TP=4 | AR AlgBW | ~18-37 GB/s | ~149 GB/s |
| TP=8 | Ring 数 | 1-3 | 7 |
| TP=8 | AR BusBW | ~28-168 GB/s | 224 GB/s |
| TP=8 | AR AlgBW | ~18-96 GB/s | 128 GB/s |
7 条 Ring 打 1 条 Ring,TP=8 时算法带宽差距拉到了数倍。这在千亿参数模型训练里就是能不能跑起来的区别。
三、多节点 Scale-Out:PXN 的价值
单节点内互联再强,跨节点还是要走 IB/RoCE。这里有个容易踩的坑------PCIe 上行带宽。
如果 Scale-Out 链路走 PCIe → PCIe Switch → 网卡,PCIe 5.0 x16 的单向带宽就是硬上限,多节点 AllReduce 的 busBW 被卡在 ~64GB/s。
新一代方案支持 PXN(Peer eXchange Network),跨节点 AllReduce 不走 PCIe Switch,而是走加速卡直连交换芯片 → 网卡:
#mermaid-svg-KhgOE2W8137n03sb{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-KhgOE2W8137n03sb .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-KhgOE2W8137n03sb .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-KhgOE2W8137n03sb .error-icon{fill:#552222;}#mermaid-svg-KhgOE2W8137n03sb .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-KhgOE2W8137n03sb .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-KhgOE2W8137n03sb .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-KhgOE2W8137n03sb .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-KhgOE2W8137n03sb .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-KhgOE2W8137n03sb .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-KhgOE2W8137n03sb .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-KhgOE2W8137n03sb .marker{fill:#333333;stroke:#333333;}#mermaid-svg-KhgOE2W8137n03sb .marker.cross{stroke:#333333;}#mermaid-svg-KhgOE2W8137n03sb svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-KhgOE2W8137n03sb p{margin:0;}#mermaid-svg-KhgOE2W8137n03sb .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-KhgOE2W8137n03sb .cluster-label text{fill:#333;}#mermaid-svg-KhgOE2W8137n03sb .cluster-label span{color:#333;}#mermaid-svg-KhgOE2W8137n03sb .cluster-label span p{background-color:transparent;}#mermaid-svg-KhgOE2W8137n03sb .label text,#mermaid-svg-KhgOE2W8137n03sb span{fill:#333;color:#333;}#mermaid-svg-KhgOE2W8137n03sb .node rect,#mermaid-svg-KhgOE2W8137n03sb .node circle,#mermaid-svg-KhgOE2W8137n03sb .node ellipse,#mermaid-svg-KhgOE2W8137n03sb .node polygon,#mermaid-svg-KhgOE2W8137n03sb .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-KhgOE2W8137n03sb .rough-node .label text,#mermaid-svg-KhgOE2W8137n03sb .node .label text,#mermaid-svg-KhgOE2W8137n03sb .image-shape .label,#mermaid-svg-KhgOE2W8137n03sb .icon-shape .label{text-anchor:middle;}#mermaid-svg-KhgOE2W8137n03sb .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-KhgOE2W8137n03sb .rough-node .label,#mermaid-svg-KhgOE2W8137n03sb .node .label,#mermaid-svg-KhgOE2W8137n03sb .image-shape .label,#mermaid-svg-KhgOE2W8137n03sb .icon-shape .label{text-align:center;}#mermaid-svg-KhgOE2W8137n03sb .node.clickable{cursor:pointer;}#mermaid-svg-KhgOE2W8137n03sb .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-KhgOE2W8137n03sb .arrowheadPath{fill:#333333;}#mermaid-svg-KhgOE2W8137n03sb .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-KhgOE2W8137n03sb .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-KhgOE2W8137n03sb .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-KhgOE2W8137n03sb .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-KhgOE2W8137n03sb .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-KhgOE2W8137n03sb .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-KhgOE2W8137n03sb .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-KhgOE2W8137n03sb .cluster text{fill:#333;}#mermaid-svg-KhgOE2W8137n03sb .cluster span{color:#333;}#mermaid-svg-KhgOE2W8137n03sb div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-KhgOE2W8137n03sb .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-KhgOE2W8137n03sb rect.text{fill:none;stroke-width:0;}#mermaid-svg-KhgOE2W8137n03sb .icon-shape,#mermaid-svg-KhgOE2W8137n03sb .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-KhgOE2W8137n03sb .icon-shape p,#mermaid-svg-KhgOE2W8137n03sb .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-KhgOE2W8137n03sb .icon-shape .label rect,#mermaid-svg-KhgOE2W8137n03sb .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-KhgOE2W8137n03sb .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-KhgOE2W8137n03sb .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-KhgOE2W8137n03sb :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 节点 2
节点 1
加速卡1
加速卡2
交换芯片组
网卡1
网卡2
加速卡1
加速卡2
交换芯片组
网卡1
网卡2
IB/RoCE 交换机
PXN 的关键在于:多机 AllReduce 的瓶颈从"PCIe 上行速率"变成了"加速卡到交换芯片的速率",理论上限从 ~64GB/s 跳到 ~224GB/s。
| 规模 | 不走 PXN(理论上限) | 走 PXN(理论上限) |
|---|---|---|
| 2 节点 16 卡 | ~64 GB/s BusBW | ~224 GB/s BusBW |
| 4 节点 32 卡 | ~64 GB/s | ~224 GB/s |
| 128 节点 1024 卡 | ~64 GB/s | ~224 GB/s |
多机场景下这个差距是致命的。不走 PXN 的方案,加再多节点 AllReduce 带宽也涨不上去。
四、训练性能:拿真实模型跑一圈
4.1 跨代对比
用同一模型、同一节点、同一超参做全参数训练,新一代产品对比上一代的单节点吞吐(TGS):
#mermaid-svg-o9uYyR0NfzwlzWuu{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-o9uYyR0NfzwlzWuu .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-o9uYyR0NfzwlzWuu .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-o9uYyR0NfzwlzWuu .error-icon{fill:#552222;}#mermaid-svg-o9uYyR0NfzwlzWuu .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-o9uYyR0NfzwlzWuu .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-o9uYyR0NfzwlzWuu .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-o9uYyR0NfzwlzWuu .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-o9uYyR0NfzwlzWuu .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-o9uYyR0NfzwlzWuu .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-o9uYyR0NfzwlzWuu .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-o9uYyR0NfzwlzWuu .marker{fill:#333333;stroke:#333333;}#mermaid-svg-o9uYyR0NfzwlzWuu .marker.cross{stroke:#333333;}#mermaid-svg-o9uYyR0NfzwlzWuu svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-o9uYyR0NfzwlzWuu p{margin:0;}#mermaid-svg-o9uYyR0NfzwlzWuu :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 新一代 vs 上一代训练吞吐对比(以新一代为基准) LLaMA类MoE类文生图文生视频 1201101009080706050403020100 相对性能(%)
新一代在各类模型上训练吞吐普遍达到上一代的 300%+,跨代提升显著。
4.2 与国际主流产品对比
与某国际主流训练卡的对比(同一并行策略 + 同一精度):
#mermaid-svg-zV97ccXIlXwCb8oy{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-zV97ccXIlXwCb8oy .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-zV97ccXIlXwCb8oy .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-zV97ccXIlXwCb8oy .error-icon{fill:#552222;}#mermaid-svg-zV97ccXIlXwCb8oy .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-zV97ccXIlXwCb8oy .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-zV97ccXIlXwCb8oy .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-zV97ccXIlXwCb8oy .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-zV97ccXIlXwCb8oy .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-zV97ccXIlXwCb8oy .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-zV97ccXIlXwCb8oy .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-zV97ccXIlXwCb8oy .marker{fill:#333333;stroke:#333333;}#mermaid-svg-zV97ccXIlXwCb8oy .marker.cross{stroke:#333333;}#mermaid-svg-zV97ccXIlXwCb8oy svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-zV97ccXIlXwCb8oy p{margin:0;}#mermaid-svg-zV97ccXIlXwCb8oy :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 与国际主流产品性能比(以对方为 100% 基准) LLaMA-7BLLaMA-13BLLaMA-70BMoE-8x7BMoE-32B 1201101009080706050403020100 性能比(%)
结论:基本达到国际主流产品的 95%-105%,精度一致的前提下性能持平。
与另一国际旗舰产品对比,MoE 类模型大致在 75%-85%,Dense 类模型在 65%-105%,仍在持续优化中。
五、推理性能:MoE 和 Dense 分开看
推理这边分两种场景:
5.1 大模型推理
以某主流推理卡为基准,DeepSeek 671B BF16、TPOT ≤ 100ms、TTFT ≤ 30s 条件下,新一代产品的并发处理能力:
- MoE 模型:达到上一代国产方案的 150%-300%
- Dense 模型:达到上一代国产方案的 350%-500%
对比国际主流推理卡,MoE 模型约 75%-85%,Dense 模型约 65%-105%。
5.2 中小模型推理延迟
512 输入 1024 输出场景下,New Card 对比国际主流推理卡的 Prefill/Decode 性能比:
#mermaid-svg-Bat170aIFZlCcI95{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-Bat170aIFZlCcI95 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-Bat170aIFZlCcI95 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-Bat170aIFZlCcI95 .error-icon{fill:#552222;}#mermaid-svg-Bat170aIFZlCcI95 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-Bat170aIFZlCcI95 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-Bat170aIFZlCcI95 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-Bat170aIFZlCcI95 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-Bat170aIFZlCcI95 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-Bat170aIFZlCcI95 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-Bat170aIFZlCcI95 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-Bat170aIFZlCcI95 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-Bat170aIFZlCcI95 .marker.cross{stroke:#333333;}#mermaid-svg-Bat170aIFZlCcI95 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-Bat170aIFZlCcI95 p{margin:0;}#mermaid-svg-Bat170aIFZlCcI95 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 中小模型推理 Prefill 性能比(以竞品为 100%) 1.5B7B8B14B32B70B 200180160140120100806040200 性能比(%)
小模型场景下 Prefill 优势更明显,大模型逐渐趋于持平。Decode 阶段的表现也呈类似趋势。
六、软件栈:从 DTK 到 OpenDAS
硬件再强,软件不行也是白瞎。这套方案的软件栈分三层:
#mermaid-svg-LNrvT1HpubqFwgSv{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-LNrvT1HpubqFwgSv .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-LNrvT1HpubqFwgSv .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-LNrvT1HpubqFwgSv .error-icon{fill:#552222;}#mermaid-svg-LNrvT1HpubqFwgSv .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-LNrvT1HpubqFwgSv .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-LNrvT1HpubqFwgSv .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-LNrvT1HpubqFwgSv .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-LNrvT1HpubqFwgSv .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-LNrvT1HpubqFwgSv .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-LNrvT1HpubqFwgSv .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-LNrvT1HpubqFwgSv .marker{fill:#333333;stroke:#333333;}#mermaid-svg-LNrvT1HpubqFwgSv .marker.cross{stroke:#333333;}#mermaid-svg-LNrvT1HpubqFwgSv svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-LNrvT1HpubqFwgSv p{margin:0;}#mermaid-svg-LNrvT1HpubqFwgSv .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-LNrvT1HpubqFwgSv .cluster-label text{fill:#333;}#mermaid-svg-LNrvT1HpubqFwgSv .cluster-label span{color:#333;}#mermaid-svg-LNrvT1HpubqFwgSv .cluster-label span p{background-color:transparent;}#mermaid-svg-LNrvT1HpubqFwgSv .label text,#mermaid-svg-LNrvT1HpubqFwgSv span{fill:#333;color:#333;}#mermaid-svg-LNrvT1HpubqFwgSv .node rect,#mermaid-svg-LNrvT1HpubqFwgSv .node circle,#mermaid-svg-LNrvT1HpubqFwgSv .node ellipse,#mermaid-svg-LNrvT1HpubqFwgSv .node polygon,#mermaid-svg-LNrvT1HpubqFwgSv .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-LNrvT1HpubqFwgSv .rough-node .label text,#mermaid-svg-LNrvT1HpubqFwgSv .node .label text,#mermaid-svg-LNrvT1HpubqFwgSv .image-shape .label,#mermaid-svg-LNrvT1HpubqFwgSv .icon-shape .label{text-anchor:middle;}#mermaid-svg-LNrvT1HpubqFwgSv .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-LNrvT1HpubqFwgSv .rough-node .label,#mermaid-svg-LNrvT1HpubqFwgSv .node .label,#mermaid-svg-LNrvT1HpubqFwgSv .image-shape .label,#mermaid-svg-LNrvT1HpubqFwgSv .icon-shape .label{text-align:center;}#mermaid-svg-LNrvT1HpubqFwgSv .node.clickable{cursor:pointer;}#mermaid-svg-LNrvT1HpubqFwgSv .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-LNrvT1HpubqFwgSv .arrowheadPath{fill:#333333;}#mermaid-svg-LNrvT1HpubqFwgSv .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-LNrvT1HpubqFwgSv .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-LNrvT1HpubqFwgSv .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-LNrvT1HpubqFwgSv .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-LNrvT1HpubqFwgSv .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-LNrvT1HpubqFwgSv .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-LNrvT1HpubqFwgSv .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-LNrvT1HpubqFwgSv .cluster text{fill:#333;}#mermaid-svg-LNrvT1HpubqFwgSv .cluster span{color:#333;}#mermaid-svg-LNrvT1HpubqFwgSv div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-LNrvT1HpubqFwgSv .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-LNrvT1HpubqFwgSv rect.text{fill:none;stroke-width:0;}#mermaid-svg-LNrvT1HpubqFwgSv .icon-shape,#mermaid-svg-LNrvT1HpubqFwgSv .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-LNrvT1HpubqFwgSv .icon-shape p,#mermaid-svg-LNrvT1HpubqFwgSv .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-LNrvT1HpubqFwgSv .icon-shape .label rect,#mermaid-svg-LNrvT1HpubqFwgSv .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-LNrvT1HpubqFwgSv .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-LNrvT1HpubqFwgSv .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-LNrvT1HpubqFwgSv :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 硬件
加速卡
运行时
HIP Runtime
ROCk Kernel Driver
基础库层 DTK
BLAS/cuBLAS 兼容
FFT/cuFFT 兼容
RNG/curand 兼容
Sparse/cuSPARSE 兼容
DNN/cuDNN 兼容
NCCL/RCCL 兼容
Thrust 兼容
CUB 兼容
编译优化层
AI 编译器
Triton
XLA
TVM
AITemplate
框架层
PyTorch
TensorFlow
PaddlePaddle
Jittor
DeepSpeed
vLLM
LMDeploy
Megatron-LM
Colossal-AI
SGLang
应用层
模型库
开发者社区
OpenDAS 开源套件
DAS 平台
6.1 DTK:CUDA 兼容层
DTK(DCU Toolkit)走的是 HIP(Heterogeneous-compute Interface for Portability)路线。HIP 是一个开源的 C/C++ 异构并行编程模型,同时兼容 CUDA、ROCm 和 DTK。
| CUDA 生态 | DTK 生态 | 说明 |
|---|---|---|
| cuBLAS | rocBLAS/hipBLAS | 基础矩阵运算 |
| cuDNN | MIOpen | 深度学习基础库 |
| NCCL | RCCL | 集合通信 |
| cuRAND | hipRAND | 随机数 |
| cuSPARSE | hipSPARSE | 稀疏矩阵 |
| cuFFT | rocFFT/hipFFT | 快速傅里叶变换 |
| CUB | hipCUB | 基础算法库 |
| Thrust | rocThrust | 并行算法库 |
术语对照:
| NVIDIA | DCU | 描述 |
|---|---|---|
| SM (Streaming Multiprocessor) | CU (Compute Unit) | 计算单元 |
| Warp (32 threads) | Wavefront (64 threads) | 硬件执行线程束 |
| Thread | Work-item / Thread | 执行单元 |
| Block | Work-group / Block | 线程块 |
写 CUDA 代码的老哥上手 HIP 基本没门槛------核函数语法一模一样,Runtime API 高度兼容。DTK 目前兼容 CUDA 10.2 和 11.8 的 API,还支持 CMake 构建系统,工程迁移成本很低。
6.2 DAS:AI 全栈平台
DAS(DCU AI Software Stack)在 DTK 之上集成了大量 AI 框架和工具:
- 训练框架:PyTorch、DeepSpeed、Megatron-LM、Colossal-AI、FastMoE
- 推理框架:vLLM、LMDeploy、TGI、SGLang、Llama.cpp、FastLLM
- 微调工具:LLaMA-Factory、HF PEFT、xFormers、FlashAttention
- AI4Science:AlphaFold2、DeepMD-kit、UniFold、OpenFold 等
PyTorch API 覆盖度 98%+,vLLM API 覆盖度 99%+,主流模型的适配已经比较成熟。
6.3 OpenDAS 开源生态
OpenDAS 把整套工具链开源了出来,包括:
- AI 编译器(Triton、XLA、TVM、AITemplate)
- 推理框架(ONNX Runtime、MIGraphX)
- PyTorch 套件(TorchVision、TorchAudio 等)
- PaddlePaddle 全套件
- AI4Science 套件
- MMCV 系列
6.4 科学计算生态
这个容易被忽略,但这套方案在科学计算领域的积累相当扎实:
#mermaid-svg-mRNM2ggkCAWg8qNy{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-mRNM2ggkCAWg8qNy .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-mRNM2ggkCAWg8qNy .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-mRNM2ggkCAWg8qNy .error-icon{fill:#552222;}#mermaid-svg-mRNM2ggkCAWg8qNy .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-mRNM2ggkCAWg8qNy .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-mRNM2ggkCAWg8qNy .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-mRNM2ggkCAWg8qNy .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-mRNM2ggkCAWg8qNy .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-mRNM2ggkCAWg8qNy .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-mRNM2ggkCAWg8qNy .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-mRNM2ggkCAWg8qNy .marker{fill:#333333;stroke:#333333;}#mermaid-svg-mRNM2ggkCAWg8qNy .marker.cross{stroke:#333333;}#mermaid-svg-mRNM2ggkCAWg8qNy svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-mRNM2ggkCAWg8qNy p{margin:0;}#mermaid-svg-mRNM2ggkCAWg8qNy .edge{stroke-width:3;}#mermaid-svg-mRNM2ggkCAWg8qNy .section--1 rect,#mermaid-svg-mRNM2ggkCAWg8qNy .section--1 path,#mermaid-svg-mRNM2ggkCAWg8qNy .section--1 circle,#mermaid-svg-mRNM2ggkCAWg8qNy .section--1 polygon,#mermaid-svg-mRNM2ggkCAWg8qNy .section--1 path{fill:hsl(240, 100%, 76.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .section--1 text{fill:#ffffff;}#mermaid-svg-mRNM2ggkCAWg8qNy .node-icon--1{font-size:40px;color:#ffffff;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-edge--1{stroke:hsl(240, 100%, 76.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .edge-depth--1{stroke-width:17;}#mermaid-svg-mRNM2ggkCAWg8qNy .section--1 line{stroke:hsl(60, 100%, 86.2745098039%);stroke-width:3;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled circle,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:lightgray;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:#efefef;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-0 rect,#mermaid-svg-mRNM2ggkCAWg8qNy .section-0 path,#mermaid-svg-mRNM2ggkCAWg8qNy .section-0 circle,#mermaid-svg-mRNM2ggkCAWg8qNy .section-0 polygon,#mermaid-svg-mRNM2ggkCAWg8qNy .section-0 path{fill:hsl(60, 100%, 73.5294117647%);}#mermaid-svg-mRNM2ggkCAWg8qNy .section-0 text{fill:black;}#mermaid-svg-mRNM2ggkCAWg8qNy .node-icon-0{font-size:40px;color:black;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-edge-0{stroke:hsl(60, 100%, 73.5294117647%);}#mermaid-svg-mRNM2ggkCAWg8qNy .edge-depth-0{stroke-width:14;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-0 line{stroke:hsl(240, 100%, 83.5294117647%);stroke-width:3;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled circle,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:lightgray;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:#efefef;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-1 rect,#mermaid-svg-mRNM2ggkCAWg8qNy .section-1 path,#mermaid-svg-mRNM2ggkCAWg8qNy .section-1 circle,#mermaid-svg-mRNM2ggkCAWg8qNy .section-1 polygon,#mermaid-svg-mRNM2ggkCAWg8qNy .section-1 path{fill:hsl(80, 100%, 76.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .section-1 text{fill:black;}#mermaid-svg-mRNM2ggkCAWg8qNy .node-icon-1{font-size:40px;color:black;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-edge-1{stroke:hsl(80, 100%, 76.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .edge-depth-1{stroke-width:11;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-1 line{stroke:hsl(260, 100%, 86.2745098039%);stroke-width:3;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled circle,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:lightgray;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:#efefef;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-2 rect,#mermaid-svg-mRNM2ggkCAWg8qNy .section-2 path,#mermaid-svg-mRNM2ggkCAWg8qNy .section-2 circle,#mermaid-svg-mRNM2ggkCAWg8qNy .section-2 polygon,#mermaid-svg-mRNM2ggkCAWg8qNy .section-2 path{fill:hsl(270, 100%, 76.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .section-2 text{fill:#ffffff;}#mermaid-svg-mRNM2ggkCAWg8qNy .node-icon-2{font-size:40px;color:#ffffff;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-edge-2{stroke:hsl(270, 100%, 76.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .edge-depth-2{stroke-width:8;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-2 line{stroke:hsl(90, 100%, 86.2745098039%);stroke-width:3;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled circle,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:lightgray;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:#efefef;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-3 rect,#mermaid-svg-mRNM2ggkCAWg8qNy .section-3 path,#mermaid-svg-mRNM2ggkCAWg8qNy .section-3 circle,#mermaid-svg-mRNM2ggkCAWg8qNy .section-3 polygon,#mermaid-svg-mRNM2ggkCAWg8qNy .section-3 path{fill:hsl(300, 100%, 76.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .section-3 text{fill:black;}#mermaid-svg-mRNM2ggkCAWg8qNy .node-icon-3{font-size:40px;color:black;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-edge-3{stroke:hsl(300, 100%, 76.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .edge-depth-3{stroke-width:5;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-3 line{stroke:hsl(120, 100%, 86.2745098039%);stroke-width:3;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled circle,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:lightgray;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:#efefef;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-4 rect,#mermaid-svg-mRNM2ggkCAWg8qNy .section-4 path,#mermaid-svg-mRNM2ggkCAWg8qNy .section-4 circle,#mermaid-svg-mRNM2ggkCAWg8qNy .section-4 polygon,#mermaid-svg-mRNM2ggkCAWg8qNy .section-4 path{fill:hsl(330, 100%, 76.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .section-4 text{fill:black;}#mermaid-svg-mRNM2ggkCAWg8qNy .node-icon-4{font-size:40px;color:black;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-edge-4{stroke:hsl(330, 100%, 76.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .edge-depth-4{stroke-width:2;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-4 line{stroke:hsl(150, 100%, 86.2745098039%);stroke-width:3;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled circle,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:lightgray;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:#efefef;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-5 rect,#mermaid-svg-mRNM2ggkCAWg8qNy .section-5 path,#mermaid-svg-mRNM2ggkCAWg8qNy .section-5 circle,#mermaid-svg-mRNM2ggkCAWg8qNy .section-5 polygon,#mermaid-svg-mRNM2ggkCAWg8qNy .section-5 path{fill:hsl(0, 100%, 76.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .section-5 text{fill:black;}#mermaid-svg-mRNM2ggkCAWg8qNy .node-icon-5{font-size:40px;color:black;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-edge-5{stroke:hsl(0, 100%, 76.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .edge-depth-5{stroke-width:-1;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-5 line{stroke:hsl(180, 100%, 86.2745098039%);stroke-width:3;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled circle,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:lightgray;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:#efefef;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-6 rect,#mermaid-svg-mRNM2ggkCAWg8qNy .section-6 path,#mermaid-svg-mRNM2ggkCAWg8qNy .section-6 circle,#mermaid-svg-mRNM2ggkCAWg8qNy .section-6 polygon,#mermaid-svg-mRNM2ggkCAWg8qNy .section-6 path{fill:hsl(30, 100%, 76.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .section-6 text{fill:black;}#mermaid-svg-mRNM2ggkCAWg8qNy .node-icon-6{font-size:40px;color:black;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-edge-6{stroke:hsl(30, 100%, 76.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .edge-depth-6{stroke-width:-4;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-6 line{stroke:hsl(210, 100%, 86.2745098039%);stroke-width:3;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled circle,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:lightgray;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:#efefef;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-7 rect,#mermaid-svg-mRNM2ggkCAWg8qNy .section-7 path,#mermaid-svg-mRNM2ggkCAWg8qNy .section-7 circle,#mermaid-svg-mRNM2ggkCAWg8qNy .section-7 polygon,#mermaid-svg-mRNM2ggkCAWg8qNy .section-7 path{fill:hsl(90, 100%, 76.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .section-7 text{fill:black;}#mermaid-svg-mRNM2ggkCAWg8qNy .node-icon-7{font-size:40px;color:black;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-edge-7{stroke:hsl(90, 100%, 76.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .edge-depth-7{stroke-width:-7;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-7 line{stroke:hsl(270, 100%, 86.2745098039%);stroke-width:3;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled circle,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:lightgray;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:#efefef;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-8 rect,#mermaid-svg-mRNM2ggkCAWg8qNy .section-8 path,#mermaid-svg-mRNM2ggkCAWg8qNy .section-8 circle,#mermaid-svg-mRNM2ggkCAWg8qNy .section-8 polygon,#mermaid-svg-mRNM2ggkCAWg8qNy .section-8 path{fill:hsl(150, 100%, 76.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .section-8 text{fill:black;}#mermaid-svg-mRNM2ggkCAWg8qNy .node-icon-8{font-size:40px;color:black;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-edge-8{stroke:hsl(150, 100%, 76.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .edge-depth-8{stroke-width:-10;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-8 line{stroke:hsl(330, 100%, 86.2745098039%);stroke-width:3;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled circle,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:lightgray;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:#efefef;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-9 rect,#mermaid-svg-mRNM2ggkCAWg8qNy .section-9 path,#mermaid-svg-mRNM2ggkCAWg8qNy .section-9 circle,#mermaid-svg-mRNM2ggkCAWg8qNy .section-9 polygon,#mermaid-svg-mRNM2ggkCAWg8qNy .section-9 path{fill:hsl(180, 100%, 76.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .section-9 text{fill:black;}#mermaid-svg-mRNM2ggkCAWg8qNy .node-icon-9{font-size:40px;color:black;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-edge-9{stroke:hsl(180, 100%, 76.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .edge-depth-9{stroke-width:-13;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-9 line{stroke:hsl(0, 100%, 86.2745098039%);stroke-width:3;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled circle,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:lightgray;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:#efefef;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-10 rect,#mermaid-svg-mRNM2ggkCAWg8qNy .section-10 path,#mermaid-svg-mRNM2ggkCAWg8qNy .section-10 circle,#mermaid-svg-mRNM2ggkCAWg8qNy .section-10 polygon,#mermaid-svg-mRNM2ggkCAWg8qNy .section-10 path{fill:hsl(210, 100%, 76.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .section-10 text{fill:black;}#mermaid-svg-mRNM2ggkCAWg8qNy .node-icon-10{font-size:40px;color:black;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-edge-10{stroke:hsl(210, 100%, 76.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .edge-depth-10{stroke-width:-16;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-10 line{stroke:hsl(30, 100%, 86.2745098039%);stroke-width:3;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled circle,#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:lightgray;}#mermaid-svg-mRNM2ggkCAWg8qNy .disabled text{fill:#efefef;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-root rect,#mermaid-svg-mRNM2ggkCAWg8qNy .section-root path,#mermaid-svg-mRNM2ggkCAWg8qNy .section-root circle,#mermaid-svg-mRNM2ggkCAWg8qNy .section-root polygon{fill:hsl(240, 100%, 46.2745098039%);}#mermaid-svg-mRNM2ggkCAWg8qNy .section-root text{fill:#ffffff;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-root span{color:#ffffff;}#mermaid-svg-mRNM2ggkCAWg8qNy .section-2 span{color:#ffffff;}#mermaid-svg-mRNM2ggkCAWg8qNy .icon-container{height:100%;display:flex;justify-content:center;align-items:center;}#mermaid-svg-mRNM2ggkCAWg8qNy .edge{fill:none;}#mermaid-svg-mRNM2ggkCAWg8qNy .mindmap-node-label{dy:1em;alignment-baseline:middle;text-anchor:middle;dominant-baseline:middle;text-align:center;}#mermaid-svg-mRNM2ggkCAWg8qNy :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 科学计算生态
分子动力学
GROMACS
NAMD
LAMMPS
Amber
OpenMM
计算化学
CP2K
VASP
NWChem
PWMAT
FHI-aims
工业仿真
OpenFOAM
OpenCFD
RapidCFD
GESTS
气象环境
SD3
CALPUFF
NAQPMS
LICOM3
计算物理
QUDA
Chroma
PIConGPU
生信基因
BarraCUDA
Blast
BWA
DeepVariant
分子动力学、计算化学、工业仿真、气象环境、天体物理、生信基因------覆盖了几乎所有科学计算主流软件。
七、CUDA→HIP 迁移实操
从 CUDA 切到 HIP,代码改动量有多小?直接看 DAXPY 示例:
CUDA 版本:
cpp
__global__ void add(int n, double *x, double *y) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride) {
y[i] = x[i] + y[i];
}
}
HIP 版本:
cpp
__global__ void add(int n, double *x, double *y) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride) {
y[i] = x[i] + y[i];
}
}
一模一样。Kernel 语法完全一致,Runtime API 做简单的命名映射即可。DTK 还提供了 GPUFusion 工具,对已有 CUDA 程序做自动编译测试,Amber、NAMD、GROMACS、MAGMA、XGBoost 等大量科学计算软件均通过编译测试。
八、总结
这套国产 GPGPU 方案的核心竞争力,个人跑下来觉得有三点:
-
互联拓扑架构领先:自研交换芯片 + Clos 全互联,P2P 带宽 448GB/s,对分带宽 1792GB/s,对比 Fullmesh 方案有 2-8 倍的差距。大模型 TP 并行和多机 Scale-Out 场景优势明显。
-
软件生态兼容性好:HIP 编程模型 + DTK 数学库做到了 CUDA 兼容,PyTorch/vLLM 等主流框架覆盖度 98%+,迁移成本低。
-
全场景覆盖:从 AI 训练推理到科学计算(分子动力学、计算化学、气象环境),从单卡到万卡集群,一套方案打全场。
当然也有一些需要持续关注的方面------Dense 模型推理性能还在爬坡,部分小众框架的适配仍需时间。但以当前迭代速度,这些问题都在快速改善。
本文基于公开技术资料整理,具体性能数据已做脱敏处理,仅供参考。