算子开发工程师技能树

我直接给你一个 工业界真实版本的「算子开发工程师技能树」。不是培训机构版本，而是 NVIDIA / ByteDance / 华为昇腾 / DeepSeek / OpenAI infra / AMD / Meta kernel team 实际要求的能力分层。

你要先建立一个认知：

算子开发工程师本质是：

AI × HPC Kernel × 编译器 × 体系结构的交叉岗位

不是写模型，不是写 PyTorch，而是写 高性能 kernel + schedule + hardware mapping。

下面是完整技能树，从底层到顶层。

第一层：硬核基础（必须扎实，否则做不了算子优化）

1）计算机体系结构（这是你优势区）

核心知识：

memory hierarchy

cache / shared memory / register file

SIMD / SIMT / warp

pipeline / ILP / TLP

roofline model

memory bandwidth vs compute bound

必须做到：

看 kernel 就能判断：

瓶颈在

global memory

shared memory

latency hiding

occupancy

2）并行计算模型（GPU/AI加速器）

必须掌握：

CUDA execution model

thread / warp / block

SIMT divergence

memory coalescing

bank conflict

tensor core execution model

AI 芯片方向需要：

DMA orchestration

on-chip SRAM scheduling

dataflow mapping（WS/OS/RS）

3）线性代数（不是数学推导，是计算结构）

重点：

GEMM blocking

tensor contraction

reduce pattern

einsum mapping

你要能一眼看出：

这个 op → GEMM

这个 op → reduce + elementwise

第二层：Kernel 实现能力（岗位核心）

1）GPU Kernel 编程（至少精通一个）

工业主流：

CUDA（最硬核）

Triton（LLM infra 很多）

CUTLASS（NVIDIA系）

你要能写：

tiled matmul

shared memory blocking

warp-level mma

double buffering

async copy pipeline

2）Loop Schedule 设计（DL Compiler 核心）

你必须理解：

loop tiling

loop reorder

fusion

vectorization

unrolling

tensorization

工具：

TVM TensorIR

MLIR Linalg

Halide

Triton IR

本质能力：

把

GEMM

写成：

最优 loop nest

3）Memory Layout 设计（新人最弱的部分）

工业里大量时间花在：

NHWC vs NCHW

COL32

tensorcore fragment layout

swizzle layout

padding layout

目标：

消除：

bank conflict

uncoalesced access

L2 thrashing

第三层：算子级优化能力（岗位核心价值）

1）算子多实现策略

以 Conv 为例：

direct conv

implicit GEMM

winograd

fft

tensorcore conv

算子开发要做：

根据 shape 自动选 kernel

2）Kernel Fusion（LLM时代核心技能）

你必须能设计：

fused matmul + bias + gelu

fused attention

fused rmsnorm

fused kv-cache update

本质目标：

减少 global memory traffic

3）Autotuning & Dispatch

真实系统不是：

一个 kernel

而是：

kernel registry

shape dispatch

autotuning search

第四层：硬件映射能力（这是你可以冲顶的方向）

你要能做：

loop → dataflow mapping

例如：

weight stationary

output stationary

row stationary

要理解：

tile = communication unit

tile = noc packet

在 AI 芯片公司这是核心能力。

第五层：工具链（工业必备）

Profiling：

nsys

nvprof

ncu

rocprof

IR：

MLIR

TVM

Triton IR

反汇编：

cuobjdump

nvdisasm

sass analysis

第六层：LLM时代新增核心技能（2024+ 强需求）

FlashAttention 结构

PagedAttention

KV-cache layout

MoE routing kernels

quantization kernel（int8/fp8）

第七层：真实面试考察点（非常真实）

手写：

tiled matmul kernel

解释：

为什么 shared memory 要 double buffer

分析：

一个 kernel 的 memory pattern

设计：

一个 fused operator

判断：

这个算子 compute bound 还是 memory bound

第八层：给你一个非常现实的能力分级（行业内部）

L1：会写 CUDA matmul（新人）

L2：会做 tiling + shared memory（可入职）

L3：会 tensorcore + fusion（强工程师）

L4：能设计新算子 kernel family（高级）

L5：能设计 compiler schedule + dataflow（核心架构）