TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

TVM can process graph-level and operator-level optimization.

graph-level optimization

As for the graph-level optimization, it can do operator fusion, constant folding, static memory pre-allocation, and data transformate pass.

operator fusion

Now I want emphasis operator fusion, It split the operator into 4 type:

  • injective (one on one map, e.g., add)
  • reduction
  • complex out fusable(can fuse element-wise op to output)
  • opaque(can't be fused, e.g., sort)

TVM will fuse as much as possible.

These optimization methods are very common.

Operator-level Optimization

TVM seperate schedule and compute. So it can detribute different devices. There are 3 schedule primitives in TVM, Special Memory Scope, Tensorizaiton, Latency Hiding.

  • Special Memory Scope, to utilize maxmium the shaped memory in GPU.
  • Tensorization, spliting a bigger data into micro-data to fully utiize the vectorization.
  • Latency Hiding. Overlaping the computation and transition. On CPU, it is achieving by using multi-threading or hardward prefetching. GPU relys on repid context switching of many wraps of threads.

Automating Optimization

How to find the optimal parameter is very important. It proposed a ML-based cost model, which is a gradient tree boosting model based on XGboost, to predict these prameters by giving the loop pragram in the kernel, which include the memory access count, and the resue ratio of each memory buffer, as well as one-hot encoding of loop annotation such as "vectorize", "unroll" and "parallel". As shown in the following graph, the collected data can be train the model again. So the TVM matainer will updated this model periodicly.

Consequently, TVM lowers the threshold for writing a relavely high-performance kernel. I think there are 2 points deserved us to learn more, which are the schedule primitive and the prediction model.

相关推荐
小白狮ww6 小时前
要给 OCR 装个脑子吗?DeepSeek-OCR 2 让文档不再只是扫描
人工智能·深度学习·机器学习·ocr·cpu·gpu·deepseek
lili-felicity7 小时前
CANN优化LLaMA大语言模型推理:KV-Cache与FlashAttention深度实践
人工智能·语言模型·llama
程序猿追7 小时前
深度解码昇腾 AI 算力引擎:CANN Runtime 核心架构与技术演进
人工智能·架构
金融RPA机器人丨实在智能7 小时前
Android Studio开发App项目进入AI深水区:实在智能Agent引领无代码交互革命
android·人工智能·ai·android studio
lili-felicity7 小时前
CANN异步推理实战:从Stream管理到流水线优化
大数据·人工智能
做人不要太理性7 小时前
CANN Runtime 运行时组件深度解析:任务下沉执行、异构内存规划与全栈维测诊断机制
人工智能·神经网络·魔珐星云
不爱学英文的码字机器7 小时前
破壁者:CANN ops-nn 仓库与昇腾 AI 算子优化的工程哲学
人工智能
晚霞的不甘7 小时前
CANN 编译器深度解析:TBE 自定义算子开发实战
人工智能·架构·开源·音视频
愚公搬代码7 小时前
【愚公系列】《AI短视频创作一本通》016-AI短视频的生成(AI短视频运镜方法)
人工智能·音视频
哈__7 小时前
CANN内存管理与资源优化
人工智能·pytorch