Triton学习笔记

Ref

  1. triton案例

Triton

从Add开始入门

py 复制代码
import torch
import triton
import triton.language as tl
@triton.jit
def add_kernel(x_ptr, # *Pointer* to first input vector.
    y_ptr, # *Pointer* to second input vector.
    z_ptr, # *Pointer* to output vector.
    N, # Size of the vector.
    BLOCK_SIZE: tl.constexpr, # Num elements each program uses
    ):
    # There are multiple 'programs' processing different data.
    # We identify which program we are here:
    pid = tl.program_id(axis=0)
    # Offsets is a list of which elements this program instance will act on
    # e.g. if BLOCK_SIZE is 32 these would be
    # [0:32], [32:64], [64:96] etc, using the `pid` to find the starting index
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    # Create a mask to guard memory operations against out-of-bounds acces
    mask = offsets < N
    # Load x and y, using the mask
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    z = x + y
    # Write z back to HBM.
    tl.store(z_ptr + offsets, z, mask=mask)

可以看到,pid是以BLOCK_SIZE为单位启动的,然后你同时launch许多pid,他们找到自己执行的区域开始执行并且store回HBM

之后我们launch它:

py 复制代码
def add(x: torch.Tensor, y: torch.Tensor):
    # Preallocate the output.
    z = torch.empty_like(x)
    N = z.numel()
    # grid can be a static tuple, or a callable that returns a tuple
    # here it will be (N//BLOCK_SIZE,)
    grid = lambda meta: (triton.cdiv(N, meta['BLOCK_SIZE']), )
    add_kernel[grid](x, y, z, N, BLOCK_SIZE=1024)
    return z

虽然你传入了Tensor,但是他使用了@triton.jit,所以会自动重载到和Kernel相符合的格式

相关推荐
载数而行5201 天前
QT的五类布局
c++·qt·学习
载数而行5201 天前
QT的QString类
c++·qt·学习
zl_dfq1 天前
Python学习2 之 【数据类型、运算及相关函数、math库】
学习
左左右右左右摇晃1 天前
HashMap 扩容机制
笔记
2301_781143561 天前
C语言学习笔记
笔记·学习
Alphapeople1 天前
Isaac Sim学习
学习
蒸蒸yyyyzwd1 天前
高并发40问学习笔记
笔记·学习
天若有情6731 天前
循环条件隐藏陷阱:我发现了「同循环双条件竞态问题」
c++·学习·算法·编程范式·while循环··竞态
Amazing_Cacao1 天前
褪去故事滤镜:重建精品可可的“结构语言”
笔记·学习
网络工程小王1 天前
【大数据技术详解】——Sqoop技术(学习笔记)
大数据·学习·sqoop