cublasGemmEx测试及Profing
- 1.准备工作
- 2.使用NCU获取GPU峰值性能
-
- [A.获取 metrics,过滤掉当前GPU不支持的](#A.获取 metrics,过滤掉当前GPU不支持的)
- B.计算硬件理论性能
- 3.计算GEMM的理论性能
- 4.cublasGemmEx测试
本文演示了cublasGemmEx API的使用、GEMM理论算力的计算、NCU获取GPU的峰值算力及实测算力
1.准备工作
- 锁频
bash
nvidia-smi -q -d SUPPORTED_CLOCKS
nvidia-smi -pm 1
nvidia-smi -q -d CLOCK
nvidia-smi -lgc 2115
nvidia-smi -lmc 7501
nvidia-smi -q -d CLOCK
- 获取tensorcore相关metrics
bash
cat /usr/local/NVIDIA-Nsight-Compute/sections/SpeedOfLight_HierarchicalTensorRooflineChart.section | grep "sm__ops_path_tensor_" | grep "peak_sustained" | sort | uniq | awk -F: '{print $2}' | sed 's/ //g' | sed 's/\"//g'
2.使用NCU获取GPU峰值性能
A.获取 metrics,过滤掉当前GPU不支持的
bash
tee ncu_get_gpu_peak_sustained.cu<<-'EOF'
#include <iostream>
#include <cuda_runtime.h>
__global__ void kernel2(float *d_in, float *d_out) {
}
int main() {
float *d_in;
float *d_out;
int sm_count=28;
int smsp_count=4;
int warpsize=32;
int total_count=sm_count*smsp_count*warpsize;
cudaMalloc((void**)&d_in, total_count * sizeof(float));
cudaMalloc((void**)&d_out, total_count * sizeof(float));
kernel2<<<sm_count, warpsize*smsp_count>>>(d_in, d_out);cudaDeviceSynchronize();
cudaFree(d_in);
cudaFree(d_out);
return 0;
}
EOF
/usr/local/cuda/bin/nvcc -std=c++17 -lineinfo ncu_get_gpu_peak_sustained.cu -o ncu_get_gpu_peak_sustained
/usr/local/NVIDIA-Nsight-Compute/ncu --clock-control=none --metrics \
sm__sass_thread_inst_executed_op_hfma_pred_on.sum.peak_sustained,\
sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained,\
sm__inst_executed_pipe_tensor.sum.peak_sustained,\
sm__ops_path_tensor_src_int8.sum.peak_sustained,\
sm__cycles_elapsed.avg.per_second,\
dram__bytes.sum.peak_sustained,\
dram__cycles_elapsed.avg.per_second,\
lts__lts2xbar_cycles_active.sum.peak_sustained,\
lts__cycles_elapsed.avg.per_second,\
l1tex__lsu_writeback_active_mem_lg.sum.peak_sustained,\
l1tex__cycles_elapsed.avg.per_second,\
sm__ops_path_tensor_op_bgmma_src_int1.sum.peak_sustained,\
sm__ops_path_tensor_op_bmma_src_int1.sum.peak_sustained,\
sm__ops_path_tensor_op_hgmma_src_bf16_dst_fp32_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_op_hgmma_src_bf16_dst_fp32_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_op_hgmma_src_fp16_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_op_hgmma_src_fp16_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_op_hgmma_src_tf32_dst_fp32_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_op_hgmma_src_tf32_dst_fp32_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_op_hmma_src_bf16_dst_fp32_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_op_hmma_src_bf16_dst_fp32_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_op_hmma_src_fp16_dst_fp16_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_op_hmma_src_fp16_dst_fp16_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_op_hmma_src_fp16_dst_fp32_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_op_hmma_src_fp16_dst_fp32_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_op_hmma_src_tf32_dst_fp32_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_op_hmma_src_tf32_dst_fp32_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_op_igmma_src_int8_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_op_igmma_src_int8_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_op_imma_src_int8_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_op_imma_src_int8_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_src_bf16_dst_fp32_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_src_bf16_dst_fp32_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_src_bf16_dst_fp32.sum.peak_sustained,\
sm__ops_path_tensor_src_fp16_bf16_tf32_dst_fp32.sum.peak_sustained,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_src_fp16_dst_fp16.sum.peak_sustained,\
sm__ops_path_tensor_src_fp16_dst_fp32_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_src_fp16_dst_fp32_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_src_fp16_dst_fp32.sum.peak_sustained,\
sm__ops_path_tensor_src_fp16.sum.peak_sustained,\
sm__ops_path_tensor_src_fp64.sum.peak_sustained,\
sm__ops_path_tensor_src_fp8_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_src_fp8_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_src_int1.sum.peak_sustained,\
sm__ops_path_tensor_src_int4_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_src_int4_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_src_int4.sum.peak_sustained,\
sm__ops_path_tensor_src_int8_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_src_int8_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_src_int8.sum.peak_sustained,\
sm__ops_path_tensor_src_tf32_dst_fp32_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_src_tf32_dst_fp32_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_src_tf32_dst_fp32.sum.peak_sustained ./ncu_get_gpu_peak_sustained | grep -v "n/a"
输出
bash
------------------------------------------------------------------------------ ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------------------------------------------------------ ----------- ------------
dram__bytes.sum.peak_sustained byte/cycle 48
dram__cycles_elapsed.avg.per_second Ghz 7.06
l1tex__cycles_elapsed.avg.per_second Ghz 1.89
l1tex__lsu_writeback_active_mem_lg.sum.peak_sustained 28
lts__cycles_elapsed.avg.per_second Ghz 1.79
lts__lts2xbar_cycles_active.sum.peak_sustained 18
sm__cycles_elapsed.avg.per_second Ghz 1.89
sm__inst_executed_pipe_tensor.sum.peak_sustained inst/cycle 28
sm__ops_path_tensor_src_bf16_dst_fp32.sum.peak_sustained 1/cycle 28,672
sm__ops_path_tensor_src_bf16_dst_fp32_sparsity_off.sum.peak_sustained 1/cycle 14,336
sm__ops_path_tensor_src_bf16_dst_fp32_sparsity_on.sum.peak_sustained 1/cycle 28,672
sm__ops_path_tensor_src_fp16_dst_fp16.sum.peak_sustained 1/cycle 57,344
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.peak_sustained 1/cycle 28,672
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_on.sum.peak_sustained 1/cycle 57,344
sm__ops_path_tensor_src_fp16_dst_fp32.sum.peak_sustained 1/cycle 28,672
sm__ops_path_tensor_src_fp16_dst_fp32_sparsity_off.sum.peak_sustained 1/cycle 14,336
sm__ops_path_tensor_src_fp16_dst_fp32_sparsity_on.sum.peak_sustained 1/cycle 28,672
sm__ops_path_tensor_src_fp64.sum.peak_sustained 1/cycle 98.87
sm__ops_path_tensor_src_int1.sum.peak_sustained 1/cycle 458,752
sm__ops_path_tensor_src_int4.sum.peak_sustained 1/cycle 229,376
sm__ops_path_tensor_src_int4_sparsity_off.sum.peak_sustained 1/cycle 114,688
sm__ops_path_tensor_src_int4_sparsity_on.sum.peak_sustained 1/cycle 229,376
sm__ops_path_tensor_src_int8.sum.peak_sustained 1/cycle 114,688
sm__ops_path_tensor_src_int8_sparsity_off.sum.peak_sustained 1/cycle 57,344
sm__ops_path_tensor_src_int8_sparsity_on.sum.peak_sustained 1/cycle 114,688
sm__ops_path_tensor_src_tf32_dst_fp32.sum.peak_sustained 1/cycle 14,336
sm__ops_path_tensor_src_tf32_dst_fp32_sparsity_off.sum.peak_sustained 1/cycle 7,168
sm__ops_path_tensor_src_tf32_dst_fp32_sparsity_on.sum.peak_sustained 1/cycle 14,336
sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained inst/cycle 3,584
sm__sass_thread_inst_executed_op_hfma_pred_on.sum.peak_sustained inst/cycle 1,792
------------------------------------------------------------------------------ ----------- ------------
B.计算硬件理论性能
text
* DRAM内存带宽 = dram__bytes.sum.peak_sustained*dram__cycles_elapsed.avg.per_second=48(192bit*2/8)*7.06(频率) = 338.88GB/s
* L1带宽=l1tex__lsu_writeback_active_mem_lg.sum.peak_sustained*l1tex__cycles_elapsed.avg.per_second*128=28*1.89*128=6773.76GB/s
* L2带宽=lts__lts2xbar_cycles_active.sum.peak_sustained*lts__cycles_elapsed.avg.per_second*32=18*1.79*32=1031.04GB/s
* ffma峰值算力=sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained*2*sm__cycles_elapsed.avg.per_second=3584*2*1.89=13547.52GFLOPS
* hfma峰值算力=sm__sass_thread_inst_executed_op_hfma_pred_on.sum.peak_sustained*2*sm__cycles_elapsed.avg.per_second=1792*2*1.89=6773.76GFLOPS
* tensorcore_fp16_dst_fp16峰值算力=sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.peak_sustained*sm__cycles_elapsed.avg.per_second=28672*1.89=54190.08GFLOPS
* tensorcore_int8峰值算力=sm__ops_path_tensor_src_int8_sparsity_off.sum.peak_sustained*sm__cycles_elapsed.avg.per_second=57344*1.89=108380.16GFLOPS
* ffma_dram算力密度=13547.52GFLOPS/338.88 GB/s=39.98 OP/byte
* tensorcore_fp16_dram算力密度=54190.08GFLOPS/338.88 GB/s=159.91 OP/byte
3.计算GEMM的理论性能
text
* 理论计算量=(2*8192*8192*8192)/1000000000=1099.51 GFLOPs
* 理论上需要加载的数据量=8192*8192*2*2/1024/1024/1024=0.25 GB
* 理论算力密度=1099.5/0.25 = 4398.0 OP/byte
* 如果M*N按128*128大小分块计算:
- A Tile=128*8192 B Tile=8192*128 C Tile=128*128
- Tile数总=power(8192/128,2)=4096
- 每个Tlle的计算量=128*128*8192*2/1000000000=0.2684 GFLOPs 总的计算量=0.2684*4096=1099.36 GFLOPs
- 每个Tlle加载的数据量=128*8192*2*2/1024/1024/1024=0.0039 GB 总的数据量=0.0039*4096=15.9744 GB
- 算力密度:1099.36/15.9744=68.82 OP/byte
- RTX3060上FP16纯计算理论耗时(s)=1099.36 GFLOPs/54190.08GFLOPS=0.02028s
- RTX3060上从DRAM Load数据总耗时(s)=15.9744GB/338.88GB/s=0.0471s [>计算耗时]
- RTX3060上从L2 Load数据总耗时(s)=15.9744GB/1031.04GB/s=0.01549s [<计算耗时] 算力利用率100%
- RTX3060上从DRAM Load数据理论可达到的性能(FLOPS)=1099.36GFLOPs/MAX(0.0471s,0.02028s)=1099.36/0.0471=23340.97FLOPS
- 所以实际性能在23340.97FLOPS与54190.08GFLOPS之间
- 如果L2 Cache命中率为70%,总Load数据总耗时预估:(0.7*15.9744)/1031.04+((1-0.7)*15.9744)/338.88=0.02498s ->1099.36 GFLOPs/0.02498s->44009.60GFLOPS 是峰值性能的:81%
4.cublasGemmEx测试
bash
tee cublas_demo.cu<<-'EOF'
#include <stdint.h>
#include "cublas_v2.h"
#include "cuda_fp16.h"
#include <cuda.h>
#include <iostream>
#define HGEMM_UNLIKELY(x) __builtin_expect(!!(x), 0)
#define HGEMM_CHECK_CUBLAS_ERROR(_expr_) \
do { \
cublasStatus_t _ret_ = _expr_; \
if (HGEMM_UNLIKELY(_ret_ != CUBLAS_STATUS_SUCCESS)) { \
size_t _rt_version_ = cublasGetCudartVersion(); \
printf("CUBLAS API error = %04d, runtime version: %zu", static_cast<int>(_ret_), _rt_version_); \
exit(EXIT_FAILURE); \
} \
} while (0)
cublasHandle_t getCublasTensorOpHandle() {
cublasHandle_t handle = nullptr;
HGEMM_CHECK_CUBLAS_ERROR(cublasCreate(&handle));
HGEMM_CHECK_CUBLAS_ERROR(cublasSetMathMode(handle, CUBLAS_TENSOR_OP_MATH));
return handle;
}
void cublasTensorOp(half *A, half *B, half *C, size_t M, size_t N, size_t K) {
static cublasHandle_t handle = getCublasTensorOpHandle();
static half alpha = 1.0;
static half beta = 0.0;
HGEMM_CHECK_CUBLAS_ERROR(cublasGemmEx(handle, CUBLAS_OP_T, CUBLAS_OP_N, N, M, K, &alpha, B, CUDA_R_16F, K, A,
CUDA_R_16F, K, &beta, C, CUDA_R_16F, N, CUBLAS_COMPUTE_16F,
CUBLAS_GEMM_DEFAULT));
}
#define CUDA_CHECK(status) \
{ \
cudaError_t error = status; \
if (error != cudaSuccess) { \
std::cerr << "Got bad cuda status: " << cudaGetErrorString(error) \
<< " at line: " << __LINE__ << std::endl; \
exit(EXIT_FAILURE); \
} \
}
struct GpuTimer
{
cudaStream_t _stream_id;
cudaEvent_t _start;
cudaEvent_t _stop;
GpuTimer() : _stream_id(0)
{
CUDA_CHECK(cudaEventCreate(&_start));
CUDA_CHECK(cudaEventCreate(&_stop));
}
~GpuTimer()
{
CUDA_CHECK(cudaEventDestroy(_start));
CUDA_CHECK(cudaEventDestroy(_stop));
}
void start(cudaStream_t stream_id = 0)
{
_stream_id = stream_id;
CUDA_CHECK(cudaEventRecord(_start, _stream_id));
}
void stop()
{
CUDA_CHECK(cudaEventRecord(_stop, _stream_id));
}
float elapsed_millis()
{
float elapsed = 0.0;
CUDA_CHECK(cudaEventSynchronize(_stop));
CUDA_CHECK(cudaEventElapsedTime(&elapsed, _start, _stop));
return elapsed;
}
};
#define MATRIX_M 8192
#define MATRIX_N 8192
#define MATRIX_K 8192
int main()
{
// Matrix A
half *a_host, *a_device;
CUDA_CHECK(cudaMallocHost(&a_host, MATRIX_M * MATRIX_K * sizeof(half)));
CUDA_CHECK(cudaMalloc(&a_device, MATRIX_M * MATRIX_K * sizeof(half)));
// Matrix B
half *b_host, *b_device;
CUDA_CHECK(cudaMallocHost(&b_host, MATRIX_K * MATRIX_N * sizeof(half)));
CUDA_CHECK(cudaMalloc(&b_device, MATRIX_K * MATRIX_N * sizeof(half)));
// Matrix C
half *c_host, *c_device;
CUDA_CHECK(cudaMallocHost(&c_host, MATRIX_M * MATRIX_N * sizeof(half)));
CUDA_CHECK(cudaMalloc(&c_device, MATRIX_M * MATRIX_N * sizeof(half)));
cublasTensorOp(a_device, b_device, c_device, MATRIX_M, MATRIX_N,MATRIX_K);
CUDA_CHECK(cudaDeviceSynchronize());
GpuTimer timer;
int iterations=2;
timer.start();
for(int i=0;i<iterations;i++)
{
cublasTensorOp(a_device, b_device, c_device, MATRIX_M, MATRIX_N,MATRIX_K);
}
CUDA_CHECK(cudaDeviceSynchronize());
timer.stop();
float elapsed_ms = timer.elapsed_millis();
std::cout << " Avg runtime: " << elapsed_ms/double(iterations) << " ms" << std::endl;
double avg_runtime_ms = double(elapsed_ms) / double(iterations)/ 1000.0;
double gflops=(2.0*MATRIX_M*MATRIX_N*MATRIX_K)/ double(1.0e9) / avg_runtime_ms;
std::cout << " GFLOPs: " << gflops << std::endl;
return 0;
}
EOF
# 编译
/usr/local/cuda/bin/nvcc -std=c++17 -o cublas_demo cublas_demo.cu -lcublas
# 运行
./cublas_demo
# 输出 GFLOPs: 50638.6
# 查看Kernel名字
/usr/local/cuda/bin/nsys profile --stats=true -t cuda,nvtx ./cublas_demo
# 输出 ampere_h1688gemm_128x128_ldg8_stages_32x1_tn
# 查看SASS指令
/usr/local/cuda/bin/cuda-gdb --args ./cublas_demo
b ampere_h1688gemm_128x128_ldg8_stages_32x1_tn
r
disas
# 输出 0x00007fffa3255c00 in ampere_h1688gemm_128x128_ldg8_stages_32x1_tn<<<(64,64,1),(128,1,1)>>> ()
# NCU获取详细的Metrics
/usr/local/NVIDIA-Nsight-Compute/ncu --set full --section SpeedOfLight_HierarchicalTensorRooflineChart --target-processes all --clock-control=none \
--print-details all --export cublas_demo_report -f ./cublas_demo
# NCU获取部分Metrics
/usr/local/NVIDIA-Nsight-Compute/ncu --clock-control=none --metrics \
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.max.pct_of_peak_sustained_elapsed,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.min.pct_of_peak_sustained_elapsed,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.pct_of_peak_sustained_elapsed,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.per_second,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.max.per_second,\
sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_elapsed,\
sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active,\
gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed,\
gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed,\
sm__cycles_elapsed.avg.per_second,\
sm__cycles_elapsed ./cublas_demo
# 其它命令
/usr/local/cuda/bin/cuobjdump /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.12 -ltext | grep gemm | /usr/local/cuda/bin/cu++filt
/usr/local/cuda/bin/cuobjdump /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.12 -ltext | grep "gemm" | grep "sm_80" | /usr/local//cuda/bin/cu++filt | grep "__half, __half, __half"
/usr/local/cuda/bin/cuobjdump --dump-sass --gpu-architecture sm_80 /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.12 | /usr/local/cuda/bin/cu++filt > template.txt
输出
bash
------------------------------------------------------------------------------------ ----------- -----------------
Metric Name Metric Unit Metric Value
------------------------------------------------------------------------------------ ----------- -----------------
gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed % 77.00
gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed % 48.84
sm__cycles_elapsed.avg cycle 40,803,640.79
sm__cycles_elapsed.max cycle 40,805,689
sm__cycles_elapsed.min cycle 40,801,923
sm__cycles_elapsed.sum cycle 1,142,501,942
sm__cycles_elapsed.avg.per_second Ghz 1.92 #SM主频
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.max.pct_of_peak_sustained_elapsed % 94.44
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.max.per_second 1/ns 1,857.40
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.min.pct_of_peak_sustained_elapsed % 93.80
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum 1,099,511,627,776 #计算量=8192*8192*8192*2 跟理论HMMA计算量一致
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.pct_of_peak_sustained_elapsed % 93.98 #利用率
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.peak_sustained 1/cycle 28,672 #峰值算力=peak_sustained*SM主频=28672*1.92=55050.24GFLOPS
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.per_second 1/ns 51,754.47 #实测算力GFLOPS
sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active % 93.64
sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_elapsed % 93.98
------------------------------------------------------------------------------------ ----------- -----------------