cublasGemmEx测试及Profing

cublasGemmEx测试及Profing

本文演示了cublasGemmEx API的使用、GEMM理论算力的计算、NCU获取GPU的峰值算力及实测算力

1.准备工作

  • 锁频
bash 复制代码
nvidia-smi -q -d SUPPORTED_CLOCKS
nvidia-smi -pm 1
nvidia-smi -q -d CLOCK
nvidia-smi -lgc 2115
nvidia-smi -lmc 7501
nvidia-smi -q -d CLOCK
  • 获取tensorcore相关metrics
bash 复制代码
cat /usr/local/NVIDIA-Nsight-Compute/sections/SpeedOfLight_HierarchicalTensorRooflineChart.section | grep "sm__ops_path_tensor_" | grep "peak_sustained" | sort | uniq | awk -F: '{print $2}' | sed 's/ //g'  | sed 's/\"//g'

2.使用NCU获取GPU峰值性能

A.获取 metrics,过滤掉当前GPU不支持的

bash 复制代码
tee ncu_get_gpu_peak_sustained.cu<<-'EOF'
#include <iostream>
#include <cuda_runtime.h>

__global__ void kernel2(float *d_in, float *d_out) {

}

int main() {
    float *d_in;
    float *d_out;
    int sm_count=28;
    int smsp_count=4;
    int warpsize=32;
    int total_count=sm_count*smsp_count*warpsize;    
    cudaMalloc((void**)&d_in, total_count * sizeof(float));
    cudaMalloc((void**)&d_out, total_count * sizeof(float));
    kernel2<<<sm_count, warpsize*smsp_count>>>(d_in, d_out);cudaDeviceSynchronize();
    cudaFree(d_in);
    cudaFree(d_out);
    return 0;
}
EOF
/usr/local/cuda/bin/nvcc -std=c++17 -lineinfo ncu_get_gpu_peak_sustained.cu -o ncu_get_gpu_peak_sustained
/usr/local/NVIDIA-Nsight-Compute/ncu  --clock-control=none --metrics \
sm__sass_thread_inst_executed_op_hfma_pred_on.sum.peak_sustained,\
sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained,\
sm__inst_executed_pipe_tensor.sum.peak_sustained,\
sm__ops_path_tensor_src_int8.sum.peak_sustained,\
sm__cycles_elapsed.avg.per_second,\
dram__bytes.sum.peak_sustained,\
dram__cycles_elapsed.avg.per_second,\
lts__lts2xbar_cycles_active.sum.peak_sustained,\
lts__cycles_elapsed.avg.per_second,\
l1tex__lsu_writeback_active_mem_lg.sum.peak_sustained,\
l1tex__cycles_elapsed.avg.per_second,\
sm__ops_path_tensor_op_bgmma_src_int1.sum.peak_sustained,\
sm__ops_path_tensor_op_bmma_src_int1.sum.peak_sustained,\
sm__ops_path_tensor_op_hgmma_src_bf16_dst_fp32_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_op_hgmma_src_bf16_dst_fp32_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_op_hgmma_src_fp16_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_op_hgmma_src_fp16_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_op_hgmma_src_tf32_dst_fp32_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_op_hgmma_src_tf32_dst_fp32_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_op_hmma_src_bf16_dst_fp32_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_op_hmma_src_bf16_dst_fp32_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_op_hmma_src_fp16_dst_fp16_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_op_hmma_src_fp16_dst_fp16_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_op_hmma_src_fp16_dst_fp32_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_op_hmma_src_fp16_dst_fp32_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_op_hmma_src_tf32_dst_fp32_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_op_hmma_src_tf32_dst_fp32_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_op_igmma_src_int8_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_op_igmma_src_int8_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_op_imma_src_int8_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_op_imma_src_int8_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_src_bf16_dst_fp32_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_src_bf16_dst_fp32_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_src_bf16_dst_fp32.sum.peak_sustained,\
sm__ops_path_tensor_src_fp16_bf16_tf32_dst_fp32.sum.peak_sustained,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_src_fp16_dst_fp16.sum.peak_sustained,\
sm__ops_path_tensor_src_fp16_dst_fp32_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_src_fp16_dst_fp32_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_src_fp16_dst_fp32.sum.peak_sustained,\
sm__ops_path_tensor_src_fp16.sum.peak_sustained,\
sm__ops_path_tensor_src_fp64.sum.peak_sustained,\
sm__ops_path_tensor_src_fp8_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_src_fp8_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_src_int1.sum.peak_sustained,\
sm__ops_path_tensor_src_int4_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_src_int4_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_src_int4.sum.peak_sustained,\
sm__ops_path_tensor_src_int8_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_src_int8_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_src_int8.sum.peak_sustained,\
sm__ops_path_tensor_src_tf32_dst_fp32_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_src_tf32_dst_fp32_sparsity_on.sum.peak_sustained,\
sm__ops_path_tensor_src_tf32_dst_fp32.sum.peak_sustained ./ncu_get_gpu_peak_sustained | grep -v "n/a"

输出

bash 复制代码
 ------------------------------------------------------------------------------ ----------- ------------
 Metric Name                                                                    Metric Unit Metric Value
 ------------------------------------------------------------------------------ ----------- ------------
 dram__bytes.sum.peak_sustained                                                  byte/cycle           48
 dram__cycles_elapsed.avg.per_second                                                    Ghz         7.06
 l1tex__cycles_elapsed.avg.per_second                                                   Ghz         1.89
 l1tex__lsu_writeback_active_mem_lg.sum.peak_sustained                                                28
 lts__cycles_elapsed.avg.per_second                                                     Ghz         1.79
 lts__lts2xbar_cycles_active.sum.peak_sustained                                                       18
 sm__cycles_elapsed.avg.per_second                                                      Ghz         1.89
 sm__inst_executed_pipe_tensor.sum.peak_sustained                                inst/cycle           28
 sm__ops_path_tensor_src_bf16_dst_fp32.sum.peak_sustained                           1/cycle       28,672
 sm__ops_path_tensor_src_bf16_dst_fp32_sparsity_off.sum.peak_sustained              1/cycle       14,336
 sm__ops_path_tensor_src_bf16_dst_fp32_sparsity_on.sum.peak_sustained               1/cycle       28,672
 sm__ops_path_tensor_src_fp16_dst_fp16.sum.peak_sustained                           1/cycle       57,344
 sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.peak_sustained              1/cycle       28,672
 sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_on.sum.peak_sustained               1/cycle       57,344
 sm__ops_path_tensor_src_fp16_dst_fp32.sum.peak_sustained                           1/cycle       28,672
 sm__ops_path_tensor_src_fp16_dst_fp32_sparsity_off.sum.peak_sustained              1/cycle       14,336
 sm__ops_path_tensor_src_fp16_dst_fp32_sparsity_on.sum.peak_sustained               1/cycle       28,672
 sm__ops_path_tensor_src_fp64.sum.peak_sustained                                    1/cycle        98.87
 sm__ops_path_tensor_src_int1.sum.peak_sustained                                    1/cycle      458,752
 sm__ops_path_tensor_src_int4.sum.peak_sustained                                    1/cycle      229,376
 sm__ops_path_tensor_src_int4_sparsity_off.sum.peak_sustained                       1/cycle      114,688
 sm__ops_path_tensor_src_int4_sparsity_on.sum.peak_sustained                        1/cycle      229,376
 sm__ops_path_tensor_src_int8.sum.peak_sustained                                    1/cycle      114,688
 sm__ops_path_tensor_src_int8_sparsity_off.sum.peak_sustained                       1/cycle       57,344
 sm__ops_path_tensor_src_int8_sparsity_on.sum.peak_sustained                        1/cycle      114,688
 sm__ops_path_tensor_src_tf32_dst_fp32.sum.peak_sustained                           1/cycle       14,336
 sm__ops_path_tensor_src_tf32_dst_fp32_sparsity_off.sum.peak_sustained              1/cycle        7,168
 sm__ops_path_tensor_src_tf32_dst_fp32_sparsity_on.sum.peak_sustained               1/cycle       14,336
 sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained                inst/cycle        3,584
 sm__sass_thread_inst_executed_op_hfma_pred_on.sum.peak_sustained                inst/cycle        1,792
 ------------------------------------------------------------------------------ ----------- ------------

B.计算硬件理论性能

text 复制代码
* DRAM内存带宽 = dram__bytes.sum.peak_sustained*dram__cycles_elapsed.avg.per_second=48(192bit*2/8)*7.06(频率) = 338.88GB/s
* L1带宽=l1tex__lsu_writeback_active_mem_lg.sum.peak_sustained*l1tex__cycles_elapsed.avg.per_second*128=28*1.89*128=6773.76GB/s
* L2带宽=lts__lts2xbar_cycles_active.sum.peak_sustained*lts__cycles_elapsed.avg.per_second*32=18*1.79*32=1031.04GB/s
* ffma峰值算力=sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained*2*sm__cycles_elapsed.avg.per_second=3584*2*1.89=13547.52GFLOPS
* hfma峰值算力=sm__sass_thread_inst_executed_op_hfma_pred_on.sum.peak_sustained*2*sm__cycles_elapsed.avg.per_second=1792*2*1.89=6773.76GFLOPS
* tensorcore_fp16_dst_fp16峰值算力=sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.peak_sustained*sm__cycles_elapsed.avg.per_second=28672*1.89=54190.08GFLOPS
* tensorcore_int8峰值算力=sm__ops_path_tensor_src_int8_sparsity_off.sum.peak_sustained*sm__cycles_elapsed.avg.per_second=57344*1.89=108380.16GFLOPS
* ffma_dram算力密度=13547.52GFLOPS/338.88 GB/s=39.98 OP/byte
* tensorcore_fp16_dram算力密度=54190.08GFLOPS/338.88 GB/s=159.91 OP/byte

3.计算GEMM的理论性能

text 复制代码
* 理论计算量=(2*8192*8192*8192)/1000000000=1099.51 GFLOPs
* 理论上需要加载的数据量=8192*8192*2*2/1024/1024/1024=0.25 GB
* 理论算力密度=1099.5/0.25 = 4398.0 OP/byte
* 如果M*N按128*128大小分块计算:
 - A Tile=128*8192 B Tile=8192*128 C Tile=128*128
 - Tile数总=power(8192/128,2)=4096
 - 每个Tlle的计算量=128*128*8192*2/1000000000=0.2684 GFLOPs   总的计算量=0.2684*4096=1099.36 GFLOPs
 - 每个Tlle加载的数据量=128*8192*2*2/1024/1024/1024=0.0039 GB 总的数据量=0.0039*4096=15.9744 GB
 - 算力密度:1099.36/15.9744=68.82 OP/byte
 - RTX3060上FP16纯计算理论耗时(s)=1099.36 GFLOPs/54190.08GFLOPS=0.02028s 
 - RTX3060上从DRAM Load数据总耗时(s)=15.9744GB/338.88GB/s=0.0471s [>计算耗时]
 - RTX3060上从L2 Load数据总耗时(s)=15.9744GB/1031.04GB/s=0.01549s [<计算耗时] 算力利用率100%
 - RTX3060上从DRAM Load数据理论可达到的性能(FLOPS)=1099.36GFLOPs/MAX(0.0471s,0.02028s)=1099.36/0.0471=23340.97FLOPS
 - 所以实际性能在23340.97FLOPS与54190.08GFLOPS之间
 - 如果L2 Cache命中率为70%,总Load数据总耗时预估:(0.7*15.9744)/1031.04+((1-0.7)*15.9744)/338.88=0.02498s ->1099.36 GFLOPs/0.02498s->44009.60GFLOPS 是峰值性能的:81%

4.cublasGemmEx测试

bash 复制代码
tee cublas_demo.cu<<-'EOF'
#include <stdint.h>
#include "cublas_v2.h"
#include "cuda_fp16.h"
#include <cuda.h>
#include <iostream>

#define HGEMM_UNLIKELY(x) __builtin_expect(!!(x), 0)

#define HGEMM_CHECK_CUBLAS_ERROR(_expr_)                                                                  \
    do {                                                                                                  \
        cublasStatus_t _ret_ = _expr_;                                                                    \
        if (HGEMM_UNLIKELY(_ret_ != CUBLAS_STATUS_SUCCESS)) {                                             \
            size_t _rt_version_ = cublasGetCudartVersion();                                               \
            printf("CUBLAS API error = %04d, runtime version: %zu", static_cast<int>(_ret_), _rt_version_); \
            exit(EXIT_FAILURE);                                                                           \
        }                                                                                                 \
    } while (0)
    
cublasHandle_t getCublasTensorOpHandle() {
    cublasHandle_t handle = nullptr;
    HGEMM_CHECK_CUBLAS_ERROR(cublasCreate(&handle));
    HGEMM_CHECK_CUBLAS_ERROR(cublasSetMathMode(handle, CUBLAS_TENSOR_OP_MATH));

    return handle;
}

void cublasTensorOp(half *A, half *B, half *C, size_t M, size_t N, size_t K) {
    static cublasHandle_t handle = getCublasTensorOpHandle();
    static half alpha = 1.0;
    static half beta = 0.0;

    HGEMM_CHECK_CUBLAS_ERROR(cublasGemmEx(handle, CUBLAS_OP_T, CUBLAS_OP_N, N, M, K, &alpha, B, CUDA_R_16F, K, A,
                                          CUDA_R_16F, K, &beta, C, CUDA_R_16F, N, CUBLAS_COMPUTE_16F,
                                          CUBLAS_GEMM_DEFAULT));
}

#define CUDA_CHECK(status)                                              \
  {                                                                     \
    cudaError_t error = status;                                         \
    if (error != cudaSuccess) {                                         \
      std::cerr << "Got bad cuda status: " << cudaGetErrorString(error) \
                << " at line: " << __LINE__ << std::endl;               \
      exit(EXIT_FAILURE);                                               \
    }                                                                   \
  }

struct GpuTimer
{
    cudaStream_t _stream_id;
    cudaEvent_t _start;
    cudaEvent_t _stop;
    GpuTimer() : _stream_id(0)
    {
        CUDA_CHECK(cudaEventCreate(&_start));
        CUDA_CHECK(cudaEventCreate(&_stop));
    }
    ~GpuTimer()
    {
        CUDA_CHECK(cudaEventDestroy(_start));
        CUDA_CHECK(cudaEventDestroy(_stop));
    }
    void start(cudaStream_t stream_id = 0)
    {
        _stream_id = stream_id;
        CUDA_CHECK(cudaEventRecord(_start, _stream_id));
    }
    void stop()
    {
        CUDA_CHECK(cudaEventRecord(_stop, _stream_id));
    }
    float elapsed_millis()
    {
        float elapsed = 0.0;
        CUDA_CHECK(cudaEventSynchronize(_stop));
        CUDA_CHECK(cudaEventElapsedTime(&elapsed, _start, _stop));
        return elapsed;
    }
};


#define MATRIX_M 8192
#define MATRIX_N 8192
#define MATRIX_K 8192

int main()
{
    // Matrix A
    half *a_host, *a_device;
    CUDA_CHECK(cudaMallocHost(&a_host, MATRIX_M * MATRIX_K * sizeof(half)));
    CUDA_CHECK(cudaMalloc(&a_device, MATRIX_M * MATRIX_K * sizeof(half)));

    // Matrix B
    half *b_host, *b_device;
    CUDA_CHECK(cudaMallocHost(&b_host, MATRIX_K * MATRIX_N * sizeof(half)));
    CUDA_CHECK(cudaMalloc(&b_device, MATRIX_K * MATRIX_N * sizeof(half)));

    // Matrix C
    half *c_host, *c_device;
    CUDA_CHECK(cudaMallocHost(&c_host, MATRIX_M * MATRIX_N * sizeof(half)));
    CUDA_CHECK(cudaMalloc(&c_device, MATRIX_M * MATRIX_N * sizeof(half)));
    
    cublasTensorOp(a_device, b_device, c_device, MATRIX_M, MATRIX_N,MATRIX_K);    
    CUDA_CHECK(cudaDeviceSynchronize());

    GpuTimer timer;
    int iterations=2;
    timer.start();
    for(int i=0;i<iterations;i++)
    {
        cublasTensorOp(a_device, b_device, c_device, MATRIX_M, MATRIX_N,MATRIX_K);
    }
    CUDA_CHECK(cudaDeviceSynchronize());
    timer.stop();
    
    float elapsed_ms = timer.elapsed_millis();
    std::cout << "  Avg runtime: " << elapsed_ms/double(iterations) << " ms" << std::endl;
    double avg_runtime_ms = double(elapsed_ms) / double(iterations)/ 1000.0;    
    double gflops=(2.0*MATRIX_M*MATRIX_N*MATRIX_K)/ double(1.0e9) / avg_runtime_ms;   
    std::cout << "  GFLOPs: " << gflops << std::endl;
    return 0;
}
EOF

# 编译
/usr/local/cuda/bin/nvcc  -std=c++17 -o cublas_demo cublas_demo.cu -lcublas

# 运行
./cublas_demo

# 输出 GFLOPs: 50638.6

# 查看Kernel名字
/usr/local/cuda/bin/nsys profile --stats=true -t cuda,nvtx ./cublas_demo

# 输出 ampere_h1688gemm_128x128_ldg8_stages_32x1_tn

# 查看SASS指令
/usr/local/cuda/bin/cuda-gdb --args ./cublas_demo
b ampere_h1688gemm_128x128_ldg8_stages_32x1_tn
r
disas

# 输出 0x00007fffa3255c00 in ampere_h1688gemm_128x128_ldg8_stages_32x1_tn<<<(64,64,1),(128,1,1)>>> ()

# NCU获取详细的Metrics
/usr/local/NVIDIA-Nsight-Compute/ncu --set full --section SpeedOfLight_HierarchicalTensorRooflineChart --target-processes all --clock-control=none \
                --print-details all --export cublas_demo_report -f ./cublas_demo

# NCU获取部分Metrics
/usr/local/NVIDIA-Nsight-Compute/ncu --clock-control=none --metrics \
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.max.pct_of_peak_sustained_elapsed,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.min.pct_of_peak_sustained_elapsed,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.pct_of_peak_sustained_elapsed,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.per_second,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.max.per_second,\
sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_elapsed,\
sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active,\
gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed,\
gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed,\
sm__cycles_elapsed.avg.per_second,\
sm__cycles_elapsed ./cublas_demo

# 其它命令
/usr/local/cuda/bin/cuobjdump /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.12  -ltext | grep gemm | /usr/local/cuda/bin/cu++filt
/usr/local/cuda/bin/cuobjdump /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.12  -ltext | grep "gemm" | grep "sm_80" | /usr/local//cuda/bin/cu++filt | grep "__half, __half, __half"
/usr/local/cuda/bin/cuobjdump --dump-sass --gpu-architecture sm_80 /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.12 |  /usr/local/cuda/bin/cu++filt > template.txt

输出

bash 复制代码
------------------------------------------------------------------------------------ ----------- -----------------
Metric Name                                                                          Metric Unit      Metric Value
------------------------------------------------------------------------------------ ----------- -----------------
gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed                               %             77.00
gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed                                         %             48.84
sm__cycles_elapsed.avg                                                                     cycle     40,803,640.79
sm__cycles_elapsed.max                                                                     cycle        40,805,689
sm__cycles_elapsed.min                                                                     cycle        40,801,923
sm__cycles_elapsed.sum                                                                     cycle     1,142,501,942
sm__cycles_elapsed.avg.per_second                                                            Ghz              1.92  #SM主频
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.max.pct_of_peak_sustained_elapsed           %             94.44
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.max.per_second                           1/ns          1,857.40
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.min.pct_of_peak_sustained_elapsed           %             93.80
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum                                           1,099,511,627,776  #计算量=8192*8192*8192*2 跟理论HMMA计算量一致
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.pct_of_peak_sustained_elapsed           %             93.98  #利用率
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.peak_sustained                    1/cycle            28,672  #峰值算力=peak_sustained*SM主频=28672*1.92=55050.24GFLOPS
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.per_second                           1/ns         51,754.47  #实测算力GFLOPS
sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active                                 %             93.64
sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_elapsed                                %             93.98
------------------------------------------------------------------------------------ ----------- -----------------
相关推荐
通信.萌新28 分钟前
OpenCV边沿检测(Python版)
人工智能·python·opencv
ARM+FPGA+AI工业主板定制专家30 分钟前
基于RK3576/RK3588+FPGA+AI深度学习的轨道异物检测技术研究
人工智能·深度学习
赛丽曼33 分钟前
机器学习-分类算法评估标准
人工智能·机器学习·分类
伟贤AI之路35 分钟前
从音频到 PDF:AI 全流程打造完美英文绘本教案
人工智能
weixin_3077791336 分钟前
分析一个深度学习项目并设计算法和用PyTorch实现的方法和步骤
人工智能·pytorch·python
helianying5542 分钟前
云原生架构下的AI智能编排:ScriptEcho赋能前端开发
前端·人工智能·云原生·架构
池央1 小时前
StyleGAN - 基于样式的生成对抗网络
人工智能·神经网络·生成对抗网络
PaLu-LI2 小时前
ORB-SLAM2源码学习:Initializer.cc⑧: Initializer::CheckRT检验三角化结果
c++·人工智能·opencv·学习·ubuntu·计算机视觉
小猪咪piggy2 小时前
【深度学习入门】深度学习知识点总结
人工智能·深度学习
汤姆和佩琦2 小时前
2025-1-20-sklearn学习(42) 使用scikit-learn计算 钿车罗帕,相逢处,自有暗尘随马。
人工智能·python·学习·机器学习·scikit-learn·sklearn