【CUDA】thrust进行前缀和的操作

上篇文章,可以发现使用CUDA提供的API进行前缀和扫描时,第一次运行的时间不如共享内存访问,猜测是使用到了全局内存。

首先看调用逻辑:

cpp 复制代码
thrust::inclusive_scan(thrust::device, d_x, d_x + N, d_x);

第一个参数指定了设备,根据实参数量和类型找到对应的函数,是scan.h中的如下函数:

cpp 复制代码
template <typename DerivedPolicy, typename InputIterator, typename OutputIterator>
_CCCL_HOST_DEVICE OutputIterator inclusive_scan(
  const thrust::detail::execution_policy_base<DerivedPolicy>& exec,
  InputIterator first,
  InputIterator last,
  OutputIterator result);

其实现位于thrust\thrust\system\cuda\detail\scan.h
注意 :路径可能与实际有偏差,可以在/usr/local/下使用find . -name xx查找对应的文件

cpp 复制代码
template <typename Derived, typename InputIt, typename OutputIt>
_CCCL_HOST_DEVICE OutputIt
inclusive_scan(thrust::cuda_cub::execution_policy<Derived>& policy, InputIt first, InputIt last, OutputIt result)
{
  return thrust::cuda_cub::inclusive_scan(policy, first, last, result, thrust::plus<>{});
}

将操作指定为plus,

然后执行同一文件下的此函数:

cpp 复制代码
template <typename Derived, typename InputIt, typename OutputIt, typename ScanOp>
_CCCL_HOST_DEVICE OutputIt inclusive_scan(
  thrust::cuda_cub::execution_policy<Derived>& policy, InputIt first, InputIt last, OutputIt result, ScanOp scan_op)
{
  using diff_t           = typename thrust::iterator_traits<InputIt>::difference_type;
  diff_t const num_items = thrust::distance(first, last);
  return thrust::cuda_cub::inclusive_scan_n(policy, first, num_items, result, scan_op);
}

最终找到主要的执行逻辑:

cpp 复制代码
_CCCL_EXEC_CHECK_DISABLE
template <typename Derived, typename InputIt, typename Size, typename OutputIt, typename ScanOp>
_CCCL_HOST_DEVICE OutputIt inclusive_scan_n_impl(
  thrust::cuda_cub::execution_policy<Derived>& policy, InputIt first, Size num_items, OutputIt result, ScanOp scan_op)
{
  using AccumT     = typename thrust::iterator_traits<InputIt>::value_type;
  using Dispatch32 = cub::DispatchScan<InputIt, OutputIt, ScanOp, cub::NullType, std::int32_t, AccumT>;
  using Dispatch64 = cub::DispatchScan<InputIt, OutputIt, ScanOp, cub::NullType, std::int64_t, AccumT>;

  cudaStream_t stream = thrust::cuda_cub::stream(policy);
  cudaError_t status;

  // Determine temporary storage requirements:
  size_t tmp_size = 0;
  {
    THRUST_INDEX_TYPE_DISPATCH2(
      status,
      Dispatch32::Dispatch,
      Dispatch64::Dispatch,
      num_items,
      (nullptr, tmp_size, first, result, scan_op, cub::NullType{}, num_items_fixed, stream));
    thrust::cuda_cub::throw_on_error(
      status,
      "after determining tmp storage "
      "requirements for inclusive_scan");
  }

  // Run scan:
  {
    // Allocate temporary storage:
    thrust::detail::temporary_array<std::uint8_t, Derived> tmp{policy, tmp_size};
    THRUST_INDEX_TYPE_DISPATCH2(
      status,
      Dispatch32::Dispatch,
      Dispatch64::Dispatch,
      num_items,
      (tmp.data().get(), tmp_size, first, result, scan_op, cub::NullType{}, num_items_fixed, stream));
    thrust::cuda_cub::throw_on_error(status, "after dispatching inclusive_scan kernel");
    thrust::cuda_cub::throw_on_error(
      thrust::cuda_cub::synchronize_optional(policy), "inclusive_scan failed to synchronize");
  }

  return result + num_items;
}

可以看到,此处thrust调用了cub的Dispatchscan操作,而cub中是使用全局内存的,因此造成了效率还不如手动编写使用共享内存的算法。

相关推荐
七宝大爷1 小时前
CUDA与cuDNN:深度学习加速库
人工智能·深度学习·cuda·cudnn
@Wufan1 天前
ubuntu服务器子用户(无sudo权限)安装/切换多个版本cuda
linux·服务器·ubuntu·cuda
FF-Studio3 天前
解决 NVIDIA RTX 50 系列 (sm_120) 架构下的 PyTorch 与 Unsloth 依赖冲突
pytorch·自然语言处理·cuda·unsloth·rtx 50 series
FF-Studio4 天前
RTX 5060 Ti Linux 驱动黑屏避坑指南:CUDA 13.1, Open Kernel 与 BIOS 设置
linux·运维·服务器·cuda
james bid4 天前
MacBook Pro 2015 上 XUbuntu 24.04 启用 eGPU (GeForce GTX 1080 Ti) 和核显黑屏问题解决
linux·ubuntu·macos·cuda·egpu
Peter·Pan爱编程5 天前
cmake 升级
c++·cmake·cuda
Eloudy5 天前
cudaEventCreateWithFlags 的 cudaEventInterprocess 和 cudaEventDisableTiming
gpu·cuda·arch
self-motivation9 天前
cuda编程 --------- warp 级别规约指令 __shfl_xor_sync
cuda·hpc·warp·shfl_xor_sync·dot product
云雾J视界9 天前
高性能计算新范式:用Thrust和cuRAND库重构蒙特卡罗仿真
gpu·cuda·高性能计算·thrust·蒙特卡罗·curand·摩根大通
封奚泽优9 天前
Deep-Live-Cam(调试和求助)
git·python·ffmpeg·pip·cuda