CUDA-GDB(11)——调试示例演练

安全二次方(security²)博客目录导读

[11.1. 示例：bitreverse](#11.1. 示例：bitreverse)

[11.1.1. 代码演练](#11.1.1. 代码演练)

[11.2. 示例：自动步进autostep](#11.2. 示例：自动步进autostep)

[11.2.1. 使用Autostep调试](#11.2.1. 使用Autostep调试)

[11.3. 示例：MPI CUDA应用](#11.3. 示例：MPI CUDA应用)

本博客包含三个CUDA-GDB的演练示例：

▶ Example: bitreverse
▶ Example: autostep
▶ Example: MPI CUDA Application

11.1. 示例：bitreverse

本节将通过调试一个名为bitreverse的示例应用程序，来演示CUDA-GDB的使用过程。该应用程序对数据集执行简单的8位反转操作。

源代码

复制代码

1  #include <stdio.h>
2  #include <stdlib.h>
3
4  // Simple 8-bit bit reversal Compute test
5
6  #define N 256
7
8  __global__ void bitreverse(void *data) {
9     unsigned int *idata = (unsigned int*)data;
10    extern __shared__ int array[];
11
12    array[threadIdx.x] = idata[threadIdx.x];
13
14    array[threadIdx.x] = ((0xf0f0f0f0 & array[threadIdx.x]) >> 4) |
15                        ((0x0f0f0f0f & array[threadIdx.x]) << 4);
16    array[threadIdx.x] = ((0xcccccccc & array[threadIdx.x]) >> 2) |
17                        ((0x33333333 & array[threadIdx.x]) << 2);
18    array[threadIdx.x] = ((0xaaaaaaaa & array[threadIdx.x]) >> 1) |
19                         ((0x55555555 & array[threadIdx.x]) << 1);
20
21    idata[threadIdx.x] = array[threadIdx.x];
22 }
23
24 int main(void) {
25     void *d = NULL; int i;
26     unsigned int idata[N], odata[N];
27
28     for (i = 0; i < N; i++)
29         idata[i] = (unsigned int)i;
30
31     cudaMalloc((void**)&d, sizeof(int)*N);
32     cudaMemcpy(d, idata, sizeof(int)*N,
33                cudaMemcpyHostToDevice);
34
35     bitreverse<<<1, N, N*sizeof(int)>>>(d);
36
37     cudaMemcpy(odata, d, sizeof(int)*N,
38                cudaMemcpyDeviceToHost);
39
40     for (i = 0; i < N; i++)
41        printf("%u -> %u\n", idata[i], odata[i]);
42
43     cudaFree((void*)d);
44     return 0;
45 }

11.1.1. 代码演练

首先通过以下命令在shell提示符下编译bitreverse.cu CUDA应用程序以进行调试：
复制代码
```
$ nvcc -g -G bitreverse.cu -o bitreverse
```
该命令假设源文件名为bitreverse.cu且编译时不需要额外的编译器标志。另请参阅Debug Compilation
在shell提示符下输入以下命令启动CUDA调试器：
复制代码
```
$ cuda-gdb bitreverse
```

设置断点。在此处设置主机(main)和GPU(bitreverse)断点。同时，在设备函数的特定行设置断点(bitreverse.cu:18)。

复制代码

(cuda-gdb) break main
Breakpoint 1 at 0x18e1: file bitreverse.cu, line 25.
(cuda-gdb) break bitreverse
Breakpoint 2 at 0x18a1: file bitreverse.cu, line 8.
(cuda-gdb) break 21
Breakpoint 3 at 0x18ac: file bitreverse.cu, line 21.

运行CUDA应用程序，程序将执行直至到达上一步设置的首个断点（main）。

复制代码

(cuda-gdb) run
Starting program: /Users/CUDA_User1/docs/bitreverse
Reading symbols for shared libraries
..++........................................................... done

Breakpoint 1, main () at bitreverse.cu:25
25  void *d = NULL; int i;

此时可以输入命令来推进执行或打印程序状态。在本示例中，我们将继续执行直到设备内核启动。

复制代码

(cuda-gdb) continue
Continuing.
Reading symbols for shared libraries .. done
Reading symbols for shared libraries .. done
[Context Create of context 0x80f200 on Device 0]
[Launch of CUDA Kernel 0 (bitreverse<<<(1,1,1),(256,1,1)>>>) on Device 0]
Breakpoint 3 at 0x8667b8: file bitreverse.cu, line 21.
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]

Breakpoint 2, bitreverse<<<(1,1,1),(256,1,1)>>> (data=0x110000) at bitreverse.cu:9
9   unsigned int *idata = (unsigned int*)data;

CUDA-GDB已检测到已到达CUDA设备内核。调试器会打印当前聚焦的CUDA线程。

使用info cuda threads命令查看当前聚焦的CUDA线程，并在主机线程和CUDA线程之间切换：

复制代码

(cuda-gdb) info cuda threads
  块索引   线程索引   目标块索引 线程索引 数量            虚拟PC
文件名      行号

内核0
* (0,0,0)   (0,0,0)   (0,0,0)  (255,0,0)   256 0x0000000000866400 bitreverse.cu     9
(cuda-gdb) thread
[当前线程是1 (进程16738)]
(cuda-gdb) thread 1
[切换到线程1 (进程16738)]
#0  0x000019d5 在 main () 位于 bitreverse.cu:34
34    bitreverse<<<1, N, N*sizeof(int)>>>(d);
(cuda-gdb) backtrace
#0  0x000019d5 在 main () 位于 bitreverse.cu:34
(cuda-gdb) info cuda kernels
内核 设备 网格   SMs掩码 网格维度 块维度        名称 参数
     0   0    1 0x00000001 (1,1,1) (256,1,1) bitreverse data=0x110000
(cuda-gdb) cuda kernel 0
[将焦点切换到CUDA内核0，网格1，块(0,0,0)，线程(0,0,0)，设备0，sm 0，warp 0，lane 0]
9    unsigned int *idata = (unsigned int*)data;
(cuda-gdb) backtrace
#0   bitreverse<<<(1,1,1),(256,1,1)>>> (data=0x110000) 位于 bitreverse.cu:9

通过打印块和线程索引来验证此信息：

复制代码

(cuda-gdb) print blockIdx
$1 = {x = 0, y = 0}
(cuda-gdb) print threadIdx
$2 = {x = 0, y = 0, z = 0)

也可以打印网格和块的维度：

复制代码

(cuda-gdb) print gridDim
$3 = {x = 1, y = 1}
(cuda-gdb) print blockDim
$4 = {x = 256, y = 1, z = 1)

推进内核执行并验证一些数据：

复制代码

(cuda-gdb) next
12       array[threadIdx.x] = idata[threadIdx.x];
(cuda-gdb) next
14       array[threadIdx.x] = ((0xf0f0f0f0 & array[threadIdx.x]) >> 4) |
(cuda-gdb) next
16       array[threadIdx.x] = ((0xcccccccc & array[threadIdx.x]) >> 2) |
(cuda-gdb) next
18       array[threadIdx.x] = ((0xaaaaaaaa & array[threadIdx.x]) >> 1) |
(cuda-gdb) next

Breakpoint 3, bitreverse <<<(1,1),(256,1,1)>>> (data=0x100000) at bitreverse.cu:21
21             idata[threadIdx.x] = array[threadIdx.x];
(cuda-gdb) print array[0]@12
$7 = {0, 128, 64, 192, 32, 160, 96, 224, 16, 144, 80, 208}
(cuda-gdb) print/x array[0]@12
$8 = {0x0, 0x80, 0x40, 0xc0, 0x20, 0xa0, 0x60, 0xe0, 0x10, 0x90, 0x50,
0xd0}

(cuda-gdb) print &data
$9 = (@global void * @parameter *) 0x10
(cuda-gdb) print *(@global void * @parameter *) 0x10
$10 = (@global void * @parameter) 0x100000

输出结果取决于内存位置的当前内容。

由于线程 (0,0,0) 反转了 0 的值，切换到另一个线程以显示更有趣的数据：

复制代码

(cuda-gdb) cuda thread 170
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread
(170,0,0), device 0, sm 0, warp 5, lane 10]

删除断点并继续执行程序直至完成:

复制代码

(cuda-gdb) delete breakpoints
Delete all breakpoints? (y or n) y
(cuda-gdb) continue
Continuing.

程序正常退出。
(cuda-gdb)

11.2. 示例：自动步进autostep

本节展示如何使用autostep命令，并演示它如何帮助提高内存错误报告的精确度。

源代码

复制代码

1  #define NUM_BLOCKS 8
2  #define THREADS_PER_BLOCK 64
3
4  __global__ void example(int **data) {
5    int value1, value2, value3, value4, value5;
6    int idx1, idx2, idx3;
7
8    idx1 = blockIdx.x * blockDim.x;
9    idx2 = threadIdx.x;
10   idx3 = idx1 + idx2;
11   value1 = *(data[idx1]);
12   value2 = *(data[idx2]);
13   value3 = value1 + value2;
14   value4 = value1 * value2;
15   value5 = value3 + value4;
16   *(data[idx3]) = value5;
17   *(data[idx1]) = value3;
18   *(data[idx2]) = value4;
19   idx1 = idx2 = idx3 = 0;
20 }
21
22 int main(int argc, char *argv[]) {
23   int *host_data[NUM_BLOCKS * THREADS_PER_BLOCK];
24   int **dev_data;
25   const int zero = 0;
26
27   /* Allocate an integer for each thread in each block */
28   for (int block = 0; block < NUM_BLOCKS; block++) {
29     for (int thread = 0; thread < THREADS_PER_BLOCK; thread++) {
30       int idx = thread + block * THREADS_PER_BLOCK;
31       cudaMalloc(&host_data[idx], sizeof(int));
32       cudaMemcpy(host_data[idx], &zero, sizeof(int),
33                  cudaMemcpyHostToDevice);
34     }
35   }
36
37   /* This inserts an error into block 3, thread 39*/
38   host_data[3*THREADS_PER_BLOCK  + 39] = NULL;
39
40   /* Copy the array of pointers to the device */
41   cudaMalloc((void**)&dev_data,  sizeof(host_data));
42   cudaMemcpy(dev_data, host_data, sizeof(host_data), cudaMemcpyHostToDevice);
43
44   /* Execute example */
45   example <<< NUM_BLOCKS, THREADS_PER_BLOCK >>> (dev_data);
46   cudaThreadSynchronize();
47 }

在这个小例子中，我们有一个指向整数的指针数组，我们想对这些整数进行一些操作。然而，假设其中一个指针如第38行所示是NULL。当我们尝试访问与块3、线程39对应的整数时，这将导致抛出CUDA_EXCEPTION_10 "Device Illegal Address"异常。当我们尝试写入该值时，这个异常应该会在第16行发生。

11.2.1. 使用Autostep调试

编译示例并正常启动CUDA-GDB。我们首先运行程序：

复制代码

(cuda-gdb) run
Starting program: /home/jitud/cudagdb_test/autostep_ex/example
[Thread debugging using libthread_db enabled] [New Thread 0x7ffff5688700 (LWP 9083)]
[Context Create of context 0x617270 on Device 0]
[Launch of CUDA Kernel 0 (example<<<(8,1,1),(64,1,1)>>>) on Device 0]

Program received signal CUDA_EXCEPTION_10, Device Illegal Address.
[Switching focus to CUDA kernel 0, grid 1, block (1,0,0), thread (0,0,0), device 0, sm 1, warp 0, lane 0]
0x0000000000796f60 in example (data=0x200300000) at example.cu:17
17        *(data[idx1]) = value3;

如预期所示，我们收到了CUDA_EXCEPTION_10错误。然而，报告的线程是块1、线程0，错误发生在第17行。由于CUDA_EXCEPTION_10是全局错误，没有报告具体的线程信息，因此我们需要手动检查所有512个线程。

设置autosteps。为了获取更精确的信息，我们推断由于CUDA_EXCEPTION_10是一个内存访问错误，它必然发生在访问内存的代码处。这种情况出现在第11、12、16、17和18行，因此我们为这些区域设置两个自动步进窗口：
复制代码
```
(cuda-gdb) autostep 11 for 2 lines
Breakpoint 1 at 0x796d18: file example.cu, line 11.
Created autostep of length 2 lines
(cuda-gdb) autostep 16 for 3 lines
Breakpoint 2 at 0x796e90: file example.cu, line 16.
Created autostep of length 3 lines
```

最后，我们使用这些自动步骤再次运行程序：

复制代码

(cuda-gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
[Termination of CUDA Kernel 0 (example<<<(8,1,1),(64,1,1)>>>) on Device 0]
Starting program: /home/jitud/cudagdb_test/autostep_ex/example
[Thread debugging using libthread_db enabled]
[New Thread 0x7ffff5688700 (LWP 9089)]
[Context Create of context 0x617270 on Device 0]
[Launch of CUDA Kernel 1 (example<<<(8,1,1),(64,1,1)>>>) on Device 0]
[Switching focus to CUDA kernel 1, grid 1, block (0,0,0), thread (0,0,0),
device 0, sm 0, warp 0, lane 0]

Program received signal CUDA_EXCEPTION_10, Device Illegal Address.
[Current focus set to CUDA kernel 1, grid 1, block (3,0,0), thread
(32,0,0), device 0, sm 1, warp 3, lane 0]
Autostep precisely caught exception at example.cu:16 (0x796e90)

这次我们正确地捕获了第16行的异常。尽管CUDA_EXCEPTION_10是一个全局错误，但我们现在已经将其缩小为一个warp错误，因此我们知道抛出异常的线程必须与块3、线程32处于同一个warp中。

在这个示例中，我们仅通过设置两个autosteps并重新运行程序，就将错误范围从512个线程缩小到了32个线程。

11.3. 示例：MPI CUDA应用

对于大规模MPI CUDA应用程序调试，NVIDIA推荐使用合作伙伴Allinea和Totalview提供的并行调试器。这两者都提供了出色的并行调试工具，并对CUDA有扩展支持。然而，对于调试较小规模的应用程序，或者仅调试大型应用中的少数进程，可以使用CUDA-GDB。

如果集群节点支持xterm，可以按照与使用作业启动器启动gdb相同的方式启动CUDA-GDB。例如：

复制代码

$ mpirun -np 4 -host nv1,nv2 xterm -e cuda-gdb a.out

你可能需要导出DISPLAY环境变量，以确保xterm能够正确显示在你的屏幕上。例如：

复制代码

$ mpirun -np 4 -host nv1,nv2 -x DISPLAY=host.nvidia.com:0 xterm -e cuda-gdb a.out

作业启动器有不同的方式将环境变量导出到集群节点。更多详情请查阅您的作业启动器文档（job launcher documentation）。

当您的集群环境不支持xterm时，可以在程序中插入一个spin loop，通过ssh连接到计算节点，并附加到MPI进程上。在程序启动位置附近，添加类似以下代码片段：

复制代码

{
    int i = 0;
    char host[256];
    printf("PID %d on node %s is ready for attach\n",
            getpid(), host);
    fflush(stdout);
    while (0 == i) {
        sleep(5);
    }
}

重新编译并启动应用程序。启动后，通过SSH连接到目标节点，并使用CUDA-GDB attach到进程。将变量i设为1以跳出循环：

复制代码

$ mpirun -np 2 -host nv1,nv2 a.out
PID 20060 on node nv1 is ready for attach
PID 5488 on node nv2 is ready for attach

复制代码

$ ssh nv1
[nv1]$ cuda-gdb --pid 5488

复制代码

$ ssh nv2
[nv2]$ cuda-gdb --pid 20060

对于较大的应用程序，您可以使用MPI_Comm_rank函数基于MPI等级来条件化spin loop。

对于计算能力低于6.0的设备，Mul-**tiple Debuggers中描述的软件抢占解决方案不适用于MPI应用程序。对于这些GPU，请确保每个MPI进程对应唯一的GPU。

如果设置了CUDA_VISIBLE_DEVICES，可能会导致MPI应用程序中的GPU选择逻辑出现问题。它还可能阻止节点上GPU之间的CUDA IPC工作。