Hands-On GPU:Accelerated Computer Vision with OpenCV and CUDA
上QQ阅读APP看书,第一时间看更新

Passing parameters by reference

Now we will see how to write the same program by passing parameters by reference. For that, we have to first modify the kernel function for addition of two variables. The modified kernel for passing parameters by reference is shown here:

#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>
//Kernel function to add two variables, parameters are passed by reference
__global__ void gpuAdd(int *d_a, int *d_b, int *d_c)
{
*d_c = *d_a + *d_b;
}

Instead of using integer variables d_a and d_b as inputs to the kernel, the pointers to these variables on device *d_a and *d_b are taken as inputs. The answer which will be obtained after the addition is stored at the memory location pointed by third integer pointer d_c. The pointers passed as a reference to this device function should be allocated memory with the cudaMalloc function. The main function for this code is shown here:  

int main(void) 
{
//Defining host and variables
int h_a,h_b, h_c;
int *d_a,*d_b,*d_c;
//Initializing host variables
h_a = 1;
h_b = 4;
//Allocating memory for Device Pointers
cudaMalloc((void**)&d_a, sizeof(int));
cudaMalloc((void**)&d_b, sizeof(int));
cudaMalloc((void**)&d_c, sizeof(int));
//Coping value of host variables in device memory
cudaMemcpy(d_a, &h_a, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &h_b, sizeof(int), cudaMemcpyHostToDevice);
//Calling kernel with one thread and one block with parameters passed by reference
gpuAdd << <1, 1 >> > (d_a, d_b, d_c);
//Coping result from device memory to host
cudaMemcpy(&h_c, d_c, sizeof(int), cudaMemcpyDeviceToHost);
printf("Passing Parameter by Reference Output: %d + %d = %d\n", h_a, h_b, h_c);
//Free up memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
return 0;
}

h_a, h_b, and h_c are variables in the host memory. They are defined like normal C code. On the other hand, d_a, d_b, and d_c are pointers residing on host memory, and they point to the device memory. They are allocated memory from the host by using the cudaMalloc function. The values of h_a and h_b are copied to the device memory pointed to by d_a and d_b by using the cudaMemcpy function, and the direction of data transfer is from the host to the device. Then, in kernel call, these three device pointers are passed to the kernel as parameters. The kernel computes addition and stores the result at the memory location pointed by d_c. The result is copied back to the host memory by using cudaMemcpy again, but this time with the direction of data transfer as the device to host. The output of the program is as follows:

The memory used by three device pointers is freed by using the cudaFree at the end of the program. The sample memory map on the host and the device will look similar to the following:

As you can see from the table, d_a, d_b, and d_c are residing on the host and pointing to values on the device memory. While passing parameters by reference to kernels, you should take care that all pointers are pointing to the device memory only. If it is not the case, then the program may crash. 

While using device pointers and passing them to kernels, there are some restrictions that have to be followed by the programmer. The device pointers that are allocated memory using cudaMalloc can only be used to read or write from the device memory. They can be passed as parameters to the device function, and they should not be used to read and write memory from the host functions. To simplify, device pointers should be used to read and write device memory from the device function, and host pointers should be used to read and write host memory from host functions. So, in this book, you will always find the device pointer prefixed by d_ in kernel functions.

To summarize, in this section, concepts related to CUDA programming were explained in detail by taking two-variable additional programs as an example. After this section, you should be familiar with basic CUDA programming concepts and the terminology associated with CUDA programs. In the next section, you will learn how threads are executed on the device.