Hands-On GPU：Accelerated Computer Vision with OpenCV and CUDA

上QQ阅读APP看书，第一时间看更新

Summary

This chapter explained the launch of multiple blocks, with each having multiple threads from the kernel function. It showed the method for choosing the two parameters for a large value of threads. It also explained the hierarchical memory architecture that can be used by CUDA programs. The memory nearest to the thread being executed is fast, and as we move away from it, memories get slower. When multiple threads want to communicate with each other, then CUDA provides the flexibility of using shared memory, by which threads from the same blocks can communicate with each other. When multiple threads use the same memory location, then there should be synchronization between the memory access; otherwise, the final result will not be as expected. We also saw the use of an atomic operation to accomplish this synchronization. If some parameters remain constant throughout the kernel's execution, then it can be stored in constant memory for speed up. When CUDA programs exhibit a certain communication pattern like spatial locality, then texture memory should be used to improve the performance of the program. To summarize, to improve the performance of CUDA programs, we should reduce memory traffic to slow memories. If this is done efficiently, drastic improvement in the performance of the program can be achieved.

In the next chapter, the concept of CUDA streams will be discussed, which is similar to multitasking in CPU programs. How we can measure the performance of CUDA programs will also be discussed. It will also show the use of CUDA in simple image processing applications.