
Texture memory
Texture memory is another read-only memory that can accelerate the program and reduce memory bandwidth when data is read in a certain pattern. Like constant memory, it is also cached on a chip. This memory was originally designed for rendering graphics, but it can also be used for general purpose computing applications. It is very effective when applications have memory access that exhibits a great deal of spatial locality. The meaning of spatial locality is that each thread is likely to read from the nearby location what other nearby threads read. This is great in image processing applications where we work on 4-point connectivity and 8-point connectivity. A two-dimensional spatial locality for accessing memory location by threads may look something like this:

General global memory cache will not be able to capture this spatial locality and will result in lots of memory traffic to global memory. Texture memory is designed for this kind of access pattern so that it will only read from memory once, and then it will be cached so that execution will be much faster. Texture memory supports one and two-dimensional fetch operations. Using texture memory in your CUDA program is not trivial, especially for those who are not programming experts. In this section, a simple example of how to copy array values using texture memory is explained. The kernel function for using texture memory is explained as follows:
#include "stdio.h"
#include<iostream>
#include <cuda.h>
#include <cuda_runtime.h>
#define NUM_THREADS 10
#define N 10
//Define texture reference for 1-d access
texture <float, 1, cudaReadModeElementType> textureRef;
__global__ void gpu_texture_memory(int n, float *d_out)
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
if (idx < n) {
float temp = tex1D(textureRef, float(idx));
d_out[idx] = temp;
}
}
The part of texture memory that should be fetched is defined by texture reference. In code, it is defined using the texture API. It has three arguments. The first argument indicates the data type of texture elements. In this example, it is a float. The second argument indicates the type of texture reference, which can be one-dimensional, two-dimensional, and so on. Here, it is a one-dimensional reference. The third argument specifies the read mode and it is an optional argument. Please make sure that this texture reference is declared as a static global variable, and it should not be passed as parameters to any function. In the kernel function, data stored at the thread ID is read from this texture reference and copied to the d_out global memory pointer. Here, we are not using any spatial locality as this example is only taken to show you how to use texture memory from CUDA programs. The spatial locality will be explained in the next chapter when we see some image processing applications with CUDA. The main function for this example is shown as follows:
int main()
{
//Calculate number of blocks to launch
int num_blocks = N / NUM_THREADS + ((N % NUM_THREADS) ? 1 : 0);
float *d_out;
// allocate space on the device for the results
cudaMalloc((void**)&d_out, sizeof(float) * N);
// allocate space on the host for the results
float *h_out = (float*)malloc(sizeof(float)*N);
float h_in[N];
for (int i = 0; i < N; i++)
{
h_in[i] = float(i);
}
//Define CUDA Array
cudaArray *cu_Array;
cudaMallocArray(&cu_Array, &textureRef.channelDesc, N, 1);
cudaMemcpyToArray(cu_Array, 0, 0, h_in, sizeof(float)*N, cudaMemcpyHostToDevice);
// bind a texture to the CUDA array
cudaBindTextureToArray(textureRef, cu_Array);
gpu_texture_memory << <num_blocks, NUM_THREADS >> >(N, d_out);
// copy result to host
cudaMemcpy(h_out, d_out, sizeof(float)*N, cudaMemcpyDeviceToHost);
printf("Use of Texture memory on GPU: \n");
// Print the result
for (int i = 0; i < N; i++)
{
printf("Average between two nearest element is : %f\n", h_out[i]);
}
free(h_out);
cudaFree(d_out);
cudaFreeArray(cu_Array);
cudaUnbindTexture(textureRef);
}
In the main function, after declaring and allocating memory for host and device arrays, the host array is initialized with values from zero to nine. In this example, you will see the first use of CUDA arrays. They are similar to normal arrays, but they are dedicated to textures. They are read-only to kernel functions and can be written to device memory from the host by using the cudaMemcpyToArray function, as shown in the preceding code. The second and third arguments in that function are width and height offset that are taken as 0, 0, meaning that we are starting from the top left corner. They are opaque memory layouts optimized for texture memory fetches.
The cudaBindTextureToArray functions bind texture reference to this CUDA array. This means, it copies this array to a texture reference starting from the top left corner. After binding the texture reference, the kernel is called, which uses this texture reference and computes the array to be stored on device memory. After the kernel finishes, the output array is copied back to the host for displaying on the console. When using texture memory, we have to unbind the texture from our code. This is done by using the cudaUnbindTexture function. The cudaFreeArray function is used to free up memory used by the CUDA array. The output of the program displayed on the console is shown as follows:

This section finishes our discussion on memory architecture in CUDA. When the memories available in CUDA are used judiciously according to your application, it improves the performance of the program drastically. You need to look carefully at the memory access pattern of all threads in your application and then select which memory you should use for your application. The last section of this chapter briefly describes the complex CUDA program, which uses all the concepts we have used up until this point.