Vector Addition¶
In this section, we delve into basic vector operations with a focus on vector addition, illustrating how to implement it efficiently on both CPU and GPU using CUDA. Vector addition is a fundamental operation in scientific computing and serves as an excellent example to understand parallel programming concepts.
CPU Vector Addition¶
A typical CPU vector addition function iterates over each element of the vectors and adds them sequentially. This method is straightforward but does not leverage parallelism, leading to inefficiencies for large vectors.
// CPU function that adds two vectors
void vector_add_cpu(float *a, float *b, float *out, int n)
{
for (int i = 0; i < n; i++)
{
out[i] = a[i] + b[i];
}
}
-
Parameters:
a
,b
: Input vectors of sizen
.out
: Output vector to store the result.n
: Number of elements in the vectors.
-
Operation:
- A simple
for
loop adds corresponding elements ofa
andb
, storing the result inout
.
- A simple
GPU Vector Addition with CUDA¶
To harness the parallel computing capabilities of GPUs, we convert the CPU function into a CUDA kernel. This allows us to perform vector addition concurrently using multiple threads.
- Steps for GPU Implementation:
- Memory Allocation on Host and Device
- Initialization of Host Data
- Data Transfer from Host to Device
- Thread Block Configuration
- Kernel Execution
- Data Transfer from Device to Host
- Memory Deallocation
1. Memory Allocation on Host and Device
Allocate memory for the vectors on both the host (CPU) and the device (GPU). Host Memory Allocation:
// Initialize pointers for host memory
float *h_a, *h_b, *h_out;
// Allocate host memory
h_a = (float*)malloc(sizeof(float) * N);
h_b = (float*)malloc(sizeof(float) * N);
h_out = (float*)malloc(sizeof(float) * N);
// Initialize pointers for device memory
float *d_a, *d_b, *d_out;
// Allocate device memory
cudaMalloc((void**)&d_a, sizeof(float) * N);
cudaMalloc((void**)&d_b, sizeof(float) * N);
cudaMalloc((void**)&d_out, sizeof(float) * N);
Explanation:
- Host Variables (
h_
prefix): Pointers to memory allocated on the CPU. - Device Variables (
d_
prefix): Pointers to memory allocated on the GPU. cudaMalloc
allocates memory on the GPU device.
2. Initialization of Host Data
Initialize the input vectors h_a
and h_b
on the host.
// Initialize host arrays
for (int i = 0; i < N; i++)
{
h_a[i] = 1.0f; // or any desired value
h_b[i] = 2.0f; // or any desired value
}
3. Data Transfer from Host to Device
Copy the input data from host memory to device memory.
// Transfer data from host to device memory
cudaMemcpy(d_a, h_a, sizeof(float) * N, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, sizeof(float) * N, cudaMemcpyHostToDevice);
cudaMemcpy
is used to transfer data between host and device. - The fourth parameter specifies the direction of the transfer. 4. Thread Block Configuration
Configure the execution parameters, defining the number of threads per block and blocks per grid.
// Define the number of threads per block and blocks per grid
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
- threadsPerBlock: Number of threads within each block (commonly a multiple of 32 due to warp size).
- blocksPerGrid: Total number of blocks required to process all elements.
- The calculation ensures all elements are covered even if
N
is not a multiple ofthreadsPerBlock
.
5. Kernel Execution
Invoke the CUDA kernel function to perform vector addition on the GPU.
// Execute the CUDA kernel function
vector_add_cuda<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_out, N);
CUDA Kernel Function:
// GPU kernel function that adds two vectors
__global__ void vector_add_cuda(float *a, float *b, float *out, int n)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
// Ensure we do not access out-of-bounds memory
if (i < n)
{
out[i] = a[i] + b[i];
}
}
- Index Calculation:
- i = blockIdx.x * blockDim.x + threadIdx.x;
- Calculates the global thread index.
- Boundary Check:
- Ensures that threads beyond the vector size do not perform invalid memory accesses.
- Memory Access:
- Each thread adds one element from vectors a and b, storing the result in out.
6. Data Transfer from Device to Host
Copy the result from device memory back to host memory.
// Transfer data back to host memory
cudaMemcpy(h_out, d_out, sizeof(float) * N, cudaMemcpyDeviceToHost);
7. Memory Deallocation
Free the allocated memory on both the host and device to prevent memory leaks. Device Memory Deallocation:
// Deallocate device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_out);
Host Memory Deallocation:
// Deallocate host memory
free(h_a);
free(h_b);
free(h_out);
GPU Vector Addition with 2D Thread Blocks¶
In some cases, using 2D thread blocks can be beneficial, especially when working with 2D data structures like matrices. Here, we demonstrate how to adapt vector addition to use 2D thread blocks and map them to a 1D data structure.
Thread Configuration and Kernel Execution
Define 2D Thread Blocks:
// Thread block dimensions
dim3 threadsPerBlock(16, 16); // 256 threads per block
dim3 blocksPerGrid((N + threadsPerBlock.x * threadsPerBlock.y - 1) / (threadsPerBlock.x * threadsPerBlock.y));
Kernel Execution:
// Execute the CUDA kernel function with 2D thread blocks
vector_add_cuda_2d<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_out, N);
CUDA Kernel Function with 2D Thread Blocks
// GPU kernel function that adds two vectors using 2D thread blocks
__global__ void vector_add_cuda_2d(float *a, float *b, float *out, int n)
{
int blockId = blockIdx.x + blockIdx.y * gridDim.x;
int threadIdInBlock = threadIdx.y * blockDim.x + threadIdx.x;
int i = blockId * (blockDim.x * blockDim.y) + threadIdInBlock;
if (i < n)
{
out[i] = a[i] + b[i];
}
}
- Block and Thread Indexing:
blockId
calculates a unique block ID in a potentially 2D grid.threadIdInBlock
computes a unique thread ID within a 2D block.
- Global Thread Index:
- Combines
blockId
andthreadIdInBlock
to get a global index i.
- Combines
- Mapping to 1D Data:
- Even though threads are organized in 2D, they operate on 1D data.
Advantages of Using 2D Thread Blocks
- Better Utilization: Can lead to better resource utilization on the GPU.
- Mapping to Data Structures: Easier mapping when dealing with 2D or 3D data.
- Memory Coalescing: May improve memory access patterns for certain data layouts.
Performance Considerations¶
- Thread Configuration:
- Choosing the right number of threads per block and blocks per grid is crucial for optimal performance.
- Factors such as warp size, occupancy, and hardware limitations influence this choice.
- Memory Coalescing:
- Accessing global memory in a coalesced manner improves performance.
- Ensure that consecutive threads access consecutive memory addresses.
- Error Checking:
- Always check for errors after CUDA API calls to catch issues early.
- Use
cudaGetLastError()
andcudaDeviceSynchronize()
for debugging.