Vector Addition¶

Script

"Vector addition is a fundamental operation, particularly in scientific computing. It’s often used as a foundational example to illustrate parallel programming concepts. This session will explore how to implement vector addition on both the CPU and GPU, leveraging CUDA to optimize performance."
"This example demonstrates a basic CPU implementation of vector addition. Each element in the input vectors a and b is added sequentially, with the result stored in the output vector out. While simple, this approach is limited to sequential execution and doesn’t utilize parallelism, making it less efficient for large datasets."
"To perform vector addition on the GPU, we use CUDA to execute the operation concurrently across multiple threads. This approach harnesses the parallel processing power of GPUs, significantly improving performance for larger datasets."
"Memory allocation is a critical step. On the host, arrays h_a, h_b, and h_out are allocated using malloc. On the GPU, device memory is allocated for d_a, d_b, and d_out using cudaMalloc. This separation ensures that the host and device have their respective memory spaces."
"Once the host arrays are initialized, the input vectors are transferred to the device using cudaMemcpy. This step ensures that the GPU has access to the necessary data for computation. Efficient data transfer is key to minimizing overhead in GPU programming."
"Configuring the GPU requires setting the number of threads per block and blocks per grid. Using 256 threads per block is a common choice for modern GPUs. The number of blocks per grid is calculated to ensure that all elements are processed, even when the total size isn’t evenly divisible by the block size."
"The CUDA kernel calculates a global index for each thread using blockIdx, blockDim, and threadIdx. This index determines which elements of a and b the thread processes. A boundary check ensures that threads outside the array bounds don’t perform unnecessary operations."
"Once the kernel is configured and launched, the results stored in d_out on the GPU are copied back to h_out on the host using cudaMemcpy. This final step retrieves the computed data for verification or further use."
"To prevent memory leaks, it’s essential to deallocate memory after computation. On the GPU, cudaFree is used to release memory allocated with cudaMalloc, while free handles the host memory cleanup. Proper memory management ensures efficient resource use."
"2D thread blocks can be used to improve resource utilization, especially for 2D or higher-dimensional data structures. Here, both thread blocks and grids are configured in two dimensions, enabling more efficient computation and better alignment with the data layout."
"In the 2D thread block kernel, a global thread index is calculated using both block and thread IDs in two dimensions. This setup allows the kernel to handle more complex data layouts efficiently, while maintaining boundary checks to ensure safe memory access."
"For optimal performance, thread configurations should align with the GPU’s architecture. Memory access patterns must be optimized for coalescing to reduce latency, and error-checking tools like cudaGetLastError can help identify and debug potential issues during kernel execution."
"In summary, we’ve explored vector addition implementations on both the CPU and GPU. Using CUDA, we can allocate memory, configure threads, and execute kernels efficiently. By following best practices such as optimizing thread configurations and ensuring memory coalescing, GPU-based vector addition becomes a powerful and scalable solution for large-scale data processing."

In this section, we delve into basic vector operations with a focus on vector addition, illustrating how to implement it efficiently on both CPU and GPU using CUDA. Vector addition is a fundamental operation in scientific computing and serves as an excellent example to understand parallel programming concepts.

CPU Vector Addition¶

A typical CPU vector addition function iterates over each element of the vectors and adds them sequentially. This method is straightforward but does not leverage parallelism, leading to inefficiencies for large vectors.

// CPU function that adds two vectors
void vector_add_cpu(float *a, float *b, float *out, int n) 
{
    for (int i = 0; i < n; i++)
    {
        out[i] = a[i] + b[i];
    }
}

Explanation:

Parameters:
a, b: Input vectors of size n.
out: Output vector to store the result.
n: Number of elements in the vectors.
Operation:
A simple for loop adds corresponding elements of a and b, storing the result in out.

GPU Vector Addition with CUDA¶

To harness the parallel computing capabilities of GPUs, we convert the CPU function into a CUDA kernel. This allows us to perform vector addition concurrently using multiple threads.

Steps for GPU Implementation:
Memory Allocation on Host and Device
Initialization of Host Data
Data Transfer from Host to Device
Thread Block Configuration
Kernel Execution
Data Transfer from Device to Host
Memory Deallocation

1. Memory Allocation on Host and Device

Allocate memory for the vectors on both the host (CPU) and the device (GPU). Host Memory Allocation:

// Initialize pointers for host memory
float *h_a, *h_b, *h_out;

// Allocate host memory
h_a   = (float*)malloc(sizeof(float) * N);
h_b   = (float*)malloc(sizeof(float) * N);
h_out = (float*)malloc(sizeof(float) * N);

Device Memory Allocation:

// Initialize pointers for device memory
float *d_a, *d_b, *d_out;

// Allocate device memory
cudaMalloc((void**)&d_a, sizeof(float) * N);
cudaMalloc((void**)&d_b, sizeof(float) * N);
cudaMalloc((void**)&d_out, sizeof(float) * N);

Explanation:

Host Variables (h_ prefix): Pointers to memory allocated on the CPU.
Device Variables (d_ prefix): Pointers to memory allocated on the GPU.
cudaMalloc allocates memory on the GPU device.

2. Initialization of Host Data

Initialize the input vectors h_a and h_b on the host.

// Initialize host arrays
for (int i = 0; i < N; i++)
{
    h_a[i] = 1.0f; // or any desired value
    h_b[i] = 2.0f; // or any desired value
}

3. Data Transfer from Host to Device

Copy the input data from host memory to device memory.

// Transfer data from host to device memory
cudaMemcpy(d_a, h_a, sizeof(float) * N, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, sizeof(float) * N, cudaMemcpyHostToDevice);

Explanation:

cudaMemcpy is used to transfer data between host and device.
The fourth parameter specifies the direction of the transfer.

4. Thread Block Configuration

Configure the execution parameters, defining the number of threads per block and blocks per grid.

// Define the number of threads per block and blocks per grid
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

Explanation:

threadsPerBlock: Number of threads within each block (commonly a multiple of 32 due to warp size).
blocksPerGrid: Total number of blocks required to process all elements.
The calculation ensures all elements are covered even if N is not a multiple of threadsPerBlock.

5. Kernel Execution

Invoke the CUDA kernel function to perform vector addition on the GPU.

// Execute the CUDA kernel function
vector_add_cuda<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_out, N);

CUDA Kernel Function:

// GPU kernel function that adds two vectors
__global__ void vector_add_cuda(float *a, float *b, float *out, int n) 
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    // Ensure we do not access out-of-bounds memory
    if (i < n)
    {
        out[i] = a[i] + b[i];
    }
}

Explanation:

Index Calculation:
i = blockIdx.x * blockDim.x + threadIdx.x;
Calculates the global thread index.
Boundary Check:
Ensures that threads beyond the vector size do not perform invalid memory accesses.
Memory Access:
Each thread adds one element from vectors a and b, storing the result in out.

6. Data Transfer from Device to Host

Copy the result from device memory back to host memory.

// Transfer data back to host memory
cudaMemcpy(h_out, d_out, sizeof(float) * N, cudaMemcpyDeviceToHost);

7. Memory Deallocation

Free the allocated memory on both the host and device to prevent memory leaks. Device Memory Deallocation:

// Deallocate device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_out);

Host Memory Deallocation:

// Deallocate host memory
free(h_a);
free(h_b);
free(h_out);

GPU Vector Addition with 2D Thread Blocks¶

In some cases, using 2D thread blocks can be beneficial, especially when working with 2D data structures like matrices. Here, we demonstrate how to adapt vector addition to use 2D thread blocks and map them to a 1D data structure.

Thread Configuration and Kernel Execution

Define 2D Thread Blocks:

// Thread block dimensions
dim3 threadsPerBlock(16, 16); // 256 threads per block
dim3 blocksPerGrid((N + threadsPerBlock.x * threadsPerBlock.y - 1) / (threadsPerBlock.x * threadsPerBlock.y));

Kernel Execution:

// Execute the CUDA kernel function with 2D thread blocks
vector_add_cuda_2d<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_out, N);

CUDA Kernel Function with 2D Thread Blocks

// GPU kernel function that adds two vectors using 2D thread blocks
__global__ void vector_add_cuda_2d(float *a, float *b, float *out, int n) 
{
    int blockId = blockIdx.x + blockIdx.y * gridDim.x;
    int threadIdInBlock = threadIdx.y * blockDim.x + threadIdx.x;
    int i = blockId * (blockDim.x * blockDim.y) + threadIdInBlock;

    if (i < n)
    {
        out[i] = a[i] + b[i];
    }
}

Explanation:

Block and Thread Indexing:
blockId calculates a unique block ID in a potentially 2D grid.
threadIdInBlock computes a unique thread ID within a 2D block.
Global Thread Index:
Combines blockId and threadIdInBlock to get a global index i.
Mapping to 1D Data:
Even though threads are organized in 2D, they operate on 1D data.

Advantages of Using 2D Thread Blocks

Better Utilization: Can lead to better resource utilization on the GPU.
Mapping to Data Structures: Easier mapping when dealing with 2D or 3D data.
Memory Coalescing: May improve memory access patterns for certain data layouts.

Performance Considerations¶

Thread Configuration:
Choosing the right number of threads per block and blocks per grid is crucial for optimal performance.
Factors such as warp size, occupancy, and hardware limitations influence this choice.
Memory Coalescing:
Accessing global memory in a coalesced manner improves performance.
Ensure that consecutive threads access consecutive memory addresses.
Error Checking:
Always check for errors after CUDA API calls to catch issues early.
Use cudaGetLastError() and cudaDeviceSynchronize() for debugging.