Skip to content

Vector Addition

In this section, we delve into basic vector operations with a focus on vector addition, illustrating how to implement it efficiently on both CPU and GPU using CUDA. Vector addition is a fundamental operation in scientific computing and serves as an excellent example to understand parallel programming concepts.

CPU Vector Addition

A typical CPU vector addition function iterates over each element of the vectors and adds them sequentially. This method is straightforward but does not leverage parallelism, leading to inefficiencies for large vectors.

// CPU function that adds two vectors
void vector_add_cpu(float *a, float *b, float *out, int n) 
{
    for (int i = 0; i < n; i++)
    {
        out[i] = a[i] + b[i];
    }
}
Explanation:

  • Parameters:

    • a, b: Input vectors of size n.
    • out: Output vector to store the result.
    • n: Number of elements in the vectors.
  • Operation:

    • A simple for loop adds corresponding elements of a and b, storing the result in out.

GPU Vector Addition with CUDA

To harness the parallel computing capabilities of GPUs, we convert the CPU function into a CUDA kernel. This allows us to perform vector addition concurrently using multiple threads.

  1. Steps for GPU Implementation:
  2. Memory Allocation on Host and Device
  3. Initialization of Host Data
  4. Data Transfer from Host to Device
  5. Thread Block Configuration
  6. Kernel Execution
  7. Data Transfer from Device to Host
  8. Memory Deallocation

1. Memory Allocation on Host and Device

Allocate memory for the vectors on both the host (CPU) and the device (GPU). Host Memory Allocation:

// Initialize pointers for host memory
float *h_a, *h_b, *h_out;

// Allocate host memory
h_a   = (float*)malloc(sizeof(float) * N);
h_b   = (float*)malloc(sizeof(float) * N);
h_out = (float*)malloc(sizeof(float) * N);
Device Memory Allocation:

// Initialize pointers for device memory
float *d_a, *d_b, *d_out;

// Allocate device memory
cudaMalloc((void**)&d_a, sizeof(float) * N);
cudaMalloc((void**)&d_b, sizeof(float) * N);
cudaMalloc((void**)&d_out, sizeof(float) * N);

Explanation:

  • Host Variables (h_ prefix): Pointers to memory allocated on the CPU.
  • Device Variables (d_ prefix): Pointers to memory allocated on the GPU.
  • cudaMalloc allocates memory on the GPU device.

2. Initialization of Host Data

Initialize the input vectors h_a and h_b on the host.

// Initialize host arrays
for (int i = 0; i < N; i++)
{
    h_a[i] = 1.0f; // or any desired value
    h_b[i] = 2.0f; // or any desired value
}

3. Data Transfer from Host to Device

Copy the input data from host memory to device memory.

// Transfer data from host to device memory
cudaMemcpy(d_a, h_a, sizeof(float) * N, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, sizeof(float) * N, cudaMemcpyHostToDevice);
Explanation: - cudaMemcpy is used to transfer data between host and device. - The fourth parameter specifies the direction of the transfer.


4. Thread Block Configuration

Configure the execution parameters, defining the number of threads per block and blocks per grid.

// Define the number of threads per block and blocks per grid
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
Explanation:

  • threadsPerBlock: Number of threads within each block (commonly a multiple of 32 due to warp size).
  • blocksPerGrid: Total number of blocks required to process all elements.
  • The calculation ensures all elements are covered even if N is not a multiple of threadsPerBlock.

5. Kernel Execution

Invoke the CUDA kernel function to perform vector addition on the GPU.

// Execute the CUDA kernel function
vector_add_cuda<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_out, N);

CUDA Kernel Function:

// GPU kernel function that adds two vectors
__global__ void vector_add_cuda(float *a, float *b, float *out, int n) 
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    // Ensure we do not access out-of-bounds memory
    if (i < n)
    {
        out[i] = a[i] + b[i];
    }
}
Explanation:

  • Index Calculation:
    • i = blockIdx.x * blockDim.x + threadIdx.x;
    • Calculates the global thread index.
  • Boundary Check:
    • Ensures that threads beyond the vector size do not perform invalid memory accesses.
  • Memory Access:
    • Each thread adds one element from vectors a and b, storing the result in out.

6. Data Transfer from Device to Host

Copy the result from device memory back to host memory.

// Transfer data back to host memory
cudaMemcpy(h_out, d_out, sizeof(float) * N, cudaMemcpyDeviceToHost);

7. Memory Deallocation

Free the allocated memory on both the host and device to prevent memory leaks. Device Memory Deallocation:

// Deallocate device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_out);

Host Memory Deallocation:

// Deallocate host memory
free(h_a);
free(h_b);
free(h_out);

GPU Vector Addition with 2D Thread Blocks

In some cases, using 2D thread blocks can be beneficial, especially when working with 2D data structures like matrices. Here, we demonstrate how to adapt vector addition to use 2D thread blocks and map them to a 1D data structure.

Thread Configuration and Kernel Execution

Define 2D Thread Blocks:

// Thread block dimensions
dim3 threadsPerBlock(16, 16); // 256 threads per block
dim3 blocksPerGrid((N + threadsPerBlock.x * threadsPerBlock.y - 1) / (threadsPerBlock.x * threadsPerBlock.y));

Kernel Execution:

// Execute the CUDA kernel function with 2D thread blocks
vector_add_cuda_2d<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_out, N);

CUDA Kernel Function with 2D Thread Blocks

// GPU kernel function that adds two vectors using 2D thread blocks
__global__ void vector_add_cuda_2d(float *a, float *b, float *out, int n) 
{
    int blockId = blockIdx.x + blockIdx.y * gridDim.x;
    int threadIdInBlock = threadIdx.y * blockDim.x + threadIdx.x;
    int i = blockId * (blockDim.x * blockDim.y) + threadIdInBlock;

    if (i < n)
    {
        out[i] = a[i] + b[i];
    }
}
Explanation:

  • Block and Thread Indexing:
    • blockId calculates a unique block ID in a potentially 2D grid.
    • threadIdInBlock computes a unique thread ID within a 2D block.
  • Global Thread Index:
    • Combines blockId and threadIdInBlock to get a global index i.
  • Mapping to 1D Data:
    • Even though threads are organized in 2D, they operate on 1D data.

Advantages of Using 2D Thread Blocks

  • Better Utilization: Can lead to better resource utilization on the GPU.
  • Mapping to Data Structures: Easier mapping when dealing with 2D or 3D data.
  • Memory Coalescing: May improve memory access patterns for certain data layouts.

Performance Considerations

  • Thread Configuration:
    • Choosing the right number of threads per block and blocks per grid is crucial for optimal performance.
    • Factors such as warp size, occupancy, and hardware limitations influence this choice.
  • Memory Coalescing:
    • Accessing global memory in a coalesced manner improves performance.
    • Ensure that consecutive threads access consecutive memory addresses.
  • Error Checking:
    • Always check for errors after CUDA API calls to catch issues early.
    • Use cudaGetLastError() and cudaDeviceSynchronize() for debugging.