Matrix Multiplication¶
Matrix multiplication is a fundamental operation in scientific computing, especially in areas such as machine learning and computer graphics. In this guide, we will look into matrix multiplication on both CPU and GPU using CUDA, focusing on optimizing performance for large matrices through parallelization.
Overview of Matrix Multiplication¶
Matrix multiplication involves computing the dot product of rows and columns to produce an output matrix. In our example, we store matrices in a 1D array, making it easier to adapt between CPU and GPU implementations.
CPU Matrix Multiplication¶
The CPU implementation of matrix multiplication iterates through each element in the matrix and computes the dot product of corresponding rows and columns.
// CPU function for matrix multiplication
float *matrix_mul(float *h_a, float *h_b, float *h_c, int width)
{
for (int row = 0; row < width; ++row)
{
for (int col = 0; col < width; ++col)
{
float single_entry = 0;
for (int i = 0; i < width; ++i)
{
single_entry += h_a[row * width + i] * h_b[i * width + col];
}
h_c[row * width + col] = single_entry;
}
}
return h_c;
}
Explanation:
- Loop Structure:
- The outer two loops iterate over each row and column, while the innermost loop calculates the dot product for the current element.
- Memory Layout:
- The 2D matrix is stored in a 1D array to simplify addressing when moving to the GPU.
GPU Matrix Multiplication with CUDA¶
The GPU implementation of matrix multiplication leverages CUDA’s parallelism by assigning each element computation to a thread. Here’s a breakdown of the steps involved in implementing matrix multiplication on the GPU:
1. Memory Allocation on Host and Device
Allocate memory for matrices on both the host (CPU) and the device (GPU).
// Host memory allocation
float *a = (float*)malloc(sizeof(float) * (N * N));
float *b = (float*)malloc(sizeof(float) * (N * N));
float *c = (float*)malloc(sizeof(float) * (N * N));
// Device memory allocation
float *d_a, *d_b, *d_c;
cudaMalloc((void**)&d_a, sizeof(float) * (N * N));
cudaMalloc((void**)&d_b, sizeof(float) * (N * N));
cudaMalloc((void**)&d_c, sizeof(float) * (N * N));
2. Initialize Host Data
Initialize the elements of matrices a and b on the host.
for (int i = 0; i < N * N; i++)
{
a[i] = 1.0f;
b[i] = 2.0f;
}
3. Transfer Data from Host to Device
Transfer the initialized data from the host to the device.
cudaMemcpy(d_a, a, sizeof(float) * (N * N), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, sizeof(float) * (N * N), cudaMemcpyHostToDevice);
Define the grid and block dimensions to organize the threads. Here, we use 32x32 threads per block, resulting in 1024 threads per block for optimal occupancy on modern GPUs.
int blockSize = 32;
dim3 dimBlock(blockSize, blockSize, 1);
dim3 dimGrid((N + blockSize - 1) / blockSize, (N + blockSize - 1) / blockSize, 1);
- dimBlock: Specifies a
32x32
block of threads. - dimGrid: Defines the number of blocks needed to cover the entire matrix, rounding up for incomplete blocks.
5. Launch the Kernel
Invoke the CUDA kernel for matrix multiplication on the GPU.
matrix_mul<<<dimGrid, dimBlock>>>(d_a, d_b, d_c, N);
6. CUDA Kernel for Matrix Multiplication
The CUDA kernel function calculates the matrix product in parallel by assigning each thread to compute one element of the output matrix.
__global__ void matrix_mul(float* d_a, float* d_b, float* d_c, int width)
{
int row = blockIdx.x * blockDim.x + threadIdx.x;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if (row < width && col < width)
{
float single_entry = 0;
for (int i = 0; i < width; ++i)
{
single_entry += d_a[row * width + i] * d_b[i * width + col];
}
d_c[row * width + col] = single_entry;
}
}
Explanation:
- Index Calculation:
- Each thread calculates its unique row and column indices based on
blockIdx
andthreadIdx
.
- Each thread calculates its unique row and column indices based on
- Boundary Check:
- Ensaures that threads only process valid elements within the matrix dimensions.
- Matrix Multiplication:
- Each thread computes a single element of the result matrix by performing the dot product for its assigned position.
7. Transfer Data Back to Host
Copy the result from the device memory back to the host memory.
cudaMemcpy(c, d_c, sizeof(float) * (N * N), cudaMemcpyDeviceToHost);
Release the allocated memory on both the device and the host.
// Free device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
// Free host memory
free(a);
free(b);
free(c);