Matrix Multiplication¶

Script

"Matrix multiplication is a fundamental operation widely used in scientific computing and machine learning. This session focuses on optimizing matrix multiplication for large datasets using CUDA to leverage the parallel processing power of GPUs."
"In matrix multiplication, each element of the output matrix is calculated as the dot product of a row from the first matrix and a column from the second. In GPU implementations, matrices are typically stored as 1D arrays to simplify memory addressing and access."
"This is a CPU implementation of matrix multiplication. The code iterates over rows and columns of the output matrix, calculating each element as the dot product of the corresponding row and column from the input matrices. While straightforward, this approach doesn’t utilize parallelism and becomes inefficient for large matrices."
"GPU implementations of matrix multiplication follow a systematic process, starting with memory allocation, data initialization, and transferring data to the GPU. Kernel execution handles the parallel computation, and results are transferred back to the host before memory is deallocated. Each step is designed to maximize the GPU’s parallel processing capabilities."
"Memory allocation is the first step. Host memory is allocated using standard malloc, while GPU memory is allocated using cudaMalloc. Separate memory spaces for the host and device allow for efficient data transfer and computation."
"Input matrices are initialized on the host and transferred to the GPU using cudaMemcpy. This ensures that the GPU has access to the data it needs for computation, a critical step in CUDA programming."
"To achieve optimal parallelism, we configure a 32x32 block size, resulting in 1,024 threads per block. Grid dimensions are calculated to ensure all elements of the matrices are covered, even when the matrix size isn’t divisible by the block size. This configuration maximizes GPU utilization."
"In the CUDA kernel, each thread computes a single element of the output matrix. The thread’s unique row and column indices are calculated using blockIdx and threadIdx. A boundary check ensures that threads operate only within valid memory ranges, preventing out-of-bounds errors."
"Once the kernel is launched with the configured grid and block dimensions, the GPU performs the computations. The resulting output matrix is then transferred back to the host using cudaMemcpy, making it available for further processing or analysis."
"Memory management is a critical aspect of GPU programming. After computations are complete, GPU memory is freed using cudaFree, while host memory is deallocated using free. Proper cleanup prevents memory leaks and ensures efficient use of resources."
"Achieving optimal performance in GPU matrix multiplication requires careful consideration of thread block size and memory access patterns. Aligning memory access for coalescing minimizes latency, and using tools like cudaGetLastError helps identify and debug issues during kernel execution."
"In summary, GPU-based matrix multiplication with CUDA provides a highly efficient solution for handling large matrices. By leveraging parallelism, optimizing memory management, and configuring kernels effectively, we can achieve significant speedups compared to CPU implementations. CUDA’s power lies in its ability to scale computations across thousands of threads, making it indispensable for scientific and machine learning applications."

Matrix multiplication is a fundamental operation in scientific computing, especially in areas such as machine learning and computer graphics. In this guide, we will look into matrix multiplication on both CPU and GPU using CUDA, focusing on optimizing performance for large matrices through parallelization.

Overview of Matrix Multiplication¶

Matrix multiplication involves computing the dot product of rows and columns to produce an output matrix. In our example, we store matrices in a 1D array, making it easier to adapt between CPU and GPU implementations.

CPU Matrix Multiplication¶

The CPU implementation of matrix multiplication iterates through each element in the matrix and computes the dot product of corresponding rows and columns.

// CPU function for matrix multiplication
float *matrix_mul(float *h_a, float *h_b, float *h_c, int width) 
{
    for (int row = 0; row < width; ++row) 
    {
        for (int col = 0; col < width; ++col) 
        {
            float single_entry = 0;
            for (int i = 0; i < width; ++i) 
            {
                single_entry += h_a[row * width + i] * h_b[i * width + col];
            }
            h_c[row * width + col] = single_entry;
        }
    }
    return h_c;
}

Explanation:

Loop Structure:
The outer two loops iterate over each row and column, while the innermost loop calculates the dot product for the current element.
Memory Layout:
The 2D matrix is stored in a 1D array to simplify addressing when moving to the GPU.

GPU Matrix Multiplication with CUDA¶

The GPU implementation of matrix multiplication leverages CUDA’s parallelism by assigning each element computation to a thread. Here’s a breakdown of the steps involved in implementing matrix multiplication on the GPU:

1. Memory Allocation on Host and Device

Allocate memory for matrices on both the host (CPU) and the device (GPU).

// Host memory allocation
float *a = (float*)malloc(sizeof(float) * (N * N));
float *b = (float*)malloc(sizeof(float) * (N * N));
float *c = (float*)malloc(sizeof(float) * (N * N));

// Device memory allocation
float *d_a, *d_b, *d_c;
cudaMalloc((void**)&d_a, sizeof(float) * (N * N));
cudaMalloc((void**)&d_b, sizeof(float) * (N * N));
cudaMalloc((void**)&d_c, sizeof(float) * (N * N));

2. Initialize Host Data

Initialize the elements of matrices a and b on the host.

for (int i = 0; i < N * N; i++) 
{
    a[i] = 1.0f;
    b[i] = 2.0f;
}

3. Transfer Data from Host to Device

Transfer the initialized data from the host to the device.

cudaMemcpy(d_a, a, sizeof(float) * (N * N), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, sizeof(float) * (N * N), cudaMemcpyHostToDevice);

4. Configure Thread Blocks and Grid

Define the grid and block dimensions to organize the threads. Here, we use 32x32 threads per block, resulting in 1024 threads per block for optimal occupancy on modern GPUs.

int blockSize = 32;
dim3 dimBlock(blockSize, blockSize, 1);
dim3 dimGrid((N + blockSize - 1) / blockSize, (N + blockSize - 1) / blockSize, 1);

Explanation:

dimBlock: Specifies a 32x32 block of threads.
dimGrid: Defines the number of blocks needed to cover the entire matrix, rounding up for incomplete blocks.

5. Launch the Kernel

Invoke the CUDA kernel for matrix multiplication on the GPU.

matrix_mul<<<dimGrid, dimBlock>>>(d_a, d_b, d_c, N);

6. CUDA Kernel for Matrix Multiplication

The CUDA kernel function calculates the matrix product in parallel by assigning each thread to compute one element of the output matrix.

__global__ void matrix_mul(float* d_a, float* d_b, float* d_c, int width) 
{
    int row = blockIdx.x * blockDim.x + threadIdx.x;
    int col = blockIdx.y * blockDim.y + threadIdx.y;

    if (row < width && col < width) 
    {
        float single_entry = 0;
        for (int i = 0; i < width; ++i) 
        {
            single_entry += d_a[row * width + i] * d_b[i * width + col];
        }
        d_c[row * width + col] = single_entry;
    }
}

Explanation:

Index Calculation:
Each thread calculates its unique row and column indices based on blockIdx and threadIdx.
Boundary Check:
Ensaures that threads only process valid elements within the matrix dimensions.
Matrix Multiplication:
Each thread computes a single element of the result matrix by performing the dot product for its assigned position.

7. Transfer Data Back to Host

Copy the result from the device memory back to the host memory.

cudaMemcpy(c, d_c, sizeof(float) * (N * N), cudaMemcpyDeviceToHost);

8. Free Memory

Release the allocated memory on both the device and the host.

// Free device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);

// Free host memory
free(a);
free(b);
free(c);