Using Shared Memory for Matrix Multiplication¶

Script

"Shared memory is a high-speed, low-latency memory available on GPUs. It allows threads within a block to share data efficiently, reducing the need for costly global memory accesses. This session focuses on using shared memory to optimize matrix multiplication on GPUs."
"Matrix multiplication calculates each element of the output matrix as a dot product of a row from matrix A and a column from matrix B. However, global memory accesses during this process can be slow and inefficient, making shared memory an attractive alternative for frequently accessed data."
"Shared memory provides faster access times compared to global memory, reduces redundant data fetches, and enables data reuse within a thread block. These advantages make shared memory ideal when multiple threads need access to the same data during computation."
"In this approach, a single row of matrix A is loaded into shared memory. Each thread in the block can then access this data directly from shared memory instead of repeatedly fetching it from global memory. Synchronization ensures all threads load the data before computations begin."
"This approach stores both a row from matrix A and a column from matrix B in shared memory. By caching data for both matrices, we completely avoid accessing global memory during computation within the block. This technique is particularly useful for reducing memory traffic in more complex data access patterns."
"Tiled matrix multiplication divides matrices into smaller tiles that fit into shared memory. Each thread block processes one tile at a time, loading it into shared memory, performing partial computations, and then moving to the next tile. This approach minimizes global memory accesses and makes efficient use of shared memory."
"This CUDA kernel implements tiled matrix multiplication. Each thread block loads a tile of data from matrices A and B into shared memory. The threads within the block then compute partial results by iterating over the tiles. These partial results are accumulated to form the final matrix product."
"Tiling improves performance by reducing global memory accesses and maximizing memory bandwidth utilization. It ensures that threads work with smaller chunks of data in shared memory, improving GPU utilization and computational efficiency."
"To use shared memory effectively, keep usage within the hardware limits, typically 32 KB per block. Use synchronization carefully with __syncthreads() to avoid race conditions. Experiment with tile sizes to optimize performance, and use profiling tools like NVIDIA Nsight to evaluate memory and computation efficiency."
"In summary, shared memory significantly enhances the efficiency of matrix multiplication on GPUs by reducing global memory accesses. Storing single rows or columns is beneficial for simpler access patterns, while tiled matrix multiplication unlocks the full potential of GPUs for large-scale computations, achieving superior performance through data reuse and parallelism."

In this section, we will explore how to use shared memory in CUDA for optimizing matrix multiplication on GPUs. Shared memory offers significantly higher bandwidth than global memory, making it ideal for operations like matrix multiplication, where repeated access to the same data is common.

By storing parts of matrices in shared memory, we can reduce the number of global memory accesses, thus improving performance. Here, we will discuss two approaches to matrix multiplication using shared memory and also cover a more advanced approach called tiled matrix multiplication.

Basic Approaches for Using Shared Memory¶

In matrix multiplication on the GPU, we want each thread to perform a single element calculation of the resulting matrix. However, to compute each element, each thread may require data from entire rows and columns of the input matrices. Using shared memory allows us to cache parts of these rows and columns, thereby speeding up the access.

Option 1: Storing a Single Row in Shared Memory¶

In this approach, we store only a single row of matrix $A$ in shared memory. This allows each element of the row to be accessed quickly by all threads responsible for calculating elements in the same row of the product matrix $C$ .

__global__ void matrix_mul(float *a, float* b, float *c, int width)
{
    __shared__ float aTile[TILE_DIM][TILE_DIM];

    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    float single_entry = 0.0f;

    aTile[threadIdx.y][threadIdx.x] = a[row * TILE_WIDTH + threadIdx.x];
    __syncwarp(); // Synchronize all threads in the warp

    for (int i = 0; i < width; i++) 
    {
        single_entry += aTile[threadIdx.y][i] * b[i * width + col];
    }

    c[row * width + col] = single_entry;
}

In this code:

The row of matrix A is loaded into shared memory (aTile), allowing faster access for subsequent calculations.
Each thread computes a single element in the resulting matrix by iterating over the corresponding row from A (in shared memory) and the relevant column from B (from global memory).

Option 2: Storing Rows and Columns in Shared Memory¶

In this approach, both the row of matrix A and the column of matrix B are stored in shared memory. This minimizes global memory accesses further, allowing faster computation for all elements within the same block.

__global__ void matrix_mul(float *a, float* b, float *c, int width)
{
    __shared__ float aTile[TILE_DIM][TILE_DIM];
    __shared__ float bTile[TILE_DIM][TILE_DIM];

    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    aTile[threadIdx.y][threadIdx.x] = a[row * TILE_WIDTH + threadIdx.x];
    bTile[threadIdx.y][threadIdx.x] = b[threadIdx.y * width + col];

    __syncthreads(); // Ensure all data is loaded into shared memory

    float single_entry = 0.0f;    
    for (int i = 0; i < width; i++) 
    { 
        single_entry += aTile[threadIdx.y][i] * bTile[i][threadIdx.x];    
    }

    c[row * width + col] = single_entry;
}

In this code:

aTile and bTile are used to store the current block's rows and columns of matrices A and B, respectively.
Each thread iterates over the row and column elements stored in shared memory, allowing all threads within a block to compute their results more efficiently.

Block Matrix Multiplication (Tiled Matrix Multiplication)¶

For larger matrices, a common optimization technique is tiled matrix multiplication, where both input matrices are divided into smaller blocks (tiles) that fit into shared memory. This approach loads one tile of each matrix into shared memory at a time, performs partial matrix multiplication for that tile, and accumulates the results. This process is repeated across all tiles needed to compute the final result.

__global__ void matrix_mul(const float *d_a, const float *d_b, 
                           float *d_c, int width)
{
    __shared__ float a_block[TILE_WIDTH][TILE_WIDTH];
    __shared__ float b_block[TILE_WIDTH][TILE_WIDTH];

    int bx = blockIdx.x;
    int by = blockIdx.y;
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    int row = by * TILE_WIDTH + ty;
    int col = bx * TILE_WIDTH + tx;

    float single_entry = 0;

    for (int i = 0; i < width / TILE_WIDTH; ++i)
    {
        // Load tiles of A and B into shared memory
        a_block[ty][tx] = d_a[row * width + (i * TILE_WIDTH + tx)];
        b_block[ty][tx] = d_b[col + (i * TILE_WIDTH + ty) * width];

        __syncthreads(); // Ensure all threads have loaded their tile

        // Perform the partial matrix multiplication for this tile
        for (int j = 0; j < TILE_WIDTH; ++j)
        {
            single_entry += a_block[ty][j] * b_block[j][tx];
        }

        __syncthreads(); // Synchronize before loading the next tile
    }

    d_c[row * width + col] = single_entry;
}

In this code:

The matrix is divided into tiles of size TILE_WIDTH × TILE_WIDTH.
Each tile of A and B is loaded into shared memory as a_block and b_block.
The threads within each block compute a partial sum using these tiles and then accumulate this sum across all tiles to form the final result.

Benefits of Tiled Matrix Multiplication¶

Reduced Global Memory Access: By loading blocks of data into shared memory, we significantly reduce the number of times we access global memory, which is slower.
Improved Memory Bandwidth Utilization: Shared memory provides much higher bandwidth, enabling faster data reuse within the block.
Parallelism Optimization: Tiling allows threads within a block to cooperatively load data, making the best use of parallelism and shared memory.

Summary¶

By using shared memory, we can reduce the time complexity and memory bandwidth requirements of matrix multiplication on the GPU. Each approach has its own use case:

Single Row/Column in Shared Memory: Best suited for cases where a limited amount of data needs frequent access.
Both Rows and Columns in Shared Memory: Useful when we need to load both rows and columns into shared memory, avoiding repetitive accesses to global memory.
Tiled Matrix Multiplication: The most efficient for large matrices, allowing for optimal memory usage and computational efficiency.

Utilizing shared memory in these ways allows for high-performance matrix multiplications, unlocking the potential of GPUs for large-scale computations.