Skip to content

Using Shared Memory for Matrix Multiplication

In this section, we will explore how to use shared memory in CUDA for optimizing matrix multiplication on GPUs. Shared memory offers significantly higher bandwidth than global memory, making it ideal for operations like matrix multiplication, where repeated access to the same data is common.

By storing parts of matrices in shared memory, we can reduce the number of global memory accesses, thus improving performance. Here, we will discuss two approaches to matrix multiplication using shared memory and also cover a more advanced approach called tiled matrix multiplication.

Basic Approaches for Using Shared Memory

In matrix multiplication on the GPU, we want each thread to perform a single element calculation of the resulting matrix. However, to compute each element, each thread may require data from entire rows and columns of the input matrices. Using shared memory allows us to cache parts of these rows and columns, thereby speeding up the access.

Option 1: Storing a Single Row in Shared Memory

In this approach, we store only a single row of matrix A in shared memory. This allows each element of the row to be accessed quickly by all threads responsible for calculating elements in the same row of the product matrix C .

__global__ void matrix_mul(float *a, float* b, float *c, int width)
{
    __shared__ float aTile[TILE_DIM][TILE_DIM];

    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    float single_entry = 0.0f;

    aTile[threadIdx.y][threadIdx.x] = a[row * TILE_WIDTH + threadIdx.x];
    __syncwarp(); // Synchronize all threads in the warp

    for (int i = 0; i < width; i++) 
    {
        single_entry += aTile[threadIdx.y][i] * b[i * width + col];
    }

    c[row * width + col] = single_entry;
}

In this code:

  • The row of matrix A is loaded into shared memory (aTile), allowing faster access for subsequent calculations.
  • Each thread computes a single element in the resulting matrix by iterating over the corresponding row from A (in shared memory) and the relevant column from B (from global memory).

Option 2: Storing Rows and Columns in Shared Memory

In this approach, both the row of matrix A and the column of matrix B are stored in shared memory. This minimizes global memory accesses further, allowing faster computation for all elements within the same block.

__global__ void matrix_mul(float *a, float* b, float *c, int width)
{
    __shared__ float aTile[TILE_DIM][TILE_DIM];
    __shared__ float bTile[TILE_DIM][TILE_DIM];

    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    aTile[threadIdx.y][threadIdx.x] = a[row * TILE_WIDTH + threadIdx.x];
    bTile[threadIdx.y][threadIdx.x] = b[threadIdx.y * width + col];

    __syncthreads(); // Ensure all data is loaded into shared memory

    float single_entry = 0.0f;    
    for (int i = 0; i < width; i++) 
    { 
        single_entry += aTile[threadIdx.y][i] * bTile[i][threadIdx.x];    
    }

    c[row * width + col] = single_entry;
}

In this code:

  • aTile and bTile are used to store the current block's rows and columns of matrices A and B, respectively.
  • Each thread iterates over the row and column elements stored in shared memory, allowing all threads within a block to compute their results more efficiently.

Block Matrix Multiplication (Tiled Matrix Multiplication)

For larger matrices, a common optimization technique is tiled matrix multiplication, where both input matrices are divided into smaller blocks (tiles) that fit into shared memory. This approach loads one tile of each matrix into shared memory at a time, performs partial matrix multiplication for that tile, and accumulates the results. This process is repeated across all tiles needed to compute the final result.

__global__ void matrix_mul(const float *d_a, const float *d_b, 
                           float *d_c, int width)
{
    __shared__ float a_block[TILE_WIDTH][TILE_WIDTH];
    __shared__ float b_block[TILE_WIDTH][TILE_WIDTH];

    int bx = blockIdx.x;
    int by = blockIdx.y;
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    int row = by * TILE_WIDTH + ty;
    int col = bx * TILE_WIDTH + tx;

    float single_entry = 0;

    for (int i = 0; i < width / TILE_WIDTH; ++i)
    {
        // Load tiles of A and B into shared memory
        a_block[ty][tx] = d_a[row * width + (i * TILE_WIDTH + tx)];
        b_block[ty][tx] = d_b[col + (i * TILE_WIDTH + ty) * width];

        __syncthreads(); // Ensure all threads have loaded their tile

        // Perform the partial matrix multiplication for this tile
        for (int j = 0; j < TILE_WIDTH; ++j)
        {
            single_entry += a_block[ty][j] * b_block[j][tx];
        }

        __syncthreads(); // Synchronize before loading the next tile
    }

    d_c[row * width + col] = single_entry;
}
In this code:

  • The matrix is divided into tiles of size TILE_WIDTH × TILE_WIDTH.
  • Each tile of A and B is loaded into shared memory as a_block and b_block.
  • The threads within each block compute a partial sum using these tiles and then accumulate this sum across all tiles to form the final result.

Benefits of Tiled Matrix Multiplication

  • Reduced Global Memory Access: By loading blocks of data into shared memory, we significantly reduce the number of times we access global memory, which is slower.
  • Improved Memory Bandwidth Utilization: Shared memory provides much higher bandwidth, enabling faster data reuse within the block.
  • Parallelism Optimization: Tiling allows threads within a block to cooperatively load data, making the best use of parallelism and shared memory.

Summary

By using shared memory, we can reduce the time complexity and memory bandwidth requirements of matrix multiplication on the GPU. Each approach has its own use case:

  • Single Row/Column in Shared Memory: Best suited for cases where a limited amount of data needs frequent access.
  • Both Rows and Columns in Shared Memory: Useful when we need to load both rows and columns into shared memory, avoiding repetitive accesses to global memory.
  • Tiled Matrix Multiplication: The most efficient for large matrices, allowing for optimal memory usage and computational efficiency.

Utilizing shared memory in these ways allows for high-performance matrix multiplications, unlocking the potential of GPUs for large-scale computations.