Skip to content

Using Shared Memory for Matrix Multiplication in CUDA

1. Why is shared memory beneficial for matrix multiplication on GPUs?

  • A. Shared memory has higher bandwidth and lower latency than global memory
  • B. Shared memory is larger than global memory
  • C. Shared memory persists between kernel launches
  • D. Shared memory can be used by both the CPU and GPU
Click to reveal the answer Answer: A. Shared memory has higher bandwidth and lower latency than global memory

Basic Approaches for Using Shared Memory

2. In the first shared memory approach, why is only a single row of matrix A stored in shared memory?

  • A. To reduce memory usage on the device
  • B. To allow quick access by threads working on the same row of the result matrix
  • C. To ensure that each thread has unique data
  • D. To avoid using global memory altogether
Click to reveal the answer Answer: B. To allow quick access by threads working on the same row of the result matrix

3. True or False: In the second shared memory approach, both a row of matrix A and a column of matrix B are stored in shared memory to reduce global memory accesses.

Click to reveal the answer Answer: True

Tiled Matrix Multiplication

4. What is the primary purpose of dividing matrices into tiles in tiled matrix multiplication?

  • A. To simplify the matrix multiplication algorithm
  • B. To reduce the number of required multiplications
  • C. To load data into shared memory in manageable chunks, reducing global memory accesses
  • D. To improve readability of the kernel code
Click to reveal the answer Answer: C. To load data into shared memory in manageable chunks, reducing global memory accesses

5. In the tiled matrix multiplication kernel, what does __syncthreads() achieve after loading each tile?

  • A. It allows threads in different blocks to synchronize
  • B. It ensures that all threads within the block have loaded their portion of the tile into shared memory before proceeding
  • C. It prevents overwriting of shared memory
  • D. It initiates data transfer from global to shared memory
Click to reveal the answer Answer: B. It ensures that all threads within the block have loaded their portion of the tile into shared memory before proceeding

6. Why do we need a loop over i in the tiled matrix multiplication kernel?

  • A. To initialize shared memory before each kernel launch
  • B. To load each portion (tile) of matrices A and B into shared memory in steps, reducing global memory loads
  • C. To handle matrix dimensions that are not multiples of the tile width
  • D. To compute the result matrix on the host
Click to reveal the answer Answer: B. To load each portion (tile) of matrices \( A \) and \( B \) into shared memory in steps, reducing global memory loads

Benefits of Tiled Matrix Multiplication

7. Which of the following is NOT a benefit of using tiled matrix multiplication?

  • A. Reduced number of accesses to global memory
  • B. Improved use of shared memory bandwidth
  • C. Easier to debug than standard matrix multiplication
  • D. Enhanced performance for large matrices
Click to reveal the answer Answer: C. Easier to debug than standard matrix multiplication