Using Shared Memory for Matrix Multiplication in CUDA¶
1. Why is shared memory beneficial for matrix multiplication on GPUs?
- A. Shared memory has higher bandwidth and lower latency than global memory
- B. Shared memory is larger than global memory
- C. Shared memory persists between kernel launches
- D. Shared memory can be used by both the CPU and GPU
Click to reveal the answer
Answer: A. Shared memory has higher bandwidth and lower latency than global memoryBasic Approaches for Using Shared Memory¶
2. In the first shared memory approach, why is only a single row of matrix A stored in shared memory?
- A. To reduce memory usage on the device
- B. To allow quick access by threads working on the same row of the result matrix
- C. To ensure that each thread has unique data
- D. To avoid using global memory altogether
Click to reveal the answer
Answer: B. To allow quick access by threads working on the same row of the result matrix3. True or False: In the second shared memory approach, both a row of matrix A and a column of matrix B are stored in shared memory to reduce global memory accesses.
Click to reveal the answer
Answer: TrueTiled Matrix Multiplication¶
4. What is the primary purpose of dividing matrices into tiles in tiled matrix multiplication?
- A. To simplify the matrix multiplication algorithm
- B. To reduce the number of required multiplications
- C. To load data into shared memory in manageable chunks, reducing global memory accesses
- D. To improve readability of the kernel code
Click to reveal the answer
Answer: C. To load data into shared memory in manageable chunks, reducing global memory accesses5. In the tiled matrix multiplication kernel, what does __syncthreads()
achieve after loading each tile?
- A. It allows threads in different blocks to synchronize
- B. It ensures that all threads within the block have loaded their portion of the tile into shared memory before proceeding
- C. It prevents overwriting of shared memory
- D. It initiates data transfer from global to shared memory
Click to reveal the answer
Answer: B. It ensures that all threads within the block have loaded their portion of the tile into shared memory before proceeding6. Why do we need a loop over i
in the tiled matrix multiplication kernel?
- A. To initialize shared memory before each kernel launch
- B. To load each portion (tile) of matrices A and B into shared memory in steps, reducing global memory loads
- C. To handle matrix dimensions that are not multiples of the tile width
- D. To compute the result matrix on the host
Click to reveal the answer
Answer: B. To load each portion (tile) of matrices \( A \) and \( B \) into shared memory in steps, reducing global memory loadsBenefits of Tiled Matrix Multiplication¶
7. Which of the following is NOT a benefit of using tiled matrix multiplication?
- A. Reduced number of accesses to global memory
- B. Improved use of shared memory bandwidth
- C. Easier to debug than standard matrix multiplication
- D. Enhanced performance for large matrices