Using Shared Memory for Matrix Multiplication in CUDA¶

1. Why is shared memory beneficial for matrix multiplication on GPUs?

A. Shared memory has higher bandwidth and lower latency than global memory
B. Shared memory is larger than global memory
C. Shared memory persists between kernel launches
D. Shared memory can be used by both the CPU and GPU

Click to reveal the answer

Answer: A. Shared memory has higher bandwidth and lower latency than global memory

Basic Approaches for Using Shared Memory¶

2. In the first shared memory approach, why is only a single row of matrix $A$ stored in shared memory?

A. To reduce memory usage on the device
B. To allow quick access by threads working on the same row of the result matrix
C. To ensure that each thread has unique data
D. To avoid using global memory altogether

Click to reveal the answer

Answer: B. To allow quick access by threads working on the same row of the result matrix

3. True or False: In the second shared memory approach, both a row of matrix $A$ and a column of matrix $B$ are stored in shared memory to reduce global memory accesses.

Click to reveal the answer

Answer: True

Tiled Matrix Multiplication¶

4. What is the primary purpose of dividing matrices into tiles in tiled matrix multiplication?

A. To simplify the matrix multiplication algorithm
B. To reduce the number of required multiplications
C. To load data into shared memory in manageable chunks, reducing global memory accesses
D. To improve readability of the kernel code

Click to reveal the answer

Answer: C. To load data into shared memory in manageable chunks, reducing global memory accesses

5. In the tiled matrix multiplication kernel, what does __syncthreads() achieve after loading each tile?

A. It allows threads in different blocks to synchronize
B. It ensures that all threads within the block have loaded their portion of the tile into shared memory before proceeding
C. It prevents overwriting of shared memory
D. It initiates data transfer from global to shared memory

Click to reveal the answer

Answer: B. It ensures that all threads within the block have loaded their portion of the tile into shared memory before proceeding

6. Why do we need a loop over i in the tiled matrix multiplication kernel?

A. To initialize shared memory before each kernel launch
B. To load each portion (tile) of matrices $A$ and $B$ into shared memory in steps, reducing global memory loads
C. To handle matrix dimensions that are not multiples of the tile width
D. To compute the result matrix on the host

Click to reveal the answer

Answer: B. To load each portion (tile) of matrices $ A $ and $ B $ into shared memory in steps, reducing global memory loads

Benefits of Tiled Matrix Multiplication¶

7. Which of the following is NOT a benefit of using tiled matrix multiplication?

A. Reduced number of accesses to global memory
B. Improved use of shared memory bandwidth
C. Easier to debug than standard matrix multiplication
D. Enhanced performance for large matrices

Click to reveal the answer

Answer: C. Easier to debug than standard matrix multiplication