Skip to content

Matrix Multiplication

This document explores various OpenMP offloading constructs used to parallelize matrix multiplication in both C/C++ and Fortran, focusing on how different compute and data mapping clauses affect GPU execution. Each option showcases a unique approach to efficiently map data and utilize GPU threads.


Matrix Multiplication in C/C++

Option 1: #pragma omp target parallel for collapse(N)

In this option, we utilize target parallel for with collapse(2), which merges the outer row and col loops, allowing them to be executed in parallel across GPU threads. The private clause provides each thread with its own copies of row, col, and i to avoid data hazards.

// Matrix_Multiplication.c
void Matrix_Multiplication(float *a, float *b, float *c, int n) {
    #pragma omp target parallel for collapse(2) private(row, col, i)
    for (int row = 0; row < n; row++) {
        for (int col = 0; col < n; col++) {
            float sum = 0;
            for (int i = 0; i < n; i++) {
                sum += a[row * n + i] * b[i * n + col];
            }
            c[row * n + col] = sum;
        }
    }
}

Compilation Output Explanation:

  • target parallel for: Offloads the outer loop and parallelizes it across threads.
  • collapse(2): Collapses the row and col loops to distribute the work across threads efficiently.
  • Private copies of row, col, and i for each thread ensure correct calculations.
Matrix_Multiplication:
     14, #omp target parallel do
         14, Generating "nvkernel_Matrix_Multiplication_F1L14_2" GPU kernel
         16, Loop parallelized across threads(128), schedule(static)
main:
     51, Generating map(tofrom:a[:N*N],c[:N*N],b[:N*N])

Option 2: #pragma omp target teams distribute parallel for collapse(N)

This option introduces teams distribute to create multiple teams. Each team executes a portion of the matrix multiplication, with threads within teams working on the calculations in parallel. This approach scales efficiently across the GPU by dividing work between multiple teams.

// Matrix_Multiplication.c
void Matrix_Multiplication(float *a, float *b, float *c, int n) {
    #pragma omp target teams distribute parallel for collapse(2) private(row, col, i)
    for (int row = 0; row < n; row++) {
        for (int col = 0; col < n; col++) {
            float sum = 0;
            for (int i = 0; i < n; i++) {
                sum += a[row * n + i] * b[i * n + col];
            }
            c[row * n + col] = sum;
        }
    }
}

Compilation Output Explanation:

  • target teams distribute parallel for: Creates a team of threads that work on portions of the matrix, improving GPU utilization.
  • collapse(2): Collapses the row and col loops.
  • private(row, col, i): Ensures each team has unique copies of variables, preventing race conditions.
Matrix_Multiplication:
     13, #omp target teams distribute parallel for num_teams(5)
         13, Generating "nvkernel_Matrix_Multiplication_F1L13_2" GPU kernel
             Loop parallelized across teams and threads(128), schedule(static)
main:
     50, Generating map(tofrom:a[:N*N],c[:N*N],b[:N*N])

Option 3: #pragma omp target teams distribute parallel for num_teams(N) collapse(N)

This version includes num_teams(5) to create five teams, each of which processes a different part of the loop iterations. This approach is efficient for systems with multiple compute units, enhancing parallelism by limiting the number of teams based on available hardware resources.

// Matrix_Multiplication.c
void Matrix_Multiplication(float *a, float *b, float *c, int n) {
    #pragma omp target teams distribute parallel for num_teams(5) collapse(2) private(row, col, i)
    for (int row = 0; row < n; row++) {
        for (int col = 0; col < n; col++) {
            float sum = 0;
            for (int i = 0; i < n; i++) {
                sum += a[row * n + i] * b[i * n + col];
            }
            c[row * n + col] = sum;
        }
    }
}

Compilation Output Explanation:

  • num_teams(5): Creates five teams, allowing fine-grained control over GPU resource allocation.
  • collapse(2): Combines row and col loops for more efficient parallelization.
  • private(row, col, i): Maintains thread safety by allocating separate copies of each variable.
Matrix_Multiplication:
     13, #omp target teams distribute parallel for num_teams(5)
         13, Generating "nvkernel_Matrix_Multiplication_F1L13_2" GPU kernel
             Loop parallelized across teams and threads(128), schedule(static)
main:
     50, Generating map(tofrom:a[:N*N],c[:N*N],b[:N*N])

Matrix Multiplication in Fortran

The following sections describe similar options in Fortran, using OpenMP offloading constructs to enable parallel execution on GPUs.

Option 1: !$omp target loop collapse(N)

subroutine Matrix_Multiplication(a, b, c, n)
  real(8), intent(in) :: a(n, n), b(n, n)
  real(8), intent(out) :: c(n, n)
  integer :: row, col, i
  real(8) :: sum

  !$omp target loop collapse(2) private(row, col, i)
  do row = 1, n
     do col = 1, n
        sum = 0.0
        do i = 1, n
           sum = sum + a(row, i) * b(i, col)
        end do
        c(row, col) = sum
     end do
  end do
  !$omp end target loop
end subroutine

Compilation Output Explanation:

  • target loop: Offloads the computation to the GPU.
  • collapse(2): Collapses row and col loops for efficient parallel execution.
  • private(row, col, i): Isolates each thread's variables, ensuring no data conflict.
Matrix_Multiplication:
     16, #omp target loop
         16, Generating "nvkernel_Matrix_Multiplication_F1L16_2" GPU kernel
             Generating NVIDIA GPU code
           16, Loop parallelized across teams, threads(128) collapse(2) /* blockIdx.x threadIdx.x */
           18,   /* blockIdx.x threadIdx.x collapsed */
           20, Loop run sequentially 
         16, Generating Multicore code
           16, Loop parallelized across threads
     16, Generating implicit map(to:a[:],b[:]) 
         Generating implicit map(from:c[:]) 
     18, Generating implicit private(sum)
     20, Loop is parallelizable
main:
     53, Generating map(tofrom:a[:N*N],c[:N*N],b[:N*N])

Option 2: !$omp target teams loop collapse(N)

subroutine Matrix_Multiplication(a, b, c, n)
  real(8), intent(in) :: a(n, n), b(n, n)
  real(8), intent(out) :: c(n, n)
  integer :: row, col, i
  real(8) :: sum

  !$omp target teams loop collapse(2) private(row, col, i)
  do row = 1, n
     do col = 1, n
        sum = 0.0
        do i = 1, n
           sum = sum + a(row, i) * b(i, col)
        end do
        c(row, col) = sum
     end do
  end do
  !$omp end target teams loop
end subroutine

Compilation Output Explanation:

  • target teams loop: Executes across a league of teams, ideal for distributing large workloads.
  • collapse(2): Merges row and col loops to improve parallelization.
  • private(row, col, i): Ensures thread-safety by creating isolated copies.
Matrix_Multiplication:
     16, #omp target teams loop
         16, Generating "nvkernel_Matrix_Multiplication_F1L16_2" GPU kernel
             Generating NVIDIA GPU code
           16, Loop parallelized across teams, threads(128) collapse(2) /* blockIdx.x threadIdx.x */
           18,   /* blockIdx.x threadIdx.x collapsed */
           20, Loop run sequentially 
         16, Generating Multicore code
           16, Loop parallelized across threads
     16, Generating implicit map(to:a[:],b[:]) 
         Generating implicit map(from:c[:]) 
     18, Generating implicit private(sum)
     20, Loop is parallelizable
main:
     53, Generating map(tofrom:a[:N*N],c[:N*N],b[:N*N])

Option 3: !$omp target teams loop num_teams(N) collapse(N)

subroutine Matrix_Multiplication(a, b, c, n)
  real(8), intent(in) :: a(n, n), b(n, n)
  real(8), intent(out) :: c(n, n)
  integer :: row, col, i
  real(8) :: sum

  !$omp target teams loop num_teams(5) collapse(2) private(row, col, i)
  do row = 1, n
     do col = 1, n
        sum = 0.0
        do i = 1, n
           sum = sum + a(row, i) * b(i, col)
        end do
        c(row, col) = sum
     end do
  end do
  !$omp end target teams loop
end subroutine

Compilation Output Explanation:

  • num_teams(5): Creates five teams for distributed execution.
  • collapse(2): Combines row and col loops for improved load balancing.
  • private(row, col, i): Prevents race conditions by assigning separate variables to each team.
Matrix_Multiplication:
     16, #omp target teams loop num_teams(5)
         16, Generating "nvkernel_Matrix_Multiplication_F1L16_2" GPU kernel
             Generating NVIDIA GPU code
           16, Loop parallelized across teams(5), threads(128) collapse(2) blockIdx.x threadIdx.x */
           18,   /* blockIdx.x threadIdx.x collapsed */
           20, Loop run sequentially 
         16, Generating Multicore code
           16, Loop parallelized across threads
     16, Generating implicit map(to:a[:],b[:]) 
         Generating implicit map(from:c[:]) 
     18, Generating implicit private(sum)
     20, Loop is parallelizable
main:
     53, Generating map(tofrom:a[:N*N],c[:N*N],b[:N*N])

Summary

These options for matrix multiplication using OpenMP offloading provide several ways to leverage GPUs for efficient parallel computation:

  • Collapse Clauses: By collapsing loops with collapse(2), the code merges outer loops into one, maximizing parallelism and balancing workloads.
  • Teams and Threads: Variants like target parallel for and target teams distribute allow users to control workload distribution across threads and teams, optimizing GPU resource use.
  • Data Mapping: Each example uses data mapping to efficiently manage memory transfers between host and device, which is crucial for performance on heterogeneous systems.

Each approach provides unique trade-offs in resource utilization, parallelism, and memory handling, enabling developers to tailor code to the specific architecture and problem size.