Matrix Multiplication¶
This document explores various OpenMP offloading constructs used to parallelize matrix multiplication in both C/C++ and Fortran, focusing on how different compute and data mapping clauses affect GPU execution. Each option showcases a unique approach to efficiently map data and utilize GPU threads.
Matrix Multiplication in C/C++¶
Option 1: #pragma omp target parallel for collapse(N)
¶
In this option, we utilize target parallel for
with collapse(2)
, which merges the outer row
and col
loops, allowing them to be executed in parallel across GPU threads. The private
clause provides each thread with its own copies of row
, col
, and i
to avoid data hazards.
// Matrix_Multiplication.c
void Matrix_Multiplication(float *a, float *b, float *c, int n) {
#pragma omp target parallel for collapse(2) private(row, col, i)
for (int row = 0; row < n; row++) {
for (int col = 0; col < n; col++) {
float sum = 0;
for (int i = 0; i < n; i++) {
sum += a[row * n + i] * b[i * n + col];
}
c[row * n + col] = sum;
}
}
}
Compilation Output Explanation:¶
target parallel for
: Offloads the outer loop and parallelizes it across threads.collapse(2)
: Collapses the row and col loops to distribute the work across threads efficiently.- Private copies of
row
,col
, andi
for each thread ensure correct calculations.
Matrix_Multiplication:
14, #omp target parallel do
14, Generating "nvkernel_Matrix_Multiplication_F1L14_2" GPU kernel
16, Loop parallelized across threads(128), schedule(static)
main:
51, Generating map(tofrom:a[:N*N],c[:N*N],b[:N*N])
Option 2: #pragma omp target teams distribute parallel for collapse(N)
¶
This option introduces teams distribute to create multiple teams. Each team executes a portion of the matrix multiplication, with threads within teams working on the calculations in parallel. This approach scales efficiently across the GPU by dividing work between multiple teams.
// Matrix_Multiplication.c
void Matrix_Multiplication(float *a, float *b, float *c, int n) {
#pragma omp target teams distribute parallel for collapse(2) private(row, col, i)
for (int row = 0; row < n; row++) {
for (int col = 0; col < n; col++) {
float sum = 0;
for (int i = 0; i < n; i++) {
sum += a[row * n + i] * b[i * n + col];
}
c[row * n + col] = sum;
}
}
}
Compilation Output Explanation:¶
target teams distribute parallel for
: Creates a team of threads that work on portions of the matrix, improving GPU utilization.collapse(2)
: Collapses the row and col loops.private(row, col, i)
: Ensures each team has unique copies of variables, preventing race conditions.
Matrix_Multiplication:
13, #omp target teams distribute parallel for num_teams(5)
13, Generating "nvkernel_Matrix_Multiplication_F1L13_2" GPU kernel
Loop parallelized across teams and threads(128), schedule(static)
main:
50, Generating map(tofrom:a[:N*N],c[:N*N],b[:N*N])
Option 3: #pragma omp target teams distribute parallel for num_teams(N) collapse(N)
¶
This version includes num_teams(5) to create five teams, each of which processes a different part of the loop iterations. This approach is efficient for systems with multiple compute units, enhancing parallelism by limiting the number of teams based on available hardware resources.
// Matrix_Multiplication.c
void Matrix_Multiplication(float *a, float *b, float *c, int n) {
#pragma omp target teams distribute parallel for num_teams(5) collapse(2) private(row, col, i)
for (int row = 0; row < n; row++) {
for (int col = 0; col < n; col++) {
float sum = 0;
for (int i = 0; i < n; i++) {
sum += a[row * n + i] * b[i * n + col];
}
c[row * n + col] = sum;
}
}
}
Compilation Output Explanation:¶
num_teams(5)
: Creates five teams, allowing fine-grained control over GPU resource allocation.collapse(2)
: Combines row and col loops for more efficient parallelization.private(row, col, i)
: Maintains thread safety by allocating separate copies of each variable.
Matrix_Multiplication:
13, #omp target teams distribute parallel for num_teams(5)
13, Generating "nvkernel_Matrix_Multiplication_F1L13_2" GPU kernel
Loop parallelized across teams and threads(128), schedule(static)
main:
50, Generating map(tofrom:a[:N*N],c[:N*N],b[:N*N])
Matrix Multiplication in Fortran¶
The following sections describe similar options in Fortran, using OpenMP offloading constructs to enable parallel execution on GPUs.
Option 1: !$omp target loop collapse(N)
¶
subroutine Matrix_Multiplication(a, b, c, n)
real(8), intent(in) :: a(n, n), b(n, n)
real(8), intent(out) :: c(n, n)
integer :: row, col, i
real(8) :: sum
!$omp target loop collapse(2) private(row, col, i)
do row = 1, n
do col = 1, n
sum = 0.0
do i = 1, n
sum = sum + a(row, i) * b(i, col)
end do
c(row, col) = sum
end do
end do
!$omp end target loop
end subroutine
Compilation Output Explanation:¶
target loop
: Offloads the computation to the GPU.collapse(2)
: Collapses row and col loops for efficient parallel execution.private(row, col, i)
: Isolates each thread's variables, ensuring no data conflict.
Matrix_Multiplication:
16, #omp target loop
16, Generating "nvkernel_Matrix_Multiplication_F1L16_2" GPU kernel
Generating NVIDIA GPU code
16, Loop parallelized across teams, threads(128) collapse(2) /* blockIdx.x threadIdx.x */
18, /* blockIdx.x threadIdx.x collapsed */
20, Loop run sequentially
16, Generating Multicore code
16, Loop parallelized across threads
16, Generating implicit map(to:a[:],b[:])
Generating implicit map(from:c[:])
18, Generating implicit private(sum)
20, Loop is parallelizable
main:
53, Generating map(tofrom:a[:N*N],c[:N*N],b[:N*N])
Option 2: !$omp target teams loop collapse(N)
¶
subroutine Matrix_Multiplication(a, b, c, n)
real(8), intent(in) :: a(n, n), b(n, n)
real(8), intent(out) :: c(n, n)
integer :: row, col, i
real(8) :: sum
!$omp target teams loop collapse(2) private(row, col, i)
do row = 1, n
do col = 1, n
sum = 0.0
do i = 1, n
sum = sum + a(row, i) * b(i, col)
end do
c(row, col) = sum
end do
end do
!$omp end target teams loop
end subroutine
Compilation Output Explanation:¶
target teams loop
: Executes across a league of teams, ideal for distributing large workloads.collapse(2)
: Merges row and col loops to improve parallelization.private(row, col, i)
: Ensures thread-safety by creating isolated copies.
Matrix_Multiplication:
16, #omp target teams loop
16, Generating "nvkernel_Matrix_Multiplication_F1L16_2" GPU kernel
Generating NVIDIA GPU code
16, Loop parallelized across teams, threads(128) collapse(2) /* blockIdx.x threadIdx.x */
18, /* blockIdx.x threadIdx.x collapsed */
20, Loop run sequentially
16, Generating Multicore code
16, Loop parallelized across threads
16, Generating implicit map(to:a[:],b[:])
Generating implicit map(from:c[:])
18, Generating implicit private(sum)
20, Loop is parallelizable
main:
53, Generating map(tofrom:a[:N*N],c[:N*N],b[:N*N])
Option 3: !$omp target teams loop num_teams(N) collapse(N)
¶
subroutine Matrix_Multiplication(a, b, c, n)
real(8), intent(in) :: a(n, n), b(n, n)
real(8), intent(out) :: c(n, n)
integer :: row, col, i
real(8) :: sum
!$omp target teams loop num_teams(5) collapse(2) private(row, col, i)
do row = 1, n
do col = 1, n
sum = 0.0
do i = 1, n
sum = sum + a(row, i) * b(i, col)
end do
c(row, col) = sum
end do
end do
!$omp end target teams loop
end subroutine
Compilation Output Explanation:¶
num_teams(5)
: Creates five teams for distributed execution.collapse(2)
: Combines row and col loops for improved load balancing.private(row, col, i)
: Prevents race conditions by assigning separate variables to each team.
Matrix_Multiplication:
16, #omp target teams loop num_teams(5)
16, Generating "nvkernel_Matrix_Multiplication_F1L16_2" GPU kernel
Generating NVIDIA GPU code
16, Loop parallelized across teams(5), threads(128) collapse(2) blockIdx.x threadIdx.x */
18, /* blockIdx.x threadIdx.x collapsed */
20, Loop run sequentially
16, Generating Multicore code
16, Loop parallelized across threads
16, Generating implicit map(to:a[:],b[:])
Generating implicit map(from:c[:])
18, Generating implicit private(sum)
20, Loop is parallelizable
main:
53, Generating map(tofrom:a[:N*N],c[:N*N],b[:N*N])
Summary¶
These options for matrix multiplication using OpenMP offloading provide several ways to leverage GPUs for efficient parallel computation:
- Collapse Clauses: By collapsing loops with collapse(2), the code merges outer loops into one, maximizing parallelism and balancing workloads.
- Teams and Threads: Variants like target parallel for and target teams distribute allow users to control workload distribution across threads and teams, optimizing GPU resource use.
- Data Mapping: Each example uses data mapping to efficiently manage memory transfers between host and device, which is crucial for performance on heterogeneous systems.
Each approach provides unique trade-offs in resource utilization, parallelism, and memory handling, enabling developers to tailor code to the specific architecture and problem size.