Matrix Multiplication¶
Script
- "In this presentation, we’ll explore how OpenMP Offloading can be used to execute matrix multiplication on GPUs. We’ll examine examples in both C/C++ and Fortran, discussing various approaches to parallelizing this operation, optimizing GPU resource utilization, and efficiently managing data transfers."
- "Our first option uses the target parallel for construct with collapse(2). The collapse clause merges the row and column loops, enabling GPU threads to handle computations for multiple rows and columns in parallel. This approach maximizes thread usage, making it suitable for handling large matrices."
- "This is the compiler output for Option 1. The compiler generates a GPU kernel for matrix multiplication, parallelizing the row and column loops across GPU threads. The map clause facilitates memory transfers between the CPU and GPU, ensuring arrays a, b, and c are appropriately transferred."
- "In Option 2, we introduce the teams distribute construct, which creates a league of teams. Each team processes a subset of the matrix, with threads within the team computing the necessary products and sums. This hierarchical structure provides more control over workload distribution across GPU cores, improving efficiency for larger computations."
- "Option 3 extends the previous example by specifying num_teams(5), which sets the number of teams to five. This adjustment allows us to explicitly control the distribution of workloads across GPU compute units. By fine-tuning the number of teams, we can optimize resource utilization and improve performance for large matrices."
- "In Fortran, we can implement similar strategies for offloading. Option 1 uses the target loop collapse(2) construct, parallelizing the row and column loops for efficient execution across GPU threads. This approach ensures that computations are evenly distributed across available resources."
- "Option 2 in Fortran introduces the teams loop construct, which organizes GPU threads into teams. Each team is responsible for processing a portion of the matrix, improving load distribution and making the approach well-suited for larger matrices where team-based execution provides significant performance gains."
- "In the final Fortran example, we specify num_teams(5) to create five teams on the GPU. This directive allows us to balance the workload across multiple compute units, optimizing performance for systems with extensive parallel processing capabilities."
- "To summarize, we explored three options for offloading matrix multiplication to GPUs in both C/C++ and Fortran. Option 1 utilizes collapse(2) for thread-level parallelism, offering simplicity and efficiency. Option 2 incorporates teams for hierarchical parallelism, enhancing workload distribution. Option 3 further refines this by specifying the number of teams, allowing for fine-tuned resource allocation and improved performance on GPUs."
- "In conclusion, OpenMP Offloading provides a powerful and flexible framework for matrix multiplication. By combining constructs like target, teams, parallel, and collapse, we can optimize GPU utilization, ensure balanced workload distribution, and manage data transfers effectively. These techniques demonstrate how OpenMP enables high-performance computing on heterogeneous systems."
This document explores various OpenMP offloading constructs used to parallelize matrix multiplication in both C/C++ and Fortran, focusing on how different compute and data mapping clauses affect GPU execution. Each option showcases a unique approach to efficiently map data and utilize GPU threads.
Matrix Multiplication in C/C++¶
Option 1: #pragma omp target parallel for collapse(N)
¶
In this option, we utilize target parallel for
with collapse(2)
, which merges the outer row
and col
loops, allowing them to be executed in parallel across GPU threads. The private
clause provides each thread with its own copies of row
, col
, and i
to avoid data hazards.
// Matrix_Multiplication.c
void Matrix_Multiplication(float *a, float *b, float *c, int n) {
#pragma omp target parallel for collapse(2) private(row, col, i)
for (int row = 0; row < n; row++) {
for (int col = 0; col < n; col++) {
float sum = 0;
for (int i = 0; i < n; i++) {
sum += a[row * n + i] * b[i * n + col];
}
c[row * n + col] = sum;
}
}
}
Compilation Output Explanation:¶
target parallel for
: Offloads the outer loop and parallelizes it across threads.collapse(2)
: Collapses the row and col loops to distribute the work across threads efficiently.- Private copies of
row
,col
, andi
for each thread ensure correct calculations.
Matrix_Multiplication:
14, #omp target parallel do
14, Generating "nvkernel_Matrix_Multiplication_F1L14_2" GPU kernel
16, Loop parallelized across threads(128), schedule(static)
main:
51, Generating map(tofrom:a[:N*N],c[:N*N],b[:N*N])
Option 2: #pragma omp target teams distribute parallel for collapse(N)
¶
This option introduces teams distribute to create multiple teams. Each team executes a portion of the matrix multiplication, with threads within teams working on the calculations in parallel. This approach scales efficiently across the GPU by dividing work between multiple teams.
// Matrix_Multiplication.c
void Matrix_Multiplication(float *a, float *b, float *c, int n) {
#pragma omp target teams distribute parallel for collapse(2) private(row, col, i)
for (int row = 0; row < n; row++) {
for (int col = 0; col < n; col++) {
float sum = 0;
for (int i = 0; i < n; i++) {
sum += a[row * n + i] * b[i * n + col];
}
c[row * n + col] = sum;
}
}
}
Compilation Output Explanation:¶
target teams distribute parallel for
: Creates a team of threads that work on portions of the matrix, improving GPU utilization.collapse(2)
: Collapses the row and col loops.private(row, col, i)
: Ensures each team has unique copies of variables, preventing race conditions.
Matrix_Multiplication:
13, #omp target teams distribute parallel for num_teams(5)
13, Generating "nvkernel_Matrix_Multiplication_F1L13_2" GPU kernel
Loop parallelized across teams and threads(128), schedule(static)
main:
50, Generating map(tofrom:a[:N*N],c[:N*N],b[:N*N])
Option 3: #pragma omp target teams distribute parallel for num_teams(N) collapse(N)
¶
This version includes num_teams(5) to create five teams, each of which processes a different part of the loop iterations. This approach is efficient for systems with multiple compute units, enhancing parallelism by limiting the number of teams based on available hardware resources.
// Matrix_Multiplication.c
void Matrix_Multiplication(float *a, float *b, float *c, int n) {
#pragma omp target teams distribute parallel for num_teams(5) collapse(2) private(row, col, i)
for (int row = 0; row < n; row++) {
for (int col = 0; col < n; col++) {
float sum = 0;
for (int i = 0; i < n; i++) {
sum += a[row * n + i] * b[i * n + col];
}
c[row * n + col] = sum;
}
}
}
Compilation Output Explanation:¶
num_teams(5)
: Creates five teams, allowing fine-grained control over GPU resource allocation.collapse(2)
: Combines row and col loops for more efficient parallelization.private(row, col, i)
: Maintains thread safety by allocating separate copies of each variable.
Matrix_Multiplication:
13, #omp target teams distribute parallel for num_teams(5)
13, Generating "nvkernel_Matrix_Multiplication_F1L13_2" GPU kernel
Loop parallelized across teams and threads(128), schedule(static)
main:
50, Generating map(tofrom:a[:N*N],c[:N*N],b[:N*N])
Matrix Multiplication in Fortran¶
The following sections describe similar options in Fortran, using OpenMP offloading constructs to enable parallel execution on GPUs.
Option 1: !$omp target loop collapse(N)
¶
subroutine Matrix_Multiplication(a, b, c, n)
real(8), intent(in) :: a(n, n), b(n, n)
real(8), intent(out) :: c(n, n)
integer :: row, col, i
real(8) :: sum
!$omp target loop collapse(2) private(row, col, i)
do row = 1, n
do col = 1, n
sum = 0.0
do i = 1, n
sum = sum + a(row, i) * b(i, col)
end do
c(row, col) = sum
end do
end do
!$omp end target loop
end subroutine
Compilation Output Explanation:¶
target loop
: Offloads the computation to the GPU.collapse(2)
: Collapses row and col loops for efficient parallel execution.private(row, col, i)
: Isolates each thread's variables, ensuring no data conflict.
Matrix_Multiplication:
16, #omp target loop
16, Generating "nvkernel_Matrix_Multiplication_F1L16_2" GPU kernel
Generating NVIDIA GPU code
16, Loop parallelized across teams, threads(128) collapse(2) /* blockIdx.x threadIdx.x */
18, /* blockIdx.x threadIdx.x collapsed */
20, Loop run sequentially
16, Generating Multicore code
16, Loop parallelized across threads
16, Generating implicit map(to:a[:],b[:])
Generating implicit map(from:c[:])
18, Generating implicit private(sum)
20, Loop is parallelizable
main:
53, Generating map(tofrom:a[:N*N],c[:N*N],b[:N*N])
Option 2: !$omp target teams loop collapse(N)
¶
subroutine Matrix_Multiplication(a, b, c, n)
real(8), intent(in) :: a(n, n), b(n, n)
real(8), intent(out) :: c(n, n)
integer :: row, col, i
real(8) :: sum
!$omp target teams loop collapse(2) private(row, col, i)
do row = 1, n
do col = 1, n
sum = 0.0
do i = 1, n
sum = sum + a(row, i) * b(i, col)
end do
c(row, col) = sum
end do
end do
!$omp end target teams loop
end subroutine
Compilation Output Explanation:¶
target teams loop
: Executes across a league of teams, ideal for distributing large workloads.collapse(2)
: Merges row and col loops to improve parallelization.private(row, col, i)
: Ensures thread-safety by creating isolated copies.
Matrix_Multiplication:
16, #omp target teams loop
16, Generating "nvkernel_Matrix_Multiplication_F1L16_2" GPU kernel
Generating NVIDIA GPU code
16, Loop parallelized across teams, threads(128) collapse(2) /* blockIdx.x threadIdx.x */
18, /* blockIdx.x threadIdx.x collapsed */
20, Loop run sequentially
16, Generating Multicore code
16, Loop parallelized across threads
16, Generating implicit map(to:a[:],b[:])
Generating implicit map(from:c[:])
18, Generating implicit private(sum)
20, Loop is parallelizable
main:
53, Generating map(tofrom:a[:N*N],c[:N*N],b[:N*N])
Option 3: !$omp target teams loop num_teams(N) collapse(N)
¶
subroutine Matrix_Multiplication(a, b, c, n)
real(8), intent(in) :: a(n, n), b(n, n)
real(8), intent(out) :: c(n, n)
integer :: row, col, i
real(8) :: sum
!$omp target teams loop num_teams(5) collapse(2) private(row, col, i)
do row = 1, n
do col = 1, n
sum = 0.0
do i = 1, n
sum = sum + a(row, i) * b(i, col)
end do
c(row, col) = sum
end do
end do
!$omp end target teams loop
end subroutine
Compilation Output Explanation:¶
num_teams(5)
: Creates five teams for distributed execution.collapse(2)
: Combines row and col loops for improved load balancing.private(row, col, i)
: Prevents race conditions by assigning separate variables to each team.
Matrix_Multiplication:
16, #omp target teams loop num_teams(5)
16, Generating "nvkernel_Matrix_Multiplication_F1L16_2" GPU kernel
Generating NVIDIA GPU code
16, Loop parallelized across teams(5), threads(128) collapse(2) blockIdx.x threadIdx.x */
18, /* blockIdx.x threadIdx.x collapsed */
20, Loop run sequentially
16, Generating Multicore code
16, Loop parallelized across threads
16, Generating implicit map(to:a[:],b[:])
Generating implicit map(from:c[:])
18, Generating implicit private(sum)
20, Loop is parallelizable
main:
53, Generating map(tofrom:a[:N*N],c[:N*N],b[:N*N])
Summary¶
These options for matrix multiplication using OpenMP offloading provide several ways to leverage GPUs for efficient parallel computation:
- Collapse Clauses: By collapsing loops with collapse(2), the code merges outer loops into one, maximizing parallelism and balancing workloads.
- Teams and Threads: Variants like target parallel for and target teams distribute allow users to control workload distribution across threads and teams, optimizing GPU resource use.
- Data Mapping: Each example uses data mapping to efficiently manage memory transfers between host and device, which is crucial for performance on heterogeneous systems.
Each approach provides unique trade-offs in resource utilization, parallelism, and memory handling, enabling developers to tailor code to the specific architecture and problem size.