Vector Addition ¶

Script

"In this session, we’ll explore how to offload vector addition to the GPU using OpenMP Offloading. We’ll look at implementations in both C/C++ and Fortran, demonstrating how various OpenMP constructs—such as target, teams, parallel, and loop—can be combined to control parallelism effectively on GPUs."
"The first option uses the target parallel for construct, which directly offloads the loop to the GPU. This approach is straightforward, as it maps the input arrays a and b to the device and ensures the result, c, is copied back to the host. However, it parallelizes only at the thread level, offering limited control over GPU resources."
"Here is the compilation output for Option 1. The compiler generates a GPU kernel, parallelizing loop iterations across threads. The map(tofrom:) clause efficiently handles data movement between the host and device for the arrays a, b, and c. The feedback confirms successful parallelization and GPU kernel generation."
"In Option 2, we introduce the teams distribute construct, which creates a league of teams on the GPU. Each team processes a portion of the loop, providing more granular control over parallelism. This hierarchical approach distributes work not just across threads but also across teams, enhancing resource utilization on the GPU."
"Option 3 builds upon the previous example by specifying num_teams(5), allowing us to control the number of teams explicitly. This added parameter lets us balance the workload across teams, which can optimize performance for large datasets or specific GPU architectures where team-based workload distribution is critical."
"In Fortran, the first option mirrors the C/C++ example, using the target parallel do construct to offload the loop to the GPU. This construct distributes loop iterations across threads, enabling basic parallelism on the GPU."
"Option 2 in Fortran introduces the teams distribute parallel do construct. By adding teams, this approach distributes the workload more effectively, leveraging the GPU’s hierarchical structure to ensure better utilization of its resources."
"In the final Fortran example, we specify num_teams(5), just like in C/C++. This directive provides precise control over the number of teams, allowing developers to balance the workload across the GPU for improved performance. Fine-tuning the number of teams is particularly beneficial for optimizing execution on large datasets."
"To summarize, we explored three options for implementing vector addition in both C/C++ and Fortran. Option 1 uses the simplest construct, target parallel for, enabling thread-level parallelism. Option 2 adds teams to distribute work across the GPU hierarchically. Option 3 further refines this by specifying the number of teams, allowing for fine-tuned workload balancing. Each approach demonstrates how OpenMP Offloading constructs can be combined to maximize performance on GPUs."

This article explores several ways to offload vector addition to the GPU using OpenMP Offloading. We demonstrate various compute constructs and data mapping options in both C/C++ and Fortran. Each option showcases how different combinations of target, teams, parallel, and loop constructs can be used to control parallelism on the GPU, offering flexibility in workload distribution and data handling.

Vector Addition in C/C++¶

In this C/C++ example, we illustrate three approaches to offloading vector addition to the GPU with varying compute constructs.

Option 1: Basic Offloading with `target parallel for`¶

The simplest approach to vector addition uses #pragma omp target parallel for, which offloads the parallelized loop directly to the GPU.

#pragma omp target map(to: a[0:n], b[0:n]) map(from: c[0:n])
#pragma omp parallel for
for(int i = 0; i < n; i ++)
{
    c[i] = a[i] + b[i];
}

Compilation Output:¶

Construct: target parallel for directly distributes iterations among GPU threads.
Loop Scheduling: Loop parallelized across threads, with 128 threads per block.
Data Mapping: Variables a, b, and c are mapped with tofrom, ensuring data is copied back to the host.

nvc -mp=gpu -gpu=cc80 -Minfo=mp,accel vector-addition.c
Vector_Addition:
     13, #omp target parallel for
         13, Generating "nvkernel_Vector_Addition_F1L13_2" GPU kernel
         19, Loop parallelized across threads(128), schedule(static)
main:
     49, Generating map(tofrom:a[:N],c[:N],b[:N])

This method is straightforward but limits control over GPU resources, as it parallelizes only at the thread level without forming teams.

Option 2: Using `target teams distribute parallel for`¶

Adding the teams construct creates a league of teams, with each team executing a subset of the loop. This approach is beneficial when there is a need to control execution across multiple thread groups.

#pragma omp target map(to: a[0:n], b[0:n]) map(from: c[0:n])
#pragma omp teams distribute parallel for
for(int i = 0; i < n; i ++)
{
    c[i] = a[i] + b[i];
}

Compilation Output:¶

Construct: target teams distribute parallel for creates teams and distributes iterations across them.
Loop Scheduling: Loop parallelized across teams and threads.
Data Mapping: The arrays are mapped with tofrom to handle data movement.

nvc -mp=gpu gpu=cc80 -Minfo=mp,accel vector-addition.c
Vector_Addition:
     13, #omp target teams distribute parallel for
         13, Generating "nvkernel_Vector_Addition_F1L13_2" GPU kernel
         19, Loop parallelized across teams and threads(128), schedule(static)
main:
     49, Generating map(tofrom:a[:N],c[:N],b[:N])

This setup provides greater flexibility for dividing work across multiple teams, enabling more control over GPU resources.

Option 3: Specifying `num_teams` with `target teams distribute parallel for`¶

To further control parallelism, we can set the number of teams with num_teams. This option is useful for adjusting the workload balance and optimizing GPU utilization.

#pragma omp target map(to: a[0:n], b[0:n]) map(from: c[0:n])
#pragma omp teams distribute parallel for num_teams(5)
for(int i = 0; i < n; i ++)
{
    c[i] = a[i] + b[i];
}

Compilation Output¶

Construct: target teams distribute parallel for num_teams(5) specifies five teams, each with 128 threads.
Loop Scheduling: Loop parallelized across teams and threads.
Data Mapping: Arrays a, b, and c are mapped as before.

nvc -mp=gpu -gpu=cc80 -Minfo=mp,accel vector-addition.c
Vector_Addition:
     13, #omp target teams distribute parallel for num_teams(5)
         13, Generating "nvkernel_Vector_Addition_F1L13_2" GPU kernel
         19, Loop parallelized across teams and threads(128), schedule(static)
main:
     49, Generating map(tofrom:a[:N],c[:N],b[:N])

This approach allows fine-tuning the number of teams to balance the load and optimize performance on the target device.

Vector Addition in Fortran¶

Below are similar options for vector addition in Fortran, illustrating how to use target, teams, and parallel constructs along with data mapping clauses.

Option 1: Simple `target parallel do`¶

The simplest approach in Fortran uses !$omp target with parallel do, enabling parallel execution of the loop on the GPU.

!$omp target map(to:a, b) map(from:c)  
!$omp parallel do
do i = 1, n
   c(i) = a(i) + b(i)
end do
!$omp end parallel do  
!$omp end target

Compilation Output:¶

Construct: target parallel do directly maps data and distributes the loop across threads on the GPU.
Data Mapping: Arrays a and b are mapped to the device, and c is mapped back to the host.

nvfortran -mp=gpu -gpu=cc80 -Minfo=mp,accel vector-addition.f90
vector_addition:
     11, !$omp target
         11, Generating "nvkernel_vector_addition_mod_vector_addition__F1L11_2" GPU kernel
     11, Generating map(to:a(:)) 
         Generating map(from:c(:)) 
         Generating map(to:b(:))

Option 2: Using `teams distribute parallel do`¶

Adding teams and distribute allows more granular control by creating a league of teams, where each team is responsible for a portion of the loop.

!$omp target map(to:a, b) map(from:c)
!$omp teams distribute parallel do private(i)
do i = 1, n
   c(i) = a(i) + b(i)
end do
!$omp end teams distribute parallel do  
!$omp end target

Compilation Output:¶

Construct: target teams distribute parallel do creates teams on the GPU, with each team handling a part of the loop.
Data Mapping: Similar to previous examples, arrays are mapped appropriately for data transfer.

nvfortran -mp=gpu -gpu=cc80 -Minfo=mp,accel vector-addition.f90
vector_addition:
     11, !$omp target teams distribute parallel do
         11, Generating "nvkernel_vector_addition_mod_vector_addition__F1L11_2" GPU kernel
     11, Generating map(to:a(:)) 
         Generating map(from:c(:)) 
         Generating map(to:b(:))

Option 3: `teams distribute parallel do` with `num_teams`¶

Setting num_teams(5) specifies five teams for distributing the work, which can enhance performance on large datasets by balancing the load across teams.

!$omp target map(to:a, b) map(from:c)
!$omp teams distribute parallel do private(i) num_teams(5)
do i = 1, n
   c(i) = a(i) + b(i)
end do
!$omp end teams distribute parallel do  
!$omp end target

Compilation Output:¶

Construct: target teams distribute parallel do with num_teams(5) enhances control over GPU resources.
Data Mapping: The same mapping approach applies, with a and b copied to the device and c copied back to the host.

nvfortran -mp=gpu -gpu=cc80 -Minfo=mp,accel vector-addition.f90
vector_addition:
     11, !$omp target teams distribute parallel do num_teams(5)
         11, Generating "nvkernel_vector_addition_mod_vector_addition__F1L11_2" GPU kernel
     11, Generating map(to:a(:))
         Generating map(from:c(:))
         Generating map(to:b(:))

Summary of Options¶

Option	Construct	Description
Option 1 (C/C++)	`target parallel for`	Basic parallelism across threads.
Option 2 (C/C++)	`target teams distribute parallel for`	Adds teams, distributing work across teams and threads.
Option 3 (C/C++)	`target teams distribute parallel for num_teams(5)`	Specifies the number of teams for better workload balance.
Option 1 (Fortran)	`target parallel do`	Direct parallelization across threads on the GPU.
Option 2 (Fortran)	`target teams distribute parallel do`	Adds teams, distributing work across teams and threads.
Option 3 (Fortran)	`target teams distribute parallel do num_teams(5)`	Specifies number of teams for optimized parallelism.

This table provides a summary of OpenMP offloading options for C/C++ and Fortran, detailing progressively complex parallel constructs:

Option 1: Basic parallelism across threads (parallel for or parallel do).
Option 2: Adds teams, distributing work across multiple teams and threads.
Option 3: Specifies a set number of teams (e.g., num_teams(5)) for improved load balancing.

Each option demonstrates how OpenMP Offloading allows fine-grained control over GPU execution

Vector Addition ¶

Vector Addition in C/C++¶

Option 1: Basic Offloading with target parallel for¶

Compilation Output:¶

Option 2: Using target teams distribute parallel for¶

Compilation Output:¶

Option 3: Specifying num_teams with target teams distribute parallel for¶

Compilation Output¶

Vector Addition in Fortran¶

Option 1: Simple target parallel do¶

Compilation Output:¶

Option 2: Using teams distribute parallel do¶

Compilation Output:¶

Option 3: teams distribute parallel do with num_teams¶

Compilation Output:¶

Summary of Options¶

Option 1: Basic Offloading with `target parallel for`¶

Option 2: Using `target teams distribute parallel for`¶

Option 3: Specifying `num_teams` with `target teams distribute parallel for`¶

Option 1: Simple `target parallel do`¶

Option 2: Using `teams distribute parallel do`¶

Option 3: `teams distribute parallel do` with `num_teams`¶