Vector Addition ¶
This article explores several ways to offload vector addition to the GPU using OpenMP Offloading. We demonstrate various compute constructs and data mapping options in both C/C++ and Fortran. Each option showcases how different combinations of target
, teams
, parallel
, and loop
constructs can be used to control parallelism on the GPU, offering flexibility in workload distribution and data handling.
Vector Addition in C/C++¶
In this C/C++ example, we illustrate three approaches to offloading vector addition to the GPU with varying compute constructs.
Option 1: Basic Offloading with target parallel for
¶
The simplest approach to vector addition uses #pragma omp target parallel for
, which offloads the parallelized loop directly to the GPU.
#pragma omp target map(to: a[0:n], b[0:n]) map(from: c[0:n])
#pragma omp parallel for
for(int i = 0; i < n; i ++)
{
c[i] = a[i] + b[i];
}
Compilation Output:¶
- Construct:
target parallel for
directly distributes iterations among GPU threads. - Loop Scheduling: Loop parallelized across threads, with 128 threads per block.
- Data Mapping: Variables
a
,b
, andc
are mapped withtofrom
, ensuring data is copied back to the host.
nvc -mp=gpu -gpu=cc80 -Minfo=mp,accel vector-addition.c
Vector_Addition:
13, #omp target parallel for
13, Generating "nvkernel_Vector_Addition_F1L13_2" GPU kernel
19, Loop parallelized across threads(128), schedule(static)
main:
49, Generating map(tofrom:a[:N],c[:N],b[:N])
This method is straightforward but limits control over GPU resources, as it parallelizes only at the thread level without forming teams.
Option 2: Using target teams distribute parallel for
¶
Adding the teams
construct creates a league of teams, with each team executing a subset of the loop. This approach is beneficial when there is a need to control execution across multiple thread groups.
#pragma omp target map(to: a[0:n], b[0:n]) map(from: c[0:n])
#pragma omp teams distribute parallel for
for(int i = 0; i < n; i ++)
{
c[i] = a[i] + b[i];
}
Compilation Output:¶
- Construct:
target teams distribute parallel for
creates teams and distributes iterations across them. - Loop Scheduling: Loop parallelized across teams and threads.
- Data Mapping: The arrays are mapped with
tofrom
to handle data movement.
nvc -mp=gpu gpu=cc80 -Minfo=mp,accel vector-addition.c
Vector_Addition:
13, #omp target teams distribute parallel for
13, Generating "nvkernel_Vector_Addition_F1L13_2" GPU kernel
19, Loop parallelized across teams and threads(128), schedule(static)
main:
49, Generating map(tofrom:a[:N],c[:N],b[:N])
This setup provides greater flexibility for dividing work across multiple teams, enabling more control over GPU resources.
Option 3: Specifying num_teams
with target teams distribute parallel for
¶
To further control parallelism, we can set the number of teams with num_teams
. This option is useful for adjusting the workload balance and optimizing GPU utilization.
#pragma omp target map(to: a[0:n], b[0:n]) map(from: c[0:n])
#pragma omp teams distribute parallel for num_teams(5)
for(int i = 0; i < n; i ++)
{
c[i] = a[i] + b[i];
}
Compilation Output¶
- Construct:
target teams distribute parallel for num_teams(5)
specifies five teams, each with 128 threads. - Loop Scheduling: Loop parallelized across teams and threads.
- Data Mapping: Arrays
a
,b
, andc
are mapped as before.
nvc -mp=gpu -gpu=cc80 -Minfo=mp,accel vector-addition.c
Vector_Addition:
13, #omp target teams distribute parallel for num_teams(5)
13, Generating "nvkernel_Vector_Addition_F1L13_2" GPU kernel
19, Loop parallelized across teams and threads(128), schedule(static)
main:
49, Generating map(tofrom:a[:N],c[:N],b[:N])
This approach allows fine-tuning the number of teams to balance the load and optimize performance on the target device.
Vector Addition in Fortran¶
Below are similar options for vector addition in Fortran, illustrating how to use target
, teams
, and parallel
constructs along with data mapping clauses.
Option 1: Simple target parallel do
¶
The simplest approach in Fortran uses !$omp target
with parallel do
, enabling parallel execution of the loop on the GPU.
!$omp target map(to:a, b) map(from:c)
!$omp parallel do
do i = 1, n
c(i) = a(i) + b(i)
end do
!$omp end parallel do
!$omp end target
Compilation Output:¶
- Construct:
target parallel do
directly maps data and distributes the loop across threads on the GPU. - Data Mapping: Arrays
a
andb
are mapped to the device, andc
is mapped back to the host.
nvfortran -mp=gpu -gpu=cc80 -Minfo=mp,accel vector-addition.f90
vector_addition:
11, !$omp target
11, Generating "nvkernel_vector_addition_mod_vector_addition__F1L11_2" GPU kernel
11, Generating map(to:a(:))
Generating map(from:c(:))
Generating map(to:b(:))
Option 2: Using teams distribute parallel do
¶
Adding teams
and distribute
allows more granular control by creating a league of teams, where each team is responsible for a portion of the loop.
!$omp target map(to:a, b) map(from:c)
!$omp teams distribute parallel do private(i)
do i = 1, n
c(i) = a(i) + b(i)
end do
!$omp end teams distribute parallel do
!$omp end target
Compilation Output:¶
- Construct:
target teams distribute parallel do
creates teams on the GPU, with each team handling a part of the loop. - Data Mapping: Similar to previous examples, arrays are mapped appropriately for data transfer.
nvfortran -mp=gpu -gpu=cc80 -Minfo=mp,accel vector-addition.f90
vector_addition:
11, !$omp target teams distribute parallel do
11, Generating "nvkernel_vector_addition_mod_vector_addition__F1L11_2" GPU kernel
11, Generating map(to:a(:))
Generating map(from:c(:))
Generating map(to:b(:))
Option 3: teams distribute parallel do
with num_teams
¶
Setting num_teams(5)
specifies five teams for distributing the work, which can enhance performance on large datasets by balancing the load across teams.
!$omp target map(to:a, b) map(from:c)
!$omp teams distribute parallel do private(i) num_teams(5)
do i = 1, n
c(i) = a(i) + b(i)
end do
!$omp end teams distribute parallel do
!$omp end target
Compilation Output:¶
- Construct:
target teams distribute parallel do
withnum_teams(5)
enhances control over GPU resources. - Data Mapping: The same mapping approach applies, with
a
andb
copied to the device andc
copied back to the host.
nvfortran -mp=gpu -gpu=cc80 -Minfo=mp,accel vector-addition.f90
vector_addition:
11, !$omp target teams distribute parallel do num_teams(5)
11, Generating "nvkernel_vector_addition_mod_vector_addition__F1L11_2" GPU kernel
11, Generating map(to:a(:))
Generating map(from:c(:))
Generating map(to:b(:))
Summary of Options¶
Option | Construct | Description |
---|---|---|
Option 1 (C/C++) | target parallel for | Basic parallelism across threads. |
Option 2 (C/C++) | target teams distribute parallel for | Adds teams, distributing work across teams and threads. |
Option 3 (C/C++) | target teams distribute parallel for num_teams(5) | Specifies the number of teams for better workload balance. |
Option 1 (Fortran) | target parallel do | Direct parallelization across threads on the GPU. |
Option 2 (Fortran) | target teams distribute parallel do | Adds teams, distributing work across teams and threads. |
Option 3 (Fortran) | target teams distribute parallel do num_teams(5) | Specifies number of teams for optimized parallelism. |
This table provides a summary of OpenMP offloading options for C/C++ and Fortran, detailing progressively complex parallel constructs:
- Option 1: Basic parallelism across threads (
parallel for
orparallel do
). - Option 2: Adds teams, distributing work across multiple teams and threads.
- Option 3: Specifies a set number of teams (e.g.,
num_teams(5)
) for improved load balancing.
Each option demonstrates how OpenMP Offloading allows fine-grained control over GPU execution