Skip to content

Vector Addition

This article explores several ways to offload vector addition to the GPU using OpenMP Offloading. We demonstrate various compute constructs and data mapping options in both C/C++ and Fortran. Each option showcases how different combinations of target, teams, parallel, and loop constructs can be used to control parallelism on the GPU, offering flexibility in workload distribution and data handling.


Vector Addition in C/C++

In this C/C++ example, we illustrate three approaches to offloading vector addition to the GPU with varying compute constructs.

Option 1: Basic Offloading with target parallel for

The simplest approach to vector addition uses #pragma omp target parallel for, which offloads the parallelized loop directly to the GPU.

#pragma omp target map(to: a[0:n], b[0:n]) map(from: c[0:n])
#pragma omp parallel for
for(int i = 0; i < n; i ++)
{
    c[i] = a[i] + b[i];
}
Compilation Output:
  • Construct: target parallel for directly distributes iterations among GPU threads.
  • Loop Scheduling: Loop parallelized across threads, with 128 threads per block.
  • Data Mapping: Variables a, b, and c are mapped with tofrom, ensuring data is copied back to the host.
nvc -mp=gpu -gpu=cc80 -Minfo=mp,accel vector-addition.c
Vector_Addition:
     13, #omp target parallel for
         13, Generating "nvkernel_Vector_Addition_F1L13_2" GPU kernel
         19, Loop parallelized across threads(128), schedule(static)
main:
     49, Generating map(tofrom:a[:N],c[:N],b[:N])

This method is straightforward but limits control over GPU resources, as it parallelizes only at the thread level without forming teams.


Option 2: Using target teams distribute parallel for

Adding the teams construct creates a league of teams, with each team executing a subset of the loop. This approach is beneficial when there is a need to control execution across multiple thread groups.

#pragma omp target map(to: a[0:n], b[0:n]) map(from: c[0:n])
#pragma omp teams distribute parallel for
for(int i = 0; i < n; i ++)
{
    c[i] = a[i] + b[i];
}
Compilation Output:
  • Construct: target teams distribute parallel for creates teams and distributes iterations across them.
  • Loop Scheduling: Loop parallelized across teams and threads.
  • Data Mapping: The arrays are mapped with tofrom to handle data movement.
nvc -mp=gpu gpu=cc80 -Minfo=mp,accel vector-addition.c
Vector_Addition:
     13, #omp target teams distribute parallel for
         13, Generating "nvkernel_Vector_Addition_F1L13_2" GPU kernel
         19, Loop parallelized across teams and threads(128), schedule(static)
main:
     49, Generating map(tofrom:a[:N],c[:N],b[:N])

This setup provides greater flexibility for dividing work across multiple teams, enabling more control over GPU resources.


Option 3: Specifying num_teams with target teams distribute parallel for

To further control parallelism, we can set the number of teams with num_teams. This option is useful for adjusting the workload balance and optimizing GPU utilization.

#pragma omp target map(to: a[0:n], b[0:n]) map(from: c[0:n])
#pragma omp teams distribute parallel for num_teams(5)
for(int i = 0; i < n; i ++)
{
    c[i] = a[i] + b[i];
}

Compilation Output

  • Construct: target teams distribute parallel for num_teams(5) specifies five teams, each with 128 threads.
  • Loop Scheduling: Loop parallelized across teams and threads.
  • Data Mapping: Arrays a, b, and c are mapped as before.
nvc -mp=gpu -gpu=cc80 -Minfo=mp,accel vector-addition.c
Vector_Addition:
     13, #omp target teams distribute parallel for num_teams(5)
         13, Generating "nvkernel_Vector_Addition_F1L13_2" GPU kernel
         19, Loop parallelized across teams and threads(128), schedule(static)
main:
     49, Generating map(tofrom:a[:N],c[:N],b[:N])

This approach allows fine-tuning the number of teams to balance the load and optimize performance on the target device.


Vector Addition in Fortran

Below are similar options for vector addition in Fortran, illustrating how to use target, teams, and parallel constructs along with data mapping clauses.

Option 1: Simple target parallel do

The simplest approach in Fortran uses !$omp target with parallel do, enabling parallel execution of the loop on the GPU.

!$omp target map(to:a, b) map(from:c)  
!$omp parallel do
do i = 1, n
   c(i) = a(i) + b(i)
end do
!$omp end parallel do  
!$omp end target

Compilation Output:

  • Construct: target parallel do directly maps data and distributes the loop across threads on the GPU.
  • Data Mapping: Arrays a and b are mapped to the device, and c is mapped back to the host.
nvfortran -mp=gpu -gpu=cc80 -Minfo=mp,accel vector-addition.f90
vector_addition:
     11, !$omp target
         11, Generating "nvkernel_vector_addition_mod_vector_addition__F1L11_2" GPU kernel
     11, Generating map(to:a(:)) 
         Generating map(from:c(:)) 
         Generating map(to:b(:))

Option 2: Using teams distribute parallel do

Adding teams and distribute allows more granular control by creating a league of teams, where each team is responsible for a portion of the loop.

!$omp target map(to:a, b) map(from:c)
!$omp teams distribute parallel do private(i)
do i = 1, n
   c(i) = a(i) + b(i)
end do
!$omp end teams distribute parallel do  
!$omp end target

Compilation Output:

  • Construct: target teams distribute parallel do creates teams on the GPU, with each team handling a part of the loop.
  • Data Mapping: Similar to previous examples, arrays are mapped appropriately for data transfer.
nvfortran -mp=gpu -gpu=cc80 -Minfo=mp,accel vector-addition.f90
vector_addition:
     11, !$omp target teams distribute parallel do
         11, Generating "nvkernel_vector_addition_mod_vector_addition__F1L11_2" GPU kernel
     11, Generating map(to:a(:)) 
         Generating map(from:c(:)) 
         Generating map(to:b(:))

Option 3: teams distribute parallel do with num_teams

Setting num_teams(5) specifies five teams for distributing the work, which can enhance performance on large datasets by balancing the load across teams.

!$omp target map(to:a, b) map(from:c)
!$omp teams distribute parallel do private(i) num_teams(5)
do i = 1, n
   c(i) = a(i) + b(i)
end do
!$omp end teams distribute parallel do  
!$omp end target

Compilation Output:

  • Construct: target teams distribute parallel do with num_teams(5) enhances control over GPU resources.
  • Data Mapping: The same mapping approach applies, with a and b copied to the device and c copied back to the host.
nvfortran -mp=gpu -gpu=cc80 -Minfo=mp,accel vector-addition.f90
vector_addition:
     11, !$omp target teams distribute parallel do num_teams(5)
         11, Generating "nvkernel_vector_addition_mod_vector_addition__F1L11_2" GPU kernel
     11, Generating map(to:a(:))
         Generating map(from:c(:))
         Generating map(to:b(:))

Summary of Options

Option Construct Description
Option 1 (C/C++) target parallel for Basic parallelism across threads.
Option 2 (C/C++) target teams distribute parallel for Adds teams, distributing work across teams and threads.
Option 3 (C/C++) target teams distribute parallel for num_teams(5) Specifies the number of teams for better workload balance.
Option 1 (Fortran) target parallel do Direct parallelization across threads on the GPU.
Option 2 (Fortran) target teams distribute parallel do Adds teams, distributing work across teams and threads.
Option 3 (Fortran) target teams distribute parallel do num_teams(5) Specifies number of teams for optimized parallelism.

This table provides a summary of OpenMP offloading options for C/C++ and Fortran, detailing progressively complex parallel constructs:

  • Option 1: Basic parallelism across threads (parallel for or parallel do).
  • Option 2: Adds teams, distributing work across multiple teams and threads.
  • Option 3: Specifies a set number of teams (e.g., num_teams(5)) for improved load balancing.

Each option demonstrates how OpenMP Offloading allows fine-grained control over GPU execution