Programming in OpenMP Offloading¶
In this article, we explore how to write a simple OpenMP Offloading program. OpenMP Offloading extends the familiar OpenMP model to support heterogeneous computing by enabling code execution on devices like GPUs. We start with a basic "Hello World" program, illustrating how to move computation from the CPU to the GPU.
Basic OpenMP Offloading Example in C/C++¶
The following example demonstrates a simple "Hello World" program in C/C++.
// hello-world.c
#include <stdio.h>
#include <omp.h>
void Print_Hello_World()
{
for(int i = 0; i < 5; i++)
{
printf("Hello World!\n");
}
}
int main()
{
Print_Hello_World();
return 0;
}
// Compiling the serial code for CPU execution
nvc -mp=multicore -Minfo=mp,accel hello-world.c
If we compile the above program, it will execute on the CPU and sequentially, since we did not any OpenMP Offloding directive calls. However, our goal is to offload part or all of this computation to the GPU.
Offloading Code to the GPU¶
To offload computation to the GPU, we need to use the #pragma omp target
directive. This directive instructs the compiler to execute the enclosed code block on the GPU.
// hello-world.c
#include <stdio.h>
#include <omp.h>
void Print_Hello_World()
{
#pragma omp target
for(int i = 0; i < 5; i++)
{
printf("Hello World!\n");
}
}
int main()
{
Print_Hello_World();
return 0;
}
Compiling for multicore architecture¶
nvc -mp -target=multicore -Minfo=mp,accel hello-world.c
- -mp: Enables OpenMP parallelization.
- -target=multicore: Specifies targeting multicore CPU architecture.
- -Minfo=mp,accel: Provides compiler feedback on OpenMP (mp) and GPU acceleration (accel). Even though this is targeting a multicore CPU, the accel information may still be useful for GPU offloading scenarios in different compilations.
Compiling for GPU architecture¶
nvc -mp=gpu -gpu=cc80 -Minfo=mp,accel hello-world.c
#pragma omp target
: Offloads the enclosed code block to the GPU.- -mp=gpu: Enables OpenMP parallelization.
- -gpu=cc80: Specifies targeting a GPU architecture.
- -Minfo=mp,accel: Provides feedback on OpenMP and acceleration optimizations for the GPU.
Compilation Output Analysis¶
When compiled for both multicore and GPU architectures:
- The compiler generates code that is compatible with both CPU and GPU.
- The output shows the generation of GPU kernels, which are the computational units executed on the GPU.
This simple example demonstrates how OpenMP Offloading allows a single source code to execute on both multicore CPUs and GPUs by modifying compiler flags.
Fortran Example with OpenMP Offloading¶
The OpenMP Offloading model can also be used in Fortran. Below is the "Hello World" example in Fortran, demonstrating how to offload computation to the GPU using similar constructs.
subroutine Print_Hello_World()
integer :: i
!$omp target
do i = 1, 5
print *, "Hello World"
end do
!$omp end target
end subroutine Print_Hello_World
program main
use omp_lib
implicit none
call Print_Hello_World()
end program main
Compiling for multicore architecture¶
nvfortran -mp -target=multicore -Minfo=mp,accel hello-world.f90
Compiling for GPU architecture¶
nvfortran -mp=gpu -gpu=cc80 -Minfo=mp,accel hello-world.f90
In Fortran:
- The
!$omp target
directive specifies that the code block should be offloaded to the GPU. - The syntax and behavior mirror the C/C++ example, allowing a similar approach across both languages.
Note: The Fortran compiler may provide limited output regarding target device execution compared to C/C++. However, when properly configured, the code will execute on the specified target device (e.g., GPU).
Technical Considerations in OpenMP Offloading¶
OpenMP Offloading offers significant flexibility for heterogeneous computing, but achieving optimal performance often requires additional constructs and clauses:
- Thread Handling Constructs: OpenMP provides various constructs like
teams
,parallel
, anddistribute
that define how work is distributed across GPU threads. - Mapping Data: The
map
clause enables data transfer between the CPU and GPU, which is critical for ensuring that necessary data is available on the device. - Reduction and Synchronization: The
reduction
andcollapse
clauses facilitate parallel computation of complex operations (e.g., sums) and handle nested loops, respectively.
Key OpenMP Offloading Constructs¶
The table below outlines some essential OpenMP Offloading constructs used for controlling computation and data transfer between the host and device.
OpenMP API | Description |
---|---|
#pragma omp target | Offloads execution to a target device (e.g., GPU). |
#pragma omp loop | Specifies a loop to be parallelized on the target device. |
reduction(op:list) | Applies a reduction operation (e.g., sum, max) across a list of variables, allowing for parallel updates. |
collapse(n) | Combines n nested loops into a single logical loop, simplifying nested loop execution on the device. |
By understanding these constructs, developers can effectively structure code for heterogeneous computing. In future articles, we will explore more advanced constructs and optimizations for parallel execution on GPUs.
Summary¶
This introduction to OpenMP Offloading demonstrates how to transfer code execution from the CPU to the GPU. Using the #pragma omp target
directive, a single source code can be compiled to execute on both multicore CPUs and GPUs.
OpenMP Offloading also provides robust constructs for thread control, data transfer, and parallel reduction, making it a powerful tool for developing high-performance applications on heterogeneous systems.