Programming in OpenMP Offloading¶

Script

"OpenMP Offloading extends the OpenMP programming model to heterogeneous computing, enabling code execution on accelerators like GPUs. This allows developers to leverage the computational power of GPUs for high-performance applications. In this session, we’ll explore OpenMP Offloading through a simple 'Hello World' example, demonstrating how to shift computation from the CPU to the GPU."
"Let’s begin with a basic 'Hello World' function in C/C++. In this example, the function prints 'Hello World' five times, running sequentially on the CPU. This represents a traditional approach where the computation is handled entirely by the host processor. Next, we’ll look at how to offload this computation to the GPU."
"To offload the computation to the GPU, we add the #pragma omp target directive before the loop. This directive instructs the compiler to execute the loop on the GPU. With a single line of code, we enable GPU execution, making this approach ideal for tasks that benefit from parallel processing on accelerators."
"The code can be compiled for either CPU or GPU execution by using specific compiler flags. For CPU execution, the -mp=multicore flag targets multicore processors, while the -mp=gpu flag generates code for GPU execution. The -Minfo=mp,accel flag provides detailed feedback on compiler optimizations for both OpenMP and GPU acceleration, helping developers understand how their code is being optimized."
"After compilation, the output varies depending on the target platform. For CPU execution, the compiler output focuses on optimizations for multicore threading. In contrast, GPU compilation generates kernels, enabling the GPU to execute the code in parallel. This difference highlights the flexibility of OpenMP Offloading in adapting to diverse hardware environments."
"OpenMP Offloading is also supported in Fortran, with a syntax similar to C/C++. By adding the !$omp target directive before the loop, we can direct the computation to the GPU. This approach mirrors the functionality of the C/C++ example, showcasing OpenMP’s consistent usability across languages."
"Similar to C/C++, Fortran code can be compiled for either CPU or GPU execution by adjusting compiler flags. Using the -mp=gpu flag with nvfortran instructs the compiler to generate GPU-specific code, while -mp=multicore targets CPU execution. This flexibility ensures that developers can optimize their applications for the intended hardware."
"OpenMP Offloading offers several key constructs to support parallel computation. The target directive offloads execution to a device, while loop specifies parallelization of loops. The reduction construct combines results across threads for operations like summation, and collapse simplifies nested loops for efficient parallel execution. These constructs provide developers with powerful tools for high-performance computing."
"In summary, OpenMP Offloading allows developers to easily transition computation between the CPU and GPU using a single codebase. By leveraging constructs like target, reduction, and collapse, developers can achieve fine-grained control over parallel operations. This flexibility makes OpenMP Offloading a valuable approach for high-performance computing on heterogeneous systems."

In this article, we explore how to write a simple OpenMP Offloading program. OpenMP Offloading extends the familiar OpenMP model to support heterogeneous computing by enabling code execution on devices like GPUs. We start with a basic "Hello World" program, illustrating how to move computation from the CPU to the GPU.

Basic OpenMP Offloading Example in C/C++¶

The following example demonstrates a simple "Hello World" program in C/C++.

// hello-world.c
#include <stdio.h>              
#include <omp.h>

void Print_Hello_World()    
{
  for(int i = 0; i < 5; i++)
    {                                
      printf("Hello World!\n");
    }
} 

int main()
{ 
  Print_Hello_World();     
  return 0;     
}

// Compiling the serial code for CPU execution
nvc -mp=multicore -Minfo=mp,accel hello-world.c

If we compile the above program, it will execute on the CPU and sequentially, since we did not any OpenMP Offloding directive calls. However, our goal is to offload part or all of this computation to the GPU.

Offloading Code to the GPU¶

To offload computation to the GPU, we need to use the #pragma omp target directive. This directive instructs the compiler to execute the enclosed code block on the GPU.

// hello-world.c
#include <stdio.h>              
#include <omp.h>

void Print_Hello_World()    
{
#pragma omp target
  for(int i = 0; i < 5; i++)
    {                                
      printf("Hello World!\n");
    }
} 

int main()
{ 
  Print_Hello_World();     
  return 0;     
}

Compiling for multicore architecture¶

nvc -mp -target=multicore -Minfo=mp,accel hello-world.c

Here:

-mp: Enables OpenMP parallelization.
-target=multicore: Specifies targeting multicore CPU architecture.
-Minfo=mp,accel: Provides compiler feedback on OpenMP (mp) and GPU acceleration (accel). Even though this is targeting a multicore CPU, the accel information may still be useful for GPU offloading scenarios in different compilations.

Compiling for GPU architecture¶

nvc -mp=gpu -gpu=cc80 -Minfo=mp,accel hello-world.c

Here:

#pragma omp target: Offloads the enclosed code block to the GPU.
-mp=gpu: Enables OpenMP parallelization.
-gpu=cc80: Specifies targeting a GPU architecture.
-Minfo=mp,accel: Provides feedback on OpenMP and acceleration optimizations for the GPU.

Compilation Output Analysis¶

When compiled for both multicore and GPU architectures:

The compiler generates code that is compatible with both CPU and GPU.
The output shows the generation of GPU kernels, which are the computational units executed on the GPU.

This simple example demonstrates how OpenMP Offloading allows a single source code to execute on both multicore CPUs and GPUs by modifying compiler flags.

Fortran Example with OpenMP Offloading¶

The OpenMP Offloading model can also be used in Fortran. Below is the "Hello World" example in Fortran, demonstrating how to offload computation to the GPU using similar constructs.

subroutine Print_Hello_World()
  integer :: i
  !$omp target
  do i = 1, 5
     print *, "Hello World"
  end do
  !$omp end target
end subroutine Print_Hello_World

program main
  use omp_lib
  implicit none
  call Print_Hello_World()
end program main

Compiling for multicore architecture¶

nvfortran -mp -target=multicore -Minfo=mp,accel hello-world.f90

Compiling for GPU architecture¶

nvfortran -mp=gpu -gpu=cc80 -Minfo=mp,accel hello-world.f90

In Fortran:

The !$omp target directive specifies that the code block should be offloaded to the GPU.
The syntax and behavior mirror the C/C++ example, allowing a similar approach across both languages.

Note: The Fortran compiler may provide limited output regarding target device execution compared to C/C++. However, when properly configured, the code will execute on the specified target device (e.g., GPU).

Technical Considerations in OpenMP Offloading¶

OpenMP Offloading offers significant flexibility for heterogeneous computing, but achieving optimal performance often requires additional constructs and clauses:

Thread Handling Constructs: OpenMP provides various constructs like teams, parallel, and distribute that define how work is distributed across GPU threads.
Mapping Data: The map clause enables data transfer between the CPU and GPU, which is critical for ensuring that necessary data is available on the device.
Reduction and Synchronization: The reduction and collapse clauses facilitate parallel computation of complex operations (e.g., sums) and handle nested loops, respectively.

Key OpenMP Offloading Constructs¶

The table below outlines some essential OpenMP Offloading constructs used for controlling computation and data transfer between the host and device.

OpenMP API	Description
`#pragma omp target`	Offloads execution to a target device (e.g., GPU).
`#pragma omp loop`	Specifies a loop to be parallelized on the target device.
`reduction(op:list)`	Applies a reduction operation (e.g., sum, max) across a list of variables, allowing for parallel updates.
`collapse(n)`	Combines `n` nested loops into a single logical loop, simplifying nested loop execution on the device.

By understanding these constructs, developers can effectively structure code for heterogeneous computing. In future articles, we will explore more advanced constructs and optimizations for parallel execution on GPUs.

Summary¶

This introduction to OpenMP Offloading demonstrates how to transfer code execution from the CPU to the GPU. Using the #pragma omp target directive, a single source code can be compiled to execute on both multicore CPUs and GPUs.

OpenMP Offloading also provides robust constructs for thread control, data transfer, and parallel reduction, making it a powerful tool for developing high-performance applications on heterogeneous systems.