Skip to content

OpenMP Offloading: Compute Constructs

This section provides an in-depth overview of OpenMP Offloading compute constructs, which enable parallel execution across CPUs and accelerators (e.g., GPUs). These constructs offer flexible, high-level parallelism and data mapping options, allowing fine-grained control over computation and data handling across devices. Key OpenMP Offloading constructs include target, target data, teams, distribute, and various combinations to maximize performance.


Key Compute Constructs in OpenMP Offloading

Target Construct

The target construct transfers data and code execution to the device (e.g., GPU). By using target, variables are mapped to the device’s data environment, and the enclosed code runs on that device.

  • C/C++: #pragma omp target [clauses]
  • Fortran: !$omp target [clauses]

The target construct is versatile, serving as the foundation for most offloading constructs by establishing a device context. It can work with other constructs like parallel, teams, and simd to optimize computation.

Target data Construct

The target data construct defines a device data environment for a specified region, allowing data to reside on the device throughout the construct’s scope.

  • Usage: Used to map data to the device without running computation directly, which can reduce the need for frequent data transfers.

  • C/C++: #pragma omp target data [clauses]

  • Fortran: !$omp target data [clauses]

Teams Construct

The teams construct creates a "league" of teams on the target device, where each team has a specified number of threads. Within each team, an initial thread executes the region code. The number of teams and threads per team can be customized with clauses like num_teams and thread_limit.

  • Purpose: Optimizes parallelism on devices with hierarchical thread structures (e.g., GPUs).

  • C/C++: #pragma omp target teams [clauses]

  • Fortran: !$omp target teams [clauses]

Each team operates independently, executing its portion of the workload. This construct is effective for distributing workload across teams of threads, which can later be combined with distribute or parallel for fine-grained control.

Distribute Construct

The distribute construct divides a loop’s iterations among the master threads of each team, allowing each team to execute a subset of the iterations.

  • Usage: Often used with target and teams to allocate work across teams, which is useful in cases with large workloads.

  • C/C++: #pragma omp target teams distribute [clauses]

  • Fortran: !$omp target teams distribute [clauses]

The distribute construct optimally handles loops by distributing work across team leaders. This is an essential construct in GPU programming, where loop iterations are distributed across teams.


Combining Directives for Enhanced Performance

OpenMP Offloading constructs can be combined to increase performance, with each combination catering to different parallelization needs. For instance:

  • target teams: Offloads code to the device and distributes it among teams.
  • target teams distribute: Distributes loop iterations across the team leaders on the device.
  • target teams distribute parallel for: Offloads and distributes iterations in parallel across threads within teams.
  • target teams distribute parallel for simd: Adds vectorization for enhanced performance, ensuring work is parallelized across teams, threads, and SIMD (single instruction, multiple data) lanes.

OpenMP Offloading Hierarchy

Hierarchy of OpenMP Offloading Constructs


Detailed Syntax and Examples for Common Constructs

This table introduces the basic constructs for offloading code and data from the host (CPU) to a device (such as a GPU). Each construct’s syntax for both C/C++ and Fortran is shown along with a description:

C/C++ API Fortran API Description
#pragma omp target [clauses] !$omp target [clauses] Transfers code and data to the device.
#pragma omp target teams [clauses] !$omp target teams [clauses] Creates a league of teams on the device, executed by initial threads.
#pragma omp target teams distribute !$omp target teams distribute Distributes iterations among team leaders on the device.
#pragma omp target teams distribute parallel for !$omp target teams distribute parallel do Offloads code, distributes iterations, and executes in parallel on teams.
#pragma omp target teams distribute parallel for simd !$omp target teams distribute parallel do simd Adds SIMD vectorization to parallelized team execution.
  • target directive: Transfers execution from the host to the target device.
  • target teams directive: Establishes a hierarchy of teams on the device, where each team has a single initial thread.
  • Combination directives (target teams distribute parallel): These combine device offloading with team distribution and parallel execution, enabling efficient utilization of device resources.

Loop and Parallel Constructs

This table focuses on constructs that allow parallelization of loops and concurrent execution on the target device:

C/C++ Fortran Description
#pragma omp target parallel [clauses] !$omp target parallel [clauses] Creates a parallel team on the device.
#pragma omp target parallel for !$omp target parallel do Parallelizes a loop on the device.
#pragma omp target parallel loop !$omp target parallel loop Parallelizes and loops over iterations concurrently.
#pragma omp target teams loop !$omp target teams loop Offloads and creates teams executing loop iterations.
  • target parallel directive: Creates parallel teams on the device for executing code blocks.
  • parallel for or parallel loop directives: Parallelizes loops on the target device, allowing concurrent execution across loop iterations.
  • teams loop directive: Offloads loops to teams of threads on the device, supporting fine-grained parallelism in iterative calculations.

SIMD and Combined Constructs

The table highlights constructs that introduce SIMD (Single Instruction, Multiple Data) execution along with device offloading and parallelization:

C/C++ Fortran Description
#pragma omp target simd !$omp target simd Offloads and vectorizes loop for SIMD execution.
#pragma omp target parallel for simd !$omp target parallel do simd Combines parallel execution with SIMD vectorization.
#pragma omp target teams distribute simd !$omp target teams distribute simd Distributes work among teams with SIMD execution.
#pragma omp target teams distribute parallel for simd !$omp target teams distribute parallel do simd Fully combines offloading, team distribution, parallelization, and SIMD.
  • simd directive: Vectorizes loop iterations for SIMD execution, optimizing throughput by applying operations across multiple data points simultaneously.
  • Combined SIMD with parallel and teams distribute: Enables fully optimized execution, combining offloading, team distribution, parallelization, and SIMD, providing the highest potential speedup on devices that support these features.

Explanation of Constructs and Usage

  • target parallel: Offloads code to the device and creates a team of threads to execute the region.
  • target teams distribute parallel for: Combines the target, teams, distribute, and parallel for constructs to distribute loop iterations across multiple teams, with each team executing its subset in parallel.
  • simd: The simd construct vectorizes loop iterations for more efficient execution on hardware that supports SIMD lanes, such as GPUs. Used in combination with parallel for or distribute to achieve maximum throughput.

These constructs allow for flexible, hierarchical parallelism on heterogeneous hardware, making it easier to scale code across CPUs and accelerators. OpenMP Offloading's compute constructs enable fine-grained control over work distribution, data transfer, and SIMD execution, making it an ideal choice for high-performance applications on multi-threaded and multi-device architectures.