Skip to content

OpenACC Directives

Script
  • "OpenACC provides several constructs that define parallel regions and control how work is distributed among threads. The main constructs include parallel, kernels, loop, and routine, each offering different levels of control and automation for parallel execution in C, C++, and Fortran."
  • "The parallel construct allows developers to define regions where parallel execution should occur. This construct offers a high level of control, as the programmer specifies which parts of the code should be parallelized. Here, a simple #pragma acc parallel directive distributes the loop’s work across threads, allowing concurrent execution."
  • "The kernels construct differs from parallel by allowing the compiler to analyze and automatically determine which operations can safely run in parallel. This reduces the need for manual intervention and helps prevent potential errors, making it easier for developers to implement parallelism."
  • "While both parallel and kernels constructs enable parallel execution, parallel requires explicit directives from the programmer, providing greater control. In contrast, kernels allows the compiler to manage parallelization, simplifying the process and ensuring safe execution."
  • "The loop construct in OpenACC is essential for parallelizing loop iterations on GPUs and other accelerators. It provides clauses like gang, worker, and vector, which control how work is distributed among threads, and others like collapse and reduction to optimize multi-dimensional loops and safely perform accumulation tasks across threads."
  • "The routine construct enables the use of functions within parallel regions, marking them as compatible for parallel execution. This is particularly useful for modular code design, allowing complex computations within functions to run in parallel when called in larger parallel regions."
  • "In summary, OpenACC’s primary constructs—parallel, kernels, loop, and routine—provide flexible, powerful ways to specify parallel execution. Each construct offers different levels of control, from explicit parallelization with parallel to compiler-managed execution with kernels, optimizing code for high-performance computing environments."

OpenACC provides several constructs for defining parallel regions and managing work distribution among threads. The primary compute constructs in OpenACC are parallel, kernels, loop, and routine. These constructs allow developers to control parallel execution in designated code regions for C/C++ and Fortran. Below is an overview of each construct, including usage examples for both languages.


Parallel Construct

The parallel construct defines a region where parallel execution should occur. In this region, work is explicitly distributed among threads. This construct is more hands-on, requiring the programmer to indicate which parts of the code should run in parallel.

  • C/C++ Example:
#pragma acc parallel 
for(int i = 0; i < 5; i++) {                                
    printf("Hello World!\n");
}
  • Fortran Example:
!$acc parallel
do i = 1, 5
   print *, "hello world"
end do
!$acc end parallel

Kenrels Construct

The kernels construct defines a region where the compiler will analyze and automatically parallelize operations it deems safe for concurrent execution. This construct offers a "safer" approach by letting the compiler manage parallelization, requiring less manual intervention from the programmer.

  • C/C++ Example:
#pragma acc kernels 
for(int i = 0; i < 5; i++) {                                
    printf("Hello World!\n");
}
  • Fortran Example:
!$acc kernels
do i = 1, 5
   print *, "hello world"
end do
!$acc end parallel

Difference between parallel and kernels

While both parallel and kernels constructs initiate parallel computation, there is an important distinction between them:

  • parallel Construct: This construct requires the programmer to be explicit about which code regions should be parallelized. It offers more control but also demands careful coding to ensure correctness.

  • kernels Construct: This construct allows the compiler to determine and apply safe parallelization strategies automatically. It’s generally safer and simpler for the programmer, as the compiler implicitly handles most aspects of parallelization, reducing the risk of errors.


Loop construct

The loop construct in OpenACC is a directive that allows developers to parallelize loop iterations on accelerator devices, such as GPUs. By using the loop construct, you can specify how the iterations of a loop should be distributed across different levels of parallelism on an accelerator. This is a crucial part of optimizing code to achieve high performance in parallel computing, as it allows better control over how work is divided among threads and blocks.

  • C/C++ Example:
#pragma acc parallel loop (or) kernels loop
for (int i = 0; i < N; i++) {
    // Loop code here
}
  • Fortran Example:
!$acc parallel loop (or) kernels loop
do i = 1, N
    ! Loop code
end do

The loop construct in OpenACC offers several clauses that help control and optimize parallelism within loops:

  • gang: Distributes iterations across multiple "gangs," which are equivalent to blocks in CUDA terminology. This clause is used to control the number of gangs.
    • Usage: num_gangs(N)
  • worker: Further divides work within each gang into "workers" (similar to threads within a block in CUDA). This clause defines the number of workers per gang.
    • Usage: num_workers(N)
  • vector: Splits the workload even further within each worker, enabling SIMD (Single Instruction, Multiple Data) execution across the vector lanes of each worker. This clause sets the vector length, or the number of SIMD lanes per worker.
    • Usage: vector_length(N)
  • collapse: Combines (or "collapses") nested loops into a single loop for parallelization, allowing for more efficient handling of multi-dimensional data.
  • reduction: Specifies reduction operations (such as sum, max, etc.) across the loop iterations, useful for accumulating results across parallel threads in a safe manner.

Routine Construct

The routine construct allows the definition of functions that can be safely called within parallel regions. It specifies that a function is compatible with parallel execution. This can be especially useful when complex computations are encapsulated within functions that need to be called inside a parallel region.

  • C/C++ Example:
#pragma acc routine seq
extern int simplecompute(int *a);

#pragma acc routine seq
int simplecompute(int *a) {
    return a % 2;
}

void maincompute(int *x, int N) {
    #pragma acc parallel loop
    for (int i = 0; i < N; i++) {
        x[i] = simplecompute(i);
    }
}
  • Fortran Example:
!$acc routine seq
integer function simplecompute(a)
    integer, intent(in) :: a
    simplecompute = mod(a, 2)
end function simplecompute

subroutine maincompute(x, N)
    integer, intent(inout) :: x(*)
    integer, intent(in) :: N
    integer :: i

    !$acc parallel loop
    do i = 1, N
        x(i) = simplecompute(i)
    end do
    !$acc end parallel loop
end subroutine maincompute

In this example:

  • simplecompute: Declared with !$acc routine seq to run sequentially on each element.
  • maincompute: Runs a parallel loop, calling simplecompute for each index in x.

Summary

The OpenACC compute constructs (parallel, kernels, loop, and routine) provide flexible mechanisms for specifying parallel execution, with varying levels of control and automation. Understanding how to apply these constructs allows developers to effectively harness parallel computing power with minimal changes to their existing codebases. Each construct serves a specific purpose:

  • parallel: Provides explicit parallelization control to the programmer.
  • kernels: Relies on the compiler to determine safe parallel execution, reducing manual intervention.
  • loop: Parallelizes loops in conjunction with compute constructs.
  • routine: Enables function calls within parallel regions, facilitating modular and reusable code in parallel contexts.

By leveraging these constructs, OpenACC enables powerful parallelism that is both easy to implement and highly portable across different high-performance computing architectures.