OpenACC Directives¶
OpenACC provides several constructs for defining parallel regions and managing work distribution among threads. The primary compute constructs in OpenACC are parallel
, kernels
, loop
, and routine
. These constructs allow developers to control parallel execution in designated code regions for C/C++
and Fortran
. Below is an overview of each construct, including usage examples for both languages.
Parallel Construct¶
The parallel
construct defines a region where parallel execution should occur. In this region, work is explicitly distributed among threads. This construct is more hands-on, requiring the programmer to indicate which parts of the code should run in parallel.
- C/C++ Example:
#pragma acc parallel
for(int i = 0; i < 5; i++) {
printf("Hello World!\n");
}
- Fortran Example:
!$acc parallel
do i = 1, 5
print *, "hello world"
end do
!$acc end parallel
Kenrels Construct¶
The kernels
construct defines a region where the compiler will analyze and automatically parallelize operations it deems safe for concurrent execution. This construct offers a "safer" approach by letting the compiler manage parallelization, requiring less manual intervention from the programmer.
- C/C++ Example:
#pragma acc kernels
for(int i = 0; i < 5; i++) {
printf("Hello World!\n");
}
- Fortran Example:
!$acc kernels
do i = 1, 5
print *, "hello world"
end do
!$acc end parallel
Difference between parallel
and kernels
¶
While both parallel
and kernels
constructs initiate parallel computation, there is an important distinction between them:
-
parallel
Construct: This construct requires the programmer to be explicit about which code regions should be parallelized. It offers more control but also demands careful coding to ensure correctness. -
kernels
Construct: This construct allows the compiler to determine and apply safe parallelization strategies automatically. It’s generally safer and simpler for the programmer, as the compiler implicitly handles most aspects of parallelization, reducing the risk of errors.
Loop construct¶
The loop construct in OpenACC is a directive that allows developers to parallelize loop iterations on accelerator devices, such as GPUs. By using the loop construct, you can specify how the iterations of a loop should be distributed across different levels of parallelism on an accelerator. This is a crucial part of optimizing code to achieve high performance in parallel computing, as it allows better control over how work is divided among threads and blocks.
- C/C++ Example:
#pragma acc parallel loop (or) kernels loop
for (int i = 0; i < N; i++) {
// Loop code here
}
- Fortran Example:
!$acc parallel loop (or) kernels loop
do i = 1, N
! Loop code
end do
The loop
construct in OpenACC offers several clauses that help control and optimize parallelism within loops:
- gang: Distributes iterations across multiple "gangs," which are equivalent to blocks in CUDA terminology. This clause is used to control the number of gangs.
- Usage: num_gangs(N)
- worker: Further divides work within each gang into "workers" (similar to threads within a block in CUDA). This clause defines the number of workers per gang.
- Usage: num_workers(N)
- vector: Splits the workload even further within each worker, enabling SIMD (Single Instruction, Multiple Data) execution across the vector lanes of each worker. This clause sets the vector length, or the number of SIMD lanes per worker.
- Usage: vector_length(N)
- collapse: Combines (or "collapses") nested loops into a single loop for parallelization, allowing for more efficient handling of multi-dimensional data.
- reduction: Specifies reduction operations (such as sum, max, etc.) across the loop iterations, useful for accumulating results across parallel threads in a safe manner.
Routine Construct¶
The routine
construct allows the definition of functions that can be safely called within parallel regions. It specifies that a function is compatible with parallel execution. This can be especially useful when complex computations are encapsulated within functions that need to be called inside a parallel region.
- C/C++ Example:
#pragma acc routine seq
extern int simplecompute(int *a);
#pragma acc routine seq
int simplecompute(int *a) {
return a % 2;
}
void maincompute(int *x, int N) {
#pragma acc parallel loop
for (int i = 0; i < N; i++) {
x[i] = simplecompute(i);
}
}
- Fortran Example:
!$acc routine seq
integer function simplecompute(a)
integer, intent(in) :: a
simplecompute = mod(a, 2)
end function simplecompute
subroutine maincompute(x, N)
integer, intent(inout) :: x(*)
integer, intent(in) :: N
integer :: i
!$acc parallel loop
do i = 1, N
x(i) = simplecompute(i)
end do
!$acc end parallel loop
end subroutine maincompute
In this example:
simplecompute
: Declared with!$acc routine seq
to run sequentially on each element.maincompute
: Runs a parallelloop
, calling simplecompute for each index inx
.
Summary¶
The OpenACC compute constructs (parallel
, kernels
, loop
, and routine
) provide flexible mechanisms for specifying parallel execution, with varying levels of control and automation. Understanding how to apply these constructs allows developers to effectively harness parallel computing power with minimal changes to their existing codebases. Each construct serves a specific purpose:
- parallel: Provides explicit parallelization control to the programmer.
- kernels: Relies on the compiler to determine safe parallel execution, reducing manual intervention.
- loop: Parallelizes loops in conjunction with compute constructs.
- routine: Enables function calls within parallel regions, facilitating modular and reusable code in parallel contexts.
By leveraging these constructs, OpenACC enables powerful parallelism that is both easy to implement and highly portable across different high-performance computing architectures.