Basics in OpenACC¶
This article demonstrates a basic example of parallel programming with OpenACC, showing how to print "Hello World" in parallel using both C/C++ and Fortran. We’ll start with a basic serial version of the code and then enhance it using OpenACC directives, enabling the loop to be executed in parallel on the GPU.
Hello World in C/C++¶
The following example presents a simple “Hello World” function written in C/C++.
C/C++ Serial Version¶
In this serial version, the loop in the Print_Hello_World
function is not parallelized, meaning each iteration of the loop is executed sequentially on the CPU. Here’s the code for Hello_World.c:
For C/C++
// Hello_World.c
void Print_Hello_World()
{
for(int i = 0; i < 5; i++)
{
printf("Hello World!\n");
}
}
Compilation
nvc -fast -acc -Minfo=accel,all -gpu=cc80 Hello_World.c
Compilation output
Print_Hello_World:
6, Loop not vectorized/parallelized: contains call
main:
14, Print_Hello_World inlined, size=5 (inline) file Hello_World.c (4)
6, Loop not vectorized/parallelized: contains call
C/C++ Parallel Version with OpenACC¶
By adding the OpenACC kernels directive, we can instruct the compiler to parallelize the for loop in Print_Hello_World
, which enables the code to execute on the GPU. Here’s the updated code:
For C/C++
// Hello_World_OpenACC.c
void Print_Hello_World()
{
#pragma acc kernels
for(int i = 0; i < 5; i++)
{
printf("Hello World!\n");
}
}
Here, the #pragma acc kernels
directive tells the compiler to analyze and potentially parallelize the code within the scope of this directive.
Compilation
nvc -fast -acc -Minfo=accel,all -gpu=cc80 Hello_World.c
Compilation output
Print_Hello_World:
8, Loop is parallelizable
Generating NVIDIA GPU code
8, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
In this output, the compiler confirms that the for
loop is parallelizable and generates GPU code. The terms gang
and vector
(with a default of 32 threads per vector) indicate how OpenACC assigns work on the GPU. In OpenACC, gang
and vector
correspond to CUDA’s “blocks” and “threads,” allowing the GPU to divide tasks among multiple processing units efficiently.
Note: In this example, we used the kernels directive, which asks the compiler to find parallelizable regions. For finer control, the loop directive can be combined with kernels, providing more explicit parallelization, which is often useful in more complex loops.
Hello World in Fortran¶
The following example illustrates the same Hello World function, but written in Fortran.
Fortran Serial Version¶
In the initial version, the Print_Hello_World subroutine is written in a serial format, meaning each iteration of the loop executes sequentially on the CPU. Here’s the code for Hello_World.f90:
For Fortran:
!! Hello_World.f90
subroutine Print_Hello_World()
integer ::i, temp
do i = 1, 5
temp = temp + i
end do
print *, "Result: ", temp
end subroutine Print_Hello_World
Compilation
nvfortran -fast -acc -Minfo=accel -gpu=cc80 Hello_World.f90
print_hello_world:
3, Loop not vectorized/parallelized: contains call
The output shows that the loop within Print_Hello_World
is not parallelized by default, as the compiler recognizes it but doesn’t have any directive to act upon.
Fortran Parallel Version with OpenACC¶
Now, let’s add OpenACC to parallelize the loop in Print_Hello_World
. Here’s the modified code:
!! Hello_World_OpenACC.f90
subroutine Print_Hello_World()
integer ::i, temp
!$acc kernels
do i = 1, 5
temp = temp + i
end do
!$acc end kernels
print *, "Result: ", temp
end subroutine Print_Hello_World
!$acc kernels
directive tells the compiler to attempt parallelization within the subroutine. Compilation
nvfortran -fast -acc -Minfo=accel -gpu=cc80 Hello_World_OpenACC.f90
Compilation Output
print_hello_world:
4, Loop is parallelizable
Generating NVIDIA GPU code
4, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
In the output, the compiler indicates that the loop has been parallelized for GPU execution. Similar to C/C++, the terms gang
and vector
refer to GPU execution configurations, where gang represents a group of threads, and vector(32)
specifies that each gang will have 32 threads.
Fortran's Limited I/O on GPU: Fortran doesn’t have direct support for print statements on the GPU, as Fortran’s standard I/O model is not GPU-compatible.
Key Concepts and Considerations¶
-
Data Movement: In both examples, the
kernels
directive is used to instruct the compiler to parallelize the loop. This directive allows OpenACC to automatically identify parallelizable code regions and offload them to the GPU. -
Execution Model: The parallel execution model in OpenACC is organized into gangs, workers, and vectors, similar to CUDA’s blocks, warps, and threads. The compiler chooses an optimal configuration based on the loop’s structure and target GPU architecture.
- In our example,
vector(32)
specifies that each gang has 32 threads. - The compiler makes these choices automatically, but developers can also specify values manually for more control.
- In our example,
-
Comparison to Serial Execution: In the initial serial versions of both C/C++ and Fortran examples, each loop iteration is executed sequentially on the CPU.
- By adding OpenACC directives, the loop is transformed to run on the GPU, allowing parallel execution of multiple iterations, which can lead to significant performance improvements, especially with larger workloads.
-
Compiler Feedback: The compiler output provides insights into the parallelization process, such as whether the loop was parallelized and the specific GPU code generated. These messages can help identify bottlenecks or cases where additional optimization may be needed.
Summary¶
By following these steps, you can successfully parallelize basic code examples using OpenACC, enabling computation on the GPU and improving performance. This simple “Hello World” example is a foundation for understanding OpenACC’s directives and how they interact with GPU architecture.