Skip to content

Basics in OpenACC

This article demonstrates a basic example of parallel programming with OpenACC, showing how to print "Hello World" in parallel using both C/C++ and Fortran. We’ll start with a basic serial version of the code and then enhance it using OpenACC directives, enabling the loop to be executed in parallel on the GPU.


Hello World in C/C++

The following example presents a simple “Hello World” function written in C/C++.

C/C++ Serial Version

In this serial version, the loop in the Print_Hello_World function is not parallelized, meaning each iteration of the loop is executed sequentially on the CPU. Here’s the code for Hello_World.c:

For C/C++

// Hello_World.c
void Print_Hello_World()    
{   
  for(int i = 0; i < 5; i++)
    {                                
      printf("Hello World!\n");
    }
}
When we compile this code using the nvc compiler (part of the NVIDIA HPC SDK), the compiler provides feedback on whether the code could be parallelized.

Compilation

nvc -fast -acc -Minfo=accel,all -gpu=cc80 Hello_World.c

Compilation output

Print_Hello_World:
      6, Loop not vectorized/parallelized: contains call
main:
     14, Print_Hello_World inlined, size=5 (inline) file Hello_World.c (4)
           6, Loop not vectorized/parallelized: contains call
The output shows that the compiler didn’t parallelize the loop. The message "Loop not vectorized/parallelized: contains call" indicates that the compiler identified the loop but didn’t attempt parallelization, as no directive was given.

C/C++ Parallel Version with OpenACC

By adding the OpenACC kernels directive, we can instruct the compiler to parallelize the for loop in Print_Hello_World, which enables the code to execute on the GPU. Here’s the updated code:

For C/C++

// Hello_World_OpenACC.c
void Print_Hello_World()    
{
#pragma acc kernels
  for(int i = 0; i < 5; i++)
    {                                
      printf("Hello World!\n");
    }
} 

Here, the #pragma acc kernels directive tells the compiler to analyze and potentially parallelize the code within the scope of this directive.

Compilation

nvc -fast -acc -Minfo=accel,all -gpu=cc80 Hello_World.c

Compilation output

Print_Hello_World:
      8, Loop is parallelizable
         Generating NVIDIA GPU code
          8, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */

In this output, the compiler confirms that the for loop is parallelizable and generates GPU code. The terms gang and vector (with a default of 32 threads per vector) indicate how OpenACC assigns work on the GPU. In OpenACC, gang and vector correspond to CUDA’s “blocks” and “threads,” allowing the GPU to divide tasks among multiple processing units efficiently.

Note: In this example, we used the kernels directive, which asks the compiler to find parallelizable regions. For finer control, the loop directive can be combined with kernels, providing more explicit parallelization, which is often useful in more complex loops.


Hello World in Fortran

The following example illustrates the same Hello World function, but written in Fortran.

Fortran Serial Version

In the initial version, the Print_Hello_World subroutine is written in a serial format, meaning each iteration of the loop executes sequentially on the CPU. Here’s the code for Hello_World.f90:

For Fortran:

!! Hello_World.f90
subroutine Print_Hello_World()
  integer ::i, temp
  do i = 1, 5
     temp = temp + i
  end do
  print *, "Result: ", temp
end subroutine Print_Hello_World
We compile this Fortran code as follows:

Compilation

nvfortran -fast -acc -Minfo=accel -gpu=cc80 Hello_World.f90
Compilation Output
print_hello_world:
      3, Loop not vectorized/parallelized: contains call

The output shows that the loop within Print_Hello_World is not parallelized by default, as the compiler recognizes it but doesn’t have any directive to act upon.

Fortran Parallel Version with OpenACC

Now, let’s add OpenACC to parallelize the loop in Print_Hello_World. Here’s the modified code:

!! Hello_World_OpenACC.f90
subroutine Print_Hello_World()
  integer ::i, temp
  !$acc kernels
  do i = 1, 5
     temp = temp + i
  end do
  !$acc end kernels
  print *, "Result: ", temp
end subroutine Print_Hello_World
Here, the !$acc kernels directive tells the compiler to attempt parallelization within the subroutine.

Compilation

nvfortran -fast -acc -Minfo=accel -gpu=cc80  Hello_World_OpenACC.f90

Compilation Output

print_hello_world:
      4, Loop is parallelizable
         Generating NVIDIA GPU code
          4, !$acc loop gang, vector(32) ! blockidx%x threadidx%x

In the output, the compiler indicates that the loop has been parallelized for GPU execution. Similar to C/C++, the terms gang and vector refer to GPU execution configurations, where gang represents a group of threads, and vector(32) specifies that each gang will have 32 threads.

Fortran's Limited I/O on GPU: Fortran doesn’t have direct support for print statements on the GPU, as Fortran’s standard I/O model is not GPU-compatible.


Key Concepts and Considerations

  1. Data Movement: In both examples, the kernels directive is used to instruct the compiler to parallelize the loop. This directive allows OpenACC to automatically identify parallelizable code regions and offload them to the GPU.

  2. Execution Model: The parallel execution model in OpenACC is organized into gangs, workers, and vectors, similar to CUDA’s blocks, warps, and threads. The compiler chooses an optimal configuration based on the loop’s structure and target GPU architecture.

    • In our example, vector(32) specifies that each gang has 32 threads.
    • The compiler makes these choices automatically, but developers can also specify values manually for more control.
  3. Comparison to Serial Execution: In the initial serial versions of both C/C++ and Fortran examples, each loop iteration is executed sequentially on the CPU.

    • By adding OpenACC directives, the loop is transformed to run on the GPU, allowing parallel execution of multiple iterations, which can lead to significant performance improvements, especially with larger workloads.
  4. Compiler Feedback: The compiler output provides insights into the parallelization process, such as whether the loop was parallelized and the specific GPU code generated. These messages can help identify bottlenecks or cases where additional optimization may be needed.


Summary

By following these steps, you can successfully parallelize basic code examples using OpenACC, enabling computation on the GPU and improving performance. This simple “Hello World” example is a foundation for understanding OpenACC’s directives and how they interact with GPU architecture.