Basics in OpenACC¶

Script

"This article introduces the basics of parallel programming with OpenACC by demonstrating how to execute a simple 'Hello World' function in parallel. We start with a serial version and then add OpenACC directives to enable parallel execution on the GPU, using both C/C++ and Fortran."
"In the serial C/C++ version of Print_Hello_World, each loop iteration executes sequentially on the CPU. Compiling this code without OpenACC directives results in no parallelization, as indicated by the compiler feedback."
"By adding the #pragma acc kernels directive, we instruct the compiler to parallelize the loop. Now, the code can execute on the GPU, with the compiler generating specific GPU configurations using gang and vector, which correspond to CUDA’s blocks and threads."
"In the Fortran serial version, the loop runs sequentially on the CPU. The compiler identifies the loop but does not attempt parallelization, as there is no directive provided for parallel execution."
"Adding the !$acc kernels directive in the Fortran code enables the compiler to parallelize the loop. The compiler output confirms that the loop is now parallelizable and provides GPU code, assigning gang and vector for efficient parallel execution on the GPU."
"In OpenACC, the kernels directive handles automatic data movement and parallelization. The execution model uses gang for blocks and vector for threads, similar to CUDA. In parallel, the GPU executes multiple iterations concurrently, improving performance over the serial CPU execution."
"In summary, this example shows how to take basic serial code, add OpenACC directives, and achieve parallel execution on the GPU. This foundational example demonstrates how OpenACC simplifies parallel programming and provides helpful compiler feedback for optimizing code for GPU use."

This article demonstrates a basic example of parallel programming with OpenACC, showing how to print "Hello World" in parallel using both C/C++ and Fortran. We’ll start with a basic serial version of the code and then enhance it using OpenACC directives, enabling the loop to be executed in parallel on the GPU.

Hello World in C/C++¶

The following example presents a simple “Hello World” function written in C/C++.

C/C++ Serial Version¶

In this serial version, the loop in the Print_Hello_World function is not parallelized, meaning each iteration of the loop is executed sequentially on the CPU. Here’s the code for Hello_World.c:

For C/C++

// Hello_World.c
void Print_Hello_World()    
{   
  for(int i = 0; i < 5; i++)
    {                                
      printf("Hello World!\n");
    }
}

When we compile this code using the nvc compiler (part of the NVIDIA HPC SDK), the compiler provides feedback on whether the code could be parallelized.

Compilation

nvc -fast -acc -Minfo=accel,all -gpu=cc80 Hello_World.c

Compilation output

Print_Hello_World:
      6, Loop not vectorized/parallelized: contains call
main:
     14, Print_Hello_World inlined, size=5 (inline) file Hello_World.c (4)
           6, Loop not vectorized/parallelized: contains call

The output shows that the compiler didn’t parallelize the loop. The message "Loop not vectorized/parallelized: contains call" indicates that the compiler identified the loop but didn’t attempt parallelization, as no directive was given.

C/C++ Parallel Version with OpenACC¶

By adding the OpenACC kernels directive, we can instruct the compiler to parallelize the for loop in Print_Hello_World, which enables the code to execute on the GPU. Here’s the updated code:

For C/C++

// Hello_World_OpenACC.c
void Print_Hello_World()    
{
#pragma acc kernels
  for(int i = 0; i < 5; i++)
    {                                
      printf("Hello World!\n");
    }
}

Here, the #pragma acc kernels directive tells the compiler to analyze and potentially parallelize the code within the scope of this directive.

Compilation

nvc -fast -acc -Minfo=accel,all -gpu=cc80 Hello_World.c

Compilation output

Print_Hello_World:
      8, Loop is parallelizable
         Generating NVIDIA GPU code
          8, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */

In this output, the compiler confirms that the for loop is parallelizable and generates GPU code. The terms gang and vector (with a default of 32 threads per vector) indicate how OpenACC assigns work on the GPU. In OpenACC, gang and vector correspond to CUDA’s “blocks” and “threads,” allowing the GPU to divide tasks among multiple processing units efficiently.

Note: In this example, we used the kernels directive, which asks the compiler to find parallelizable regions. For finer control, the loop directive can be combined with kernels, providing more explicit parallelization, which is often useful in more complex loops.

Hello World in Fortran¶

The following example illustrates the same Hello World function, but written in Fortran.

Fortran Serial Version¶

In the initial version, the Print_Hello_World subroutine is written in a serial format, meaning each iteration of the loop executes sequentially on the CPU. Here’s the code for Hello_World.f90:

For Fortran:

!! Hello_World.f90
subroutine Print_Hello_World()
  integer ::i, temp
  do i = 1, 5
     temp = temp + i
  end do
  print *, "Result: ", temp
end subroutine Print_Hello_World

We compile this Fortran code as follows:

Compilation

nvfortran -fast -acc -Minfo=accel -gpu=cc80 Hello_World.f90

Compilation Output

print_hello_world:
      3, Loop not vectorized/parallelized: contains call

The output shows that the loop within Print_Hello_World is not parallelized by default, as the compiler recognizes it but doesn’t have any directive to act upon.

Fortran Parallel Version with OpenACC¶

Now, let’s add OpenACC to parallelize the loop in Print_Hello_World. Here’s the modified code:

!! Hello_World_OpenACC.f90
subroutine Print_Hello_World()
  integer ::i, temp
  !$acc kernels
  do i = 1, 5
     temp = temp + i
  end do
  !$acc end kernels
  print *, "Result: ", temp
end subroutine Print_Hello_World

Here, the !$acc kernels directive tells the compiler to attempt parallelization within the subroutine.

Compilation

nvfortran -fast -acc -Minfo=accel -gpu=cc80  Hello_World_OpenACC.f90

Compilation Output

print_hello_world:
      4, Loop is parallelizable
         Generating NVIDIA GPU code
          4, !$acc loop gang, vector(32) ! blockidx%x threadidx%x

In the output, the compiler indicates that the loop has been parallelized for GPU execution. Similar to C/C++, the terms gang and vector refer to GPU execution configurations, where gang represents a group of threads, and vector(32) specifies that each gang will have 32 threads.

Fortran's Limited I/O on GPU: Fortran doesn’t have direct support for print statements on the GPU, as Fortran’s standard I/O model is not GPU-compatible.

Key Concepts and Considerations¶

Data Movement: In both examples, the kernels directive is used to instruct the compiler to parallelize the loop. This directive allows OpenACC to automatically identify parallelizable code regions and offload them to the GPU.
Execution Model: The parallel execution model in OpenACC is organized into gangs, workers, and vectors, similar to CUDA’s blocks, warps, and threads. The compiler chooses an optimal configuration based on the loop’s structure and target GPU architecture.
- In our example, vector(32) specifies that each gang has 32 threads.
- The compiler makes these choices automatically, but developers can also specify values manually for more control.
Comparison to Serial Execution: In the initial serial versions of both C/C++ and Fortran examples, each loop iteration is executed sequentially on the CPU.
- By adding OpenACC directives, the loop is transformed to run on the GPU, allowing parallel execution of multiple iterations, which can lead to significant performance improvements, especially with larger workloads.
Compiler Feedback: The compiler output provides insights into the parallelization process, such as whether the loop was parallelized and the specific GPU code generated. These messages can help identify bottlenecks or cases where additional optimization may be needed.

Summary¶

By following these steps, you can successfully parallelize basic code examples using OpenACC, enabling computation on the GPU and improving performance. This simple “Hello World” example is a foundation for understanding OpenACC’s directives and how they interact with GPU architecture.