Vector Addition¶

Script

"This article demonstrates how to implement a basic vector addition operation in both C/C++ and Fortran using OpenACC. We’ll explore both serial and parallel versions, using OpenACC directives to enable GPU-based computation and optimize data transfer."
"In the serial C/C++ version, vector addition is performed sequentially on the CPU. Each element in arrays a and b is added, and the result is stored in array c. This serial execution is straightforward but not optimized for large-scale data processing."
"To parallelize vector addition on the GPU, we add OpenACC directives. The kernels directive offloads the loop to the GPU, and the loop clause signals parallelization. Data transfer is managed with copyin for input arrays and copyout for the output array c, enabling efficient data handling between the CPU and GPU."
"In Fortran, the serial version of vector addition is similar to C/C++. Here, the loop iterates sequentially over each element, storing the result in the output vector c. Like C/C++, this sequential approach is suitable for small data sizes but lacks the parallel efficiency required for larger computations."
"The parallel version in Fortran uses the !$acc kernels directive, which offloads the loop computation to the GPU. copyin and copyout manage data transfer for input and output arrays, respectively. By parallelizing the loop, the computation can now benefit from the GPU’s processing power."
"OpenACC provides explicit control over data movement with clauses like copyin and copyout, optimizing memory usage on the GPU. Loop parallelization follows a structured hierarchy of gang, worker, and vector, automatically determined by the compiler. In C/C++, using the restrict qualifier prevents pointer aliasing, enhancing efficiency."
"In summary, parallelizing vector addition with OpenACC is a straightforward example of the model’s power and simplicity. Using OpenACC directives, we achieve efficient GPU-based computation and data handling, which improves performance and scalability for larger datasets."

This section explains how to implement a vector addition operation in both C/C++ and Fortran using OpenACC. Vector addition is a fundamental operation in parallel computing and serves as a simple, introductory example of parallelizing code with OpenACC. In this example, we’ll see both a serial version of the code and a parallelized version using OpenACC directives, along with explanations of key concepts like data transfer and memory allocation on the GPU.

Vector Addition in C/C++ (OpenACC)¶

The following code demonstrates a serial version of a vector addition function in C/C++.

Serial Version in C/C++¶

Below is the code for the serial version of vector addition, Vector_Addition.c. In this function, two input vectors a and b are added element-wise to produce the result vector c.

// Vector_Addition.c
float * Vector_Addition(float *a, float *b, float *c, int n) 
{
  for(int i = 0; i < n; i++)
    {
      c[i] = a[i] + b[i];
    }
  return c;
}

In this serial version, the for loop iterates over each element of the arrays a, b, and c. Each iteration computes c[i] = a[i] + b[i], and the result is stored back in c.

Parallel Version with OpenACC¶

To parallelize this function, we add OpenACC directives, enabling the computation to run on a GPU. OpenACC provides directives like kernels or parallel, which instruct the compiler to parallelize specific parts of the code.

Below is the OpenACC-enabled version of the vector addition function, Vector_Addition_OpenACC.c.

// Vector_Addition_OpenACC.c
void Vector_Addition(float *a, float *b, float *restrict c, int n)  
{                                                                   
#pragma acc kernels loop copyin(a[:n], b[:n]) copyout(c[:n])      
  for(int i = 0; i < n; i++)                                       
    {                                                               
      c[i] = a[i] + b[i];                                           
    }                                                               
}

In this version:

#pragma acc kernels: This directive tells the compiler to offload the following code block to the GPU, parallelizing it if possible.
loop clause: When using a loop, adding the loop clause after the kernels directive helps the compiler understand that it should attempt to parallelize the loop itself.
Data Clauses (copyin, copyout):
- copyin(a[:n], b[:n]) transfers the input arrays a and b from the host (CPU) to the device (GPU).
- copyout(c[:n]) transfers the result array c back from the device to the host once the computation is complete.

Note: OpenACC needs to avoid pointer aliasing, so we use the restrict qualifier with the c pointer. This tells the compiler that c does not overlap with other pointers, allowing for efficient updates.

Vector Addition in Fortran¶

Similarly, we can perform vector addition in Fortran. Below are the serial and parallelized versions of the code.

Serial Version in Fortran¶

The following code implements a serial version of vector addition in Fortran, Vector_Addition.f90.

!! Vector_Addition.f90
module Vector_Addition_Mod  
  implicit none 
contains
  subroutine Vector_Addition(a, b, c, n)
    !! Input vectors
    real(8), intent(in), dimension(:) :: a
    real(8), intent(in), dimension(:) :: b
    real(8), intent(out), dimension(:) :: c
    integer :: i, n
    do i = 1, n
       c(i) = a(i) + b(i)
    end do
  end subroutine Vector_Addition
end module Vector_Addition_Mod

Fortran Array Indexing: In Fortran, array indexing starts at 1 by default, unlike C/C++ where indexing starts at 0.

Parallel Version with OpenACC in Fortran¶

The following code demonstrates a parallel version of the vector addition function in Fortran, Vector_Addition_OpenACC.f90.

!! Vector_Addition_OpenACC.f90
module Vector_Addition_Mod                                                        
  implicit none                                                                   
contains                                                                          
  subroutine Vector_Addition(a, b, c, n)                                          
    !! Input vectors                                                              
    real(8), intent(in), dimension(:) :: a                                        
    real(8), intent(in), dimension(:) :: b                                        
    real(8), intent(out), dimension(:) :: c                                       
    integer :: i, n                                                               
    !$acc kernels loop copyin(a(1:n), b(1:n)) copyout(c(1:n))                     
    do i = 1, n                                                                   
       c(i) = a(i) + b(i)                                                         
    end do                                                                        
    !$acc end kernels                                                             
  end subroutine Vector_Addition                                                  
end module Vector_Addition_Mod

In this parallelized version:

!$acc kernels: The kernels directive tells the compiler to offload this code region to the GPU, where it will attempt to parallelize the loop.
Data Clauses:
- copyin(a(1:n), b(1:n)) copies data from the host arrays a and b to the device before entering the loop.
- copyout(c(1:n)) ensures that the computed values in c are copied back to the host after the loop finishes.

Safety of the kernels Directive: The kernels directive is particularly safe as it allows the compiler to examine dependencies in the code. If it detects any dependencies, it will automatically handle them by executing the code sequentially, ensuring correctness.

Key Concepts and Considerations¶

Data Movement:
- OpenACC allows explicit data movement between the CPU and GPU through clauses like copyin, copyout, and create.
- copyin moves data to the GPU, copyout moves data back to the CPU, and create allocates GPU memory without transferring any data. This example uses copyin and copyout to handle input and output data transfers.
Loop Parallelization:
- The loop clause in OpenACC specifies that the loop should be parallelized across available GPU threads. The compiler will determine the optimal configuration for gangs, workers, and vectors, allowing the computation to take advantage of the GPU's parallel architecture.
Avoiding Pointer Aliasing:
- The restrict keyword in C/C++ helps the compiler understand that different pointers do not alias the same memory, which is especially useful in functions that modify arrays in place. This optimization allows the compiler to safely parallelize the loop without concerns of overlapping memory.
Comparison with Serial Execution:
- In serial execution, the for loop in C/C++ or do loop in Fortran iterates over each element sequentially on the CPU.
- In parallel execution with OpenACC, each iteration is offloaded to the GPU, where multiple threads can compute different elements simultaneously, significantly reducing computation time for large arrays.
Compiler Feedback:
- Compiling the code with -Minfo=all provides useful feedback, such as whether the loop was successfully parallelized and the specific GPU resources assigned to it. This feedback helps in understanding the level of optimization and identifying any potential performance bottlenecks.

By implementing OpenACC in this vector addition example, we can see how simple directives enable parallel computation on the GPU, effectively accelerating the performance of the code. This foundational example can be extended to more complex operations, showing the power and simplicity of OpenACC for parallel programming.