Vector Addition¶
Script
- "This article demonstrates how to implement a basic vector addition operation in both C/C++ and Fortran using OpenACC. We’ll explore both serial and parallel versions, using OpenACC directives to enable GPU-based computation and optimize data transfer."
- "In the serial C/C++ version, vector addition is performed sequentially on the CPU. Each element in arrays a and b is added, and the result is stored in array c. This serial execution is straightforward but not optimized for large-scale data processing."
- "To parallelize vector addition on the GPU, we add OpenACC directives. The kernels directive offloads the loop to the GPU, and the loop clause signals parallelization. Data transfer is managed with copyin for input arrays and copyout for the output array c, enabling efficient data handling between the CPU and GPU."
- "In Fortran, the serial version of vector addition is similar to C/C++. Here, the loop iterates sequentially over each element, storing the result in the output vector c. Like C/C++, this sequential approach is suitable for small data sizes but lacks the parallel efficiency required for larger computations."
- "The parallel version in Fortran uses the !$acc kernels directive, which offloads the loop computation to the GPU. copyin and copyout manage data transfer for input and output arrays, respectively. By parallelizing the loop, the computation can now benefit from the GPU’s processing power."
- "OpenACC provides explicit control over data movement with clauses like copyin and copyout, optimizing memory usage on the GPU. Loop parallelization follows a structured hierarchy of gang, worker, and vector, automatically determined by the compiler. In C/C++, using the restrict qualifier prevents pointer aliasing, enhancing efficiency."
- "In summary, parallelizing vector addition with OpenACC is a straightforward example of the model’s power and simplicity. Using OpenACC directives, we achieve efficient GPU-based computation and data handling, which improves performance and scalability for larger datasets."
This section explains how to implement a vector addition operation in both C/C++ and Fortran using OpenACC. Vector addition is a fundamental operation in parallel computing and serves as a simple, introductory example of parallelizing code with OpenACC. In this example, we’ll see both a serial version of the code and a parallelized version using OpenACC directives, along with explanations of key concepts like data transfer and memory allocation on the GPU.
Vector Addition in C/C++ (OpenACC)¶
The following code demonstrates a serial version of a vector addition function in C/C++.
Serial Version in C/C++¶
Below is the code for the serial version of vector addition, Vector_Addition.c. In this function, two input vectors a and b are added element-wise to produce the result vector c.
// Vector_Addition.c
float * Vector_Addition(float *a, float *b, float *c, int n)
{
for(int i = 0; i < n; i++)
{
c[i] = a[i] + b[i];
}
return c;
}
In this serial version, the for loop iterates over each element of the arrays a, b, and c. Each iteration computes c[i] = a[i] + b[i], and the result is stored back in c.
Parallel Version with OpenACC¶
To parallelize this function, we add OpenACC directives, enabling the computation to run on a GPU. OpenACC provides directives like kernels or parallel, which instruct the compiler to parallelize specific parts of the code.
Below is the OpenACC-enabled version of the vector addition function, Vector_Addition_OpenACC.c.
// Vector_Addition_OpenACC.c
void Vector_Addition(float *a, float *b, float *restrict c, int n)
{
#pragma acc kernels loop copyin(a[:n], b[:n]) copyout(c[:n])
for(int i = 0; i < n; i++)
{
c[i] = a[i] + b[i];
}
}
In this version:
#pragma acc kernels: This directive tells the compiler to offload the following code block to the GPU, parallelizing it if possible.loopclause: When using a loop, adding theloopclause after thekernelsdirective helps the compiler understand that it should attempt to parallelize the loop itself.- Data Clauses (
copyin,copyout):copyin(a[:n], b[:n])transfers the input arraysaandbfrom the host (CPU) to the device (GPU).copyout(c[:n])transfers the result arraycback from the device to the host once the computation is complete.
Note: OpenACC needs to avoid pointer aliasing, so we use the
restrictqualifier with thecpointer. This tells the compiler thatcdoes not overlap with other pointers, allowing for efficient updates.
Vector Addition in Fortran¶
Similarly, we can perform vector addition in Fortran. Below are the serial and parallelized versions of the code.
Serial Version in Fortran¶
The following code implements a serial version of vector addition in Fortran, Vector_Addition.f90.
!! Vector_Addition.f90
module Vector_Addition_Mod
implicit none
contains
subroutine Vector_Addition(a, b, c, n)
!! Input vectors
real(8), intent(in), dimension(:) :: a
real(8), intent(in), dimension(:) :: b
real(8), intent(out), dimension(:) :: c
integer :: i, n
do i = 1, n
c(i) = a(i) + b(i)
end do
end subroutine Vector_Addition
end module Vector_Addition_Mod
Fortran Array Indexing: In Fortran, array indexing starts at
1by default, unlike C/C++ where indexing starts at0.
Parallel Version with OpenACC in Fortran¶
The following code demonstrates a parallel version of the vector addition function in Fortran, Vector_Addition_OpenACC.f90.
!! Vector_Addition_OpenACC.f90
module Vector_Addition_Mod
implicit none
contains
subroutine Vector_Addition(a, b, c, n)
!! Input vectors
real(8), intent(in), dimension(:) :: a
real(8), intent(in), dimension(:) :: b
real(8), intent(out), dimension(:) :: c
integer :: i, n
!$acc kernels loop copyin(a(1:n), b(1:n)) copyout(c(1:n))
do i = 1, n
c(i) = a(i) + b(i)
end do
!$acc end kernels
end subroutine Vector_Addition
end module Vector_Addition_Mod
In this parallelized version:
!$acc kernels: Thekernelsdirective tells the compiler to offload this code region to the GPU, where it will attempt to parallelize the loop.-
Data Clauses:
copyin(a(1:n), b(1:n))copies data from the host arraysaandbto the device before entering the loop.copyout(c(1:n))ensures that the computed values incare copied back to the host after the loop finishes.
Safety of the
kernelsDirective: Thekernelsdirective is particularly safe as it allows the compiler to examine dependencies in the code. If it detects any dependencies, it will automatically handle them by executing the code sequentially, ensuring correctness.
Key Concepts and Considerations¶
-
Data Movement:
- OpenACC allows explicit data movement between the CPU and GPU through clauses like
copyin,copyout, andcreate. copyinmoves data to the GPU,copyoutmoves data back to the CPU, andcreateallocates GPU memory without transferring any data. This example usescopyinandcopyoutto handle input and output data transfers.
- OpenACC allows explicit data movement between the CPU and GPU through clauses like
-
Loop Parallelization:
- The
loopclause in OpenACC specifies that the loop should be parallelized across available GPU threads. The compiler will determine the optimal configuration forgangs,workers, andvectors, allowing the computation to take advantage of the GPU's parallel architecture.
- The
-
Avoiding Pointer Aliasing:
- The
restrictkeyword in C/C++ helps the compiler understand that different pointers do not alias the same memory, which is especially useful in functions that modify arrays in place. This optimization allows the compiler to safely parallelize the loop without concerns of overlapping memory.
- The
-
Comparison with Serial Execution:
- In serial execution, the for loop in C/C++ or do loop in Fortran iterates over each element sequentially on the CPU.
- In parallel execution with OpenACC, each iteration is offloaded to the GPU, where multiple threads can compute different elements simultaneously, significantly reducing computation time for large arrays.
-
Compiler Feedback:
- Compiling the code with
-Minfo=allprovides useful feedback, such as whether the loop was successfully parallelized and the specific GPU resources assigned to it. This feedback helps in understanding the level of optimization and identifying any potential performance bottlenecks.
- Compiling the code with
By implementing OpenACC in this vector addition example, we can see how simple directives enable parallel computation on the GPU, effectively accelerating the performance of the code. This foundational example can be extended to more complex operations, showing the power and simplicity of OpenACC for parallel programming.
