Vector Addition¶
Script
- "This article demonstrates how to implement a basic vector addition operation in both C/C++ and Fortran using OpenACC. We’ll explore both serial and parallel versions, using OpenACC directives to enable GPU-based computation and optimize data transfer."
- "In the serial C/C++ version, vector addition is performed sequentially on the CPU. Each element in arrays a and b is added, and the result is stored in array c. This serial execution is straightforward but not optimized for large-scale data processing."
- "To parallelize vector addition on the GPU, we add OpenACC directives. The kernels directive offloads the loop to the GPU, and the loop clause signals parallelization. Data transfer is managed with copyin for input arrays and copyout for the output array c, enabling efficient data handling between the CPU and GPU."
- "In Fortran, the serial version of vector addition is similar to C/C++. Here, the loop iterates sequentially over each element, storing the result in the output vector c. Like C/C++, this sequential approach is suitable for small data sizes but lacks the parallel efficiency required for larger computations."
- "The parallel version in Fortran uses the !$acc kernels directive, which offloads the loop computation to the GPU. copyin and copyout manage data transfer for input and output arrays, respectively. By parallelizing the loop, the computation can now benefit from the GPU’s processing power."
- "OpenACC provides explicit control over data movement with clauses like copyin and copyout, optimizing memory usage on the GPU. Loop parallelization follows a structured hierarchy of gang, worker, and vector, automatically determined by the compiler. In C/C++, using the restrict qualifier prevents pointer aliasing, enhancing efficiency."
- "In summary, parallelizing vector addition with OpenACC is a straightforward example of the model’s power and simplicity. Using OpenACC directives, we achieve efficient GPU-based computation and data handling, which improves performance and scalability for larger datasets."
This section explains how to implement a vector addition operation in both C/C++ and Fortran using OpenACC. Vector addition is a fundamental operation in parallel computing and serves as a simple, introductory example of parallelizing code with OpenACC. In this example, we’ll see both a serial version of the code and a parallelized version using OpenACC directives, along with explanations of key concepts like data transfer and memory allocation on the GPU.
Vector Addition in C/C++ (OpenACC)¶
The following code demonstrates a serial version of a vector addition function in C/C++.
Serial Version in C/C++¶
Below is the code for the serial version of vector addition, Vector_Addition.c. In this function, two input vectors a
and b
are added element-wise to produce the result vector c
.
// Vector_Addition.c
float * Vector_Addition(float *a, float *b, float *c, int n)
{
for(int i = 0; i < n; i++)
{
c[i] = a[i] + b[i];
}
return c;
}
In this serial version, the for loop iterates over each element of the arrays a
, b
, and c
. Each iteration computes c[i] = a[i] + b[i]
, and the result is stored back in c
.
Parallel Version with OpenACC¶
To parallelize this function, we add OpenACC directives, enabling the computation to run on a GPU. OpenACC provides directives like kernels
or parallel
, which instruct the compiler to parallelize specific parts of the code.
Below is the OpenACC-enabled version of the vector addition function, Vector_Addition_OpenACC.c.
// Vector_Addition_OpenACC.c
void Vector_Addition(float *a, float *b, float *restrict c, int n)
{
#pragma acc kernels loop copyin(a[:n], b[:n]) copyout(c[:n])
for(int i = 0; i < n; i++)
{
c[i] = a[i] + b[i];
}
}
In this version:
#pragma acc kernels
: This directive tells the compiler to offload the following code block to the GPU, parallelizing it if possible.loop
clause: When using a loop, adding theloop
clause after thekernels
directive helps the compiler understand that it should attempt to parallelize the loop itself.- Data Clauses (
copyin
,copyout
):copyin(a[:n], b[:n])
transfers the input arraysa
andb
from the host (CPU) to the device (GPU).copyout(c[:n])
transfers the result arrayc
back from the device to the host once the computation is complete.
Note: OpenACC needs to avoid pointer aliasing, so we use the
restrict
qualifier with thec
pointer. This tells the compiler thatc
does not overlap with other pointers, allowing for efficient updates.
Vector Addition in Fortran¶
Similarly, we can perform vector addition in Fortran. Below are the serial and parallelized versions of the code.
Serial Version in Fortran¶
The following code implements a serial version of vector addition in Fortran, Vector_Addition.f90.
!! Vector_Addition.f90
module Vector_Addition_Mod
implicit none
contains
subroutine Vector_Addition(a, b, c, n)
!! Input vectors
real(8), intent(in), dimension(:) :: a
real(8), intent(in), dimension(:) :: b
real(8), intent(out), dimension(:) :: c
integer :: i, n
do i = 1, n
c(i) = a(i) + b(i)
end do
end subroutine Vector_Addition
end module Vector_Addition_Mod
Fortran Array Indexing: In Fortran, array indexing starts at
1
by default, unlike C/C++ where indexing starts at0
.
Parallel Version with OpenACC in Fortran¶
The following code demonstrates a parallel version of the vector addition function in Fortran, Vector_Addition_OpenACC.f90.
!! Vector_Addition_OpenACC.f90
module Vector_Addition_Mod
implicit none
contains
subroutine Vector_Addition(a, b, c, n)
!! Input vectors
real(8), intent(in), dimension(:) :: a
real(8), intent(in), dimension(:) :: b
real(8), intent(out), dimension(:) :: c
integer :: i, n
!$acc kernels loop copyin(a(1:n), b(1:n)) copyout(c(1:n))
do i = 1, n
c(i) = a(i) + b(i)
end do
!$acc end kernels
end subroutine Vector_Addition
end module Vector_Addition_Mod
In this parallelized version:
!$acc kernels
: Thekernels
directive tells the compiler to offload this code region to the GPU, where it will attempt to parallelize the loop.-
Data Clauses:
copyin(a(1:n), b(1:n))
copies data from the host arraysa
andb
to the device before entering the loop.copyout(c(1:n))
ensures that the computed values inc
are copied back to the host after the loop finishes.
Safety of the
kernels
Directive: Thekernels
directive is particularly safe as it allows the compiler to examine dependencies in the code. If it detects any dependencies, it will automatically handle them by executing the code sequentially, ensuring correctness.
Key Concepts and Considerations¶
-
Data Movement:
- OpenACC allows explicit data movement between the CPU and GPU through clauses like
copyin
,copyout
, andcreate
. copyin
moves data to the GPU,copyout
moves data back to the CPU, andcreate
allocates GPU memory without transferring any data. This example usescopyin
andcopyout
to handle input and output data transfers.
- OpenACC allows explicit data movement between the CPU and GPU through clauses like
-
Loop Parallelization:
- The
loop
clause in OpenACC specifies that the loop should be parallelized across available GPU threads. The compiler will determine the optimal configuration forgangs
,workers
, andvectors
, allowing the computation to take advantage of the GPU's parallel architecture.
- The
-
Avoiding Pointer Aliasing:
- The
restrict
keyword in C/C++ helps the compiler understand that different pointers do not alias the same memory, which is especially useful in functions that modify arrays in place. This optimization allows the compiler to safely parallelize the loop without concerns of overlapping memory.
- The
-
Comparison with Serial Execution:
- In serial execution, the for loop in C/C++ or do loop in Fortran iterates over each element sequentially on the CPU.
- In parallel execution with OpenACC, each iteration is offloaded to the GPU, where multiple threads can compute different elements simultaneously, significantly reducing computation time for large arrays.
-
Compiler Feedback:
- Compiling the code with
-Minfo=all
provides useful feedback, such as whether the loop was successfully parallelized and the specific GPU resources assigned to it. This feedback helps in understanding the level of optimization and identifying any potential performance bottlenecks.
- Compiling the code with
By implementing OpenACC in this vector addition example, we can see how simple directives enable parallel computation on the GPU, effectively accelerating the performance of the code. This foundational example can be extended to more complex operations, showing the power and simplicity of OpenACC for parallel programming.