Using Cache and Unified Memory in OpenACC¶
This article discusses two advanced memory management features in OpenACC: the cache
directive and unified memory. These techniques help improve performance by optimizing data access patterns between the host (CPU) and device (GPU) memory.
OpenACC cache
Directive¶
The OpenACC cache
directive is designed to store frequently accessed data in a GPU's shared memory, which has lower latency than global memory. By caching portions of an array, applications can reduce the time spent on memory accesses, especially in cases where some data elements are reused multiple times by threads in the GPU.
Caching can be applied to an entire array or to specific sections of an array, depending on the available shared memory. If the array section is too large for shared memory, the GPU will revert to loading from global memory, which can slow down performance. It is therefore recommended to cache smaller, frequently accessed portions of an array to maximize the performance benefits.
Figure 1: Illustration of cache concept
In Figure 1, an example demonstrates caching specific elements of an array in shared memory. Here, threads often need A(i-1)
, A(i)
, and A(i+1)
. Caching a 3-element section, A(i:3)
, helps each thread quickly access these values from shared memory rather than global memory, reducing memory access time.
Example of cache
in C/C++¶
Below is an example of using the cache
directive in C/C++ for a vector addition function. The function caches a 3-element section of the a
array to improve access speed.
// Vector_Cache_OpenACC.c
void Vector_Addition(float *a, float *b, float *restrict c, int n)
{
#pragma acc kernels loop copyin(a[:n], b[0:n]) copyout(c[0:n])
for(int i = 0; i < n-2; i ++)
{
#pragma acc cache(a[i:3])
c[i] = a[i] + a[i+1] + a[i+2] + b[i];
}
}
cache(a[i:3])
tells the compiler to store three elements of array a, starting from index i, in shared memory.- The cached elements are
a[i]
,a[i+1]
, anda[i+2]
, which are reused in each loop iteration.
This setup is beneficial for GPUs with limited shared memory. For more details, see the full example, Vector_Cache_OpenACC.c.
Example of cache
in Fortran¶
Here’s the Fortran version of the vector addition function with the cache
directive applied to a portion of array a
.
subroutine Vector_Addition(a, b, c, n)
real(8), intent(in), dimension(:) :: a
real(8), intent(in), dimension(:) :: b
real(8), intent(out), dimension(:) :: c
integer :: i, n
!$acc kernels loop copyin(a(1:n), b(1:n)) copyout(c(1:n))
do i = 1, n-2
!$acc cache(A(i:3))
c(i) = a(i) + a(i+1) + a(i+2) + b(i)
end do
!$acc end kernels
end subroutine Vector_Addition
cache(a(i:3))
caches the specified range of a
within each loop iteration. The Fortran example is functionally similar to the C/C++ code above. See Vector_Cache_OpenACC.f90 for the complete code. Unified Memory in OpenACC¶
Unified memory, also known as managed memory, provides a unified address space shared between the CPU and GPU. This allows data to be automatically migrated between host and device, simplifying memory management. In OpenACC, unified memory is enabled using the managed compiler flag. When unified memory is used, explicit data clauses (copyin
, copyout
, etc.) become unnecessary as memory transfers are managed automatically by the system.
Unified memory is particularly useful for applications with complex memory access patterns or cases where data might need to be frequently accessed by both the host and device. This reduces the complexity of managing memory locations and transfers manually.
Figure 2: Overview of Unified Memory
Figure 2 illustrates how unified memory works, enabling seamless memory access across CPU and GPU. The managed
memory model automatically migrates data as needed, maintaining consistency without requiring manual data directives in the code.
Unified Memory Example in C/C++¶
Below is an example of a vector addition function in C/C++ using unified memory. The managed
flag enables unified memory, so we don’t need to specify data clauses. For more details, see the full example, Vector_Addition_Managed_OpenACC.c.
// Vector_Addition_Managed_OpenACC.c
void Vector_Addition(float *a, float *b, float *c, int n)
{
#pragma acc kernels loop independent
for(int i = 0; i < n; i ++)
{
c[i] = a[i] + b[i];
}
}
In this example:
- The
independent
clause is used to inform the compiler that there are no dependencies between loop iterations, allowing more efficient execution. - The
restrict
keyword can be omitted as themanaged
memory model takes care of memory consistency.
To compile this code with unified memory enabled, use the following command:
nvc -fast -acc -ta=tesla,managed -Minfo=all Vector_Addition_Managed_OpenACC.c
Unified Memory Example in Fortran¶
The unified memory concept in Fortran is similar. Here’s the Fortran version of the vector addition function using unified memory. In this case, we also omit the copy
clauses. For more details, see the full example, Vector_Addition_Managed_OpenACC.f90.
subroutine Vector_Addition(a, b, c, n)
real(8), intent(in), dimension(:) :: a
real(8), intent(in), dimension(:) :: b
real(8), intent(out), dimension(:) :: c
integer :: i, n
!$acc kernels loop
do i = 1, n
c(i) = a(i) + b(i)
end do
!$acc end kernels
end subroutine Vector_Addition
copyin
and copyout
clauses is enough when using unified memory. To enable unified memory in Fortran, compile the code with the managed flag: nvfortran -fast -acc -ta=tesla,managed -Minfo=all Vector_Addition_Managed_OpenACC.f90
Summary¶
Using the cache directive and unified memory can greatly simplify memory management and improve performance in GPU applications:
cache
directive: Stores frequently accessed data in the GPU’s shared memory to reduce latency, which is beneficial in high-reuse scenarios.- Unified memory: Eliminates the need for explicit data management between CPU and GPU, allowing OpenACC to handle data migration automatically. This approach simplifies code while maintaining performance, especially for complex memory usage patterns.
- Both methods offer powerful ways to optimize data access, and each has its own strengths depending on the specific application requirements.