Skip to content

Using Cache and Unified Memory in OpenACC

This article discusses two advanced memory management features in OpenACC: the cache directive and unified memory. These techniques help improve performance by optimizing data access patterns between the host (CPU) and device (GPU) memory.


OpenACC cache Directive

The OpenACC cache directive is designed to store frequently accessed data in a GPU's shared memory, which has lower latency than global memory. By caching portions of an array, applications can reduce the time spent on memory accesses, especially in cases where some data elements are reused multiple times by threads in the GPU.

Caching can be applied to an entire array or to specific sections of an array, depending on the available shared memory. If the array section is too large for shared memory, the GPU will revert to loading from global memory, which can slow down performance. It is therefore recommended to cache smaller, frequently accessed portions of an array to maximize the performance benefits.

Description of Image Figure 1: Illustration of cache concept

In Figure 1, an example demonstrates caching specific elements of an array in shared memory. Here, threads often need A(i-1), A(i), and A(i+1). Caching a 3-element section, A(i:3), helps each thread quickly access these values from shared memory rather than global memory, reducing memory access time.

Example of cache in C/C++

Below is an example of using the cache directive in C/C++ for a vector addition function. The function caches a 3-element section of the a array to improve access speed.

// Vector_Cache_OpenACC.c
void Vector_Addition(float *a, float *b, float *restrict c, int n) 
{
#pragma acc kernels loop copyin(a[:n], b[0:n]) copyout(c[0:n])
  for(int i = 0; i < n-2; i ++)
    {
#pragma acc cache(a[i:3])
      c[i] = a[i] + a[i+1] + a[i+2] +  b[i];
    }
}
In this example:

  • cache(a[i:3]) tells the compiler to store three elements of array a, starting from index i, in shared memory.
  • The cached elements are a[i], a[i+1], and a[i+2], which are reused in each loop iteration.

This setup is beneficial for GPUs with limited shared memory. For more details, see the full example, Vector_Cache_OpenACC.c.

Example of cache in Fortran

Here’s the Fortran version of the vector addition function with the cache directive applied to a portion of array a.

subroutine Vector_Addition(a, b, c, n)
  real(8), intent(in), dimension(:) :: a
  real(8), intent(in), dimension(:) :: b
  real(8), intent(out), dimension(:) :: c
  integer :: i, n
  !$acc kernels loop copyin(a(1:n), b(1:n)) copyout(c(1:n)) 
  do i = 1, n-2
     !$acc cache(A(i:3))
     c(i) = a(i) + a(i+1) + a(i+2) + b(i)
  end do
  !$acc end kernels
end subroutine Vector_Addition
In this example, cache(a(i:3)) caches the specified range of a within each loop iteration. The Fortran example is functionally similar to the C/C++ code above. See Vector_Cache_OpenACC.f90 for the complete code.


Unified Memory in OpenACC

Unified memory, also known as managed memory, provides a unified address space shared between the CPU and GPU. This allows data to be automatically migrated between host and device, simplifying memory management. In OpenACC, unified memory is enabled using the managed compiler flag. When unified memory is used, explicit data clauses (copyin, copyout, etc.) become unnecessary as memory transfers are managed automatically by the system.

Unified memory is particularly useful for applications with complex memory access patterns or cases where data might need to be frequently accessed by both the host and device. This reduces the complexity of managing memory locations and transfers manually.

Description of Image Figure 2: Overview of Unified Memory

Figure 2 illustrates how unified memory works, enabling seamless memory access across CPU and GPU. The managed memory model automatically migrates data as needed, maintaining consistency without requiring manual data directives in the code.

Unified Memory Example in C/C++

Below is an example of a vector addition function in C/C++ using unified memory. The managed flag enables unified memory, so we don’t need to specify data clauses. For more details, see the full example, Vector_Addition_Managed_OpenACC.c.

// Vector_Addition_Managed_OpenACC.c
void Vector_Addition(float *a, float *b, float *c, int n) 
{
#pragma acc kernels loop independent 
  for(int i = 0; i < n; i ++)
    {
      c[i] = a[i] + b[i];
    }
}

In this example:

  • The independent clause is used to inform the compiler that there are no dependencies between loop iterations, allowing more efficient execution.
  • The restrict keyword can be omitted as the managed memory model takes care of memory consistency.

To compile this code with unified memory enabled, use the following command:

nvc -fast -acc -ta=tesla,managed -Minfo=all Vector_Addition_Managed_OpenACC.c

Unified Memory Example in Fortran

The unified memory concept in Fortran is similar. Here’s the Fortran version of the vector addition function using unified memory. In this case, we also omit the copy clauses. For more details, see the full example, Vector_Addition_Managed_OpenACC.f90.

subroutine Vector_Addition(a, b, c, n)                                   
  real(8), intent(in), dimension(:) :: a                                 
  real(8), intent(in), dimension(:) :: b                                 
  real(8), intent(out), dimension(:) :: c                                
  integer :: i, n                                                        
  !$acc kernels loop                                                     
  do i = 1, n                                                            
     c(i) = a(i) + b(i)                                                  
  end do                                                                 
  !$acc end kernels                                                      
end subroutine Vector_Addition 
In Fortran, simply omitting the copyin and copyout clauses is enough when using unified memory. To enable unified memory in Fortran, compile the code with the managed flag:
nvfortran -fast -acc -ta=tesla,managed -Minfo=all Vector_Addition_Managed_OpenACC.f90


Summary

Using the cache directive and unified memory can greatly simplify memory management and improve performance in GPU applications:

  • cache directive: Stores frequently accessed data in the GPU’s shared memory to reduce latency, which is beneficial in high-reuse scenarios.
  • Unified memory: Eliminates the need for explicit data management between CPU and GPU, allowing OpenACC to handle data migration automatically. This approach simplifies code while maintaining performance, especially for complex memory usage patterns.
  • Both methods offer powerful ways to optimize data access, and each has its own strengths depending on the specific application requirements.