Skip to content

Using Cache and Unified Memory in OpenMP Offloading

This section explores two advanced memory management features in OpenMP Offloading: the use_device_ptr clause (for cache-like behavior) and unified memory. These features help improve performance by optimizing memory access between the host (CPU) and device (GPU), enhancing data transfer efficiency and accessibility.


OpenMP use_device_ptr for Shared Memory

While OpenMP doesn’t have a direct cache directive like OpenACC, you can achieve similar results using use_device_ptr with managed pointers to control how data is accessed on the device. This technique can allow parts of arrays to be accessed directly in device memory, reducing latency in frequently accessed data.

use_device_ptr can be combined with device memory allocations to ensure portions of arrays are kept closer to the GPU cores, making it more efficient for data reuse in iterative computations.

Example of use_device_ptr in C/C++

Below is an example of using use_device_ptr in C/C++ for a vector addition function. The example uses use_device_ptr to manage access to specific elements of array a, improving memory access speed for repeated operations.

#include <omp.h>
void Vector_Addition(float *a, float *b, float *c, int n) 
{
  #pragma omp target data map(to: a[0:n], b[0:n]) map(from: c[0:n])
  {
    #pragma omp target teams distribute parallel for
    for(int i = 0; i < n-2; i++) {
      #pragma omp use_device_ptr(a)
      c[i] = a[i] + a[i+1] + a[i+2] + b[i];
    }
  }
}

In this example:

  • use_device_ptr(a) ensures that the pointer a is directly accessed in device memory.
  • This approach allows the GPU to handle data stored in a without additional copying, reducing memory latency for repeated access.

Unified Memory in OpenMP Offloading

Unified memory, or managed memory, allows the CPU and GPU to share a unified address space, automatically handling data migration. In OpenMP, unified memory can be utilized through specific compiler flags, simplifying memory management and eliminating the need for explicit data mappings, for exxample, (map(tofrom:)).

Unified memory is particularly advantageous for applications where data is frequently accessed by both the CPU and GPU, as it automatically manages data location and movement, reducing the complexity of manual memory management.


Example of Unified Memory in C/C++

In the following example, unified memory is enabled, allowing the CPU and GPU to access data seamlessly without manual mapping directives. For more details, see the full example on Unified_Memory_OpenMP.c.

#include <omp.h>
void Vector_Addition(float *a, float *b, float *c, int n) 
{
  #pragma omp target teams distribute parallel for
  for(int i = 0; i < n; i++) {
    c[i] = a[i] + b[i];
  }
}
  • The data a, b, and c should be managed in unified memory. No manual mapping is necessary since the data can be accessed by both CPU and GPU as needed.

To enable unified memory with OpenMP, use compiler flags like -gpu=managed (for NVIDIA compilers) or equivalent flags for your specific setup.

Compiling for Unified Memory
nvc -mp=gpu -gpu=cc80,managed -Minfo=accel,mp Vector_Addition_OpenMP.c

Unified Memory Example in Fortran

In Fortran, unified memory is managed similarly. Below is an example using OpenMP with unified memory for vector addition in Fortran. For full details, see the example Unified_Memory_OpenMP.f90.

subroutine Vector_Addition(a, b, c, n)                                   
  real(8), intent(in), dimension(:) :: a                                 
  real(8), intent(in), dimension(:) :: b                                 
  real(8), intent(out), dimension(:) :: c                                
  integer :: i, n

  !$omp target teams distribute parallel do
  do i = 1, n
     c(i) = a(i) + b(i)
  end do

end subroutine Vector_Addition 

In this example:

  • No target enter data and target exit data; the runtime handles memory coherence between the CPU and GPU. Unified memory automatically ensures that a, b, and c are available on both the CPU and GPU,

To enable unified memory with OpenMP, use compiler flags like -gpu=managed (for NVIDIA compilers) or equivalent flags for your specific setup.

Compiling for Unified Memory
nvfortran -mp=gpu -gpu=cc80,managed -Minfo=accel,mp Vector_Addition_OpenMP.f90

Summary

Using use_device_ptr and unified memory in OpenMP Offloading can simplify memory management and improve performance in GPU applications:

  • use_device_ptr: Allows data to be accessed directly in device memory, similar to caching, optimizing data reuse in parallel computations.
  • Unified Memory: Enables automatic memory management across CPU and GPU, eliminating manual data transfer and simplifying the code for complex memory patterns.

Both methods are essential tools in OpenMP Offloading, providing efficient ways to manage memory access and optimize performance on GPU-accelerated systems.