Using Cache and Unified Memory in OpenMP Offloading¶

Script

In this section, we’ll explore two key memory management techniques: the use_device_ptr directive, which provides direct device memory access, and unified memory, which simplifies data sharing between the CPU and GPU. These tools are invaluable for optimizing performance and streamlining memory management in heterogeneous computing systems.
This directive allows us to access device memory directly, bypassing unnecessary data transfers. This is especially useful for iterative computations where data reuse is critical. By reducing data transfer overhead, we can significantly improve performance in GPU-accelerated systems.
In this snippet, we use use_device_ptr to enable direct GPU memory access. By doing so, we avoid redundant data transfers, improving computational efficiency. This technique is particularly effective for applications requiring repeated data access in device memory.
Unified memory creates a shared memory space between the CPU and GPU, automatically managing data movement. This approach is ideal for reducing developer overhead, especially in complex systems where both host and device frequently access the same data.
Here’s an example of unified memory in C++. With unified memory, data such as arrays a, b, and c is automatically managed. There’s no need for explicit mapping or memory regions, simplifying the development process while ensuring efficient memory handling.
Now, let’s see how this works in Fortran. In this example, unified memory ensures that data transfer between the CPU and GPU is handled seamlessly. The developer doesn’t need to add directives for data mapping, making the code cleaner and easier to maintain.
use_device_ptr provides precise control over device memory, making it ideal for fine-grained optimizations. In contrast, unified memory simplifies development by automating data transfers between the host and device. While both approaches have distinct advantages, the choice depends on your application’s specific needs.
For performance optimization, use use_device_ptr to directly access device memory. For simplified coding, unified memory eliminates the need for manual data transfers. Both methods are versatile and can be applied in C/C++ or Fortran, making them essential tools in GPU-accelerated computing.

This section explores two advanced memory management features in OpenMP Offloading: the use_device_ptr clause (for cache-like behavior) and unified memory. These features help improve performance by optimizing memory access between the host (CPU) and device (GPU), enhancing data transfer efficiency and accessibility.

OpenMP `use_device_ptr` for Shared Memory¶

While OpenMP doesn’t have a direct cache directive like OpenACC, you can achieve similar results using use_device_ptr with managed pointers to control how data is accessed on the device. This technique can allow parts of arrays to be accessed directly in device memory, reducing latency in frequently accessed data.

use_device_ptr can be combined with device memory allocations to ensure portions of arrays are kept closer to the GPU cores, making it more efficient for data reuse in iterative computations.

Example of `use_device_ptr` in C/C++¶

Below is an example of using use_device_ptr in C/C++ for a vector addition function. The example uses use_device_ptr to manage access to specific elements of array a, improving memory access speed for repeated operations.

#include <omp.h>
void Vector_Addition(float *a, float *b, float *c, int n) 
{
  #pragma omp target data map(to: a[0:n], b[0:n]) map(from: c[0:n])
  {
    #pragma omp target teams distribute parallel for
    for(int i = 0; i < n-2; i++) {
      #pragma omp use_device_ptr(a)
      c[i] = a[i] + a[i+1] + a[i+2] + b[i];
    }
  }
}

In this example:

use_device_ptr(a) ensures that the pointer a is directly accessed in device memory.
This approach allows the GPU to handle data stored in a without additional copying, reducing memory latency for repeated access.

Unified Memory in OpenMP Offloading¶

Unified memory, or managed memory, allows the CPU and GPU to share a unified address space, automatically handling data migration. In OpenMP, unified memory can be utilized through specific compiler flags, simplifying memory management and eliminating the need for explicit data mappings, for exxample, (map(tofrom:)).

Unified memory is particularly advantageous for applications where data is frequently accessed by both the CPU and GPU, as it automatically manages data location and movement, reducing the complexity of manual memory management.

Example of Unified Memory in C/C++¶

In the following example, unified memory is enabled, allowing the CPU and GPU to access data seamlessly without manual mapping directives. For more details, see the full example on Unified_Memory_OpenMP.c.

#include <omp.h>
void Vector_Addition(float *a, float *b, float *c, int n) 
{
  #pragma omp target teams distribute parallel for
  for(int i = 0; i < n; i++) {
    c[i] = a[i] + b[i];
  }
}

The data a, b, and c should be managed in unified memory. No manual mapping is necessary since the data can be accessed by both CPU and GPU as needed.

To enable unified memory with OpenMP, use compiler flags like -gpu=managed (for NVIDIA compilers) or equivalent flags for your specific setup.

Compiling for Unified Memory¶

nvc -mp=gpu -gpu=cc80,managed -Minfo=accel,mp Vector_Addition_OpenMP.c

Unified Memory Example in Fortran¶

In Fortran, unified memory is managed similarly. Below is an example using OpenMP with unified memory for vector addition in Fortran. For full details, see the example Unified_Memory_OpenMP.f90.

subroutine Vector_Addition(a, b, c, n)                                   
  real(8), intent(in), dimension(:) :: a                                 
  real(8), intent(in), dimension(:) :: b                                 
  real(8), intent(out), dimension(:) :: c                                
  integer :: i, n

  !$omp target teams distribute parallel do
  do i = 1, n
     c(i) = a(i) + b(i)
  end do

end subroutine Vector_Addition

In this example:

No target enter data and target exit data; the runtime handles memory coherence between the CPU and GPU. Unified memory automatically ensures that a, b, and c are available on both the CPU and GPU,

To enable unified memory with OpenMP, use compiler flags like -gpu=managed (for NVIDIA compilers) or equivalent flags for your specific setup.

Compiling for Unified Memory¶

nvfortran -mp=gpu -gpu=cc80,managed -Minfo=accel,mp Vector_Addition_OpenMP.f90

Summary¶

Using use_device_ptr and unified memory in OpenMP Offloading can simplify memory management and improve performance in GPU applications:

use_device_ptr: Allows data to be accessed directly in device memory, similar to caching, optimizing data reuse in parallel computations.
Unified Memory: Enables automatic memory management across CPU and GPU, eliminating manual data transfer and simplifying the code for complex memory patterns.

Both methods are essential tools in OpenMP Offloading, providing efficient ways to manage memory access and optimize performance on GPU-accelerated systems.

Using Cache and Unified Memory in OpenMP Offloading¶

OpenMP use_device_ptr for Shared Memory¶

Example of use_device_ptr in C/C++¶

Unified Memory in OpenMP Offloading¶

Example of Unified Memory in C/C++¶

Compiling for Unified Memory¶

Unified Memory Example in Fortran¶

Compiling for Unified Memory¶

Summary¶

OpenMP `use_device_ptr` for Shared Memory¶

Example of `use_device_ptr` in C/C++¶