Skip to content

Data Clauses in OpenACC

OpenACC provides data clauses to efficiently manage data transfer between the host (CPU) and device (GPU). These clauses are essential for specifying how and when data should be allocated, transferred, or retained on the device. Understanding these clauses can optimize memory use and minimize unnecessary data movement, which is critical for performance on GPUs.

Description of Image


Main data clauses in OpenACC:

  • copy(list): The copy clause allocates memory on the GPU, copies data from the host to the device upon entering a data region, and copies data back from the device to the host when exiting the region. This clause is useful for cases where the result of the computation needs to be available on both the device and the host.

    void Vector_Addition(float *a, float *b, float *restrict c, int n) 
    {
      #pragma acc data copyin(a[0:n], b[0:n]) copy(c[0:n])
      {
        #pragma acc kernels loop
        for(int i = 0; i < n; i++)
        {
          c[i] = a[i] + b[i];
        }
      }
    }
    

  • copyin(list): The copyin clause allocates memory on the device and transfers data from the host to the device upon entering the data region. No data is copied back to the host when exiting the region. This clause is ideal for input data that is only read by the device.

    void Vector_Addition(float *a, float *b, float *restrict c, int n) 
    {
      #pragma acc data copyin(a[0:n], b[0:n]) copyout(c[0:n])
      {
        #pragma acc kernels loop
        for(int i = 0; i < n; i++)
        {
          c[i] = a[i] + b[i];
        }
      } 
    }
    

  • copyout(list): The copyout clause allocates memory on the device and copies data from the device to the host when exiting the region. It does not transfer data from the host to the device at the start of the region. This clause is used for output data that is generated on the device and needs to be available on the host after computation.

    void Vector_Addition(float *a, float *b, float *restrict c, int n) 
    {
      #pragma acc data copyin(a[0:n], b[0:n]) copyout(c[0:n])
      {
        #pragma acc kernels loop
        for(int i = 0; i < n; i++)
        {
          c[i] = a[i] + b[i];
        }
      }
    }
    

  • create(list): The create clause allocates memory on the device without any data transfer from the host. No data is copied back to the host upon exiting the region. This clause is useful for variables that are only used within the GPU and do not need to retain values from the host.

    void Vector_Addition(float *a, float *b, float *restrict c, int n) 
    {
      #pragma acc data copyin(a[0:n], b[0:n]) create(c[0:n]) copyout(c[0:n])
      {
        #pragma acc kernels loop
        for(int i = 0; i < n; i++)
        {
          c[i] = a[i] + b[i];
        }
      }
    }
    

  • present(list): The present clause indicates that the data is already allocated and available on the device, typically because it was allocated by a parent data region. This clause prevents re-allocation or copying, which can save memory and improve performance in nested or complex regions.

  • deviceptr(list): The deviceptr clause is used for device pointers (e.g., pointers that reference memory on the GPU allocated by other methods, such as CUDA). This clause allows OpenACC to recognize device pointers and use them directly without duplicating data.

  • copyin and copyout Example: Vector Addition Below is a practical example using copyin and copyout to add two vectors on the GPU.

    // For C/C++
    #pragma acc kernels loop copyin(a[0:n], b[0:n]) copyout(c[0:n])
    for(int i = 0; i < n; i++)
    {
        c[i] = a[i] + b[i];
    }
    
    !! For Fortran
    !$acc kernels loop copyin(a(1:n), b(1:n)) copyout(c(1:n))
    DO i = 1, N
        c(i) = a(i) + b(i)
    END DO
    !$acc end kernels
    

Unstructured vs. Structured Data Regions

OpenACC provides structured and unstructured data regions to manage memory across regions of code.

  • Unstructured Data Region: This type of region is initiated and exited explicitly, allowing memory allocation across multiple functions and code regions. Data remains on the device until it is explicitly deallocated. For example:

    #pragma acc enter data copyin(a[0:N], b[0:N]) create(c[0:N])
    #pragma acc parallel loop
    for(int i = 0; i < N; i++)
    {
        c[i] = a[i] + b[i];
    }
    #pragma acc exit data copyout(c[0:N]) delete(a, b)
    

  • Structured Data Region: A structured data region is limited to a single scope within a function. Data allocation and deallocation occur automatically at the start and end of the scope.

    #pragma acc data copyin(a[0:N], b[0:N]) copyout(c[0:N])
    {
      #pragma acc parallel loop
      for(int i = 0; i < N; i++)
      {
        c[i] = a[i] + b[i];
      }
    }
    

cache Clause: The cache clause specifies that specific data should be cached on the device for faster access within loops, especially for frequently accessed variables. This can significantly reduce memory latency in complex computations.

For C/C++:

#pragma acc parallel loop 
for (int i = 0; i < N; i++)
{
  #pragma acc cache(a[i])
  a[i] = a[i] * 2.0;
}

For Fortran:

!$acc parallel loop
DO i = 0, N
  !$acc cache(a(i))
  a(i) = a(i) * 2.0
END DO
!$acc end parallel

By using these data clauses effectively, OpenACC allows fine-grained control over memory management on the GPU, helping to optimize data transfer and memory usage, which is crucial for achieving high performance in GPU-based applications.

Summary

OpenACC provides data clauses to manage memory transfer efficiently between the host (CPU) and device (GPU). These clauses allow developers to control data allocation, movement, and retention on the GPU, optimizing performance by minimizing unnecessary data transfer.

  • copy(list): Allocates memory on the GPU, copies data from host to device at the start, and back to the host upon exiting the data region. Suitable for data that needs to be available on both host and device.

  • copyin(list): Allocates memory on the GPU and transfers data from host to device only when entering the data region. Ideal for input data that the device reads without modifying.

  • copyout(list): Allocates memory on the GPU and copies data back to the host when exiting the data region. Used for data that is generated on the device and needed on the host after computation.

  • create(list): Allocates memory on the GPU without transferring data from the host. Data remains on the device and isn’t copied back to the host upon exiting the region, suitable for temporary device-only data.

  • present(list): Assumes that data is already allocated on the GPU from a parent region, avoiding re-allocation and redundant transfers, which is beneficial in nested or complex regions.

  • deviceptr(list): Used for GPU pointers allocated outside OpenACC (e.g., through CUDA), allowing OpenACC to manage them without duplicating memory.

Structured vs. Unstructured Data Regions

  • Unstructured Data Region: Allows flexible memory allocation across multiple functions or regions and keeps data on the device until explicitly deallocated.

  • Structured Data Region: Limited to a function scope, where data is allocated and deallocated automatically at the start and end of the scope.

Cache Clause

The cache clause is used to keep frequently accessed data on the device, reducing memory latency in parallel loops.

By effectively using these data clauses, OpenACC enables precise memory control on the GPU, enhancing performance by reducing data transfer overhead, which is essential for high-performance GPU applications.