Data Clauses in OpenACC¶

Script

"Data clauses in OpenACC manage data transfer between the CPU and GPU, allowing developers to optimize memory use and reduce unnecessary data movement. These clauses provide granular control over data allocation and transfer, which is essential for performance in GPU-based applications."
"Three primary data clauses in OpenACC include copy, copyin, and copyout. The copy clause transfers data to and from the device, while copyin only transfers data from host to device and copyout transfers data from device to host. These clauses allow efficient handling of data required on both host and device."
"Additional clauses such as create, present, and deviceptr help further control data usage. Create allocates memory on the device without transferring data, present assumes the data is already available on the device, and deviceptr allows OpenACC to work with device pointers created outside of OpenACC, like in CUDA."
"OpenACC offers structured and unstructured data regions for flexible memory management. Structured data regions are scoped within functions and handle allocation automatically, while unstructured regions span multiple functions or regions and require explicit data management, giving more control over memory."
"The cache clause in OpenACC is used to cache frequently accessed data on the device, which helps reduce memory latency in loops and is beneficial for performance in complex computations."
"OpenACC data clauses enable precise data transfer control with options like copy, copyin, and copyout for managing transfers, and create, present, and deviceptr for advanced data handling. Structured and unstructured data regions allow varying levels of control, and the cache clause optimizes frequently accessed data, enhancing GPU performance."

OpenACC provides data clauses to efficiently manage data transfer between the host (CPU) and device (GPU). These clauses are essential for specifying how and when data should be allocated, transferred, or retained on the device. Understanding these clauses can optimize memory use and minimize unnecessary data movement, which is critical for performance on GPUs.

Main data clauses in OpenACC:¶

copy(list): The copy clause allocates memory on the GPU, copies data from the host to the device upon entering a data region, and copies data back from the device to the host when exiting the region. This clause is useful for cases where the result of the computation needs to be available on both the device and the host.
```
void Vector_Addition(float *a, float *b, float *restrict c, int n) 
{
  #pragma acc data copyin(a[0:n], b[0:n]) copy(c[0:n])
  {
    #pragma acc kernels loop
    for(int i = 0; i < n; i++)
    {
      c[i] = a[i] + b[i];
    }
  }
}
```

copyin(list): The copyin clause allocates memory on the device and transfers data from the host to the device upon entering the data region. No data is copied back to the host when exiting the region. This clause is ideal for input data that is only read by the device.

void Vector_Addition(float *a, float *b, float *restrict c, int n) 
{
  #pragma acc data copyin(a[0:n], b[0:n]) copyout(c[0:n])
  {
    #pragma acc kernels loop
    for(int i = 0; i < n; i++)
    {
      c[i] = a[i] + b[i];
    }
  } 
}

copyout(list): The copyout clause allocates memory on the device and copies data from the device to the host when exiting the region. It does not transfer data from the host to the device at the start of the region. This clause is used for output data that is generated on the device and needs to be available on the host after computation.
```
void Vector_Addition(float *a, float *b, float *restrict c, int n) 
{
  #pragma acc data copyin(a[0:n], b[0:n]) copyout(c[0:n])
  {
    #pragma acc kernels loop
    for(int i = 0; i < n; i++)
    {
      c[i] = a[i] + b[i];
    }
  }
}
```

create(list): The create clause allocates memory on the device without any data transfer from the host. No data is copied back to the host upon exiting the region. This clause is useful for variables that are only used within the GPU and do not need to retain values from the host.

void Vector_Addition(float *a, float *b, float *restrict c, int n) 
{
  #pragma acc data copyin(a[0:n], b[0:n]) create(c[0:n]) copyout(c[0:n])
  {
    #pragma acc kernels loop
    for(int i = 0; i < n; i++)
    {
      c[i] = a[i] + b[i];
    }
  }
}

present(list): The present clause indicates that the data is already allocated and available on the device, typically because it was allocated by a parent data region. This clause prevents re-allocation or copying, which can save memory and improve performance in nested or complex regions.
deviceptr(list): The deviceptr clause is used for device pointers (e.g., pointers that reference memory on the GPU allocated by other methods, such as CUDA). This clause allows OpenACC to recognize device pointers and use them directly without duplicating data.

copyin and copyout Example: Vector Addition Below is a practical example using copyin and copyout to add two vectors on the GPU.

// For C/C++
#pragma acc kernels loop copyin(a[0:n], b[0:n]) copyout(c[0:n])
for(int i = 0; i < n; i++)
{
    c[i] = a[i] + b[i];
}

!! For Fortran
!$acc kernels loop copyin(a(1:n), b(1:n)) copyout(c(1:n))
DO i = 1, N
    c(i) = a(i) + b(i)
END DO
!$acc end kernels

Unstructured vs. Structured Data Regions¶

OpenACC provides structured and unstructured data regions to manage memory across regions of code.

Unstructured Data Region: This type of region is initiated and exited explicitly, allowing memory allocation across multiple functions and code regions. Data remains on the device until it is explicitly deallocated. For example:
```
#pragma acc enter data copyin(a[0:N], b[0:N]) create(c[0:N])
#pragma acc parallel loop
for(int i = 0; i < N; i++)
{
    c[i] = a[i] + b[i];
}
#pragma acc exit data copyout(c[0:N]) delete(a, b)
```
Structured Data Region: A structured data region is limited to a single scope within a function. Data allocation and deallocation occur automatically at the start and end of the scope.
```
#pragma acc data copyin(a[0:N], b[0:N]) copyout(c[0:N])
{
  #pragma acc parallel loop
  for(int i = 0; i < N; i++)
  {
    c[i] = a[i] + b[i];
  }
}
```

cache Clause: The cache clause specifies that specific data should be cached on the device for faster access within loops, especially for frequently accessed variables. This can significantly reduce memory latency in complex computations.

For C/C++:

#pragma acc parallel loop 
for (int i = 0; i < N; i++)
{
  #pragma acc cache(a[i])
  a[i] = a[i] * 2.0;
}

For Fortran:

!$acc parallel loop
DO i = 0, N
  !$acc cache(a(i))
  a(i) = a(i) * 2.0
END DO
!$acc end parallel

By using these data clauses effectively, OpenACC allows fine-grained control over memory management on the GPU, helping to optimize data transfer and memory usage, which is crucial for achieving high performance in GPU-based applications.

Summary¶

OpenACC provides data clauses to manage memory transfer efficiently between the host (CPU) and device (GPU). These clauses allow developers to control data allocation, movement, and retention on the GPU, optimizing performance by minimizing unnecessary data transfer.

copy(list): Allocates memory on the GPU, copies data from host to device at the start, and back to the host upon exiting the data region. Suitable for data that needs to be available on both host and device.
copyin(list): Allocates memory on the GPU and transfers data from host to device only when entering the data region. Ideal for input data that the device reads without modifying.
copyout(list): Allocates memory on the GPU and copies data back to the host when exiting the data region. Used for data that is generated on the device and needed on the host after computation.
create(list): Allocates memory on the GPU without transferring data from the host. Data remains on the device and isn’t copied back to the host upon exiting the region, suitable for temporary device-only data.
present(list): Assumes that data is already allocated on the GPU from a parent region, avoiding re-allocation and redundant transfers, which is beneficial in nested or complex regions.
deviceptr(list): Used for GPU pointers allocated outside OpenACC (e.g., through CUDA), allowing OpenACC to manage them without duplicating memory.

Structured vs. Unstructured Data Regions¶

Unstructured Data Region: Allows flexible memory allocation across multiple functions or regions and keeps data on the device until explicitly deallocated.
Structured Data Region: Limited to a function scope, where data is allocated and deallocated automatically at the start and end of the scope.

Cache Clause¶

The cache clause is used to keep frequently accessed data on the device, reducing memory latency in parallel loops.

By effectively using these data clauses, OpenACC enables precise memory control on the GPU, enhancing performance by reducing data transfer overhead, which is essential for high-performance GPU applications.