Tuning and Optimization

In this article, we will study how to optimize the OpenACC code.

OpenACC supports the gang, worker, and vector, which are similar to CUDAs, thread block, warp, and thread. This will provide good occupancy to the GPUs. However, OpeACC supports both GPUs and CPUs, so the concept of the gang, worker, and vector are varying depending on the computing architecture. Below Table 1 provides a quick overview of the functionality of the gang, worker, and vector in OpenACC.

Platform	Gang	Worker	Vector
Multicore CPU	Whole CPU (NUMA domain)	Core	SIMD vector
Manycore CPU (e.g., Xeon Phi)	NUMA domain (whole chip or quadrant)	Core	SIMD vector
NVIDIA GPU (CUDA)	Thread block	Warp	Thread
AMD GPU (OpenCL)	Workgroup	Wavefront	Thread

Table 1: Overview of different architecture and functionality of gang, worker, and vector

Furthermore, OpenACC provides the two primary compute kernels, as we have discussed earlier. At the same time, the clauses are also varying for these two kernels, which can be seen in Table 2.

	OpenACC Kernels	OpenACC Parallel
Threads	vector(expression)	vector_length(expression)
Warps	worker(expression)	num_workers(expression)
Thread Block	gang(expression)	num_gangs(expression)
Device (nvidia or radeon)	device_type(device name)	device_type(device name)

Table 2: CLAUSES for parallel and kernels

Now let's consider a simple example, where we can optimize the thread blocks in the computation. In the CUDA we specifically mention, how is our thread blocks should be executed. Whereas on the OpenACC we can not provide like in CUDA. But, we can use the clauses to control the thread blocks in the OpenACC. By default, the OpenACC compiler chooses the best combination of the threads blocks depending on the architecture. However, sometimes, this option does not work depending on the problem you are trying to compute. The below example shows the simple case of default threads created by the compiler.

Gang, Worker, Vector Clauses:¶

// function that adds two vector 
void Vector_Addition(float *a, float *b, float *restrict c, int n) 
{
#pragma acc kernels loop copyin(a[:n], b[0:n]) copyout(c[0:n])
  for(int i = 0; i < n; i ++)
    {
      c[i] = a[i] + b[i];
    }
}

###### Profiling Output ######
Vector_Addition:
     12, Generating copyin(a[:n]) [if not already present]
         Generating copyout(c[:n]) [if not already present]
         Generating copyin(b[:n]) [if not already present]
     13, Loop is parallelizable
         Generating Tesla code
         13, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

Now, assume that you are not happy with the thread blocks due to the bad performance, and you want to have your own optimized threads blocks. To do that we can use the OpenACC clauses, as we can see in the below example.

// function that adds two vector 
void Vector_Addition(float *a, float *b, float *restrict c, int n) 
{
#pragma acc kernels loop gang(5) worker(32) vector(32) copyin(a[:n], b[0:n]) copyout(c[0:n])
  for(int i = 0; i < n; i ++)
    {
      c[i] = a[i] + b[i];
    }
}
//////////// or /////////////////////////
// function that adds two vector 
void Vector_Addition(float *a, float *b, float *restrict c, int n) 
{
#pragma acc kernels copyin(a[:n], b[0:n]) copyout(c[0:n])
#pragma acc loop gang(5) worker(32) vector(32)
  for(int i = 0; i < n; i ++)
    {
      c[i] = a[i] + b[i];
    }
}

The below example shows the compilation output from the including `OpenACC` clauses. As we can see here, the loop is thread blocked by the `thread blocks`, `warps`, and `threads`. 

###### Profiling Output ######
Vector_Addition:
     12, Generating copyin(a[:n]) [if not already present]
         Generating copyout(c[:n]) [if not already present]
         Generating copyin(b[:n]) [if not already present]
     13, Loop is parallelizable
         Generating Tesla code
         13, #pragma acc loop gang(5), worker(32), vector(32) /* blockIdx.x threadIdx.y threadIdx.x */

The following example is similar to one that was shown earlier but with acc parallel.

With default threads:

// function that adds two vector 
void Vector_Addition(float *a, float *b, float *restrict c, int n) 
{
#pragma acc parallel loop copyin(a[:n], b[0:n]) copyout(c[0:n])
  for(int i = 0; i < n; i ++)
    {
      c[i] = a[i] + b[i];
    }
}

###### Profiling Output ######
Vector_Addition:
     15, Generating copyin(a[:n]) [if not already present]
         Generating copyout(c[:n]) [if not already present]
         Generating copyin(b[:n]) [if not already present]
         Generating Tesla code
         18, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

With using num_gangs(), num_workers() and vector_length():

// function that adds two vector 
void Vector_Addition(float *a, float *b, float *restrict c, int n) 
{
#pragma acc parallel loop num_gangs(5) num_workers(32) vector_length(32) copyin(a[:n], b[0:n]) copyout(c[0:n])
  for(int i = 0; i < n; i ++)
    {
      c[i] = a[i] + b[i];
    }
}

###### Profiling Output ######
Vector_Addition:
     16, Generating copyin(a[:n]) [if not already present]
         Generating copyout(c[:n]) [if not already present]
         Generating copyin(b[:n]) [if not already present]
         Generating Tesla code
         18, #pragma acc loop gang(5), worker(32), vector(32) /* blockIdx.x threadIdx.y threadIdx.x */