Article 4

In this article, we will study how to profile both C/C++ and Fortran code and analyze & optimize the important traces and metrics in the application.

Profiling is very important to analyze the code to see where it spends most of the time. This will give us detailed information about all the functions, time consumption, memory transfer, and all the API time consumption. There are a few tools are exiting to profile the OpenACC code, which is as follows:

Among these will go through how to use the PGI compiler for the profiling. The PGI profiling is already a part of the Nvidia HPC SDK.

PGI compiler for the OpenACC provides the parallel strategy and data movement information at the compile time. This applies to both GPUs and CPUs.

To see the whole code profiling information, please use:

pgcc -fast -Minfo=all -ta=tesla -acc Vector_Addition_OpenACC.c
pgfortran -fast -Minfo=all -ta=tesla -acc Vector_Addition_OpenACC.f90

To see just kernel profiling information, please use:

pgcc -fast -Minfo=accel -ta=tesla -acc Vector_Addition_OpenACC.c
pgfortran -fast -Minfo=accel -ta=tesla -acc Vector_Addition_OpenACC.f90

Command line profiling:¶

The following steps will provide a detailed view of the profiling step by step:

The first step would be just to compile the entire code:

pgcc -fast -Minfo=all -ta=tesla -acc Vector_Addition_OpenACC.c
Then, if you do not know what to look for in the profiling, then please type the following command to query the list of options:

// this will show the list of options that pgprof provides. pgprof --help
For example, to see the following information:

GPU kernel execution profile

  pgprof --print-gpu-summary ./a.out
  pgprof --print-gpu-trace ./a.out

CUDA API execution profile

  pgprof --print-api-summary ./a.out
  pgprof --print-api-trace ./a.out

OpenACC execution profile

  pgprof --print-openacc-trace ./a.out
  pgprof --print-openacc-summary ./a.out

CPU execution profile

  pgprof --cpu-profiling-mode flat ./a.out

Visual Profiling:¶

Sometimes we also would like to see the visual profiler, especially the application's communication and computation time in the application. Because most of the time, those are the parameters we should be looking at and try to optimize the time consumption. Please refer to the below steps on how to visualize the profiled data using the pgprof.

We need to create an output file that can be opened by the pgprof:

pgprof -o profiled-output.pgprof --cpu-profiling-mode flat ./a.out

Then, to open the file, we need to open the GPU of pgprof.
Once the pgprof is opened, we can easily open the profiled-output.pgprof file.
Figure 1 shows the example of pgprof GUI.

Figure 1: Example of pgprof GUI profiling

There are a few important environmental variables which are supported by the PGI compiler, and these can be set the compilation time:

PGI_ACC_DEBUG
runtime debugging
ACC_NOTIFY
writes out a line for each kernel and data movement
options: 1 - kernels launch; 2 - data transfer; 4 - synchronous operations; 8 - region entry/exit; 16 - data allocation/free
PGI_ACC_TIME
lightweight profiler for a summary of the program
PGI_ACC_SYNCHRONOUS
disabling the synchronous operations
Example usage:
for csh: setenv ACC_NOTIFY 1
for bash: export ACC_NOTIFY=1