Profling in OpenACC¶

Script

"Profiling is a critical step to ensure that your code is optimized for the target hardware. Even if your program seems efficient, only through profiling can you confirm that resources such as CPU, GPU, and memory are fully utilized. The NVIDIA HPC SDK offers powerful tools, accessible both via the command line and a graphical interface, to help developers fine-tune their applications for OpenACC."
"The command-line profiling tools in the NVIDIA HPC SDK use environment variables to customize the profiling detail level. For example, setting NVCOMPILER_ACC_TIME allows developers to monitor specific aspects of execution, such as kernel launches or memory transfers. With NVCOMPILER_ACC_NOTIFY, you can observe real-time execution and data transfer activities, as shown in this example output."
"The Nsight Systems GUI provides an intuitive timeline view to analyze CPU and GPU interactions. On HPC systems like MeluXina, you need to SSH with X forwarding, allocate a GPU node, load the required modules, and then profile your application. The timeline report can be opened in the Nsight UI for detailed performance visualization."

Profiling is an essential step to ensure that the computational resources of the target architecture are efficiently utilized with the chosen algorithm. While it might seem that a program performs efficiently, this assumption may not hold unless proper profiling is conducted to confirm resource utilization, including CPU, GPU, memory bandwidth, and vectorization.

Using the NVIDIA HPC SDK, we can profile OpenACC code effectively through both command-line and GUI tools.

Command Line¶

export NVCOMPILER_ACC_TIME=[]
[1]: kernel launches
[2]: data transfers
[4]: region entry/exit
[8]: wait for operations or synchronizations
[16]: device memory allocates and deallocates

Setting export NVCOMPILER_ACC_NOTIFY=3 provides kernel executions and data transfer information.

Profiling: Compilation

Nvidia HPC SDK

// compilation 
Vector_Addition:
     12, Generating NVIDIA GPU code
     14, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     12, Generating implicit copyin(a[:n]) [if not already present]
         Generating implicit copyout(c[:n]) [if not already present]
         Generating implicit copyin(b[:n]) [if not already present]

//execution        
[u100@mel2041 Unified-memory]$ ./a.out 
This program does the addition of two vectors 
upload CUDA data  file=/Vector-addition-openacc.c function=Vector_Addition line=12 device=0 threadid=1 variable=b bytes=400
upload CUDA data  file=/Vector-addition-openacc.c function=Vector_Addition line=12 device=0 threadid=1 variable=a bytes=400
launch CUDA kernel  file=/Vector-addition-openacc.c function=Vector_Addition line=12 device=0 threadid=1 num_gangs=1 num_workers=1 vector_length=128 grid=1 block=128
download CUDA data  file=/Vector-addition-openacc.c function=Vector_Addition line=17 device=0 threadid=1 variable=c bytes=400
PASSED

GUI¶

The Visual Profiler is organized into views. Together, the views allow you to analyze and visualize the performance of your application. The Timeline View shows CPU and GPU activity that occurred while your application was being profiled. Multiple timelines can be opened in the Visual Profiler simultaneously in different tabs. The following figure shows a Timeline View for an OpenACC application.

In order to visualize the performance of your application, you should connect to the HPC machine via -X forward; otherwise, you will not be able to see the GUI application. For example, on MeluXina, you should do the following.

GUI login

$ ssh -X meluxina

$ salloc -A p200117 --res p200117-openacc-2 --partition=gpu --qos default -N 1 -t 00:30:00 srun --forward-x --pty bash -l

We also need to add a few extra modules to open a GUI application. On MeluXina, we need to add the following modules:

Required modules

module load NVHPC/22.7
module load CUDA/11.7.0
module load Mesa/22.0.3-GCCcore-11.3.0        
module load Qt5/5.15.5-GCCcore-11.3.0

Once the required modules are loaded, you can compile your application and visualize its performance. Finally, we need to use the command line nsys-ui to open the GUI application and load timeline.nsys-rep.

Compilation and GUI

```
[u100@mel2073 Vector-addition]$ nvc -fast -acc=gpu -gpu=cc80 -Minfo=accel Vector-addition.c
nvc-Warning-CUDA_HOME has been deprecated. Please use NVHPC_CUDA_HOME instead.
[u100@mel2073 Vector-addition]$ nsys profile -o timeline ./a.out
Warning: LBR backtrace method is not supported on this platform. DWARF backtrace method will be used.
This program does the addition of two vectors 
Please specify the vector size = 10000
PASSED
Generating '/tmp/nsys-report-6c02.qdstrm'
[1/1] [========================100%] timeline.nsys-rep

// Open the GUI application  and load timeline.nsys-rep
$ nsys-ui &
```