Profiling Options for CUDA and HIP¶

Script

"Profiling is a crucial process for optimizing CUDA and HIP applications. It helps developers identify bottlenecks such as compute or memory limitations and ensures efficient utilization of computational resources like CPUs, GPUs, and memory bandwidth. Key areas include measuring execution time, analyzing system-wide performance, and calculating occupancy for GPU optimization."
"Both CUDA and HIP use event-based APIs for measuring execution time. In CUDA, you create events with cudaEventCreate, record them with cudaEventRecord, and measure elapsed time using cudaEventElapsedTime. HIP offers a similar workflow with its hipEvent_t API. These tools allow precise measurement of kernel execution times."
"CUDA provides powerful tools like Nsight Compute for kernel-level profiling and Nsight Systems for system-wide analysis of CPU-GPU interactions. Similarly, HIP offers rocprof for collecting traces and performance metrics and rocminfo for system configuration details. These tools are essential for optimizing applications at both kernel and resource levels."

Time Measurement¶

CUDA¶

In CUDA, the execution time can be measured by using the CUDA events. CUDA API events are created using cudaEvent_t, for example, cudaEvent_t start, stop;. Events are initialized by cudaEventCreate, and execution time can be recorded with cudaEventRecord. Timing is measured using cudaEventElapsedTime.

CUDA API

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);

Then timings are measured:

CUDA API

cudaEventRecord(stop);
cudaEventSynchronize(stop);
float time;
cudaEventElapsedTime(&time, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);

Example:

Example

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);

// Device function call 
matrix_mul<<<Grid_dim, Block_dim>>>(d_a, d_b, d_c, N);

cudaEventRecord(stop);
cudaEventSynchronize(stop);
float time;
cudaEventElapsedTime(&time, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);

std::cout << "Time taken for the GPU kernel: " << time << " ms" << std::endl;

HIP¶

In ROCm, execution time can be measured using HIP events. HIP API events are created with hipEvent_t, for example, hipEvent_t start, stop;. Events are initialized using hipEventCreate and recorded with hipEventRecord. Timing is calculated using hipEventElapsedTime.

HIP API

hipEvent_t start, stop;
hipEventCreate(&start);
hipEventCreate(&stop);
hipEventRecord(start, 0);

Then timings are measured:

HIP API

hipEventRecord(stop);
hipEventSynchronize(stop);
float time;
hipEventElapsedTime(&time, start, stop);
hipEventDestroy(start);
hipEventDestroy(stop);

Example:

Example

hipEvent_t start, stop;
hipEventCreate(&start);
hipEventCreate(&stop);
hipEventRecord(start);

// Device function call
hipLaunchKernelGGL(matrix_mul, Grid_dim, Block_dim, 0, 0, d_a, d_b, d_c, N);

hipEventRecord(stop);
hipEventSynchronize(stop);
float time;
hipEventElapsedTime(&time, start, stop);
hipEventDestroy(start);
hipEventDestroy(stop);

std::cout << "Time taken for the GPU kernel: " << time << " ms" << std::endl;

System-Wide Performance Analysis¶

CUDA¶

NVIDIA provides tools for CUDA profiling:

Nsight Compute: A kernel profiler with CLI and GUI support.
Nsight Graphics: A graphics application debugger and profiler.
Nsight Systems: A system-wide performance analysis tool.

Examples and usage instructions are detailed in NVIDIA documentation.

HIP¶

ROCm provides profiling tools for HIP applications:

rocprof: A profiler to collect traces, analyze kernel calls, and compute performance metrics.

Example usage:

rocprof CLI

$ rocprof --hip-trace ./a.out

Output provides detailed execution traces, kernel timings, and system-wide performance metrics.

rocminfo: Displays device and system configuration information.
rocprof Metrics: Use rocprof --metrics to gather performance metrics like memory throughput, kernel execution times, and utilization.

Example with metrics:

rocprof Metrics

$ rocprof --metrics gpu-clock,kernel-time ./a.out

Occupancy Calculation¶

CUDA¶

The CUDA Occupancy Calculator computes the ratio of active warps to the maximum warps supported per multiprocessor.

Example:

CUDA Occupancy

int numBlocks, device;
cudaDeviceProp prop;
cudaGetDevice(&device);
cudaGetDeviceProperties(&prop, device);

cudaOccupancyMaxActiveBlocksPerMultiprocessor(
    &numBlocks, MyKernel, blockSize, 0);

int activeWarps = numBlocks * blockSize / prop.warpSize;
int maxWarps = prop.maxThreadsPerMultiProcessor / prop.warpSize;

std::cout << "Occupancy: " << (double)activeWarps / maxWarps * 100 << "%" << std::endl;

HIP¶

HIP uses a similar mechanism to calculate occupancy. Instead of CUDA APIs, you use ROCm-specific APIs.

Example:

HIP Occupancy

int numBlocks, device;
hipDeviceProp_t prop;
hipGetDevice(&device);
hipGetDeviceProperties(&prop, device);

hipOccupancyMaxActiveBlocksPerMultiprocessor(
    &numBlocks, MyKernel, blockSize, 0);

int activeWarps = numBlocks * blockSize / prop.warpSize;
int maxWarps = prop.maxThreadsPerMultiProcessor / prop.warpSize;

std::cout << "Occupancy: " << (double)activeWarps / maxWarps * 100 << "%" << std::endl;

Notes¶

CUDA and HIP profiling APIs are conceptually similar but use their respective libraries and tools.
Refer to the CUDA Documentation and ROCm Documentation for the latest updates.