Skip to content

Profiling Options for CUDA and HIP

Script
  • "Profiling is a crucial process for optimizing CUDA and HIP applications. It helps developers identify bottlenecks such as compute or memory limitations and ensures efficient utilization of computational resources like CPUs, GPUs, and memory bandwidth. Key areas include measuring execution time, analyzing system-wide performance, and calculating occupancy for GPU optimization."
  • "Both CUDA and HIP use event-based APIs for measuring execution time. In CUDA, you create events with cudaEventCreate, record them with cudaEventRecord, and measure elapsed time using cudaEventElapsedTime. HIP offers a similar workflow with its hipEvent_t API. These tools allow precise measurement of kernel execution times."
  • "CUDA provides powerful tools like Nsight Compute for kernel-level profiling and Nsight Systems for system-wide analysis of CPU-GPU interactions. Similarly, HIP offers rocprof for collecting traces and performance metrics and rocminfo for system configuration details. These tools are essential for optimizing applications at both kernel and resource levels."

Time Measurement

CUDA

In CUDA, the execution time can be measured by using the CUDA events. CUDA API events are created using cudaEvent_t, for example, cudaEvent_t start, stop;. Events are initialized by cudaEventCreate, and execution time can be recorded with cudaEventRecord. Timing is measured using cudaEventElapsedTime.

CUDA API
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);

Then timings are measured:

CUDA API
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float time;
cudaEventElapsedTime(&time, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);

Example:

Example
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);

// Device function call 
matrix_mul<<<Grid_dim, Block_dim>>>(d_a, d_b, d_c, N);

cudaEventRecord(stop);
cudaEventSynchronize(stop);
float time;
cudaEventElapsedTime(&time, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);

std::cout << "Time taken for the GPU kernel: " << time << " ms" << std::endl;

HIP

In ROCm, execution time can be measured using HIP events. HIP API events are created with hipEvent_t, for example, hipEvent_t start, stop;. Events are initialized using hipEventCreate and recorded with hipEventRecord. Timing is calculated using hipEventElapsedTime.

HIP API
hipEvent_t start, stop;
hipEventCreate(&start);
hipEventCreate(&stop);
hipEventRecord(start, 0);

Then timings are measured:

HIP API
hipEventRecord(stop);
hipEventSynchronize(stop);
float time;
hipEventElapsedTime(&time, start, stop);
hipEventDestroy(start);
hipEventDestroy(stop);

Example:

Example
hipEvent_t start, stop;
hipEventCreate(&start);
hipEventCreate(&stop);
hipEventRecord(start);

// Device function call
hipLaunchKernelGGL(matrix_mul, Grid_dim, Block_dim, 0, 0, d_a, d_b, d_c, N);

hipEventRecord(stop);
hipEventSynchronize(stop);
float time;
hipEventElapsedTime(&time, start, stop);
hipEventDestroy(start);
hipEventDestroy(stop);

std::cout << "Time taken for the GPU kernel: " << time << " ms" << std::endl;

System-Wide Performance Analysis

CUDA

NVIDIA provides tools for CUDA profiling:

  • Nsight Compute: A kernel profiler with CLI and GUI support.
  • Nsight Graphics: A graphics application debugger and profiler.
  • Nsight Systems: A system-wide performance analysis tool.

Examples and usage instructions are detailed in NVIDIA documentation.


HIP

ROCm provides profiling tools for HIP applications:

  • rocprof: A profiler to collect traces, analyze kernel calls, and compute performance metrics.

Example usage:

rocprof CLI
$ rocprof --hip-trace ./a.out

Output provides detailed execution traces, kernel timings, and system-wide performance metrics.

  • rocminfo: Displays device and system configuration information.
  • rocprof Metrics: Use rocprof --metrics to gather performance metrics like memory throughput, kernel execution times, and utilization.

Example with metrics:

rocprof Metrics
$ rocprof --metrics gpu-clock,kernel-time ./a.out

Occupancy Calculation

CUDA

The CUDA Occupancy Calculator computes the ratio of active warps to the maximum warps supported per multiprocessor.

Example:

CUDA Occupancy
int numBlocks, device;
cudaDeviceProp prop;
cudaGetDevice(&device);
cudaGetDeviceProperties(&prop, device);

cudaOccupancyMaxActiveBlocksPerMultiprocessor(
    &numBlocks, MyKernel, blockSize, 0);

int activeWarps = numBlocks * blockSize / prop.warpSize;
int maxWarps = prop.maxThreadsPerMultiProcessor / prop.warpSize;

std::cout << "Occupancy: " << (double)activeWarps / maxWarps * 100 << "%" << std::endl;

HIP

HIP uses a similar mechanism to calculate occupancy. Instead of CUDA APIs, you use ROCm-specific APIs.

Example:

HIP Occupancy
int numBlocks, device;
hipDeviceProp_t prop;
hipGetDevice(&device);
hipGetDeviceProperties(&prop, device);

hipOccupancyMaxActiveBlocksPerMultiprocessor(
    &numBlocks, MyKernel, blockSize, 0);

int activeWarps = numBlocks * blockSize / prop.warpSize;
int maxWarps = prop.maxThreadsPerMultiProcessor / prop.warpSize;

std::cout << "Occupancy: " << (double)activeWarps / maxWarps * 100 << "%" << std::endl;

Notes