Skip to content

Introduction to CUDA and HIP: Thread Organization, Memory Model, Warps, and Wavefronts

CUDA (for NVIDIA GPUs) and HIP (for AMD and cross-platform compatibility) offer flexible models for GPU programming that allow parallel processing by organizing threads into grids, blocks, and individual threads. Understanding the thread structure and memory models of these frameworks is essential for writing high-performance GPU code.

Thread Organization in CUDA and HIP

CUDA and HIP structure threads hierarchically in grids, blocks, and individual threads, which can be organized in 1D, 2D, or 3D arrays. This organization enables flexible parallelism across various types of computations.

  • CUDA vs. HIP Syntax: While CUDA is specific to NVIDIA GPUs, HIP provides portability across NVIDIA and AMD GPUs with similar syntax and functionality.

Example Kernel Launch:

  • CUDA: VecAddition<<<1, N>>>(A, B, C);
  • HIP: hipLaunchKernelGGL(VecAddition, dim3(1), dim3(N), 0, 0, A, B, C);

Both CUDA and HIP provide built-in variables to manage grid and block dimensions:

  • gridDim.x/y/z: Number of blocks in each dimension within the grid.
  • blockIdx.x/y/z: Index of the current block within the grid.
  • blockDim.x/y/z: Number of threads in each dimension within the block.
  • threadIdx.x/y/z: Index of the current thread within the block.

Executing HIP and CUDA Code: Required Libraries

In GPU programming, the correct libraries are essential to compile and run code on specific hardware. Here are the basic libraries to include for HIP and CUDA programming.

For HIP Code Execution (AMD or NVIDIA GPUs)

To execute HIP code, include the following libraries:

#include <hip/hip_runtime.h>
#include <hip/hip_runtime_api.h>
These libraries provide the functions and types required for HIP programming, enabling support for both AMD and NVIDIA GPUs when using HIP as an abstraction layer.

For CUDA Code Execution (NVIDIA GPUs)

To execute CUDA code, include the following libraries:

#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
These libraries enable core CUDA functionality, supporting GPU programming specifically for NVIDIA devices.

Memory Model in CUDA and HIP

CUDA and HIP share a similar memory hierarchy, optimized for different access patterns and lifetimes:

Memory Location Cached Device Access Scope Lifetime
Register On-chip N/A R/W One thread Thread
Local DRAM Yes R/W One thread Thread
Shared On-chip N/A R/W All threads in block Block
Global DRAM * R/W All threads + host Application
Constant DRAM Yes R All threads + host Application
Texture DRAM Yes R All threads + host Application

Note: The latest compute capabilities cache global memory in L2 by default.


Warps in CUDA and Wavefronts in HIP

What are Warps and Wavefronts?

In CUDA, warps are groups of 32 threads executed in lock-step, while in HIP (primarily for AMD GPUs), wavefronts consist of 64 threads. Both concepts represent groups of threads scheduled to execute the same instruction simultaneously.

Key Differences

Feature CUDA Warps HIP Wavefronts
Threads per Group 32 threads 64 threads
Divergence Threshold 32-thread groups 64-thread groups
Memory Coalescing 32-thread access 64-thread access

Warp and Wavefront Divergence

Divergence occurs when threads within a warp or wavefront follow different execution paths (e.g., branching with if statements). This divergence causes threads to execute serially, reducing efficiency. Optimizing code to minimize divergence within warps and wavefronts improves performance.


Practical Example: Thread Layout in CUDA and HIP

Here’s a sample CUDA/HIP program demonstrating thread organization. This example organizes threads in a 2D block and computes a 1D global index for each thread.

#include <stdio.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
//#include <hip/hip_runtime.h>
//#include <hip/hip_runtime_api.h>

__global__ void helloGPU()
{
  int i = blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x;
  printf("Hello from thread %d\n", i);
  __syncthreads();       
}

int main()
{
  dim3 comp_G(1, 1, 1);    
  dim3 comp_B(8, 8, 1); 

  helloGPU<<<comp_G, comp_B>>>();  // CUDA
  cudaDeviceReset();               // CUDA

  // helloGPU<<<comp_G, comp_B>>>();  // HIP
  // hipDeviceReset();                // HIP

  return 0;
}

Best Practices for CUDA and HIP

  1. Choosing Thread Configurations:

    • Multi-dimensional grids and blocks in both CUDA and HIP offer flexibility in parallelizing computations. Choose configurations based on the data structure of the application.
  2. Optimizing Occupancy:

    • High occupancy improves performance but must be balanced with resource use per thread. CUDA’s Occupancy Calculator provides helpful recommendations, and HIP offers similar resources for AMD GPUs.
  3. Minimizing Divergence:

    • Structure code to reduce divergence within warps (32 threads in CUDA) or wavefronts (64 threads in HIP), avoiding excessive branching within these groups.
  4. Memory Access Patterns:

    • Align memory access with warp/wavefront size to improve memory coalescing and reduce latency. Optimized access patterns enhance data throughput.
  5. Avoiding Bank Conflicts:

    • Shared memory access within warps or wavefronts can lead to bank conflicts, reducing performance. Proper data alignment within shared memory can mitigate this.
  6. Error Handling:

    • Always check for errors after launching kernels and API calls to catch issues early in both CUDA and HIP.

Computing Global Thread Indexes

Both CUDA and Hip threads can be structured as 1D, 2D, or 3D arrays. However, in some applications, it’s useful to convert these threads into a single array for easy indexing. The examples below show various conversions of CUDA thread organizations into a single-dimensional index.OB

// 1D grid of 1D blocks
__device__ int getGlobalIdx_1D_1D()
{
  return blockIdx.x * blockDim.x + threadIdx.x;
}

// 1D grid of 2D blocks
__device__ int getGlobalIdx_1D_2D()
{
  return blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x;
}

// 1D grid of 3D blocks
__device__ int getGlobalIdx_1D_3D()
{
  return blockIdx.x * blockDim.x * blockDim.y * blockDim.z 
         + threadIdx.z * blockDim.y * blockDim.x 
         + threadIdx.y * blockDim.x + threadIdx.x;
}

// 2D grid of 1D blocks
__device__ int getGlobalIdx_2D_1D()
{
  int blockId = blockIdx.y * gridDim.x + blockIdx.x;
  return blockId * blockDim.x + threadIdx.x;
}

// 2D grid of 2D blocks
__device__ int getGlobalIdx_2D_2D()
{
  int blockId = blockIdx.x + blockIdx.y * gridDim.x;
  return blockId * (blockDim.x * blockDim.y) + threadIdx.y * blockDim.x + threadIdx.x;
}

// 3D grid of 3D blocks
__device__ int getGlobalIdx_3D_3D()
{
  int blockId = blockIdx.x + blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z;
  return blockId * (blockDim.x * blockDim.y * blockDim.z) 
         + threadIdx.z * (blockDim.x * blockDim.y) 
         + threadIdx.y * blockDim.x + threadIdx.x;
}


Example Program: Printing Thread Layout

The following program demonstrates the concept by printing thread layout information from a CUDA kernel function. The threads are organized in a 2D block, and the kernel function calculates a single-dimensional index for each thread.

#include <stdio.h>

__global__ void helloCUDA()
{
  // Converting 2D thread structure into a 1D thread structure 
  int i = blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x;
  printf("Hello from thread %d\n", i);
  __syncthreads();       
}

int main()
{
  // Thread organization (2D block)
  dim3 comp_G(1, 1, 1);    
  dim3 comp_B(8, 8, 1); 

  helloCUDA<<<comp_G, comp_B>>>();

  cudaDeviceReset();

  return 0;
}

Summary

CUDA and HIP provide powerful tools for parallel programming on NVIDIA and AMD GPUs, respectively. Understanding thread organization, memory models, and the distinction between warps and wavefronts helps to optimize performance across these platforms. By applying best practices like minimizing divergence and optimizing occupancy, developers can write high-performance, portable GPU applications.