Introduction to CUDA and HIP: Thread Organization, Memory Model, Warps, and Wavefronts¶

Script

"In this session, we’ll explore essential concepts in CUDA and HIP for GPU programming. We’ll focus on how threads are organized, the hierarchical memory model, and the execution groups—warps in CUDA and wavefronts in HIP—used by NVIDIA and AMD GPUs."
"CUDA and HIP organize threads hierarchically into grids, blocks, and threads. This hierarchy allows computations to be distributed efficiently across GPU cores. Threads can be structured in 1D, 2D, or 3D arrays, providing flexibility to parallelize diverse workloads, such as image processing or scientific simulations."
"In CUDA, kernels are launched using the syntax <<>>, while HIP uses the more general hipLaunchKernelGGL function. Both frameworks provide built-in variables, such as gridDim, blockIdx, blockDim, and threadIdx, which dynamically manage thread and block dimensions during execution."
"CUDA and HIP support a hierarchical memory model optimized for specific access patterns. Registers are private to each thread, ensuring fast access for frequently used data. Shared memory allows threads within a block to share data efficiently, while global memory provides access across all threads and the host. Constant memory is optimized for read-only data shared across threads. Understanding this hierarchy is critical to optimizing GPU performance."
"In CUDA, threads are grouped into warps of 32 threads, while in HIP, wavefronts consist of 64 threads. These groups execute in lock-step, meaning that divergence within a warp or wavefront—such as conditional branching—can significantly reduce performance. It’s essential to write code that minimizes divergence and aligns memory access for maximum efficiency."
"This example demonstrates how threads are organized in 2D blocks. Each thread calculates a unique 1D index using the block and thread dimensions. This conversion is especially useful when mapping higher-dimensional data, like matrices, into linear memory for efficient computation on the GPU."
"To optimize CUDA and HIP applications, start by carefully selecting thread configurations to maximize GPU occupancy. Minimize divergence within warps or wavefronts to ensure consistent execution. Optimize memory access by aligning data with warp sizes for coalesced access, and avoid bank conflicts in shared memory. Lastly, always implement robust error handling to catch and debug kernel failures."
"Computing global thread indexes is a critical step when working with multi-dimensional grids. This example shows how to calculate a thread’s global index in a 2D grid, making it easier to map higher-dimensional data structures into linear memory. This is particularly useful in applications like matrix multiplication or image processing."
"Printing thread indexes within a kernel, as shown here, is a helpful debugging technique. By examining the output, you can verify thread organization, ensure that indexing calculations are correct, and confirm that threads are accessing the intended data."
"To summarize, CUDA and HIP provide flexible models for GPU programming, leveraging hierarchical thread organization and memory models for efficient parallel computation. Key optimization strategies include minimizing warp and wavefront divergence, aligning memory access patterns for coalescing, and selecting configurations that maximize GPU occupancy. By following these best practices, you can write high-performance and portable GPU applications."

CUDA (for NVIDIA GPUs) and HIP (for AMD and cross-platform compatibility) offer flexible models for GPU programming that allow parallel processing by organizing threads into grids, blocks, and individual threads. Understanding the thread structure and memory models of these frameworks is essential for writing high-performance GPU code.

Thread Organization in CUDA and HIP¶

CUDA and HIP structure threads hierarchically in grids, blocks, and individual threads, which can be organized in 1D, 2D, or 3D arrays. This organization enables flexible parallelism across various types of computations.

CUDA vs. HIP Syntax: While CUDA is specific to NVIDIA GPUs, HIP provides portability across NVIDIA and AMD GPUs with similar syntax and functionality.

Example Kernel Launch:

CUDA: VecAddition<<<1, N>>>(A, B, C);
HIP: hipLaunchKernelGGL(VecAddition, dim3(1), dim3(N), 0, 0, A, B, C);

Both CUDA and HIP provide built-in variables to manage grid and block dimensions:

gridDim.x/y/z: Number of blocks in each dimension within the grid.
blockIdx.x/y/z: Index of the current block within the grid.
blockDim.x/y/z: Number of threads in each dimension within the block.
threadIdx.x/y/z: Index of the current thread within the block.

Executing HIP and CUDA Code: Required Libraries¶

In GPU programming, the correct libraries are essential to compile and run code on specific hardware. Here are the basic libraries to include for HIP and CUDA programming.

For HIP Code Execution (AMD or NVIDIA GPUs)¶

To execute HIP code, include the following libraries:

#include <hip/hip_runtime.h>
#include <hip/hip_runtime_api.h>

These libraries provide the functions and types required for HIP programming, enabling support for both AMD and NVIDIA GPUs when using HIP as an abstraction layer.

For CUDA Code Execution (NVIDIA GPUs)¶

To execute CUDA code, include the following libraries:

#include <cuda_runtime.h>
#include <cuda_runtime_api.h>

These libraries enable core CUDA functionality, supporting GPU programming specifically for NVIDIA devices.

Memory Model in CUDA and HIP¶

CUDA and HIP share a similar memory hierarchy, optimized for different access patterns and lifetimes:

Memory	Location	Cached	Device Access	Scope	Lifetime
Register	On-chip	N/A	R/W	One thread	Thread
Local	DRAM	Yes	R/W	One thread	Thread
Shared	On-chip	N/A	R/W	All threads in block	Block
Global	DRAM	*	R/W	All threads + host	Application
Constant	DRAM	Yes	R	All threads + host	Application
Texture	DRAM	Yes	R	All threads + host	Application

Note: The latest compute capabilities cache global memory in L2 by default.

Warps in CUDA and Wavefronts in HIP¶

What are Warps and Wavefronts?¶

In CUDA, warps are groups of 32 threads executed in lock-step, while in HIP (primarily for AMD GPUs), wavefronts consist of 64 threads. Both concepts represent groups of threads scheduled to execute the same instruction simultaneously.

Key Differences¶

Feature	CUDA Warps	HIP Wavefronts
Threads per Group	32 threads	64 threads
Divergence Threshold	32-thread groups	64-thread groups
Memory Coalescing	32-thread access	64-thread access

Warp and Wavefront Divergence¶

Divergence occurs when threads within a warp or wavefront follow different execution paths (e.g., branching with if statements). This divergence causes threads to execute serially, reducing efficiency. Optimizing code to minimize divergence within warps and wavefronts improves performance.

Practical Example: Thread Layout in CUDA and HIP¶

Here’s a sample CUDA/HIP program demonstrating thread organization. This example organizes threads in a 2D block and computes a 1D global index for each thread.

#include <stdio.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
//#include <hip/hip_runtime.h>
//#include <hip/hip_runtime_api.h>

__global__ void helloGPU()
{
  int i = blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x;
  printf("Hello from thread %d\n", i);
  __syncthreads();       
}

int main()
{
  dim3 comp_G(1, 1, 1);    
  dim3 comp_B(8, 8, 1); 

  helloGPU<<<comp_G, comp_B>>>();  // CUDA
  cudaDeviceReset();               // CUDA

  // helloGPU<<<comp_G, comp_B>>>();  // HIP
  // hipDeviceReset();                // HIP

  return 0;
}

Best Practices for CUDA and HIP¶

Choosing Thread Configurations:
Multi-dimensional grids and blocks in both CUDA and HIP offer flexibility in parallelizing computations. Choose configurations based on the data structure of the application.
Optimizing Occupancy:
High occupancy improves performance but must be balanced with resource use per thread. CUDA’s Occupancy Calculator provides helpful recommendations, and HIP offers similar resources for AMD GPUs.
Minimizing Divergence:
Structure code to reduce divergence within warps (32 threads in CUDA) or wavefronts (64 threads in HIP), avoiding excessive branching within these groups.
Memory Access Patterns:
Align memory access with warp/wavefront size to improve memory coalescing and reduce latency. Optimized access patterns enhance data throughput.
Avoiding Bank Conflicts:
Shared memory access within warps or wavefronts can lead to bank conflicts, reducing performance. Proper data alignment within shared memory can mitigate this.
Error Handling:
Always check for errors after launching kernels and API calls to catch issues early in both CUDA and HIP.

Computing Global Thread Indexes¶

Both CUDA and Hip threads can be structured as 1D, 2D, or 3D arrays. However, in some applications, it’s useful to convert these threads into a single array for easy indexing. The examples below show various conversions of CUDA thread organizations into a single-dimensional index.OB

// 1D grid of 1D blocks
__device__ int getGlobalIdx_1D_1D()
{
  return blockIdx.x * blockDim.x + threadIdx.x;
}

// 1D grid of 2D blocks
__device__ int getGlobalIdx_1D_2D()
{
  return blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x;
}

// 1D grid of 3D blocks
__device__ int getGlobalIdx_1D_3D()
{
  return blockIdx.x * blockDim.x * blockDim.y * blockDim.z 
         + threadIdx.z * blockDim.y * blockDim.x 
         + threadIdx.y * blockDim.x + threadIdx.x;
}

// 2D grid of 1D blocks
__device__ int getGlobalIdx_2D_1D()
{
  int blockId = blockIdx.y * gridDim.x + blockIdx.x;
  return blockId * blockDim.x + threadIdx.x;
}

// 2D grid of 2D blocks
__device__ int getGlobalIdx_2D_2D()
{
  int blockId = blockIdx.x + blockIdx.y * gridDim.x;
  return blockId * (blockDim.x * blockDim.y) + threadIdx.y * blockDim.x + threadIdx.x;
}

// 3D grid of 3D blocks
__device__ int getGlobalIdx_3D_3D()
{
  int blockId = blockIdx.x + blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z;
  return blockId * (blockDim.x * blockDim.y * blockDim.z) 
         + threadIdx.z * (blockDim.x * blockDim.y) 
         + threadIdx.y * blockDim.x + threadIdx.x;
}

Example Program: Printing Thread Layout¶

The following program demonstrates the concept by printing thread layout information from a CUDA kernel function. The threads are organized in a 2D block, and the kernel function calculates a single-dimensional index for each thread.

#include <stdio.h>

__global__ void helloCUDA()
{
  // Converting 2D thread structure into a 1D thread structure 
  int i = blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x;
  printf("Hello from thread %d\n", i);
  __syncthreads();       
}

int main()
{
  // Thread organization (2D block)
  dim3 comp_G(1, 1, 1);    
  dim3 comp_B(8, 8, 1); 

  helloCUDA<<<comp_G, comp_B>>>();

  cudaDeviceReset();

  return 0;
}

Summary¶

CUDA and HIP provide powerful tools for parallel programming on NVIDIA and AMD GPUs, respectively. Understanding thread organization, memory models, and the distinction between warps and wavefronts helps to optimize performance across these platforms. By applying best practices like minimizing divergence and optimizing occupancy, developers can write high-performance, portable GPU applications.