Introduction to CUDA and HIP: Thread Organization, Memory Model, Warps, and Wavefronts¶
CUDA (for NVIDIA GPUs) and HIP (for AMD and cross-platform compatibility) offer flexible models for GPU programming that allow parallel processing by organizing threads into grids, blocks, and individual threads. Understanding the thread structure and memory models of these frameworks is essential for writing high-performance GPU code.
Thread Organization in CUDA and HIP¶
CUDA and HIP structure threads hierarchically in grids, blocks, and individual threads, which can be organized in 1D, 2D, or 3D arrays. This organization enables flexible parallelism across various types of computations.
- CUDA vs. HIP Syntax: While CUDA is specific to NVIDIA GPUs, HIP provides portability across NVIDIA and AMD GPUs with similar syntax and functionality.
Example Kernel Launch:
- CUDA:
VecAddition<<<1, N>>>(A, B, C);
- HIP:
hipLaunchKernelGGL(VecAddition, dim3(1), dim3(N), 0, 0, A, B, C);
Both CUDA and HIP provide built-in variables to manage grid and block dimensions:
- gridDim.x/y/z: Number of blocks in each dimension within the grid.
- blockIdx.x/y/z: Index of the current block within the grid.
- blockDim.x/y/z: Number of threads in each dimension within the block.
- threadIdx.x/y/z: Index of the current thread within the block.
Executing HIP and CUDA Code: Required Libraries¶
In GPU programming, the correct libraries are essential to compile and run code on specific hardware. Here are the basic libraries to include for HIP and CUDA programming.
For HIP Code Execution (AMD or NVIDIA GPUs)¶
To execute HIP code, include the following libraries:
#include <hip/hip_runtime.h>
#include <hip/hip_runtime_api.h>
For CUDA Code Execution (NVIDIA GPUs)¶
To execute CUDA code, include the following libraries:
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
Memory Model in CUDA and HIP¶
CUDA and HIP share a similar memory hierarchy, optimized for different access patterns and lifetimes:
Memory | Location | Cached | Device Access | Scope | Lifetime |
---|---|---|---|---|---|
Register | On-chip | N/A | R/W | One thread | Thread |
Local | DRAM | Yes | R/W | One thread | Thread |
Shared | On-chip | N/A | R/W | All threads in block | Block |
Global | DRAM | * | R/W | All threads + host | Application |
Constant | DRAM | Yes | R | All threads + host | Application |
Texture | DRAM | Yes | R | All threads + host | Application |
Note: The latest compute capabilities cache global memory in L2 by default.
Warps in CUDA and Wavefronts in HIP¶
What are Warps and Wavefronts?¶
In CUDA, warps are groups of 32 threads executed in lock-step, while in HIP (primarily for AMD GPUs), wavefronts consist of 64 threads. Both concepts represent groups of threads scheduled to execute the same instruction simultaneously.
Key Differences¶
Feature | CUDA Warps | HIP Wavefronts |
---|---|---|
Threads per Group | 32 threads | 64 threads |
Divergence Threshold | 32-thread groups | 64-thread groups |
Memory Coalescing | 32-thread access | 64-thread access |
Warp and Wavefront Divergence¶
Divergence occurs when threads within a warp or wavefront follow different execution paths (e.g., branching with if
statements). This divergence causes threads to execute serially, reducing efficiency. Optimizing code to minimize divergence within warps and wavefronts improves performance.
Practical Example: Thread Layout in CUDA and HIP¶
Here’s a sample CUDA/HIP program demonstrating thread organization. This example organizes threads in a 2D block and computes a 1D global index for each thread.
#include <stdio.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
//#include <hip/hip_runtime.h>
//#include <hip/hip_runtime_api.h>
__global__ void helloGPU()
{
int i = blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x;
printf("Hello from thread %d\n", i);
__syncthreads();
}
int main()
{
dim3 comp_G(1, 1, 1);
dim3 comp_B(8, 8, 1);
helloGPU<<<comp_G, comp_B>>>(); // CUDA
cudaDeviceReset(); // CUDA
// helloGPU<<<comp_G, comp_B>>>(); // HIP
// hipDeviceReset(); // HIP
return 0;
}
Best Practices for CUDA and HIP¶
-
Choosing Thread Configurations:
- Multi-dimensional grids and blocks in both CUDA and HIP offer flexibility in parallelizing computations. Choose configurations based on the data structure of the application.
-
Optimizing Occupancy:
- High occupancy improves performance but must be balanced with resource use per thread. CUDA’s Occupancy Calculator provides helpful recommendations, and HIP offers similar resources for AMD GPUs.
-
Minimizing Divergence:
- Structure code to reduce divergence within warps (32 threads in CUDA) or wavefronts (64 threads in HIP), avoiding excessive branching within these groups.
-
Memory Access Patterns:
- Align memory access with warp/wavefront size to improve memory coalescing and reduce latency. Optimized access patterns enhance data throughput.
-
Avoiding Bank Conflicts:
- Shared memory access within warps or wavefronts can lead to bank conflicts, reducing performance. Proper data alignment within shared memory can mitigate this.
-
Error Handling:
- Always check for errors after launching kernels and API calls to catch issues early in both CUDA and HIP.
Computing Global Thread Indexes¶
Both CUDA and Hip threads can be structured as 1D, 2D, or 3D arrays. However, in some applications, it’s useful to convert these threads into a single array for easy indexing. The examples below show various conversions of CUDA thread organizations into a single-dimensional index.OB
// 1D grid of 1D blocks
__device__ int getGlobalIdx_1D_1D()
{
return blockIdx.x * blockDim.x + threadIdx.x;
}
// 1D grid of 2D blocks
__device__ int getGlobalIdx_1D_2D()
{
return blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x;
}
// 1D grid of 3D blocks
__device__ int getGlobalIdx_1D_3D()
{
return blockIdx.x * blockDim.x * blockDim.y * blockDim.z
+ threadIdx.z * blockDim.y * blockDim.x
+ threadIdx.y * blockDim.x + threadIdx.x;
}
// 2D grid of 1D blocks
__device__ int getGlobalIdx_2D_1D()
{
int blockId = blockIdx.y * gridDim.x + blockIdx.x;
return blockId * blockDim.x + threadIdx.x;
}
// 2D grid of 2D blocks
__device__ int getGlobalIdx_2D_2D()
{
int blockId = blockIdx.x + blockIdx.y * gridDim.x;
return blockId * (blockDim.x * blockDim.y) + threadIdx.y * blockDim.x + threadIdx.x;
}
// 3D grid of 3D blocks
__device__ int getGlobalIdx_3D_3D()
{
int blockId = blockIdx.x + blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z;
return blockId * (blockDim.x * blockDim.y * blockDim.z)
+ threadIdx.z * (blockDim.x * blockDim.y)
+ threadIdx.y * blockDim.x + threadIdx.x;
}
Example Program: Printing Thread Layout¶
The following program demonstrates the concept by printing thread layout information from a CUDA kernel function. The threads are organized in a 2D block, and the kernel function calculates a single-dimensional index for each thread.
#include <stdio.h>
__global__ void helloCUDA()
{
// Converting 2D thread structure into a 1D thread structure
int i = blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x;
printf("Hello from thread %d\n", i);
__syncthreads();
}
int main()
{
// Thread organization (2D block)
dim3 comp_G(1, 1, 1);
dim3 comp_B(8, 8, 1);
helloCUDA<<<comp_G, comp_B>>>();
cudaDeviceReset();
return 0;
}
Summary¶
CUDA and HIP provide powerful tools for parallel programming on NVIDIA and AMD GPUs, respectively. Understanding thread organization, memory models, and the distinction between warps and wavefronts helps to optimize performance across these platforms. By applying best practices like minimizing divergence and optimizing occupancy, developers can write high-performance, portable GPU applications.