Introduction to Key CUDA and HIP APIs and Qualifiers¶
In this section, we will explore essential CUDA APIs and qualifiers used to convert C/C++ code into GPU CUDA code. CUDA programming involves utilizing both the CPU (referred to as the "Host") and the GPU (referred to as the "Device") for high-performance parallel computing. We will also introduce HIP (Heterogeneous Interface for Portability), which allows code to run on both NVIDIA and AMD GPUs with minimal modifications, enhancing cross-platform portability.
Key CUDA and HIP Terminology¶
- Host: Refers to the CPU or the main system where the program is run.
- Device: Refers to the GPU where the parallel processing takes place.
CUDA and HIP provide various function qualifiers, variable qualifiers, and thread qualifiers to manage code execution between the host and device. The following sections provide an overview of these qualifiers and their usage in CUDA and HIP programming.
Function Type Qualifiers in CUDA and HIP¶
CUDA and HIP provide function type qualifiers to define where a function will execute and where it can be called from. These qualifiers play a crucial role in organizing code execution between the CPU and GPU.
Qualifier | Description |
---|---|
__device__ | Functions with this qualifier execute only on the device (GPU) and are callable only from other device functions. |
__global__ | Functions with this qualifier execute on the device (GPU) but are callable from the host (CPU) and represent kernel functions. |
__host__ | Functions with this qualifier execute only on the host (CPU) and are callable only from the host. |
__forceinline__ | A directive that forces the compiler to inline a function, optimizing performance for frequently used, small functions. |
__noinline__ | A directive to prevent the compiler from inlining a function, useful for debugging or to limit code bloat in certain parts of the program. |
Function Qualifier Summary¶
Qualifier | Executed on | Callable from |
---|---|---|
__device__ | Device | Device |
__global__ | Device | Host |
__host__ | Host | Host |
__forceinline__ / __noinline__ | Device | Device |
In HIP, the function qualifiers remain similar, meaning __global__
, __device__
, and __host__
work in much the same way as in CUDA, providing consistency across platforms.
Example of Function Qualifiers¶
#include <stdio.h>
__device__ int deviceFunction() {
return 42;
}
__global__ void kernelFunction() {
int result = deviceFunction();
printf("Result from device function: %d\n", result);
}
int main() {
kernelFunction<<<1, 1>>>();
cudaDeviceSynchronize();
return 0;
}
In this example:
- deviceFunction is a
__device__
function, callable only from the GPU. - kernelFunction is a
__global__
function (kernel), callable from the CPU. - The cudaDeviceSynchronize() function ensures the GPU completes all work before the CPU continues.
Variable Type Qualifiers in CUDA and HIP¶
CUDA and HIP provide variable qualifiers to define the scope and lifetime of variables, specifying where variables reside and their accessibility across threads. These qualifiers are essential in managing memory usage effectively and optimizing data access patterns within the GPU.
Qualifier | Description |
---|---|
__device__ | Variables reside in global memory, accessible by both the device and the host (via the runtime). |
__constant__ | Variables reside in constant memory, accessible by all threads within a grid and the host. Constant memory is optimized for fast read-only access. |
__shared__ | Variables reside in shared memory and are accessible by all threads within a single block, providing fast, low-latency access for intra-block computations. |
Explanation of CUDA and HIP Variable Qualifiers¶
-
__device__
: This qualifier is used for variables stored in global memory. These variables are accessible from both the host (CPU) and the device (GPU) through the runtime API. Global memory offers a large storage capacity but has higher latency than shared memory, making it suitable for data that needs to be accessed by multiple threads across different blocks. -
__constant__
: Variables defined with this qualifier reside in constant memory, a type of read-only memory. Constant memory is optimized for fast read access, as it is cached within each multiprocessor. This qualifier is useful for values that are frequently read by multiple threads but do not change, such as configuration parameters. -
__shared__
: This qualifier is used for variables stored in shared memory, which is accessible by all threads within the same block. Shared memory has much lower latency than global memory and can significantly improve performance in computations where threads in the same block frequently access the same data.
Understanding and effectively using these qualifiers can optimize memory usage in CUDA and HIP programs, enhancing data access efficiency and overall GPU performance.
Example of Variable Qualifiers¶
#include <stdio.h>
__device__ int deviceVar = 100;
__constant__ int constantVar = 50;
__global__ void kernelFunction() {
__shared__ int sharedVar; // Shared across threads in a block
sharedVar = 10;
printf("Device Var: %d, Constant Var: %d, Shared Var: %d\n", deviceVar, constantVar, sharedVar);
}
int main() {
kernelFunction<<<1, 10>>>();
cudaDeviceSynchronize();
return 0;
}
In this example:
- deviceVar is a
__device__
variable in global memory, accessible by all threads in all blocks. - constantVar is a
__constant__
variable, optimized for read-only access across threads. - sharedVar is a
__shared__
variable, specific to each block and shared by all threads within that block.
CUDA and HIP Thread Qualifiers¶
CUDA and HIP thread qualifiers provide essential information about the organization of threads and blocks in a kernel. These built-in variables help manage and optimize the parallel execution of CUDA and HIP applications.
Qualifier | Type | Description |
---|---|---|
gridDim | dim3 | Dimensions (size) of the grid in each dimension (x, y, z). |
blockDim | dim3 | Dimensions (size) of the block in each dimension (x, y, z). |
blockIdx | uint3 | Block index within the grid, used to identify the block’s position in a multi-block grid configuration. |
threadIdx | uint3 | Thread index within a block, used to identify the position of a thread within its block. |
warpSize | int | Size of a warp or wavefront, typically 32 threads in CUDA and 64 threads in HIP (on AMD GPUs). |
Explanation of CUDA and HIP Thread Qualifiers¶
-
gridDim
: Specifies the dimensions of the grid, indicating the number of blocks across each dimension (x, y, and z). It helps in organizing the grid structure for managing larger problem sizes that require multiple blocks. -
blockDim
: Defines the dimensions of each block in terms of the number of threads in the x, y, and z dimensions. It is essential for identifying the block layout and for indexing calculations within a block. -
blockIdx
: Provides the block index within the grid, which is crucial in multi-block configurations. It allows each block to identify its unique position in the grid, useful for dividing work across multiple blocks. -
threadIdx
: Provides the thread index within a block, identifying each thread's position relative to other threads in the same block. It is commonly used for calculating thread-specific operations within a block. -
warpSize
: Defines the size of a warp or wavefront: - In CUDA (NVIDIA GPUs),
warpSize
is typically 32 threads. - In HIP (AMD GPUs),
warpSize
is typically 64 threads. In AMD's architecture, a warp equivalent is called a wavefront.
Understanding these thread qualifiers is essential for designing efficient CUDA and HIP applications, as they provide direct access to thread and block configurations within the runtime. Code written with warpSize
will adapt to the platform’s specific value, allowing better cross-platform compatibility.
Summary¶
- Thread Hierarchy and Indexing:
- Each thread can determine its unique position in the grid using blockIdx, threadIdx, blockDim, and gridDim. This allows a single kernel to work on large data arrays, where each thread processes a unique piece of data.
- Memory Access:
- Memory access patterns are crucial for performance. Using shared memory (via shared) within a block can drastically reduce memory latency for data shared by multiple threads.
- Warp/ Wavefront-Level Execution:
- In CUDA, a warp is a set of 32 threads that execute instructions in lockstep, while in HIP on AMD GPUs, a wavefront is a set of 64 threads. Efficient programming involves minimizing divergence within war