Introduction to Key CUDA and HIP APIs and Qualifiers¶

Script

"In this section, we’ll explore the key qualifiers and APIs in CUDA and HIP that define interactions between the host, the CPU, and the device, the GPU. We’ll focus on function, variable, and thread qualifiers—essential tools for effective parallel programming on GPUs."
"CUDA and HIP use a host-device programming model. The host refers to the CPU, where the main program executes, and the device refers to the GPU, where parallel computations occur. Function type qualifiers are critical in distinguishing where functions execute and where they can be called from."
"Function qualifiers like __device__, __global__, and __host__ control execution location and scope. For instance, __global__ functions act as kernels, executing on the GPU and callable from the CPU. Meanwhile, __device__ functions run exclusively on the GPU and are callable only from other GPU functions."
"This example demonstrates the use of __device__ and __global__. The function deviceFunction runs on the GPU and is called by kernelFunction, which serves as the kernel. The cudaDeviceSynchronize() function ensures that the host waits for the GPU to complete execution before proceeding."
"Variable qualifiers manage memory scope and access. The __device__ qualifier allocates memory in the GPU’s global memory, accessible across all threads. The __constant__ qualifier optimizes memory for frequently accessed, read-only data, while __shared__ is used for low-latency memory shared among threads in the same block."
"In this example, deviceVar resides in global memory, accessible across the GPU. constantVar is a read-only variable optimized for frequent access, while sharedVar is allocated in shared memory, offering low-latency access to threads within the same block. Each memory type serves a specific performance purpose."
"Thread qualifiers such as gridDim, blockDim, blockIdx, and threadIdx define the organization and position of threads within a grid. gridDim and blockDim specify the number of blocks and threads, while blockIdx and threadIdx identify the position of a block or thread in the hierarchy."
"In this example, blockIdx, blockDim, and threadIdx are combined to calculate a unique global thread index. This index is used to identify and print the thread’s position within the GPU kernel. Such indexing is essential for assigning workloads to specific threads."
"Global memory is large and shared across all threads but has higher latency. Shared memory, on the other hand, is faster but limited to threads within the same block. Constant memory is optimized for small, read-only datasets that require frequent access. Choosing the right memory type based on access patterns is critical for performance optimization."
"To summarize, CUDA and HIP qualifiers are vital for managing function execution, memory access, and thread organization. Function qualifiers like __global__ and __device__ organize host-device interactions, while variable qualifiers such as __shared__ and __constant__ optimize memory use. Thread qualifiers enable efficient thread indexing and parallel execution. Understanding and leveraging these qualifiers effectively is key to writing high-performance GPU programs that run seamlessly on NVIDIA and AMD platforms."

In this section, we will explore essential CUDA APIs and qualifiers used to convert C/C++ code into GPU CUDA code. CUDA programming involves utilizing both the CPU (referred to as the "Host") and the GPU (referred to as the "Device") for high-performance parallel computing. We will also introduce HIP (Heterogeneous Interface for Portability), which allows code to run on both NVIDIA and AMD GPUs with minimal modifications, enhancing cross-platform portability.

Key CUDA and HIP Terminology¶

Host: Refers to the CPU or the main system where the program is run.
Device: Refers to the GPU where the parallel processing takes place.

CUDA and HIP provide various function qualifiers, variable qualifiers, and thread qualifiers to manage code execution between the host and device. The following sections provide an overview of these qualifiers and their usage in CUDA and HIP programming.

Function Type Qualifiers in CUDA and HIP¶

CUDA and HIP provide function type qualifiers to define where a function will execute and where it can be called from. These qualifiers play a crucial role in organizing code execution between the CPU and GPU.

Qualifier	Description
`__device__`	Functions with this qualifier execute only on the device (GPU) and are callable only from other device functions.
`__global__`	Functions with this qualifier execute on the device (GPU) but are callable from the host (CPU) and represent kernel functions.
`__host__`	Functions with this qualifier execute only on the host (CPU) and are callable only from the host.
`__forceinline__`	A directive that forces the compiler to inline a function, optimizing performance for frequently used, small functions.
`__noinline__`	A directive to prevent the compiler from inlining a function, useful for debugging or to limit code bloat in certain parts of the program.

Function Qualifier Summary¶

Qualifier	Executed on	Callable from
`__device__`	Device	Device
`__global__`	Device	Host
`__host__`	Host	Host
`__forceinline__` / `__noinline__`	Device	Device

In HIP, the function qualifiers remain similar, meaning __global__, __device__, and __host__ work in much the same way as in CUDA, providing consistency across platforms.

Example of Function Qualifiers¶

#include <stdio.h>

__device__ int deviceFunction() {
    return 42;
}

__global__ void kernelFunction() {
    int result = deviceFunction();
    printf("Result from device function: %d\n", result);
}

int main() {
    kernelFunction<<<1, 1>>>();
    cudaDeviceSynchronize();
    return 0;
}

In this example:

deviceFunction is a __device__ function, callable only from the GPU.
kernelFunction is a __global__ function (kernel), callable from the CPU.
The cudaDeviceSynchronize() function ensures the GPU completes all work before the CPU continues.

Variable Type Qualifiers in CUDA and HIP¶

CUDA and HIP provide variable qualifiers to define the scope and lifetime of variables, specifying where variables reside and their accessibility across threads. These qualifiers are essential in managing memory usage effectively and optimizing data access patterns within the GPU.

Qualifier	Description
`__device__`	Variables reside in global memory, accessible by both the device and the host (via the runtime).
`__constant__`	Variables reside in constant memory, accessible by all threads within a grid and the host. Constant memory is optimized for fast read-only access.
`__shared__`	Variables reside in shared memory and are accessible by all threads within a single block, providing fast, low-latency access for intra-block computations.

Explanation of CUDA and HIP Variable Qualifiers¶

__device__: This qualifier is used for variables stored in global memory. These variables are accessible from both the host (CPU) and the device (GPU) through the runtime API. Global memory offers a large storage capacity but has higher latency than shared memory, making it suitable for data that needs to be accessed by multiple threads across different blocks.
__constant__: Variables defined with this qualifier reside in constant memory, a type of read-only memory. Constant memory is optimized for fast read access, as it is cached within each multiprocessor. This qualifier is useful for values that are frequently read by multiple threads but do not change, such as configuration parameters.
__shared__: This qualifier is used for variables stored in shared memory, which is accessible by all threads within the same block. Shared memory has much lower latency than global memory and can significantly improve performance in computations where threads in the same block frequently access the same data.

Understanding and effectively using these qualifiers can optimize memory usage in CUDA and HIP programs, enhancing data access efficiency and overall GPU performance.

Example of Variable Qualifiers¶

#include <stdio.h>

__device__ int deviceVar = 100;
__constant__ int constantVar = 50;

__global__ void kernelFunction() {
    __shared__ int sharedVar; // Shared across threads in a block
    sharedVar = 10;
    printf("Device Var: %d, Constant Var: %d, Shared Var: %d\n", deviceVar, constantVar, sharedVar);
}

int main() {
    kernelFunction<<<1, 10>>>();
    cudaDeviceSynchronize();
    return 0;
}

In this example:

deviceVar is a __device__ variable in global memory, accessible by all threads in all blocks.
constantVar is a __constant__ variable, optimized for read-only access across threads.
sharedVar is a __shared__ variable, specific to each block and shared by all threads within that block.

CUDA and HIP Thread Qualifiers¶

CUDA and HIP thread qualifiers provide essential information about the organization of threads and blocks in a kernel. These built-in variables help manage and optimize the parallel execution of CUDA and HIP applications.

Qualifier	Type	Description
`gridDim`	`dim3`	Dimensions (size) of the grid in each dimension (x, y, z).
`blockDim`	`dim3`	Dimensions (size) of the block in each dimension (x, y, z).
`blockIdx`	`uint3`	Block index within the grid, used to identify the block’s position in a multi-block grid configuration.
`threadIdx`	`uint3`	Thread index within a block, used to identify the position of a thread within its block.
`warpSize`	`int`	Size of a warp or wavefront, typically 32 threads in CUDA and 64 threads in HIP (on AMD GPUs).

Explanation of CUDA and HIP Thread Qualifiers¶

gridDim: Specifies the dimensions of the grid, indicating the number of blocks across each dimension (x, y, and z). It helps in organizing the grid structure for managing larger problem sizes that require multiple blocks.
blockDim: Defines the dimensions of each block in terms of the number of threads in the x, y, and z dimensions. It is essential for identifying the block layout and for indexing calculations within a block.
blockIdx: Provides the block index within the grid, which is crucial in multi-block configurations. It allows each block to identify its unique position in the grid, useful for dividing work across multiple blocks.
threadIdx: Provides the thread index within a block, identifying each thread's position relative to other threads in the same block. It is commonly used for calculating thread-specific operations within a block.
warpSize: Defines the size of a warp or wavefront:
In CUDA (NVIDIA GPUs), warpSize is typically 32 threads.
In HIP (AMD GPUs), warpSize is typically 64 threads. In AMD's architecture, a warp equivalent is called a wavefront.

Understanding these thread qualifiers is essential for designing efficient CUDA and HIP applications, as they provide direct access to thread and block configurations within the runtime. Code written with warpSize will adapt to the platform’s specific value, allowing better cross-platform compatibility.

Summary¶

Thread Hierarchy and Indexing:
Each thread can determine its unique position in the grid using blockIdx, threadIdx, blockDim, and gridDim. This allows a single kernel to work on large data arrays, where each thread processes a unique piece of data.
Memory Access:
Memory access patterns are crucial for performance. Using shared memory (via shared) within a block can drastically reduce memory latency for data shared by multiple threads.
Warp/ Wavefront-Level Execution:
In CUDA, a warp is a set of 32 threads that execute instructions in lockstep, while in HIP on AMD GPUs, a wavefront is a set of 64 threads. Efficient programming involves minimizing divergence within war