Hello World from GPU¶
In this article, we will demonstrate how to print "Hello World" from a CUDA program. This simple program highlights the differences between standard C/C++ programming and CUDA programming, emphasizing the distinctions between CPU and GPU computing models.
C/C++ Code Example:¶
The following example demonstrates a traditional "Hello World" program in C/C++. The function c_function
simply prints "Hello World!" when called.
// hello-world.c
#include <stdio.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
void c_function()
{
printf("Hello World!\n");
}
int main()
{
c_function();
return 0;
}
CUDA Code Example:¶
The CUDA version of "Hello World" requires marking the function with a __global__
attribute, indicating it is a kernel function that will execute on the GPU. Additionally, the kernel function is called using the CUDA launch syntax <<<1,1>>>
, specifying the execution configuration in terms of blocks and threads.
// hello-world.cu
#include <stdio.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
__global__ void cuda_function()
{
printf("Hello World from GPU!\n");
__syncthreads(); // to synchronize all threads
}
int main()
{
cuda_function<<<1,1>>>();
cudaDeviceSynchronize(); // to synchronize device call
return 0;
}
Compilation Instructions¶
The following steps outline the process to compile and execute the CUDA program.
// Load the CUDA module (compiler)
$ module load cuda/11.0 # Example: load CUDA version 11.0
$ nvcc -arch=compute_70 hello-world.cu # Compile for Nvidia Volta GPUs
$ ./a.out # Run the executable
Note: The
-arch=compute_70
flag specifies the compute capability for Nvidia Volta GPUs. Each Nvidia GPU has an associated compute capability, which determines which features and optimizations are available.
HIP Code Example (AMD GPUs):¶
HIP provides a similar programming model to CUDA but is designed to work on AMD GPUs. HIP uses __global__
to mark a kernel function, but the kernel launch syntax is slightly different, using hipLaunchKernelGGL
for defining execution configurations.
// hello-world.hip
#include <stdio.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
__global__ void hip_function()
{
printf("Hello World from GPU!\n");
__syncthreads(); // synchronize all threads within a block
}
int main()
{
hipLaunchKernelGGL(hip_function, dim3(1), dim3(1), 0, 0); // Launch kernel with 1 block and 1 thread
hipDeviceSynchronize(); // Ensure host waits for GPU completion
return 0;
}
Compilation Instructions for HIP:¶
To compile and execute the HIP program on an AMD GPU:
# Load the ROCm module (for AMD GPUs)
$ module load rocm # Load ROCm (example)
$ hipcc hello-world.hip -o hello.out # Compile the HIP code
$ ./hello.out # Run the executable
HIP Portability: HIP code can be compiled to run on both AMD and NVIDIA GPUs. When targeting NVIDIA GPUs, use
hipcc
with NVIDIA's CUDA backend enabled.
Key Concepts in CUDA and HIP Programming¶
1. Kernel Function Definition (__global__
):¶
- Both CUDA and HIP use
__global__
to mark functions (kernels) that run on the GPU. - In this example,
cuda_function
in CUDA andhip_function
in HIP are designated as__global__
, indicating they will execute on the GPU.
2. Launching Kernel Functions:¶
- CUDA: The kernel launch syntax is
<<<blocks, threads>>>
. - HIP: HIP uses
hipLaunchKernelGGL(kernel, grid, block, shared_mem, stream)
. - In both examples,
<<<1,1>>>
anddim3(1), dim3(1)
mean launching one block with one thread, mimicking a sequential "Hello World" print.
3. Synchronization (__syncthreads
and cudaDeviceSynchronize
/ hipDeviceSynchronize
):¶
- CUDA and HIP use similar synchronization calls to manage the order of operations.
__syncthreads()
within a kernel ensures all threads within a block reach the same point before continuing.cudaDeviceSynchronize()
in CUDA andhipDeviceSynchronize()
in HIP ensure that the CPU (host) waits for GPU (device) completion.
4. Memory Management in CUDA and HIP:¶
- Memory management differs slightly, but both CUDA and HIP allow for similar allocations and transfers between host and device memory. For complex applications, managing memory correctly is essential for performance and functionality.
Practical Exploration¶
Experimenting with Block and Thread Counts¶
One of the benefits of GPU programming is the ability to leverage large numbers of threads to achieve parallelism. Try modifying the kernel launch configurations to observe the difference:
For CUDA:
cuda_function<<<2, 2>>>();
For HIP:
hipLaunchKernelGGL(hip_function, dim3(2), dim3(2), 0, 0);
Each of these configurations will launch 4 threads (2 blocks, each with 2 threads), demonstrating how parallelism can be easily scaled up with GPU computing.
Understanding HIP's Cross-Platform Capability¶
HIP is designed for portability, allowing the same codebase to be compiled for both NVIDIA and AMD GPUs. By converting existing CUDA code to HIP (known as "hipification"), developers can run applications across different hardware platforms.
Converting the CUDA application to HIP application¶
Hipification is the process of converting CUDA code to HIP, enabling the code to run on both AMD and NVIDIA GPUs. HIP provides tools to ease this transition by automating much of the code translation. Below are the basic steps for hipifying CUDA code:
-
Install ROCm and HIP Tools: Ensure that the AMD ROCm stack, which includes HIP tools, is installed on your system.
-
Use hipify-perl or hipify-clang: These tools convert CUDA syntax to HIP syntax. For example, running hipify-perl my_cuda_file.cu > my_hip_file.hip converts my_cuda_file.cu into HIP-compliant code in my_hip_file.hip. For larger projects, hipify-clang offers more control and can handle complex conversions.
-
Review the Converted Code: Check the code for compatibility, as certain CUDA-specific features may not directly translate to HIP or may need modification.
-
Compile with hipcc: HIP provides the hipcc compiler to compile HIP code for both AMD and NVIDIA GPUs.
-
Run and Test on Target Hardware: Once compiled, run the program on your desired GPU to ensure correct behavior and performance.
CUDA vs. HIP Basics¶
CUDA and HIP are two popular GPU programming frameworks for accelerating parallel computing tasks on GPUs. CUDA, developed by NVIDIA, specifically targets NVIDIA GPUs, whereas HIP, developed by AMD, is designed to be portable across AMD and NVIDIA GPUs, offering a similar syntax and programming model to CUDA. This document provides an introduction to key CUDA concepts, highlighting how these map to the basics of GPU programming in HIP.
1. Kernel Function Definition (__global__
in CUDA, __global__
in HIP)¶
- CUDA: Kernel functions intended to run on the GPU are marked with the
__global__
qualifier. This tells the compiler the function is a "kernel" that will be executed by GPU threads.- Example in CUDA:
__global__ void cuda_function() { /* Code */ }
- Example in CUDA:
- HIP: Similarly, HIP uses
__global__
for defining kernel functions. HIP code written for AMD GPUs uses the same keyword, making it easy to port simple CUDA kernels to HIP by changing a few library-specific calls.
2. Launching Kernel Functions (<<<blocks, threads>>>
in CUDA and HIP)¶
- CUDA: Kernel functions are launched using the syntax
<<<blocks, threads>>>
, where:- Blocks: Groups of threads within the grid.
- Threads: Individual threads within each block.
- Example in CUDA:
cuda_function<<<1, 1>>>();
launches a single block with one thread. - HIP: Kernel launch syntax is identical in HIP, so the same code
hip_function<<<blocks, threads>>>();
would work in HIP as well, enabling straightforward parallel execution on AMD GPUs.
3. Synchronization (__syncthreads
and cudaDeviceSynchronize
in CUDA, hipDeviceSynchronize
in HIP)¶
- CUDA: Synchronization is essential to control the execution flow in parallel programming.
__syncthreads();
within a CUDA kernel synchronizes all threads in a block.cudaDeviceSynchronize();
on the host ensures the CPU waits until the GPU completes its tasks.
- HIP: HIP provides similar synchronization with
hipDeviceSynchronize();
to coordinate between the host and device. For example,hipDeviceSynchronize();
is used on the host to ensure all GPU tasks are complete before moving forward.Forgetting to synchronize can cause unpredictable behavior if the host attempts to access GPU results before computation is complete.
CPU vs. GPU Execution Model¶
In both CUDA and HIP, GPU code (kernels) is designed to execute across thousands of threads in parallel, each running a separate instance of the kernel function. This is in contrast to CPU execution, where functions like cpu_function()
execute sequentially on a single or few CPU cores. This parallel structure in GPU programming enables massive data processing capabilities, especially for tasks that can be broken down across many threads.
- Parallelism: Each thread processes a part of the data independently, enhancing performance for compute-intensive tasks based on the SMIT architecture.
GPU Architecture and Compute Capability (CUDA) vs. GCN Architecture (HIP)¶
- CUDA: CUDA-enabled GPUs use a hardware architecture of streaming multiprocessors (SMs) that execute multiple threads in parallel. The GPU’s compute capability (e.g.,
compute_70
for Turing orcompute_80
for Ampere) specifies architectural features, guiding compilation optimizations for the GPU model.- Example: Compiling with
-arch=compute_70
ensures that compiled code is optimized for a Turing architecture.
- Example: Compiling with
- HIP: AMD GPUs use the GCN (Graphics Core Next) architecture, and HIP uses HIP Clang to support architectural optimizations with
-amdgpu-target=<architecture>
flags.- Example: Using
-amdgpu-target=gfx906
for a Radeon VII GPU ensures the code is optimized for that specific AMD architecture.
- Example: Using
Synchronization and Common Pitfalls¶
- Thread Synchronization: The need for thread synchronization within kernels (__syncthreads() in CUDA and HIP) is crucial for avoiding race conditions in data processing.
- Host Synchronization: The asynchronous nature of GPU execution (between the host and device) necessitates host-level synchronization (cudaDeviceSynchronize() in CUDA, hipDeviceSynchronize() in HIP). Skipping these calls can lead to race conditions where the CPU attempts to access incomplete data, resulting in incorrect or unexpected outputs.
Summary¶
CUDA and HIP provide similar programming models for offloading tasks to the GPU, but they target different hardware platforms (CUDA for NVIDIA GPUs, HIP for AMD GPUs, and some NVIDIA GPUs). Key features include:
- Kernel Execution: CUDA and HIP allow kernel functions to run in parallel across thousands of threads.
- Kernel Launch Configuration: Both use the <<
>> syntax to specify the level of parallelism. - Synchronization: Required both at the thread level within kernels and at the host level to ensure proper sequencing of CPU-GPU interactions.
- Architectural Optimization: CUDA uses compute capabilities, while HIP uses target specifications like gfx906 to optimize for specific GPU models.
CUDA and HIP enable developers to leverage GPU parallelism, transforming compute-intensive tasks to run efficiently on modern hardware with high performance.