Hello World from GPU¶

Script

"In this session, we’ll dive into the fundamentals of GPU programming, comparing the CUDA and HIP programming models. We’ll begin with a simple 'Hello World' example to highlight the differences between CPU and GPU programming, introduce kernel functions, and discuss the concept of parallelism."
"This is a standard 'Hello World' program written in C/C++. The function c_function runs sequentially on the CPU, printing 'Hello World' when called in the main function. This approach uses a single CPU core to execute the code sequentially, typical for non-parallel tasks."
"Now, let’s look at the CUDA version of the 'Hello World' program for the GPU. The function cuda_function is marked with __global__, indicating that it will execute on the GPU. The syntax <<<1,1>>> launches a single block with one thread. To ensure that the CPU waits for the GPU to complete execution, we use cudaDeviceSynchronize()."
"To compile CUDA code, we use the nvcc compiler, part of NVIDIA’s CUDA toolkit. The option -arch=compute_70 targets a Turing architecture GPU, optimizing the code for its specific capabilities. Running the resulting executable with ./a.out launches the GPU kernel, executing the program."
"Here’s the HIP version of our 'Hello World' program for AMD GPUs. HIP is a platform-independent programming model similar to CUDA but designed to support both AMD and NVIDIA GPUs. Instead of CUDA’s kernel launch syntax, HIP uses hipLaunchKernelGGL. The function hip_function is executed on the GPU, and hipDeviceSynchronize() ensures synchronization between the host and device."
"To compile HIP code, we use hipcc, which is part of AMD’s ROCm stack. The hipcc compiler allows the same code to run on both AMD and NVIDIA GPUs, offering a cross-platform solution. Running the compiled code with ./hello.out executes the HIP kernel on the GPU."
"Both CUDA and HIP mark GPU functions with __global__ and provide synchronization functions such as cudaDeviceSynchronize and hipDeviceSynchronize. While CUDA uses the <<>> syntax to launch kernels, HIP employs the more general hipLaunchKernelGGL for specifying launch configurations. These differences make HIP a suitable alternative for cross-platform GPU programming."
"One of the primary benefits of GPU programming is parallelism. By increasing the number of blocks and threads in the launch configuration, we can scale computations. For example, configuring two blocks with two threads each launches four threads, executing computations in parallel."
"Hipification is the process of converting CUDA code to HIP, enabling cross-platform GPU support. Tools such as hipify-perl and hipify-clang automate this conversion, translating CUDA syntax into HIP-compatible code. This process allows applications built for NVIDIA GPUs to run on AMD GPUs with minimal effort."
"To summarize, both CUDA and HIP enable parallelism by allowing kernel functions to execute across thousands of threads on GPUs. Synchronization ensures proper communication between the CPU and GPU. While CUDA is NVIDIA-specific, HIP provides cross-platform compatibility, making it possible to write a single codebase that runs efficiently on both AMD and NVIDIA GPUs."

In this article, we will demonstrate how to print "Hello World" from a CUDA program. This simple program highlights the differences between standard C/C++ programming and CUDA programming, emphasizing the distinctions between CPU and GPU computing models.

C/C++ Code Example:¶

The following example demonstrates a traditional "Hello World" program in C/C++. The function c_function simply prints "Hello World!" when called.

// hello-world.c
#include <stdio.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
void c_function()
{
  printf("Hello World!\n");
}

int main()
{
  c_function();
  return 0;
}

CUDA Code Example:¶

The CUDA version of "Hello World" requires marking the function with a __global__ attribute, indicating it is a kernel function that will execute on the GPU. Additionally, the kernel function is called using the CUDA launch syntax <<<1,1>>>, specifying the execution configuration in terms of blocks and threads.

// hello-world.cu
#include <stdio.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
__global__ void cuda_function()
{
   printf("Hello World from GPU!\n");
    __syncthreads();               // to synchronize all threads
}

int main()
{
   cuda_function<<<1,1>>>();
   cudaDeviceSynchronize();      // to synchronize device call
   return 0;
}

Compilation Instructions¶

The following steps outline the process to compile and execute the CUDA program.

// Load the CUDA module (compiler)
$ module load cuda/11.0                  # Example: load CUDA version 11.0
$ nvcc -arch=compute_70 hello-world.cu   # Compile for Nvidia Volta GPUs
$ ./a.out                                # Run the executable

Note: The -arch=compute_70 flag specifies the compute capability for Nvidia Volta GPUs. Each Nvidia GPU has an associated compute capability, which determines which features and optimizations are available.

HIP Code Example (AMD GPUs):¶

HIP provides a similar programming model to CUDA but is designed to work on AMD GPUs. HIP uses __global__ to mark a kernel function, but the kernel launch syntax is slightly different, using hipLaunchKernelGGL for defining execution configurations.

// hello-world.hip
#include <stdio.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>

__global__ void hip_function()
{
   printf("Hello World from GPU!\n");
   __syncthreads();               // synchronize all threads within a block
}

int main()
{
  hipLaunchKernelGGL(hip_function, dim3(1), dim3(1), 0, 0); // Launch kernel with 1 block and 1 thread
  hipDeviceSynchronize();         // Ensure host waits for GPU completion
  return 0;
}

Compilation Instructions for HIP:¶

To compile and execute the HIP program on an AMD GPU:

# Load the ROCm module (for AMD GPUs)
$ module load rocm                        # Load ROCm (example)
$ hipcc hello-world.hip -o hello.out      # Compile the HIP code
$ ./hello.out                             # Run the executable

HIP Portability: HIP code can be compiled to run on both AMD and NVIDIA GPUs. When targeting NVIDIA GPUs, use hipcc with NVIDIA's CUDA backend enabled.

Key Concepts in CUDA and HIP Programming¶

1. Kernel Function Definition (`global`):¶

Both CUDA and HIP use __global__ to mark functions (kernels) that run on the GPU.
In this example, cuda_function in CUDA and hip_function in HIP are designated as __global__, indicating they will execute on the GPU.

2. Launching Kernel Functions:¶

CUDA: The kernel launch syntax is <<<blocks, threads>>>.
HIP: HIP uses hipLaunchKernelGGL(kernel, grid, block, shared_mem, stream).
In both examples, <<<1,1>>> and dim3(1), dim3(1) mean launching one block with one thread, mimicking a sequential "Hello World" print.

3. Synchronization (`__syncthreads` and `cudaDeviceSynchronize` / `hipDeviceSynchronize`):¶

CUDA and HIP use similar synchronization calls to manage the order of operations.
__syncthreads() within a kernel ensures all threads within a block reach the same point before continuing.
cudaDeviceSynchronize() in CUDA and hipDeviceSynchronize() in HIP ensure that the CPU (host) waits for GPU (device) completion.

4. Memory Management in CUDA and HIP:¶

Memory management differs slightly, but both CUDA and HIP allow for similar allocations and transfers between host and device memory. For complex applications, managing memory correctly is essential for performance and functionality.

Practical Exploration¶

Experimenting with Block and Thread Counts¶

One of the benefits of GPU programming is the ability to leverage large numbers of threads to achieve parallelism. Try modifying the kernel launch configurations to observe the difference:

For CUDA:

cuda_function<<<2, 2>>>();

For HIP:

hipLaunchKernelGGL(hip_function, dim3(2), dim3(2), 0, 0);

Each of these configurations will launch 4 threads (2 blocks, each with 2 threads), demonstrating how parallelism can be easily scaled up with GPU computing.

Understanding HIP's Cross-Platform Capability¶

HIP is designed for portability, allowing the same codebase to be compiled for both NVIDIA and AMD GPUs. By converting existing CUDA code to HIP (known as "hipification"), developers can run applications across different hardware platforms.

Converting the CUDA application to HIP application¶

Hipification is the process of converting CUDA code to HIP, enabling the code to run on both AMD and NVIDIA GPUs. HIP provides tools to ease this transition by automating much of the code translation. Below are the basic steps for hipifying CUDA code:

Install ROCm and HIP Tools: Ensure that the AMD ROCm stack, which includes HIP tools, is installed on your system.
Use hipify-perl or hipify-clang: These tools convert CUDA syntax to HIP syntax. For example, running hipify-perl my_cuda_file.cu > my_hip_file.hip converts my_cuda_file.cu into HIP-compliant code in my_hip_file.hip. For larger projects, hipify-clang offers more control and can handle complex conversions.
Review the Converted Code: Check the code for compatibility, as certain CUDA-specific features may not directly translate to HIP or may need modification.
Compile with hipcc: HIP provides the hipcc compiler to compile HIP code for both AMD and NVIDIA GPUs.
Run and Test on Target Hardware: Once compiled, run the program on your desired GPU to ensure correct behavior and performance.

CUDA vs. HIP Basics¶

CUDA and HIP are two popular GPU programming frameworks for accelerating parallel computing tasks on GPUs. CUDA, developed by NVIDIA, specifically targets NVIDIA GPUs, whereas HIP, developed by AMD, is designed to be portable across AMD and NVIDIA GPUs, offering a similar syntax and programming model to CUDA. This document provides an introduction to key CUDA concepts, highlighting how these map to the basics of GPU programming in HIP.

1. Kernel Function Definition (`global` in CUDA, `global` in HIP)¶

CUDA: Kernel functions intended to run on the GPU are marked with the __global__ qualifier. This tells the compiler the function is a "kernel" that will be executed by GPU threads.
Example in CUDA: __global__ void cuda_function() { /* Code */ }
HIP: Similarly, HIP uses __global__ for defining kernel functions. HIP code written for AMD GPUs uses the same keyword, making it easy to port simple CUDA kernels to HIP by changing a few library-specific calls.

2. Launching Kernel Functions (`<<<blocks, threads>>>` in CUDA and HIP)¶

CUDA: Kernel functions are launched using the syntax <<<blocks, threads>>>, where:
Blocks: Groups of threads within the grid.
Threads: Individual threads within each block.
Example in CUDA: cuda_function<<<1, 1>>>(); launches a single block with one thread.
HIP: Kernel launch syntax is identical in HIP, so the same code hip_function<<<blocks, threads>>>(); would work in HIP as well, enabling straightforward parallel execution on AMD GPUs.

3. Synchronization (`__syncthreads` and `cudaDeviceSynchronize` in CUDA, `hipDeviceSynchronize` in HIP)¶

CUDA: Synchronization is essential to control the execution flow in parallel programming.
__syncthreads(); within a CUDA kernel synchronizes all threads in a block.
cudaDeviceSynchronize(); on the host ensures the CPU waits until the GPU completes its tasks.
HIP: HIP provides similar synchronization with hipDeviceSynchronize(); to coordinate between the host and device. For example, hipDeviceSynchronize(); is used on the host to ensure all GPU tasks are complete before moving forward.

Forgetting to synchronize can cause unpredictable behavior if the host attempts to access GPU results before computation is complete.

CPU vs. GPU Execution Model¶

In both CUDA and HIP, GPU code (kernels) is designed to execute across thousands of threads in parallel, each running a separate instance of the kernel function. This is in contrast to CPU execution, where functions like cpu_function() execute sequentially on a single or few CPU cores. This parallel structure in GPU programming enables massive data processing capabilities, especially for tasks that can be broken down across many threads.

Parallelism: Each thread processes a part of the data independently, enhancing performance for compute-intensive tasks based on the SMIT architecture.

GPU Architecture and Compute Capability (CUDA) vs. GCN Architecture (HIP)¶

CUDA: CUDA-enabled GPUs use a hardware architecture of streaming multiprocessors (SMs) that execute multiple threads in parallel. The GPU’s compute capability (e.g., compute_70 for Turing or compute_80 for Ampere) specifies architectural features, guiding compilation optimizations for the GPU model.
Example: Compiling with -arch=compute_70 ensures that compiled code is optimized for a Turing architecture.
HIP: AMD GPUs use the GCN (Graphics Core Next) architecture, and HIP uses HIP Clang to support architectural optimizations with -amdgpu-target=<architecture> flags.
Example: Using -amdgpu-target=gfx906 for a Radeon VII GPU ensures the code is optimized for that specific AMD architecture.

Synchronization and Common Pitfalls¶

Thread Synchronization: The need for thread synchronization within kernels (__syncthreads() in CUDA and HIP) is crucial for avoiding race conditions in data processing.
Host Synchronization: The asynchronous nature of GPU execution (between the host and device) necessitates host-level synchronization (cudaDeviceSynchronize() in CUDA, hipDeviceSynchronize() in HIP). Skipping these calls can lead to race conditions where the CPU attempts to access incomplete data, resulting in incorrect or unexpected outputs.

Summary¶

CUDA and HIP provide similar programming models for offloading tasks to the GPU, but they target different hardware platforms (CUDA for NVIDIA GPUs, HIP for AMD GPUs, and some NVIDIA GPUs). Key features include:

Kernel Execution: CUDA and HIP allow kernel functions to run in parallel across thousands of threads.
Kernel Launch Configuration: Both use the <<>> syntax to specify the level of parallelism.
Synchronization: Required both at the thread level within kernels and at the host level to ensure proper sequencing of CPU-GPU interactions.
Architectural Optimization: CUDA uses compute capabilities, while HIP uses target specifications like gfx906 to optimize for specific GPU models.

CUDA and HIP enable developers to leverage GPU parallelism, transforming compute-intensive tasks to run efficiently on modern hardware with high performance.