Unified Memory¶

Script

"Unified memory in CUDA introduces a shared memory address space that is accessible by both the CPU and GPU. By eliminating the need for manual data transfers, unified memory simplifies memory management, making it easier to write efficient GPU-accelerated programs."
"In traditional CUDA workflows, memory management involves multiple steps: allocating memory separately on the host and device, transferring data back and forth, and explicitly freeing memory. While this process offers precise control, it adds complexity to your code."
"This example illustrates a matrix addition without unified memory. Memory must be manually allocated on the device, data transferred from the host to the device, and results copied back to the host after kernel execution. Each of these steps requires explicit function calls."
"Unified memory simplifies memory management by removing the need for explicit memory transfer calls. With unified memory, cudaMallocManaged() allocates memory that both the CPU and GPU can access directly, enabling automatic data movement between the two."
"This example demonstrates matrix addition using unified memory. With cudaMallocManaged(), memory is allocated once and shared by the host and device. Data synchronization between CPU and GPU happens automatically, significantly simplifying the workflow."
"Unified memory introduces prefetching capabilities through cudaMemPrefetchAsync(). Prefetching moves data to the device before kernel execution, reducing page faults and improving performance. This technique is particularly useful on modern GPUs with pageable memory support."
"In this code, cudaMemPrefetchAsync() preloads memory to the GPU before the kernel executes. After computation, data is prefetched back to the CPU. This approach minimizes on-demand memory migration, optimizing access speed and reducing latency."
"Unified memory simplifies code by reducing manual memory management complexity, enables efficient resource utilization with automatic memory migration, and allows seamless data access for heterogeneous computing. These features make it an attractive option for many applications."
"Unified memory isn’t always the best choice. For applications with large, infrequent data transfers, cudaMemcpy() may be more efficient. However, for workloads with frequent CPU-GPU interactions, unified memory provides significant advantages by automating data movement."
"In summary, unified memory simplifies GPU programming by enabling automatic memory migration between host and device. Prefetching further optimizes performance, making unified memory a great fit for applications requiring frequent host-device interactions. While not ideal for all scenarios, its benefits outweigh limitations in many use cases."

CUDA unified memory introduces a managed memory address space that can be shared seamlessly between the CPU (host) and the GPU (device). This functionality, currently available primarily on Linux OS, allows for simplified memory management by eliminating the need for explicit data transfer calls like cudaMemcpy(). Figure 1 illustrates the concept behind CUDA unified memory, where memory is managed across the CPU and GPU by CUDA, making memory operations simpler for developers.

Figure 1: CUDA unified memory concept

Unified memory can be implemented in two main ways in CUDA:

Using cudaMallocManaged() to allocate managed memory from the host.
Declaring variables with the __managed__ qualifier, making them globally accessible to both CPU and GPU.

To better understand the benefits of unified memory, let’s look at examples with and without unified memory.

Without Unified Memory¶

When unified memory is not used, explicit memory allocations and transfers are necessary. Below is a common workflow for such a setup:

Allocate host memory.
Allocate device memory.
Initialize values on the host.
Transfer data from the host to the device.
Perform computation on the device using a CUDA kernel.
Transfer results back from the device to the host.
Free device memory.
Free host memory.

Example: Without Unified Memory¶

The following code demonstrates matrix addition on the GPU without using unified memory:

#include "stdio.h"

__global__ 
void AplusB(int *Vector, int a, int b) 
{
  Vector[threadIdx.x] = Vector[threadIdx.x] 
    + a + b + threadIdx.x;
}

int main() 
{
  int N = 100;

  // Allocate the host memory 
  int *Host_Vector = (int *)malloc(N * sizeof(int));

  // Allocate the device memory 
  int *Device_Vector;
  cudaMalloc(&Device_Vector, N * sizeof(int));

  // Initialize the host value
  for(int i = 0; i < N; i++)
    Host_Vector[i] = 100;

  // Transfer the host value to device memory location 
  cudaMemcpy(Device_Vector, Host_Vector, N * sizeof(int), cudaMemcpyHostToDevice);

  // Do the computation using the CUDA kernel
  AplusB<<< 1, N >>>(Device_Vector, 10, 100);

  // Transfer the data from the device to host
  cudaMemcpy(Host_Vector, Device_Vector, N * sizeof(int), cudaMemcpyDeviceToHost);

  for(int i = 0; i < N; i++)
    printf("%d: A+B = %d\n", i, Host_Vector[i]); 

  // Free the device memory 
  cudaFree(Device_Vector); 

  // Free the host memory 
  free(Host_Vector);

  return 0;
}

With Unified Memory¶

With unified memory, we simplify the workflow as follows:

Allocate unified memory.
Initialize the value in unified memory.
Perform computation using a CUDA kernel.
Free unified memory.

Unified memory reduces the number of steps, especially by eliminating explicit memory copy operations. However, it requires an additional cudaDeviceSynchronize() call after the kernel execution, as cudaMemcpy() is a synchronized operation by default, while unified memory operations are not.

Example: With Unified Memory¶

Here’s an optimized example using unified memory:

#include "stdio.h"

__global__ 
void AplusB(int *Vector, int a, int b) 
{
  Vector[threadIdx.x] = Vector[threadIdx.x] 
    + a + b + threadIdx.x;
}

int main() 
{
  int N = 100;

  // Allocate the unified memory 
  int *Unified_Vector;
  cudaMallocManaged(&Unified_Vector, N * sizeof(int)); 

  // Initialize the unified memory value
  for(int i = 0; i < N; i++)
    Unified_Vector[i] = 100;

  // Perform the computation using the CUDA kernel
  AplusB<<< 1, N >>>(Unified_Vector, 10, 100);

  // Synchronize the kernel call 
  cudaDeviceSynchronize();

  for(int i = 0; i < N; i++)
    printf("%d: A+B = %d\n", i, Unified_Vector[i]); 

  // Free the unified memory 
  cudaFree(Unified_Vector); 

  return 0;
}

Prefetching Pageable Memory¶

Figure 2: Prefetching Pageable Memory in CUDA

Starting with CUDA 8.0 and the Pascal architecture, NVIDIA GPUs support prefetching pageable memory. This capability allows data to be migrated to the GPU in advance, overlapping the data transfer with computation, reducing page faulting overhead. Figure 2 illustrates the process of pageable memory prefetching.

The cudaMemPrefetchAsync() function facilitates this process, providing better performance than cudaMallocManaged() alone by reducing memory access latency.

Example: Using cudaMemPrefetchAsync() Here’s an example using cudaMemPrefetchAsync() for faster performance:

#include "stdio.h"

__global__ 
void AplusB(int *Vector, int a, int b) 
{
  Vector[threadIdx.x] = Vector[threadIdx.x] 
    + a + b + threadIdx.x;
}

int main() 
{
  int N = 100;

  // Allocate managed memory 
  int *Unified_Vector;
  cudaMallocManaged(&Unified_Vector, N * sizeof(int)); 

  // Initialize the managed memory value
  for(int i = 0; i < N; i++)
    Unified_Vector[i] = 100;

  int device = -1;
  cudaGetDevice(&device);

  // Prefetch managed memory to the device 
  cudaMemPrefetchAsync(Unified_Vector, N * sizeof(int), device, NULL);

  // Perform computation using the CUDA kernel
  AplusB<<< 1, N >>>(Unified_Vector, 10, 100);

  // Prefetch managed memory to the host
  cudaMemPrefetchAsync(Unified_Vector, N * sizeof(int), cudaCpuDeviceId, NULL);

  // Free the managed memory 
  cudaFree(Unified_Vector); 

  return 0;
}

In this code:

cudaMemPrefetchAsync() prefetches memory to the device before the kernel call, and then back to the host after kernel execution.
Prefetching reduces page fault overhead, as data is available in the target device memory in advance, enhancing performance.

Benefits of Using Unified Memory¶

Unified memory in CUDA offers several advantages:

Simplified Code: Unified memory reduces the need for explicit memory copy operations, making the code easier to write and maintain.
Automatic Memory Management: CUDA manages memory paging between CPU and GPU, which can reduce programming effort.
Efficient Resource Utilization: Prefetching and managed memory allow developers to better utilize system and GPU resources.

However, unified memory may not always offer the best performance for all applications, especially where data transfers between host and device are infrequent. In such cases, using standard cudaMemcpy() may provide better control and efficiency.

By utilizing unified memory and prefetching techniques, developers can significantly reduce code complexity and optimize performance in applications with frequent CPU-GPU interactions.