Skip to content

Unified Memory

CUDA unified memory introduces a managed memory address space that can be shared seamlessly between the CPU (host) and the GPU (device). This functionality, currently available primarily on Linux OS, allows for simplified memory management by eliminating the need for explicit data transfer calls like cudaMemcpy(). Figure 1 illustrates the concept behind CUDA unified memory, where memory is managed across the CPU and GPU by CUDA, making memory operations simpler for developers.

alt text Figure 1: CUDA unified memory concept

Unified memory can be implemented in two main ways in CUDA:

  1. Using cudaMallocManaged() to allocate managed memory from the host.
  2. Declaring variables with the __managed__ qualifier, making them globally accessible to both CPU and GPU.

To better understand the benefits of unified memory, let’s look at examples with and without unified memory.


Without Unified Memory

When unified memory is not used, explicit memory allocations and transfers are necessary. Below is a common workflow for such a setup:

  1. Allocate host memory.
  2. Allocate device memory.
  3. Initialize values on the host.
  4. Transfer data from the host to the device.
  5. Perform computation on the device using a CUDA kernel.
  6. Transfer results back from the device to the host.
  7. Free device memory.
  8. Free host memory.

Example: Without Unified Memory

The following code demonstrates matrix addition on the GPU without using unified memory:

#include "stdio.h"

__global__ 
void AplusB(int *Vector, int a, int b) 
{
  Vector[threadIdx.x] = Vector[threadIdx.x] 
    + a + b + threadIdx.x;
}

int main() 
{
  int N = 100;

  // Allocate the host memory 
  int *Host_Vector = (int *)malloc(N * sizeof(int));

  // Allocate the device memory 
  int *Device_Vector;
  cudaMalloc(&Device_Vector, N * sizeof(int));

  // Initialize the host value
  for(int i = 0; i < N; i++)
    Host_Vector[i] = 100;

  // Transfer the host value to device memory location 
  cudaMemcpy(Device_Vector, Host_Vector, N * sizeof(int), cudaMemcpyHostToDevice);

  // Do the computation using the CUDA kernel
  AplusB<<< 1, N >>>(Device_Vector, 10, 100);

  // Transfer the data from the device to host
  cudaMemcpy(Host_Vector, Device_Vector, N * sizeof(int), cudaMemcpyDeviceToHost);

  for(int i = 0; i < N; i++)
    printf("%d: A+B = %d\n", i, Host_Vector[i]); 

  // Free the device memory 
  cudaFree(Device_Vector); 

  // Free the host memory 
  free(Host_Vector);

  return 0;
}

With Unified Memory

With unified memory, we simplify the workflow as follows:

  1. Allocate unified memory.
  2. Initialize the value in unified memory.
  3. Perform computation using a CUDA kernel.
  4. Free unified memory.

Unified memory reduces the number of steps, especially by eliminating explicit memory copy operations. However, it requires an additional cudaDeviceSynchronize() call after the kernel execution, as cudaMemcpy() is a synchronized operation by default, while unified memory operations are not.

Example: With Unified Memory

Here’s an optimized example using unified memory:

#include "stdio.h"

__global__ 
void AplusB(int *Vector, int a, int b) 
{
  Vector[threadIdx.x] = Vector[threadIdx.x] 
    + a + b + threadIdx.x;
}

int main() 
{
  int N = 100;

  // Allocate the unified memory 
  int *Unified_Vector;
  cudaMallocManaged(&Unified_Vector, N * sizeof(int)); 

  // Initialize the unified memory value
  for(int i = 0; i < N; i++)
    Unified_Vector[i] = 100;

  // Perform the computation using the CUDA kernel
  AplusB<<< 1, N >>>(Unified_Vector, 10, 100);

  // Synchronize the kernel call 
  cudaDeviceSynchronize();

  for(int i = 0; i < N; i++)
    printf("%d: A+B = %d\n", i, Unified_Vector[i]); 

  // Free the unified memory 
  cudaFree(Unified_Vector); 

  return 0;
}


Prefetching Pageable Memory

alt text Figure 2: Prefetching Pageable Memory in CUDA

Starting with CUDA 8.0 and the Pascal architecture, NVIDIA GPUs support prefetching pageable memory. This capability allows data to be migrated to the GPU in advance, overlapping the data transfer with computation, reducing page faulting overhead. Figure 2 illustrates the process of pageable memory prefetching.

The cudaMemPrefetchAsync() function facilitates this process, providing better performance than cudaMallocManaged() alone by reducing memory access latency.

Example: Using cudaMemPrefetchAsync() Here’s an example using cudaMemPrefetchAsync() for faster performance:

#include "stdio.h"

__global__ 
void AplusB(int *Vector, int a, int b) 
{
  Vector[threadIdx.x] = Vector[threadIdx.x] 
    + a + b + threadIdx.x;
}

int main() 
{
  int N = 100;

  // Allocate managed memory 
  int *Unified_Vector;
  cudaMallocManaged(&Unified_Vector, N * sizeof(int)); 

  // Initialize the managed memory value
  for(int i = 0; i < N; i++)
    Unified_Vector[i] = 100;

  int device = -1;
  cudaGetDevice(&device);

  // Prefetch managed memory to the device 
  cudaMemPrefetchAsync(Unified_Vector, N * sizeof(int), device, NULL);

  // Perform computation using the CUDA kernel
  AplusB<<< 1, N >>>(Unified_Vector, 10, 100);

  // Prefetch managed memory to the host
  cudaMemPrefetchAsync(Unified_Vector, N * sizeof(int), cudaCpuDeviceId, NULL);

  // Free the managed memory 
  cudaFree(Unified_Vector); 

  return 0;
}

In this code:

  • cudaMemPrefetchAsync() prefetches memory to the device before the kernel call, and then back to the host after kernel execution.
  • Prefetching reduces page fault overhead, as data is available in the target device memory in advance, enhancing performance.

Benefits of Using Unified Memory

Unified memory in CUDA offers several advantages:

  • Simplified Code: Unified memory reduces the need for explicit memory copy operations, making the code easier to write and maintain.
  • Automatic Memory Management: CUDA manages memory paging between CPU and GPU, which can reduce programming effort.
  • Efficient Resource Utilization: Prefetching and managed memory allow developers to better utilize system and GPU resources.

However, unified memory may not always offer the best performance for all applications, especially where data transfers between host and device are infrequent. In such cases, using standard cudaMemcpy() may provide better control and efficiency.

By utilizing unified memory and prefetching techniques, developers can significantly reduce code complexity and optimize performance in applications with frequent CPU-GPU interactions.