Unified Memory¶
CUDA unified memory introduces a managed
memory address space that can be shared seamlessly between the CPU (host) and the GPU (device). This functionality, currently available primarily on Linux OS, allows for simplified memory management by eliminating the need for explicit data transfer calls like cudaMemcpy()
. Figure 1 illustrates the concept behind CUDA unified memory, where memory is managed across the CPU and GPU by CUDA, making memory operations simpler for developers.
Figure 1: CUDA unified memory concept
Unified memory can be implemented in two main ways in CUDA:
- Using
cudaMallocManaged()
to allocate managed memory from the host. - Declaring variables with the
__managed__
qualifier, making them globally accessible to both CPU and GPU.
To better understand the benefits of unified memory, let’s look at examples with and without unified memory.
Without Unified Memory¶
When unified memory is not used, explicit memory allocations and transfers are necessary. Below is a common workflow for such a setup:
- Allocate host memory.
- Allocate device memory.
- Initialize values on the host.
- Transfer data from the host to the device.
- Perform computation on the device using a CUDA kernel.
- Transfer results back from the device to the host.
- Free device memory.
- Free host memory.
Example: Without Unified Memory¶
The following code demonstrates matrix addition on the GPU without using unified memory:
#include "stdio.h"
__global__
void AplusB(int *Vector, int a, int b)
{
Vector[threadIdx.x] = Vector[threadIdx.x]
+ a + b + threadIdx.x;
}
int main()
{
int N = 100;
// Allocate the host memory
int *Host_Vector = (int *)malloc(N * sizeof(int));
// Allocate the device memory
int *Device_Vector;
cudaMalloc(&Device_Vector, N * sizeof(int));
// Initialize the host value
for(int i = 0; i < N; i++)
Host_Vector[i] = 100;
// Transfer the host value to device memory location
cudaMemcpy(Device_Vector, Host_Vector, N * sizeof(int), cudaMemcpyHostToDevice);
// Do the computation using the CUDA kernel
AplusB<<< 1, N >>>(Device_Vector, 10, 100);
// Transfer the data from the device to host
cudaMemcpy(Host_Vector, Device_Vector, N * sizeof(int), cudaMemcpyDeviceToHost);
for(int i = 0; i < N; i++)
printf("%d: A+B = %d\n", i, Host_Vector[i]);
// Free the device memory
cudaFree(Device_Vector);
// Free the host memory
free(Host_Vector);
return 0;
}
With Unified Memory¶
With unified memory, we simplify the workflow as follows:
- Allocate unified memory.
- Initialize the value in unified memory.
- Perform computation using a CUDA kernel.
- Free unified memory.
Unified memory reduces the number of steps, especially by eliminating explicit memory copy operations. However, it requires an additional cudaDeviceSynchronize()
call after the kernel execution, as cudaMemcpy()
is a synchronized operation by default, while unified memory operations are not.
Example: With Unified Memory¶
Here’s an optimized example using unified memory:
#include "stdio.h"
__global__
void AplusB(int *Vector, int a, int b)
{
Vector[threadIdx.x] = Vector[threadIdx.x]
+ a + b + threadIdx.x;
}
int main()
{
int N = 100;
// Allocate the unified memory
int *Unified_Vector;
cudaMallocManaged(&Unified_Vector, N * sizeof(int));
// Initialize the unified memory value
for(int i = 0; i < N; i++)
Unified_Vector[i] = 100;
// Perform the computation using the CUDA kernel
AplusB<<< 1, N >>>(Unified_Vector, 10, 100);
// Synchronize the kernel call
cudaDeviceSynchronize();
for(int i = 0; i < N; i++)
printf("%d: A+B = %d\n", i, Unified_Vector[i]);
// Free the unified memory
cudaFree(Unified_Vector);
return 0;
}
Prefetching Pageable Memory¶
Figure 2: Prefetching Pageable Memory in CUDA
Starting with CUDA 8.0 and the Pascal architecture, NVIDIA GPUs support prefetching pageable memory. This capability allows data to be migrated to the GPU in advance, overlapping the data transfer with computation, reducing page faulting overhead. Figure 2 illustrates the process of pageable memory prefetching.
The cudaMemPrefetchAsync()
function facilitates this process, providing better performance than cudaMallocManaged() alone by reducing memory access latency.
Example: Using cudaMemPrefetchAsync()
Here’s an example using cudaMemPrefetchAsync()
for faster performance:
#include "stdio.h"
__global__
void AplusB(int *Vector, int a, int b)
{
Vector[threadIdx.x] = Vector[threadIdx.x]
+ a + b + threadIdx.x;
}
int main()
{
int N = 100;
// Allocate managed memory
int *Unified_Vector;
cudaMallocManaged(&Unified_Vector, N * sizeof(int));
// Initialize the managed memory value
for(int i = 0; i < N; i++)
Unified_Vector[i] = 100;
int device = -1;
cudaGetDevice(&device);
// Prefetch managed memory to the device
cudaMemPrefetchAsync(Unified_Vector, N * sizeof(int), device, NULL);
// Perform computation using the CUDA kernel
AplusB<<< 1, N >>>(Unified_Vector, 10, 100);
// Prefetch managed memory to the host
cudaMemPrefetchAsync(Unified_Vector, N * sizeof(int), cudaCpuDeviceId, NULL);
// Free the managed memory
cudaFree(Unified_Vector);
return 0;
}
In this code:
cudaMemPrefetchAsync()
prefetches memory to the device before the kernel call, and then back to the host after kernel execution.- Prefetching reduces page fault overhead, as data is available in the target device memory in advance, enhancing performance.
Benefits of Using Unified Memory¶
Unified memory in CUDA offers several advantages:
- Simplified Code: Unified memory reduces the need for explicit memory copy operations, making the code easier to write and maintain.
- Automatic Memory Management: CUDA manages memory paging between CPU and GPU, which can reduce programming effort.
- Efficient Resource Utilization: Prefetching and managed memory allow developers to better utilize system and GPU resources.
However, unified memory may not always offer the best performance for all applications, especially where data transfers between host and device are infrequent. In such cases, using standard cudaMemcpy()
may provide better control and efficiency.
By utilizing unified memory and prefetching techniques, developers can significantly reduce code complexity and optimize performance in applications with frequent CPU-GPU interactions.