GPU Architecture and NVIDIA & AMD Technologies Overview¶

GPUs have become essential in high-performance computing due to their highly parallel architecture, which is optimized for handling multiple tasks concurrently. Below, we explore key elements of GPU architecture, memory hierarchy, and advanced NVIDIA technologies.

CPU vs. GPU: Architectural Differences¶

GPUs consist of thousands of smaller cores, each optimized for parallel tasks, making them well-suited for data-intensive applications such as machine learning and scientific simulations. CPUs, in contrast, have fewer, more powerful cores operating at a higher frequency, which are optimized for sequential task execution.

CPU vs GPU architecture source

GPU Core Organization: GPCs, SMs, and TPCs¶

In NVIDIA GPUs, cores are organized into GPU Processing Clusters (GPCs). Each GPC contains multiple Streaming Multiprocessors (SMs) and Texture Processor Clusters (TPCs).

GPCs, SMs, and TPCs source

SMs execute code in a Single Instruction, Multiple Threads (SIMT) fashion, where each SM can manage thousands of threads running simultaneously. This model is ideal for tasks that require high data parallelism.

GPU Memory Hierarchy¶

Each SM in an NVIDIA GPU has several types of memory, each with unique characteristics:

Global Memory: Resides on DRAM and is accessible by all threads across the GPU. Global memory is essential for large datasets and can be read/written by both host and device using CUDA APIs.
Local Memory: Used mainly for register spilling and automatic variables that exceed register capacity. Local memory resides in off-chip DRAM and is cached in L1 and L2.
L1/Shared Memory: On-chip memory within each SM, offering fast access times comparable to registers. Shared memory enables efficient data reuse and is configurable by the programmer. Proper use of shared memory can reduce global memory traffic and improve performance.
Constant Memory: Located in off-chip memory but cached on-chip. Constant memory is read-only for threads and is typically used for frequently accessed constants.
Texture Memory: Like constant memory, texture memory is off-chip but cached on-chip. It’s optimized for read-only access with spatial locality, making it suitable for graphics and image-processing tasks.

Each SM has its own set of registers, caches, shared memory, and load/store units to handle local data and manage memory access efficiently.

SM Architecture and Execution Model¶

Each SM executes instructions in groups of 32 threads, called warps. All threads within a warp execute the same instruction, allowing high efficiency. SMs also use Cooperative Thread Arrays (CTAs) to organize threads into blocks, which execute in parallel. Efficient GPU utilization relies on high occupancy, the ratio of active warps to the maximum supported on the GPU, which can be calculated using NVIDIA’s CUDA Occupancy Calculator.

NVIDIA Microarchitectures¶

NVIDIA regularly releases new GPU microarchitectures, each offering improved performance, energy efficiency, and specialized features: - Tesla (2006), Fermi (2010), Kepler (2012), Maxwell (2014), Pascal (2016), Volta (2017), Turing (2018), and Ampere (2020).

Each architecture introduces advancements like tensor cores for AI workloads, improved memory handling, and enhanced CUDA support, making each generation more capable for diverse applications.

Compute Capabilities¶

NVIDIA defines specific compute capabilities for each architecture, indicating the feature set available, such as tensor core support and precision options. These capabilities are specified at compile time using flags like -arch=compute_70. Compute capability dictates the CUDA features that a program can use, optimizing applications for specific GPU hardware.

Unified Memory¶

Starting with the Maxwell architecture, NVIDIA introduced Unified Memory, which allows both CPU and GPU to access a shared memory address space, simplifying memory management. Unified Memory handles data transfer automatically, reducing the need for explicit memory management.

Unified Memory source

Unified Memory is especially helpful in complex applications where memory sharing between CPU and GPU is required, such as in deep learning.

NVLink and NVSwitch¶

NVIDIA's NVLink is a high-speed interconnect technology that enables faster data transfer between GPUs and between GPU and CPU. NVSwitch further extends this by allowing up to 16 GPUs to connect in a single system. NVLink and NVSwitch significantly outperform traditional PCIe-based interconnects, boosting performance for multi-GPU configurations.

Nvidia NVSwitch source

NVLink Performance source

NVLink is ideal for applications requiring high data throughput between GPUs, such as deep learning and scientific simulations. By reducing data transfer bottlenecks, NVLink enables GPUs to work more cohesively on shared workloads.

Summary¶

GPUs, especially those built on NVIDIA’s advanced architectures, are optimized for high parallelism and data-intensive tasks. The key elements of NVIDIA GPUs include:

Multi-core Architecture: Thousands of cores grouped into GPCs, SMs, and TPCs enable massive parallelism.
Memory Hierarchy: A sophisticated memory system that includes global, local, shared, constant, and texture memory to handle diverse data access patterns.
Warp-based Execution Model: The SIMT model allows efficient handling of data-parallel tasks using warps.
Unified Memory: Simplifies memory management between CPU and GPU.
NVLink and NVSwitch: Provide high-speed interconnects for multi-GPU setups.

NVIDIA’s microarchitectures and technologies such as Unified Memory and NVLink demonstrate a commitment to performance and ease of programming, making GPUs a powerful tool in modern computing.

AMD GPU Architecture and Technologies Overview¶

In addition to NVIDIA GPUs, AMD GPUs provide powerful alternatives in high-performance computing, with architectures optimized for parallel processing, memory management, and high data throughput. AMD’s Radeon Instinct and Radeon Pro series GPUs are widely used in scientific computing, AI, and other data-intensive applications.

Core Architecture: Compute Units and Wavefronts¶

AMD GPUs organize their cores into Compute Units (CUs). Each CU consists of multiple Stream Processors (SPs), designed for parallel data processing. AMD’s architecture executes instructions in groups of 64 threads called wavefronts (comparable to NVIDIA’s warps of 32 threads). This wavefront approach allows AMD GPUs to efficiently handle high-throughput workloads across large datasets.

Memory Hierarchy¶

AMD GPUs have a well-defined memory hierarchy tailored for high bandwidth and low-latency access:

Global Memory: Also known as VRAM, it is off-chip memory shared among all CUs and accessible by both the CPU and GPU.
L1 and L2 Cache: On-chip caches reduce access times to global memory, enhancing performance for frequently accessed data.
Local Data Share (LDS): Each CU has an on-chip memory area called Local Data Share for fast access to shared data within a wavefront, similar to NVIDIA’s shared memory.

This hierarchy is critical for managing data flow efficiently, allowing AMD GPUs to handle massive parallel tasks effectively.

Graphics Core Next (GCN) and RDNA Architectures¶

AMD has developed several microarchitectures optimized for compute-intensive tasks:

Graphics Core Next (GCN): Used in earlier models, GCN architectures provided robust compute performance across AMD’s GPUs, with strong support for both graphics and parallel processing.
RDNA (Radeon DNA) and RDNA 2: These newer architectures focus on increased efficiency, higher performance per watt, and improved compute capabilities. RDNA incorporates innovations such as high-performance cache hierarchy and enhanced SIMD processing, which are crucial for handling AI and compute workloads.

AMD ROCm Platform¶

AMD’s ROCm (Radeon Open Compute) platform is an open-source software stack designed to facilitate GPU computing. It provides tools and libraries for HPC and machine learning:

HIP (Heterogeneous-Compute Interface for Portability): A C++ runtime API that allows code portability between AMD and NVIDIA GPUs, enabling developers to write GPU code once and deploy it on multiple architectures.
MIOpen: A GPU-accelerated library for deep learning, offering optimized routines for convolutional neural networks.
ROCblas, ROCm Math Libraries: These libraries offer highly optimized routines for linear algebra, FFTs, and other mathematical operations, ideal for scientific computing.

Infinity Fabric and Multi-GPU Scaling¶

Infinity Fabric is AMD’s high-speed interconnect technology, allowing data transfer between GPUs, CPUs, and other components. In multi-GPU systems, Infinity Fabric enables data sharing and collaborative processing, which is essential for large-scale computations in AI and scientific simulations.

Multi-GPU Framework¶

Using ROCm, AMD supports Multi-GPU scaling, allowing seamless cooperation between multiple AMD GPUs in a system. This setup is ideal for applications like deep learning where distributed computation is beneficial.

GPU Memory Management: Unified and Heterogeneous Memory¶

AMD’s approach to memory management includes support for Unified Memory under ROCm, allowing coherent memory access between CPU and GPU. This feature simplifies data handling across devices, especially in applications where CPU and GPU need to share large datasets efficiently.

Ray Accelerators and AI-Specific Enhancements¶

AMD’s RDNA 2 architecture introduces Ray Accelerators for improved graphics rendering and parallel processing capabilities. These accelerators enable real-time ray tracing, essential for applications in visual simulations and gaming. For AI, AMD has developed features such as Matrix Cores to handle matrix operations efficiently, vital for machine learning workloads.

Summary¶

AMD GPUs, especially through RDNA and ROCm advancements, offer versatile and efficient computing solutions. Key features include:

Compute Units and Wavefronts: Organize parallel tasks for efficient data handling.
Memory Hierarchy: A layered memory system with Local Data Share, L1, and L2 caches optimizes data flow.
ROCm Platform: An open-source software stack supporting HPC and AI with tools like HIP and MIOpen.
Infinity Fabric: Enables high-speed communication in multi-GPU setups.
Unified Memory: Eases memory management between CPU and GPU.
Ray Accelerators: Provide specialized support for graphics and AI enhancements.

With these technologies, AMD GPUs present a powerful alternative for developers looking to leverage parallel processing for scientific computing, AI, and machine learning.