Skip to content

GPU Architecture and NVIDIA & AMD Technologies Overview

Script
  • "GPUs are central to high-performance computing because they can handle thousands of tasks simultaneously. Unlike CPUs, which feature fewer, more powerful cores designed for limited parallelism, GPUs are optimized for massively parallel processing. This makes them ideal for compute-intensive workloads like machine learning and scientific simulations."
  • "NVIDIA GPUs organize their cores into GPU Processing Clusters, or GPCs. Each GPC contains Streaming Multiprocessors, or SMs, which operate in SIMT, or single instruction, multiple threads mode. This structure enables thousands of threads to execute in parallel, maximizing data throughput."
  • "NVIDIA’s GPUs also feature a sophisticated memory hierarchy. Global memory is large and accessible by all threads, while shared memory within each SM offers fast access and supports data reuse. Other memory types, like constant and texture memory, are optimized for specific use cases, further enhancing performance."
  • "Threads in an SM are organized into warps, groups of 32 threads that execute the same instruction simultaneously. These warps are further grouped into Cooperative Thread Arrays, or CTAs, to execute in parallel. Achieving high occupancy, or active warp utilization, is key to maximizing GPU efficiency."
  • "NVIDIA’s architectural evolution, from Tesla to Ampere, has introduced features like tensor cores and advanced memory handling. Compute capabilities define each architecture’s feature set, allowing developers to optimize their applications for specific GPUs."
  • "NVIDIA’s Unified Memory simplifies programming by creating a shared memory space for the CPU and GPU. For multi-GPU setups, NVLink and NVSwitch provide high-speed interconnects, enabling seamless collaboration between GPUs, especially in deep learning applications."
  • "AMD GPUs are organized into Compute Units, which execute threads in wavefronts of 64 threads, similar to NVIDIA’s warps. Their memory hierarchy includes VRAM for global access and Local Data Share for rapid intra-wavefront access, making them efficient for high-throughput tasks."
  • "AMD’s architectures, including GCN and RDNA, are optimized for diverse workloads. The ROCm platform supports cross-platform GPU programming with tools like HIP, while libraries such as MIOpen and ROCblas enable deep learning and scientific applications."
  • "AMD’s Infinity Fabric interconnects GPUs and CPUs for high-speed data sharing, essential for multi-GPU configurations. Combined with ROCm’s Unified Memory, this enables seamless CPU-GPU collaboration in applications like deep learning, where memory sharing is critical."
  • "In summary, NVIDIA and AMD GPUs offer powerful architectures tailored for high-performance computing. NVIDIA’s GPC and SM structure with NVLink simplify multi-GPU scaling, while AMD’s Compute Units and ROCm platform enable efficient collaboration across GPUs, making both families essential for modern computational challenges."

GPUs have become essential in high-performance computing due to their highly parallel architecture, which is optimized for handling multiple tasks concurrently. Below, we explore key elements of GPU architecture, memory hierarchy, and advanced NVIDIA technologies.

CPU vs. GPU: Architectural Differences

GPUs consist of thousands of smaller cores, each optimized for parallel tasks, making them well-suited for data-intensive applications such as machine learning and scientific simulations. CPUs, in contrast, have fewer, more powerful cores operating at a higher frequency, which are optimized for sequential task execution.

CPU vs GPU Architecture CPU vs GPU architecture source

GPU Core Organization: GPCs, SMs, and TPCs

In NVIDIA GPUs, cores are organized into GPU Processing Clusters (GPCs). Each GPC contains multiple Streaming Multiprocessors (SMs) and Texture Processor Clusters (TPCs).

GPCs, SMs, and TPCs GPCs, SMs, and TPCs source

SMs execute code in a Single Instruction, Multiple Threads (SIMT) fashion, where each SM can manage thousands of threads running simultaneously. This model is ideal for tasks that require high data parallelism.

GPU Memory Hierarchy

Each SM in an NVIDIA GPU has several types of memory, each with unique characteristics:

  • Global Memory: Resides on DRAM and is accessible by all threads across the GPU. Global memory is essential for large datasets and can be read/written by both host and device using CUDA APIs.

  • Local Memory: Used mainly for register spilling and automatic variables that exceed register capacity. Local memory resides in off-chip DRAM and is cached in L1 and L2.

  • L1/Shared Memory: On-chip memory within each SM, offering fast access times comparable to registers. Shared memory enables efficient data reuse and is configurable by the programmer. Proper use of shared memory can reduce global memory traffic and improve performance.

  • Constant Memory: Located in off-chip memory but cached on-chip. Constant memory is read-only for threads and is typically used for frequently accessed constants.

  • Texture Memory: Like constant memory, texture memory is off-chip but cached on-chip. It’s optimized for read-only access with spatial locality, making it suitable for graphics and image-processing tasks.

Each SM has its own set of registers, caches, shared memory, and load/store units to handle local data and manage memory access efficiently.

SM Architecture and Execution Model

Each SM executes instructions in groups of 32 threads, called warps. All threads within a warp execute the same instruction, allowing high efficiency. SMs also use Cooperative Thread Arrays (CTAs) to organize threads into blocks, which execute in parallel. Efficient GPU utilization relies on high occupancy, the ratio of active warps to the maximum supported on the GPU, which can be calculated using NVIDIA’s CUDA Occupancy Calculator.

NVIDIA Microarchitectures

NVIDIA regularly releases new GPU microarchitectures, each offering improved performance, energy efficiency, and specialized features: - Tesla (2006), Fermi (2010), Kepler (2012), Maxwell (2014), Pascal (2016), Volta (2017), Turing (2018), and Ampere (2020).

Each architecture introduces advancements like tensor cores for AI workloads, improved memory handling, and enhanced CUDA support, making each generation more capable for diverse applications.

Compute Capabilities

NVIDIA defines specific compute capabilities for each architecture, indicating the feature set available, such as tensor core support and precision options. These capabilities are specified at compile time using flags like -arch=compute_70. Compute capability dictates the CUDA features that a program can use, optimizing applications for specific GPU hardware.

Unified Memory

Starting with the Maxwell architecture, NVIDIA introduced Unified Memory, which allows both CPU and GPU to access a shared memory address space, simplifying memory management. Unified Memory handles data transfer automatically, reducing the need for explicit memory management.

Unified Memory Unified Memory source

Unified Memory is especially helpful in complex applications where memory sharing between CPU and GPU is required, such as in deep learning.

NVIDIA's NVLink is a high-speed interconnect technology that enables faster data transfer between GPUs and between GPU and CPU. NVSwitch further extends this by allowing up to 16 GPUs to connect in a single system. NVLink and NVSwitch significantly outperform traditional PCIe-based interconnects, boosting performance for multi-GPU configurations.

NVSwitch Nvidia NVSwitch source

NVLink Performance NVLink Performance source

NVLink is ideal for applications requiring high data throughput between GPUs, such as deep learning and scientific simulations. By reducing data transfer bottlenecks, NVLink enables GPUs to work more cohesively on shared workloads.

Summary

GPUs, especially those built on NVIDIA’s advanced architectures, are optimized for high parallelism and data-intensive tasks. The key elements of NVIDIA GPUs include:

  • Multi-core Architecture: Thousands of cores grouped into GPCs, SMs, and TPCs enable massive parallelism.
  • Memory Hierarchy: A sophisticated memory system that includes global, local, shared, constant, and texture memory to handle diverse data access patterns.
  • Warp-based Execution Model: The SIMT model allows efficient handling of data-parallel tasks using warps.
  • Unified Memory: Simplifies memory management between CPU and GPU.
  • NVLink and NVSwitch: Provide high-speed interconnects for multi-GPU setups.

NVIDIA’s microarchitectures and technologies such as Unified Memory and NVLink demonstrate a commitment to performance and ease of programming, making GPUs a powerful tool in modern computing.


AMD GPU Architecture and Technologies Overview

In addition to NVIDIA GPUs, AMD GPUs provide powerful alternatives in high-performance computing, with architectures optimized for parallel processing, memory management, and high data throughput. AMD’s Radeon Instinct and Radeon Pro series GPUs are widely used in scientific computing, AI, and other data-intensive applications.

Core Architecture: Compute Units and Wavefronts

AMD GPUs organize their cores into Compute Units (CUs). Each CU consists of multiple Stream Processors (SPs), designed for parallel data processing. AMD’s architecture executes instructions in groups of 64 threads called wavefronts (comparable to NVIDIA’s warps of 32 threads). This wavefront approach allows AMD GPUs to efficiently handle high-throughput workloads across large datasets.

Memory Hierarchy

AMD GPUs have a well-defined memory hierarchy tailored for high bandwidth and low-latency access:

  • Global Memory: Also known as VRAM, it is off-chip memory shared among all CUs and accessible by both the CPU and GPU.
  • L1 and L2 Cache: On-chip caches reduce access times to global memory, enhancing performance for frequently accessed data.
  • Local Data Share (LDS): Each CU has an on-chip memory area called Local Data Share for fast access to shared data within a wavefront, similar to NVIDIA’s shared memory.

This hierarchy is critical for managing data flow efficiently, allowing AMD GPUs to handle massive parallel tasks effectively.

Graphics Core Next (GCN) and RDNA Architectures

AMD has developed several microarchitectures optimized for compute-intensive tasks:

  • Graphics Core Next (GCN): Used in earlier models, GCN architectures provided robust compute performance across AMD’s GPUs, with strong support for both graphics and parallel processing.
  • RDNA (Radeon DNA) and RDNA 2: These newer architectures focus on increased efficiency, higher performance per watt, and improved compute capabilities. RDNA incorporates innovations such as high-performance cache hierarchy and enhanced SIMD processing, which are crucial for handling AI and compute workloads.

AMD ROCm Platform

AMD’s ROCm (Radeon Open Compute) platform is an open-source software stack designed to facilitate GPU computing. It provides tools and libraries for HPC and machine learning:

  • HIP (Heterogeneous-Compute Interface for Portability): A C++ runtime API that allows code portability between AMD and NVIDIA GPUs, enabling developers to write GPU code once and deploy it on multiple architectures.
  • MIOpen: A GPU-accelerated library for deep learning, offering optimized routines for convolutional neural networks.
  • ROCblas, ROCm Math Libraries: These libraries offer highly optimized routines for linear algebra, FFTs, and other mathematical operations, ideal for scientific computing.

Infinity Fabric and Multi-GPU Scaling

Infinity Fabric is AMD’s high-speed interconnect technology, allowing data transfer between GPUs, CPUs, and other components. In multi-GPU systems, Infinity Fabric enables data sharing and collaborative processing, which is essential for large-scale computations in AI and scientific simulations.

Multi-GPU Framework

Using ROCm, AMD supports Multi-GPU scaling, allowing seamless cooperation between multiple AMD GPUs in a system. This setup is ideal for applications like deep learning where distributed computation is beneficial.

GPU Memory Management: Unified and Heterogeneous Memory

AMD’s approach to memory management includes support for Unified Memory under ROCm, allowing coherent memory access between CPU and GPU. This feature simplifies data handling across devices, especially in applications where CPU and GPU need to share large datasets efficiently.

Ray Accelerators and AI-Specific Enhancements

AMD’s RDNA 2 architecture introduces Ray Accelerators for improved graphics rendering and parallel processing capabilities. These accelerators enable real-time ray tracing, essential for applications in visual simulations and gaming. For AI, AMD has developed features such as Matrix Cores to handle matrix operations efficiently, vital for machine learning workloads.

Summary

AMD GPUs, especially through RDNA and ROCm advancements, offer versatile and efficient computing solutions. Key features include:

  • Compute Units and Wavefronts: Organize parallel tasks for efficient data handling.
  • Memory Hierarchy: A layered memory system with Local Data Share, L1, and L2 caches optimizes data flow.
  • ROCm Platform: An open-source software stack supporting HPC and AI with tools like HIP and MIOpen.
  • Infinity Fabric: Enables high-speed communication in multi-GPU setups.
  • Unified Memory: Eases memory management between CPU and GPU.
  • Ray Accelerators: Provide specialized support for graphics and AI enhancements.

With these technologies, AMD GPUs present a powerful alternative for developers looking to leverage parallel processing for scientific computing, AI, and machine learning.