Computer Architecture¶

Script

"Parallel processing has a rich history that dates back to the 19th century. Early concepts, like Charles Babbage’s analytical engine, introduced ideas of parallel computation. By 1910, these ideas were being explored to speed up tasks such as multiplication, laying the foundation for modern parallel computing."
"The Von Neumann architecture, a foundational model for sequential processing, introduced a bottleneck due to limited data flow between memory and the processor. To address this limitation, modern systems incorporate memory banks, enabling parallel I/O operations and reducing the bottleneck’s impact."
"Parallelization can be achieved through two primary methods. The first is vectorization, which enables parallel operations within a single processor by optimizing data-level parallelism. The second is multiprocessing, where multiple processors work together. Performance in these systems is often measured using benchmarks like LINPACK, which evaluates floating-point operations for solving linear equations."
"Today’s multicore CPUs integrate multiple processing cores on a single chip, each with its own L2 cache and a shared L3 cache. This architecture ensures efficient memory access, as seen in processors like the dual-core Intel Xeon, which is optimized for modern workloads."
"Flynn’s taxonomy provides a useful framework for classifying parallel architectures. It ranges from SISD, or single instruction, single data, which is purely sequential, to SIMD, which enables data-level parallelism. MISD is rarely used, but MIMD, or multiple instruction, multiple data, is the basis for multicore and distributed systems."
"Shared-memory architectures allow multiple processors to access the same memory space. UMA provides equal memory access time for all processors, while NUMA offers faster local memory access. Distributed shared memory combines features of both, balancing speed and flexibility."
"Distributed memory architectures are key to modern supercomputers. These systems use high-speed networks to connect nodes, with topologies like Fat-Tree and 3D Torus ensuring efficient data flow. Programming models like MPI enable scalable data exchange for applications in large-scale parallel computing."

Parallel processing concepts existed even in the pre-electronic computing era in the 19^th century. For instance, Babbage (Babbage’s analytical engine, 1910) considered parallel processing for speeding up the multiplication of two numbers by using his difference engine.

The Von Neumann architecture, illustrated in Figure 1 (a), established the foundation of sequential processing in high-speed computing. Despite its fast computation, it suffered from the Von Neumann bottleneck (Alan Huang, 1984) due to the limitation of I/O access to memory. Recent improvements have incorporated memory banks that allow parallel I/O to mitigate this bottleneck.

Parallelization can generally be achieved in two ways: vectorization within a single processor or multiprocessing across multiple processors. The performance of a computer is often measured by its ability to handle floating-point operations, as ranked by benchmarks like LINPACK (Jack J Dongarra et al., 1979), which assesses the time taken to solve dense linear equations.

Figure 1: (a) The Von Neumann architecture; (b) Early dual-core Intel Xeon

Multicore CPU¶

A multicore processor integrates two or more individual cores onto a single chip. In recent years, multicore CPUs have become increasingly popular, with manufacturers continually increasing the number of cores per chip. Figure 1 (b) shows a dual-core Intel Xeon processor from 2005. Modern multicore CPUs have individual L2 caches for each core and a shared L3 cache across cores. A memory controller manages access to the memory banks in DRAM.

According to Flynn’s taxonomy, computer architectures can be classified as:

SISD (Single Instruction Single Data): A simple, sequential model.
SIMD (Single Instruction Multiple Data): Executes the same instruction on multiple data, achieving data-level parallelism.
MISD (Multiple Instruction Single Data): Rarely used, as it is difficult to map general programs to this architecture.
MIMD (Multiple Instruction Multiple Data): Allows multiple instructions on multiple data, ideal for multicore CPUs that support thread-level parallelism.

Figure 2: (a) Memory hierarchy; (b) SIMD model

The memory hierarchy in CPUs is designed to optimize access times. On-chip caches provide rapid access to frequently used data, while off-chip memory offers larger but slower storage. Figure 2 (a) illustrates a typical memory hierarchy, with CPU cores accessing caches first before main memory.

Shared Memory Architectures¶

In shared-memory architectures, multiple processors access a shared memory space. These architectures can be further categorized as:

Uniform Memory Access (UMA): Also known as Symmetric Multi-Processing (SMP), where memory access time is uniform across all processors.
Non-Uniform Memory Access (NUMA): Memory is divided into segments for each processor, with faster access to local segments and slower access to remote segments.
Distributed Shared Memory: Combines elements of both UMA and NUMA, allowing processors to access distributed memory with consistency maintained through a memory management unit.

Figure 3: Shared memory architectures

Figure 3 shows examples of these architectures, highlighting the organization and data access patterns of each.

Distributed Memory Architectures¶

Modern supercomputers are commonly based on distributed memory architecture, where compute nodes are linked via high-speed networks. Each node has its processor and memory, and data is exchanged between nodes using message-passing protocols. This architecture relies heavily on the network topology and programming model for efficiency.

Popular network topologies include 3D Torus, Fat-Tree, and Dragonfly. Among these, Fat-Tree topology is widely used due to its versatility and high bandwidth. Figure 4 illustrates a distributed memory architecture, where each node communicates with others over a network.

Programming models like MPI (Message Passing Interface) support distributed memory architectures by providing a standardized way for processes to communicate. Libraries such as MVAPICH2, OpenMPI, MPICH, and Intel MPI allow developers to write parallel applications that scale across large supercomputers. While these libraries offer similar functionality, performance may vary based on the network and application design.

Figure 4: Distributed memory architecture