Introduction to Parallel Programming¶

Parallel programming enables the execution of computations simultaneously, making it possible to solve larger problems or accelerate computations significantly. This programming paradigm is essential in high-performance computing (HPC), where complex simulations, large-scale data processing, and real-time applications benefit from dividing work across multiple processors. Parallel programming can broadly be classified into two main models based on the memory architecture they target: Shared Memory Programming and Distributed Memory Programming. GPGPU programming is very similar to shared memory programming; however, it has unique features that use many threads, and hence, it falls into a SIMT architecture that provides massive parallel processing capacity.

Shared Memory Programming - OpenMP¶

OpenMP (Open Multi-Processing) is a widely used API that supports multi-platform shared-memory multiprocessing programming. It is designed to simplify the development of parallel applications by providing compiler directives, runtime libraries, and environmental variables to control the execution of parallel code. OpenMP uses thread-level parallelism, allowing programmers to specify which parts of their code should be run in parallel, often without needing to make significant structural changes to the code.

OpenMP's parallel execution model can be understood as a fork-join process:

Fork: At the beginning of a parallel section, the program forks, creating a team of threads (usually one thread per available CPU core). These threads perform computations in parallel.
Join: At the end of the parallel section, the program joins, where the additional threads are destroyed, and the execution continues on the master thread.

In OpenMP:

Compiler Directives (e.g., #pragma omp parallel) are used to specify parallel regions, loop parallelism, and synchronization between threads.
Runtime Library Routines provide functions for managing threads, setting the number of threads, and controlling parallel execution behavior.
Environmental Variables control aspects of runtime, such as the number of threads created, scheduling, and more.

In the fork-join model (illustrated in Figure 1), the master thread spawns additional threads to execute code concurrently in the specified parallel region, after which only the master thread continues. This model is flexible, making OpenMP suitable for applications that benefit from dividing a single workload into parallel tasks.

Figure 1: Fork-Join parallel approach (using the OpenMP programming model)

OpenMP can be applied in two primary shared memory models:

Uniform Memory Access (UMA): All processors access memory at a uniform speed, ensuring a consistent latency between CPUs and memory. This is typical in systems with a single, centralized memory.
Non-Uniform Memory Access (NUMA): Memory is divided among processors, and each processor accesses its memory region faster than those belonging to other processors. This architecture requires careful management of data locality to avoid delays due to non-local memory access.

Figure 2: Example of shared memory programming: (left) uniform memory access; (right) non-uniform memory access

Distributed Memory Programming Model - Message Passing Interface (MPI)¶

Unlike shared memory systems, where all processors access the same memory space, the Distributed Memory Model involves multiple processors, each with its memory. This model is common in clusters and supercomputers, where each node (or processor) communicates with others by passing messages.

Message Passing Interface (MPI) is the standard library used in distributed memory systems. MPI is particularly effective in Single Program Multiple Data (SPMD) applications, where each process executes the same program independently on different data subsets. This model achieves high levels of parallelism by dividing the workload across processors that communicate only when necessary, making it ideal for applications that need to scale across large clusters.

Key components of the MPI model include:

Processes and Communication: Each process runs independently with its memory, and all communication occurs through explicit message-passing functions provided by MPI. This allows for precise control over data flow and synchronization between nodes.
MPI Libraries: MPI provides a set of routines for initiating processes, synchronizing them, and exchanging data. Popular MPI implementations include MPI_Send and MPI_Recv for point-to-point communication, and collective operations like MPI_Bcast and MPI_Reduce for broadcasting and reducing data across processes.

An example of simple message passing in MPI, as shown in Figure 3, demonstrates two processes exchanging data directly. In practice, MPI enables a high degree of control over communication, which is necessary for efficient distributed computing on large systems.

Figure 3: Simple MPI (message passing interface)

MPI’s flexibility allows it to run on various hardware setups, from small clusters to massive supercomputers. However, this course does not cover MPI or OpenMP in detail. For those interested in mastering these models, additional courses are recommended.

GPU Programming - General-Purpose Computing on GPUs (GPGPU)¶

GPGPU programming leverages the parallel architecture of GPUs to accelerate computations beyond traditional graphics tasks. Modern GPUs contain thousands of cores designed for massive parallel processing, making them ideal for scientific computing, machine learning, and simulations.

In GPGPU programming:

Programming Models: CUDA (for NVIDIA GPUs) and OpenCL (for cross-platform development) provide the APIs to manage memory, transfer data, and execute functions (kernels) on the GPU.
Kernels and Threads: A kernel is a function that runs on the GPU, operating on multiple data elements in parallel. Each thread on the GPU performs part of the computation, enabling highly parallel execution.
Memory Management: Efficient GPGPU programming requires careful management of data transfer between CPU (host) and GPU (device) memory, as this transfer can be a bottleneck.

A typical GPGPU program flow includes:

Transferring data from the CPU to the GPU.
Launching a kernel to perform parallel computations on the GPU.
Retrieving the results from the GPU back to the CPU.

This approach enables significant speedup over CPU-only processing, but requires specialized knowledge of parallel programming and GPU architecture. GPGPU programming has become essential in fields requiring high computational throughput, such as deep learning, scientific simulations, and data analysis.

GPGPU program flow between CPU and GPU

By understanding and implementing these parallel programming models, developers can achieve significant performance improvements and scalability in computational applications, making these techniques essential for tasks in fields like scientific computing, artificial intelligence, and data analysis.