Matrix Multiplication¶

Script

"In this presentation, we’ll explore how OpenMP Offloading can be used to execute matrix multiplication on GPUs. We’ll examine examples in both C/C++ and Fortran, discussing various approaches to parallelizing this operation, optimizing GPU resource utilization, and efficiently managing data transfers."
"Our first option uses the target parallel for construct with collapse(2). The collapse clause merges the row and column loops, enabling GPU threads to handle computations for multiple rows and columns in parallel. This approach maximizes thread usage, making it suitable for handling large matrices."
"This is the compiler output for Option 1. The compiler generates a GPU kernel for matrix multiplication, parallelizing the row and column loops across GPU threads. The map clause facilitates memory transfers between the CPU and GPU, ensuring arrays a, b, and c are appropriately transferred."
"In Option 2, we introduce the teams distribute construct, which creates a league of teams. Each team processes a subset of the matrix, with threads within the team computing the necessary products and sums. This hierarchical structure provides more control over workload distribution across GPU cores, improving efficiency for larger computations."
"Option 3 extends the previous example by specifying num_teams(5), which sets the number of teams to five. This adjustment allows us to explicitly control the distribution of workloads across GPU compute units. By fine-tuning the number of teams, we can optimize resource utilization and improve performance for large matrices."
"In Fortran, we can implement similar strategies for offloading. Option 1 uses the target loop collapse(2) construct, parallelizing the row and column loops for efficient execution across GPU threads. This approach ensures that computations are evenly distributed across available resources."
"Option 2 in Fortran introduces the teams loop construct, which organizes GPU threads into teams. Each team is responsible for processing a portion of the matrix, improving load distribution and making the approach well-suited for larger matrices where team-based execution provides significant performance gains."
"In the final Fortran example, we specify num_teams(5) to create five teams on the GPU. This directive allows us to balance the workload across multiple compute units, optimizing performance for systems with extensive parallel processing capabilities."
"To summarize, we explored three options for offloading matrix multiplication to GPUs in both C/C++ and Fortran. Option 1 utilizes collapse(2) for thread-level parallelism, offering simplicity and efficiency. Option 2 incorporates teams for hierarchical parallelism, enhancing workload distribution. Option 3 further refines this by specifying the number of teams, allowing for fine-tuned resource allocation and improved performance on GPUs."
"In conclusion, OpenMP Offloading provides a powerful and flexible framework for matrix multiplication. By combining constructs like target, teams, parallel, and collapse, we can optimize GPU utilization, ensure balanced workload distribution, and manage data transfers effectively. These techniques demonstrate how OpenMP enables high-performance computing on heterogeneous systems."

This document explores various OpenMP offloading constructs used to parallelize matrix multiplication in both C/C++ and Fortran, focusing on how different compute and data mapping clauses affect GPU execution. Each option showcases a unique approach to efficiently map data and utilize GPU threads.

Matrix Multiplication in C/C++¶

Option 1: `#pragma omp target parallel for collapse(N)`¶

In this option, we utilize target parallel for with collapse(2), which merges the outer row and col loops, allowing them to be executed in parallel across GPU threads. The private clause provides each thread with its own copies of row, col, and i to avoid data hazards.

// Matrix_Multiplication.c
void Matrix_Multiplication(float *a, float *b, float *c, int n) {
    #pragma omp target parallel for collapse(2) private(row, col, i)
    for (int row = 0; row < n; row++) {
        for (int col = 0; col < n; col++) {
            float sum = 0;
            for (int i = 0; i < n; i++) {
                sum += a[row * n + i] * b[i * n + col];
            }
            c[row * n + col] = sum;
        }
    }
}

Compilation Output Explanation:¶

target parallel for: Offloads the outer loop and parallelizes it across threads.
collapse(2): Collapses the row and col loops to distribute the work across threads efficiently.
Private copies of row, col, and i for each thread ensure correct calculations.

Matrix_Multiplication:
     14, #omp target parallel do
         14, Generating "nvkernel_Matrix_Multiplication_F1L14_2" GPU kernel
         16, Loop parallelized across threads(128), schedule(static)
main:
     51, Generating map(tofrom:a[:N*N],c[:N*N],b[:N*N])

Option 2: `#pragma omp target teams distribute parallel for collapse(N)`¶

This option introduces teams distribute to create multiple teams. Each team executes a portion of the matrix multiplication, with threads within teams working on the calculations in parallel. This approach scales efficiently across the GPU by dividing work between multiple teams.

// Matrix_Multiplication.c
void Matrix_Multiplication(float *a, float *b, float *c, int n) {
    #pragma omp target teams distribute parallel for collapse(2) private(row, col, i)
    for (int row = 0; row < n; row++) {
        for (int col = 0; col < n; col++) {
            float sum = 0;
            for (int i = 0; i < n; i++) {
                sum += a[row * n + i] * b[i * n + col];
            }
            c[row * n + col] = sum;
        }
    }
}

Compilation Output Explanation:¶

target teams distribute parallel for: Creates a team of threads that work on portions of the matrix, improving GPU utilization.
collapse(2): Collapses the row and col loops.
private(row, col, i): Ensures each team has unique copies of variables, preventing race conditions.

Matrix_Multiplication:
     13, #omp target teams distribute parallel for num_teams(5)
         13, Generating "nvkernel_Matrix_Multiplication_F1L13_2" GPU kernel
             Loop parallelized across teams and threads(128), schedule(static)
main:
     50, Generating map(tofrom:a[:N*N],c[:N*N],b[:N*N])

Option 3: `#pragma omp target teams distribute parallel for num_teams(N) collapse(N)`¶

This version includes num_teams(5) to create five teams, each of which processes a different part of the loop iterations. This approach is efficient for systems with multiple compute units, enhancing parallelism by limiting the number of teams based on available hardware resources.

// Matrix_Multiplication.c
void Matrix_Multiplication(float *a, float *b, float *c, int n) {
    #pragma omp target teams distribute parallel for num_teams(5) collapse(2) private(row, col, i)
    for (int row = 0; row < n; row++) {
        for (int col = 0; col < n; col++) {
            float sum = 0;
            for (int i = 0; i < n; i++) {
                sum += a[row * n + i] * b[i * n + col];
            }
            c[row * n + col] = sum;
        }
    }
}

Compilation Output Explanation:¶

num_teams(5): Creates five teams, allowing fine-grained control over GPU resource allocation.
collapse(2): Combines row and col loops for more efficient parallelization.
private(row, col, i): Maintains thread safety by allocating separate copies of each variable.

Matrix_Multiplication:
     13, #omp target teams distribute parallel for num_teams(5)
         13, Generating "nvkernel_Matrix_Multiplication_F1L13_2" GPU kernel
             Loop parallelized across teams and threads(128), schedule(static)
main:
     50, Generating map(tofrom:a[:N*N],c[:N*N],b[:N*N])