Skip to content

Matrix Multiplication

Script
  • "This article demonstrates how to implement and optimize matrix multiplication in C/C++ and Fortran using OpenACC. We’ll cover the basic approach, then enhance performance with OpenACC clauses like collapse, reduction, and tile to enable efficient GPU acceleration."
  • "The serial version of matrix multiplication in C/C++ uses three nested loops to compute each element in the result matrix c. Here, each row of matrix a is multiplied with each column of matrix b, storing the sum in c. This code is executed sequentially on the CPU."
  • "By adding OpenACC directives, we parallelize matrix multiplication for GPU execution. The kernels directive offloads computation, while collapse(2) merges the outer loops, allowing parallel execution across rows and columns. The reduction clause safely accumulates the sum variable for each element."
  • "In Fortran, the serial matrix multiplication uses similar nested loops to calculate each element in matrix c sequentially on the CPU. Single-dimensional arrays with calculated indices store the matrix elements."
  • "In the parallel Fortran version, the kernels directive offloads computation, with collapse(3) merging all three loops for maximum parallelism. The reduction clause safely accumulates values in sum, ensuring correct results across threads."
  • "The tile clause further optimizes matrix multiplication by splitting loops into smaller 32x32 tiles. This approach fits data into the GPU’s shared memory, reducing latency and boosting performance. The tile clause effectively replaces collapse by managing the iteration space directly."
  • "By using OpenACC clauses like collapse, reduction, and tile, matrix multiplication can be optimized for GPU parallelism. These techniques help minimize memory latency and maximize parallel execution, significantly enhancing performance for large matrices."

This article covers the implementation of matrix multiplication in C/C++ and Fortran, focusing on parallelization using OpenACC. Matrix multiplication is computationally intensive, making it an ideal candidate for GPU acceleration. In this example, we demonstrate several OpenACC clauses such as collapse, reduction, and tile to optimize the operation.


Matrix Multiplication in C/C++

The following example shows a basic matrix multiplication function in C/C++. Here, the matrices a, b, and c are stored as single-dimensional arrays for simplicity. The function multiplies two input matrices a and b and stores the result in matrix c.

Serial Version in C/C++

Below is the serial code for matrix multiplication, Matrix_Multiplication.c. In this code, three nested loops are used to perform the matrix multiplication.

// Matrix_Multiplication.c
void Matrix_Multiplication(float *a, float *b, float *restrict c, int width)   
{ 
  float sum = 0;
  for(int row = 0; row < width ; ++row)                           
    {                                                             
      for(int col = 0; col < width ; ++col)                       
        {                                                         
          for(int i = 0; i < width ; ++i)                         
            {                                                     
              sum += a[row*width+i] * b[i*width+col];      
            }                                                     
          c[row*width+col] = sum;                           
      sum = 0;
        }
    }   
}

Explanation of Matrix Multiplication Code

  • Three nested loops are used to iterate over the rows and columns of the matrices.
  • Inner product calculation: For each element (row, col) in matrix c, we compute the inner product of the row of a and the column of b.
  • Single-dimensional arrays: All matrices are represented as single arrays for simplicity, with indexing calculated as row*width + col.

Parallel Version with OpenACC

To parallelize this function with OpenACC, we use the kernels directive, along with the collapse and reduction clauses. The collapse clause merges the nested loops, allowing the compiler to parallelize all three loops simultaneously. Additionally, the reduction clause is used to accumulate the sum variable within each thread safely.

Here’s the OpenACC-enabled code, Matrix_Multiplication_OpenACC.c:

// Matrix_Multiplication_OpenACC.c
void Matrix_Multiplication(float *a, float *b, float *restrict c, int width)   
{ 
  #pragma acc kernels copyin(a[0:length], b[0:length]) copyout(c[0:length]) collapse(2) reduction(+:sum)
  for(int row = 0; row < width ; ++row)                           
    {                                                             
      for(int col = 0; col < width ; ++col)                       
        {                                                         
          float sum = 0;
          for(int i = 0; i < width ; ++i)                         
            {                                                     
              sum += a[row*width+i] * b[i*width+col];      
            }                                                     
          c[row*width+col] = sum;                           
        }
    }   
}

In this parallelized version:

  • kernels directive: The kernels directive identifies code regions to offload to the GPU.
  • Data clauses: copyin(a[0:length], b[0:length]) and copyout(c[0:length]) specify the data transfer between the CPU and GPU, moving the input matrices a and b to the GPU and bringing the result matrix c back to the CPU.
  • collapse(3) clause: Collapses the three nested loops into a single loop, enabling parallel execution across multiple rows, columns, and inner iterations.
  • reduction(+:sum) clause: Handles the accumulation of sum safely across parallel threads.

Note: The restrict qualifier is applied to the c pointer to avoid aliasing, which allows more efficient memory access.


Matrix Multiplication in Fortran

The Fortran version of the matrix multiplication function is similar to the C/C++ version. The matrices are again stored as single-dimensional arrays, and three nested loops are used for computation.

Serial Version in Fortran

Below is the serial code in Fortran, Matrix_Multiplication.f90:

!! Matrix_Multiplication.f90
module Matrix_Multiplication_Mod  
  implicit none 
contains
  subroutine Matrix_Multiplication(a, b, c, width)
    ! Input vectors
    real(8), intent(in), dimension(:) :: a
    real(8), intent(in), dimension(:) :: b
    real(8), intent(out), dimension(:) :: c
    real(8) :: sum = 0
    integer :: i, row, col, width

    do row = 0, width-1
       do col = 0, width-1
          do i = 0, width-1
             sum = sum + (a((row*width)+i+1) * b((i*width)+col+1))
          end do
          c(row*width+col+1) = sum
          sum = 0
       end do
    end do

  end subroutine Matrix_Multiplication
end module Matrix_Multiplication_Mod

Explanation of Fortran Matrix Multiplication Code

  • Nested Loops: The outer two loops iterate over the rows and columns of the result matrix, c. The inner loop performs the summation, computing the dot product of the row from a and the column from b.

  • Array Indexing: Fortran arrays are indexed from 1 by default, so indexing is carefully handled by adding +1 in the array accesses to align with single-dimensional array storage.

  • Temporary Sum Variable: The sum variable accumulates the product of the row elements of a and column elements of b for each element in c. After each inner loop iteration, sum is assigned to c(row * width + col + 1), then reset to zero.


Parallel Version with OpenACC

The OpenACC version in Fortran follows a similar structure to the C/C++ version, using the kernels directive along with collapse and reduction clauses. Below is the OpenACC-enabled code, Matrix_Multiplication_OpenACC.f90:

!! Matrix_Multiplication_OpenACC.f90
module Matrix_Multiplication_Mod  
  implicit none 
contains
  subroutine Matrix_Multiplication(a, b, c, width)
    ! Input vectors
    real(8), intent(in), dimension(:) :: a
    real(8), intent(in), dimension(:) :: b
    real(8), intent(out), dimension(:) :: c
    real(8) :: sum = 0
    integer :: i, row, col, width, length 
    length = width * width
    !$acc kernels copyin(a(1:length), b(1:length)) copyout(c(1:length))
    !$acc loop collapse(3) reduction(+:sum)
    do row = 0, width-1
       do col = 0, width-1        
          do i = 0, width-1
             sum = sum + (a((row*width)+i+1) * b((i*width)+col+1))
          end do
          c(row*width+col+1) = sum
          sum = 0
       end do
    end do
    !$acc end kernels

  end subroutine Matrix_Multiplication
end module Matrix_Multiplication_Mod

Explanation of OpenACC-Enabled Fortran Code

  • kernels Directive: The kernels directive specifies that the enclosed code block should be offloaded to the GPU.
  • Data Transfer Clauses: The clauses copyin(a(1:length), b(1:length)) and copyout(c(1:length)) manage data transfer between the CPU and GPU. The input matrices a and b are copied to the GPU, and the result matrix c is brought back to the CPU after computation.
  • collapse(3) Clause: By collapsing the three nested loops, OpenACC can parallelize all three loops across GPU threads, enabling efficient parallel computation of rows, columns, and inner loop operations.
  • reduction(+:sum) Clause: The reduction clause ensures the sum variable is safely accumulated across parallel threads without conflicts.
  • Array Indexing:: As in the serial version, the single-dimensional array storage in Fortran requires careful indexing with +1.

Optimizing Nested Loops with the tile Clause

The tile clause provides an additional optimization for matrix multiplication by dividing large matrices into smaller, manageable blocks. This method is efficient because it allows the GPU to cache smaller tiles of data, reducing memory latency. Below is an example of matrix multiplication with the tile clause, Matrix_Multiplication_Tile_OpenACC.c:

C/C++ Version with tile Clause

// Matrix_Multiplication_Tile_OpenACC.c
void Matrix_Multiplication(float *a, float *b, float *restrict c, int width)   
{ 
  int length = width * width;
  float sum = 0;
#pragma acc kernels copyin(a[0:length], b[0:length]), copyout(c[0:length])
#pragma acc loop tile(32,32) reduction(+:sum) 
  for(int row = 0; row < width; ++row)                           
    {                                                             
      for(int col = 0; col < width; ++col)                       
        {                                                         
          for(int i = 0; i < width; ++i)                         
            {                                                     
              sum += a[row * width + i] * b[i * width + col];      
            }                                                     
          c[row * width + col] = sum;
      sum = 0;                            
        }
    }   
}

In this example:

  • tile(32,32) clause: Breaks down the iteration space into 32x32 tiles, which are small enough to fit into the GPU’s shared memory, reducing memory access latency.
  • The tile clause can replace the collapse clause, as both aim to improve efficiency but in different ways.

Fortran Version with tile Clause

Below is the Fortran equivalent with the tile clause, Matrix_Multiplication_Tile_OpenACC.f90:

!! Matrix_Multiplication_Tile_OpenACC.f90
module Matrix_Multiplication_Mod  
  implicit none 
contains
  subroutine Matrix_Multiplication(a, b, c, width)
    ! Input vectors
    real(8), intent(in), dimension(:) :: a
    real(8), intent(in), dimension(:) :: b
    real(8), intent(out), dimension(:) :: c
    real(8) :: sum = 0
    integer :: i, row, col, width, length 
    length = width * width
    !$acc kernels copyin(a(1:length), b(1:length)) copyout(c(1:length))
    !$acc loop tile(32,32) reduction(+:sum)
    do row = 0, width-1
       do col = 0, width-1        
          do i = 0, width-1
             sum = sum + (a((row*width)+i+1) * b((i*width)+col+1))
          end do
          c(row*width+col+1) = sum
          sum = 0
       end do
    end do
    !$acc end kernels

  end subroutine Matrix_Multiplication
end module Matrix_Multiplication_Mod

Note: When using tile, the collapse clause is typically removed, as they serve similar optimization purposes but are not used together.


Summary of Matrix Multiplication with OpenACC

This section covers the parallelization of matrix multiplication using OpenACC, with examples in both C/C++ and Fortran. Matrix multiplication is a computation-heavy operation, making it suitable for GPU acceleration. Here, we explore how to use key OpenACC clauses (collapse, reduction, and tile) to optimize matrix multiplication on the GPU.

  1. Matrix Multiplication Overview:

    • The core matrix multiplication involves three nested loops, where each element in the result matrix c is the inner product of a row from matrix a and a column from matrix b.
    • Arrays are stored in single-dimensional form, and elements are accessed through calculated indices for simplicity.
  2. OpenACC Parallelization:

    • The OpenACC kernels directive is applied to identify code regions for GPU offloading.
    • collapse(3) clause: Combines the three nested loops into a single loop, enabling parallel execution across rows, columns, and inner loop elements.
    • reduction(+:sum) clause: Ensures the summation within each thread is computed correctly without conflicts, allowing safe parallel accumulation of sum.
  3. Fortran Parallelization:

    • The Fortran version mirrors the structure of the C/C++ code, with similar application of the kernels, collapse, and reduction clauses to optimize the matrix multiplication on the GPU.
  4. Optimization with tile Clause:

    • The tile clause can further improve efficiency by breaking the iteration space into smaller 32x32 tiles, which are small enough to fit into the GPU’s shared memory. This reduces memory latency and is particularly useful for large matrices.
    • When using tile, the collapse clause is typically omitted, as tile achieves optimization by reducing memory access bottlenecks differently.

By leveraging these OpenACC clauses, matrix multiplication can be efficiently parallelized, improving performance by taking advantage of the GPU’s parallel processing capabilities and shared memory.