Matrix Multiplication¶
Script
- "This article demonstrates how to implement and optimize matrix multiplication in C/C++ and Fortran using OpenACC. We’ll cover the basic approach, then enhance performance with OpenACC clauses like collapse, reduction, and tile to enable efficient GPU acceleration."
- "The serial version of matrix multiplication in C/C++ uses three nested loops to compute each element in the result matrix c. Here, each row of matrix a is multiplied with each column of matrix b, storing the sum in c. This code is executed sequentially on the CPU."
- "By adding OpenACC directives, we parallelize matrix multiplication for GPU execution. The kernels directive offloads computation, while collapse(2) merges the outer loops, allowing parallel execution across rows and columns. The reduction clause safely accumulates the sum variable for each element."
- "In Fortran, the serial matrix multiplication uses similar nested loops to calculate each element in matrix c sequentially on the CPU. Single-dimensional arrays with calculated indices store the matrix elements."
- "In the parallel Fortran version, the kernels directive offloads computation, with collapse(3) merging all three loops for maximum parallelism. The reduction clause safely accumulates values in sum, ensuring correct results across threads."
- "The tile clause further optimizes matrix multiplication by splitting loops into smaller 32x32 tiles. This approach fits data into the GPU’s shared memory, reducing latency and boosting performance. The tile clause effectively replaces collapse by managing the iteration space directly."
- "By using OpenACC clauses like collapse, reduction, and tile, matrix multiplication can be optimized for GPU parallelism. These techniques help minimize memory latency and maximize parallel execution, significantly enhancing performance for large matrices."
This article covers the implementation of matrix multiplication in C/C++ and Fortran, focusing on parallelization using OpenACC. Matrix multiplication is computationally intensive, making it an ideal candidate for GPU acceleration. In this example, we demonstrate several OpenACC clauses such as collapse
, reduction
, and tile
to optimize the operation.
Matrix Multiplication in C/C++¶
The following example shows a basic matrix multiplication function in C/C++. Here, the matrices a
, b
, and c
are stored as single-dimensional arrays for simplicity. The function multiplies two input matrices a
and b
and stores the result in matrix c
.
Serial Version in C/C++¶
Below is the serial code for matrix multiplication, Matrix_Multiplication.c. In this code, three nested loops are used to perform the matrix multiplication.
// Matrix_Multiplication.c
void Matrix_Multiplication(float *a, float *b, float *restrict c, int width)
{
float sum = 0;
for(int row = 0; row < width ; ++row)
{
for(int col = 0; col < width ; ++col)
{
for(int i = 0; i < width ; ++i)
{
sum += a[row*width+i] * b[i*width+col];
}
c[row*width+col] = sum;
sum = 0;
}
}
}
Explanation of Matrix Multiplication Code¶
- Three nested loops are used to iterate over the rows and columns of the matrices.
- Inner product calculation: For each element
(row, col)
in matrixc
, we compute the inner product of the row ofa
and the column ofb
. - Single-dimensional arrays: All matrices are represented as single arrays for simplicity, with indexing calculated as
row*width + col
.
Parallel Version with OpenACC¶
To parallelize this function with OpenACC, we use the kernels
directive, along with the collapse
and reduction
clauses. The collapse
clause merges the nested loops, allowing the compiler to parallelize all three loops simultaneously. Additionally, the reduction
clause is used to accumulate the sum
variable within each thread safely.
Here’s the OpenACC-enabled code, Matrix_Multiplication_OpenACC.c:
// Matrix_Multiplication_OpenACC.c
void Matrix_Multiplication(float *a, float *b, float *restrict c, int width)
{
#pragma acc kernels copyin(a[0:length], b[0:length]) copyout(c[0:length]) collapse(2) reduction(+:sum)
for(int row = 0; row < width ; ++row)
{
for(int col = 0; col < width ; ++col)
{
float sum = 0;
for(int i = 0; i < width ; ++i)
{
sum += a[row*width+i] * b[i*width+col];
}
c[row*width+col] = sum;
}
}
}
In this parallelized version:
kernels
directive: Thekernels
directive identifies code regions to offload to the GPU.- Data clauses:
copyin(a[0:length], b[0:length])
andcopyout(c[0:length])
specify the data transfer between the CPU and GPU, moving the input matricesa
andb
to the GPU and bringing the result matrixc
back to the CPU. collapse(3)
clause: Collapses the three nested loops into a single loop, enabling parallel execution across multiple rows, columns, and inner iterations.reduction(+:sum)
clause: Handles the accumulation ofsum
safely across parallel threads.
Note: The
restrict
qualifier is applied to thec
pointer to avoid aliasing, which allows more efficient memory access.
Matrix Multiplication in Fortran¶
The Fortran version of the matrix multiplication function is similar to the C/C++ version. The matrices are again stored as single-dimensional arrays, and three nested loops are used for computation.
Serial Version in Fortran¶
Below is the serial code in Fortran, Matrix_Multiplication.f90:
!! Matrix_Multiplication.f90
module Matrix_Multiplication_Mod
implicit none
contains
subroutine Matrix_Multiplication(a, b, c, width)
! Input vectors
real(8), intent(in), dimension(:) :: a
real(8), intent(in), dimension(:) :: b
real(8), intent(out), dimension(:) :: c
real(8) :: sum = 0
integer :: i, row, col, width
do row = 0, width-1
do col = 0, width-1
do i = 0, width-1
sum = sum + (a((row*width)+i+1) * b((i*width)+col+1))
end do
c(row*width+col+1) = sum
sum = 0
end do
end do
end subroutine Matrix_Multiplication
end module Matrix_Multiplication_Mod
Explanation of Fortran Matrix Multiplication Code¶
-
Nested Loops: The outer two loops iterate over the rows and columns of the result matrix,
c
. The inner loop performs the summation, computing the dot product of the row froma
and the column fromb
. -
Array Indexing: Fortran arrays are indexed from 1 by default, so indexing is carefully handled by adding
+1
in the array accesses to align with single-dimensional array storage. -
Temporary Sum Variable: The
sum
variable accumulates the product of the row elements ofa
and column elements ofb
for each element inc
. After each inner loop iteration,sum
is assigned toc(row * width + col + 1)
, then reset to zero.
Parallel Version with OpenACC¶
The OpenACC version in Fortran follows a similar structure to the C/C++ version, using the kernels directive along with collapse and reduction clauses. Below is the OpenACC-enabled code, Matrix_Multiplication_OpenACC.f90:
!! Matrix_Multiplication_OpenACC.f90
module Matrix_Multiplication_Mod
implicit none
contains
subroutine Matrix_Multiplication(a, b, c, width)
! Input vectors
real(8), intent(in), dimension(:) :: a
real(8), intent(in), dimension(:) :: b
real(8), intent(out), dimension(:) :: c
real(8) :: sum = 0
integer :: i, row, col, width, length
length = width * width
!$acc kernels copyin(a(1:length), b(1:length)) copyout(c(1:length))
!$acc loop collapse(3) reduction(+:sum)
do row = 0, width-1
do col = 0, width-1
do i = 0, width-1
sum = sum + (a((row*width)+i+1) * b((i*width)+col+1))
end do
c(row*width+col+1) = sum
sum = 0
end do
end do
!$acc end kernels
end subroutine Matrix_Multiplication
end module Matrix_Multiplication_Mod
Explanation of OpenACC-Enabled Fortran Code¶
kernels
Directive: The kernels directive specifies that the enclosed code block should be offloaded to the GPU.- Data Transfer Clauses: The clauses
copyin(a(1:length), b(1:length))
andcopyout(c(1:length))
manage data transfer between the CPU and GPU. The input matrices a and b are copied to the GPU, and the result matrix c is brought back to the CPU after computation. collapse(3)
Clause: By collapsing the three nested loops, OpenACC can parallelize all three loops across GPU threads, enabling efficient parallel computation of rows, columns, and inner loop operations.reduction(+:sum)
Clause: The reduction clause ensures thesum
variable is safely accumulated across parallel threads without conflicts.- Array Indexing:: As in the serial version, the single-dimensional array storage in Fortran requires careful indexing with
+1
.
Optimizing Nested Loops with the tile
Clause¶
The tile
clause provides an additional optimization for matrix multiplication by dividing large matrices into smaller, manageable blocks. This method is efficient because it allows the GPU to cache smaller tiles of data, reducing memory latency. Below is an example of matrix multiplication with the tile
clause, Matrix_Multiplication_Tile_OpenACC.c:
C/C++ Version with tile
Clause¶
// Matrix_Multiplication_Tile_OpenACC.c
void Matrix_Multiplication(float *a, float *b, float *restrict c, int width)
{
int length = width * width;
float sum = 0;
#pragma acc kernels copyin(a[0:length], b[0:length]), copyout(c[0:length])
#pragma acc loop tile(32,32) reduction(+:sum)
for(int row = 0; row < width; ++row)
{
for(int col = 0; col < width; ++col)
{
for(int i = 0; i < width; ++i)
{
sum += a[row * width + i] * b[i * width + col];
}
c[row * width + col] = sum;
sum = 0;
}
}
}
In this example:
tile(32,32)
clause: Breaks down the iteration space into 32x32 tiles, which are small enough to fit into the GPU’s shared memory, reducing memory access latency.- The
tile
clause can replace thecollapse
clause, as both aim to improve efficiency but in different ways.
Fortran Version with tile
Clause¶
Below is the Fortran equivalent with the tile clause, Matrix_Multiplication_Tile_OpenACC.f90:
!! Matrix_Multiplication_Tile_OpenACC.f90
module Matrix_Multiplication_Mod
implicit none
contains
subroutine Matrix_Multiplication(a, b, c, width)
! Input vectors
real(8), intent(in), dimension(:) :: a
real(8), intent(in), dimension(:) :: b
real(8), intent(out), dimension(:) :: c
real(8) :: sum = 0
integer :: i, row, col, width, length
length = width * width
!$acc kernels copyin(a(1:length), b(1:length)) copyout(c(1:length))
!$acc loop tile(32,32) reduction(+:sum)
do row = 0, width-1
do col = 0, width-1
do i = 0, width-1
sum = sum + (a((row*width)+i+1) * b((i*width)+col+1))
end do
c(row*width+col+1) = sum
sum = 0
end do
end do
!$acc end kernels
end subroutine Matrix_Multiplication
end module Matrix_Multiplication_Mod
Note: When using
tile
, the collapse clause is typically removed, as they serve similar optimization purposes but are not used together.
Summary of Matrix Multiplication with OpenACC¶
This section covers the parallelization of matrix multiplication using OpenACC, with examples in both C/C++ and Fortran. Matrix multiplication is a computation-heavy operation, making it suitable for GPU acceleration. Here, we explore how to use key OpenACC clauses (collapse
, reduction
, and tile
) to optimize matrix multiplication on the GPU.
-
Matrix Multiplication Overview:
- The core matrix multiplication involves three nested loops, where each element in the result matrix
c
is the inner product of a row from matrixa
and a column from matrixb
. - Arrays are stored in single-dimensional form, and elements are accessed through calculated indices for simplicity.
- The core matrix multiplication involves three nested loops, where each element in the result matrix
-
OpenACC Parallelization:
- The OpenACC
kernels
directive is applied to identify code regions for GPU offloading. collapse(3)
clause: Combines the three nested loops into a single loop, enabling parallel execution across rows, columns, and inner loop elements.reduction(+:sum)
clause: Ensures the summation within each thread is computed correctly without conflicts, allowing safe parallel accumulation ofsum
.
- The OpenACC
-
Fortran Parallelization:
- The Fortran version mirrors the structure of the C/C++ code, with similar application of the
kernels
,collapse
, andreduction
clauses to optimize the matrix multiplication on the GPU.
- The Fortran version mirrors the structure of the C/C++ code, with similar application of the
-
Optimization with
tile
Clause:- The
tile
clause can further improve efficiency by breaking the iteration space into smaller 32x32 tiles, which are small enough to fit into the GPU’s shared memory. This reduces memory latency and is particularly useful for large matrices. - When using
tile
, thecollapse
clause is typically omitted, astile
achieves optimization by reducing memory access bottlenecks differently.
- The
By leveraging these OpenACC clauses, matrix multiplication can be efficiently parallelized, improving performance by taking advantage of the GPU’s parallel processing capabilities and shared memory.