In this project, I am exploring different methods to optimize matrix multiplication for square matrices.
I have also set up benchmarks to compare the performance. The benchmarks are implemented using the Google Benchmark library, which provides a robust framework for measuring the performance of the functions under various conditions.
The results from these benchmarks will help in understanding the efficiency gains achieved through the optimization techniques applied.
using Matrix = std::vector<std::vector<int>>;
Matrix multiply(const Matrix &A, const Matrix &B) {
size_t n = A.size();
Matrix C(n, std::vector<int>(n, 0));
for (size_t i = 0; i < n; ++i) {
for (size_t j = 0; j < n; ++j) {
for (size_t k = 0; k < n; ++k) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
return C;
}
This is a normal matrix multiplication function that takes two matrices as input and returns the product of the two matrices. The matrices are represented as vectors of vectors of integers.
using FlatMatrix = std::vector<int>;
FlatMatrix multiply_flat(const FlatMatrix &A, const FlatMatrix &B, size_t n) {
FlatMatrix C(n * n, 0);
for (size_t i = 0; i < n; ++i) {
for (size_t j = 0; j < n; ++j) {
for (size_t k = 0; k < n; ++k) {
C[i * n + j] += A[i * n + k] * B[k * n + j];
}
}
}
return C;
}
This is a flat matrix multiplication function that takes two flat matrices as input and returns the product of the two matrices. The matrices are represented as vectors of integers.
Total number of elements = N * N
Benchmark | Time (ns) | CPU (ns) | Iterations |
---|---|---|---|
BM_MatrixMultiplication_Random/8 | 395 ns | 395 ns | 1,795,230 |
BM_MatrixMultiplication_Random/16 | 2,270 ns | 2,270 ns | 305,677 |
BM_MatrixMultiplication_Random/32 | 16,134 ns | 16,134 ns | 42,623 |
BM_FlatMatrixMultiplication_Random/8 | 201 ns | 201 ns | 3,445,345 |
BM_FlatMatrixMultiplication_Random/16 | 1,452 ns | 1,452 ns | 476,767 |
BM_FlatMatrixMultiplication_Random/32 | 10,411 ns | 10,409 ns | 66,010 |
Link to the matrix multiplication with array
Total number of elements = N
Benchmark | Time (ns) | CPU (ns) | Iterations |
---|---|---|---|
BM_SafeArrayMultiplication_Random/64 | 289 ns | 289 ns | 2,451,525 |
BM_SafeArrayMultiplication_Random/256 | 2,307 ns | 2,301 ns | 308,611 |
BM_SafeArrayMultiplication_Random/1024 | 18,615 ns | 18,585 ns | 36,599 |
- Flattening the Matrix
- Using an Array instead of a Vector
- Fixed Size Array
- 1D is still faster than 2D
- Current Vector implementation is faster than the Array implementation
- Will focus on optimising the Vector implementation instead
MultiplyFlatBlock
Benchmark | Time | CPU | Iterations |
---|---|---|---|
BM_FlatMatrixMultiplication/32 | 0.012 ms | 0.012 ms | 60464 |
BM_FlatMatrixMultiplication/64 | 0.088 ms | 0.088 ms | 7841 |
BM_FlatMatrixMultiplication/128 | 2.22 ms | 2.22 ms | 314 |
BM_FlatMatrixMultiplication/256 | 16.6 ms | 16.6 ms | 41 |
BM_FlatMatrixMultiplication/512 | 238 ms | 237 ms | 3 |
BM_FlatMatrixMultiplication/1024 | 4133 ms | 4126 ms | 1 |
BM_MultiplyFlatBlock/32/16 | 0.019 ms | 0.019 ms | 37559 |
BM_MultiplyFlatBlock/64/16 | 0.147 ms | 0.147 ms | 4715 |
BM_MultiplyFlatBlock/128/16 | 1.18 ms | 1.18 ms | 591 |
BM_MultiplyFlatBlock/256/16 | 9.52 ms | 9.52 ms | 73 |
BM_MultiplyFlatBlock/512/16 | 88.1 ms | 88.1 ms | 8 |
BM_MultiplyFlatBlock/1024/16 | 1878 ms | 1877 ms | 1 |
BM_MultiplyFlatBlock/32/32 | 0.018 ms | 0.018 ms | 40094 |
BM_MultiplyFlatBlock/64/32 | 0.139 ms | 0.139 ms | 5063 |
BM_MultiplyFlatBlock/128/32 | 1.12 ms | 1.12 ms | 606 |
BM_MultiplyFlatBlock/256/32 | 10.1 ms | 10.1 ms | 67 |
BM_MultiplyFlatBlock/512/32 | 202 ms | 202 ms | 3 |
BM_MultiplyFlatBlock/1024/32 | 2455 ms | 2452 ms | 1 |
BM_MultiplyFlatBlock/32/64 | 0.017 ms | 0.017 ms | 41280 |
BM_MultiplyFlatBlock/64/64 | 0.136 ms | 0.136 ms | 5091 |
BM_MultiplyFlatBlock/128/64 | 1.22 ms | 1.22 ms | 571 |
BM_MultiplyFlatBlock/256/64 | 17.8 ms | 17.8 ms | 39 |
BM_MultiplyFlatBlock/512/64 | 204 ms | 203 ms | 3 |
BM_MultiplyFlatBlock/1024/64 | 2583 ms | 2581 ms | 1 |
- Cache Blocking with different block sizes
- Cache Blocking with block size 16 is the fastest
- Total Speedup at matrix Dim (1024) : 4133 ms / 1878 ms = 2.2x
Benchmark | Time (ms) | CPU (ms) | Iterations |
---|---|---|---|
BM_SIMD/32 | 0.001 | 0.001 | 471,691 |
BM_SIMD/64 | 0.011 | 0.011 | 60,121 |
BM_SIMD/128 | 0.095 | 0.095 | 7,435 |
BM_SIMD/256 | 0.790 | 0.790 | 873 |
BM_SIMD/512 | 9.86 | 9.86 | 70 |
BM_SIMD/1024 | 225 | 225 | 3 |
BM_MultiplyFlatBlock/32/16 | 0.017 | 0.017 | 40,411 |
BM_MultiplyFlatBlock/64/16 | 0.137 | 0.137 | 4,765 |
BM_MultiplyFlatBlock/128/16 | 1.10 | 1.10 | 632 |
BM_MultiplyFlatBlock/256/16 | 8.78 | 8.78 | 79 |
BM_MultiplyFlatBlock/512/16 | 81.6 | 81.6 | 8 |
BM_MultiplyFlatBlock/1024/16 | 1789 | 1789 | 1 |
- Using SIMD instructions to perform 8 multiplications at once
- Using Cache Blocking with block size 16
- Improved performance further by using SIMD instructions
- Total Speedup at matrix Dim (1024) : 1789 ms / 225 ms = 7.9x