In a scientific application I develop we make extensive use of matrix-matrix multiplications. It is therefore important that these operations are performant. Highly tuned multiplication kernels are a core component of the basic linear algebra subprograms (BLAS) API. There exist a wide variety of BLAS implementations—both open source and proprietary—for almost all HPC platforms.
What follows are a series of benchmarks for the matrix sizes that arise in our application for a variety of BLAS libraries. It is hoped that these benchmarks will help users to make informed decisions when deciding on a specific BLAS.
The dimensions of the matrices, with the exception of the final test-case (M = N = K = 2048), are all real-world examples taken from my solver.
When conducting the benchmarks the following procedures were employed:
The following comparison is between ATLAS 3.11.11 and the current Git Head of OpenBLAS. Both were compiled using GCC 4.7.2 on a Gentoo Linux system. The CPU was a 3.5Ghz Intel Core i7 3770K with turbo-boost (to 3.9Ghz) enabled. Peak is ~62.7 gigaflops single precision and ~31.4 gigaflops double precision.
ATLAS | OpenBLAS | |||||
---|---|---|---|---|---|---|
M | N | K | S | D | S | D |
6 | 25000 | 3 | 8.78 | 5.84 | 6.20 | 4.94 |
8 | 25000 | 4 | 11.31 | 6.63 | 8.83 | 6.26 |
9 | 25000 | 6 | 15.68 | 9.07 | 11.51 | 8.17 |
12 | 25000 | 9 | 20.94 | 11.31 | 14.25 | 10.37 |
12 | 25000 | 10 | 21.80 | 11.22 | 15.18 | 10.84 |
15 | 25000 | 15 | 26.33 | 13.32 | 17.54 | 12.21 |
16 | 25000 | 16 | 22.55 | 12.25 | 21.51 | 13.77 |
18 | 25000 | 21 | 25.80 | 13.37 | 23.02 | 14.14 |
20 | 25000 | 25 | 24.49 | 13.68 | 24.29 | 15.63 |
24 | 25000 | 8 | 24.62 | 11.92 | 17.77 | 11.33 |
24 | 25000 | 36 | 24.91 | 15.18 | 30.73 | 17.01 |
54 | 25000 | 27 | 28.69 | 17.92 | 32.09 | 17.28 |
96 | 25000 | 64 | 35.16 | 22.37 | 45.17 | 22.18 |
150 | 25000 | 125 | 40.65 | 25.70 | 48.41 | 25.26 |
216 | 25000 | 216 | 53.75 | 25.09 | 54.18 | 27.90 |
2048 | 2048 | 2048 | 61.02 | 30.37 | 61.61 | 31.06 |
We note here that ATLAS tends to outperform OpenBLAS when working at single precision until M = 24. At this point OpenBLAS pulls away; often delivering 20% more FLOPS than ATLAS.