In a scientific application I develop we make extensive use of matrix-matrix multiplications. It is therefore important that these operations are performant. Highly tuned multiplication kernels are a core component of the basic linear algebra subprograms (BLAS) API. There exist a wide variety of BLAS implementations—both open source and proprietary—for almost all HPC platforms.
What follows are a series of benchmarks for the matrix sizes that arise in our application for a variety of BLAS libraries. It is hoped that these benchmarks will help users to make informed decisions when deciding on a specific BLAS.
The dimensions of the matrices, with the exception of the final test-case (M = N = K = 2048), are all real-world examples taken from my solver.
When conducting the benchmarks the following procedures were employed:
The following comparison is between ATLAS 3.11.11 and the current Git Head of OpenBLAS. Both were compiled using GCC 4.7.2 on a Gentoo Linux system. The CPU was a 3.5Ghz Intel Core i7 3770K with turbo-boost (to 3.9Ghz) enabled. Peak is ~62.7 gigaflops single precision and ~31.4 gigaflops double precision.
We note here that ATLAS tends to outperform OpenBLAS when working at single precision until M = 24. At this point OpenBLAS pulls away; often delivering 20% more FLOPS than ATLAS.