BLAS GEMM Benchmarks

In a scientific application I develop we make extensive use of matrix-matrix multiplications. It is therefore important that these operations are performant. Highly tuned multiplication kernels are a core component of the basic linear algebra subprograms (BLAS) API. There exist a wide variety of BLAS implementations—both open source and proprietary—for almost all HPC platforms.

What follows are a series of benchmarks for the matrix sizes that arise in our application for a variety of BLAS libraries. It is hoped that these benchmarks will help users to make informed decisions when deciding on a specific BLAS.

The dimensions of the matrices, with the exception of the final test-case (M = N = K = 2048), are all real-world examples taken from my solver.

Methodology

When conducting the benchmarks the following procedures were employed:

Core i7 3770K

The following comparison is between ATLAS 3.11.11 and the current Git Head of OpenBLAS. Both were compiled using GCC 4.7.2 on a Gentoo Linux system. The CPU was a 3.5Ghz Intel Core i7 3770K with turbo-boost (to 3.9Ghz) enabled. Peak is ~62.7 gigaflops single precision and ~31.4 gigaflops double precision.

ATLAS OpenBLAS
MNK SD SD
62500038.785.846.204.94
825000411.316.638.836.26
925000615.689.0711.518.17
1225000920.9411.3114.2510.37
12250001021.8011.2215.1810.84
15250001526.3313.3217.5412.21
16250001622.5512.2521.5113.77
18250002125.8013.3723.0214.14
20250002524.4913.6824.2915.63
2425000824.6211.9217.7711.33
24250003624.9115.1830.7317.01
54250002728.6917.9232.0917.28
96250006435.1622.3745.1722.18
1502500012540.6525.7048.4125.26
2162500021653.7525.0954.1827.90
20482048204861.0230.3761.6131.06

We note here that ATLAS tends to outperform OpenBLAS when working at single precision until M = 24. At this point OpenBLAS pulls away; often delivering 20% more FLOPS than ATLAS.