In a scientific application I develop we make extensive use of
matrix-matrix multiplications. It is therefore important that
these operations are performant. Highly tuned multiplication
kernels are a core component of the *basic linear algebra
subprograms* (BLAS) API. There exist a wide variety of BLAS
implementations—both open source and proprietary—for almost all
HPC platforms.

What follows are a series of benchmarks for the matrix sizes that arise in our application for a variety of BLAS libraries. It is hoped that these benchmarks will help users to make informed decisions when deciding on a specific BLAS.

The dimensions of the matrices, with the exception of the final test-case (M = N = K = 2048), are all real-world examples taken from my solver.

When conducting the benchmarks the following procedures were employed:

- all BLAS libraries were compiled/run in
*serial*with all operations running on a single CPU core; - each benchmark was repeated 5000 times;
- the benchmarking process was pinned to the first core on the system;
- FLOPS were computed using 5000×(2×M×N×K)/Δt where N, M, and K are the relevant dimensions of the matrices and Δt is the wall clock time;
- all matrices were stored
*row major*order; - all matrices were initialised with random values with a consistent seed being used throughout;
- the GEMM parameters α and β were taken as 1.0 and 0.0 respectively;
- the transpose parameters were both set
to
*CblasNoTrans*; - all numbers are quoted in
*gigaflops*.

The following comparison is between ATLAS 3.11.11 and the current Git Head of OpenBLAS. Both were compiled using GCC 4.7.2 on a Gentoo Linux system. The CPU was a 3.5Ghz Intel Core i7 3770K with turbo-boost (to 3.9Ghz) enabled. Peak is ~62.7 gigaflops single precision and ~31.4 gigaflops double precision.

ATLAS | OpenBLAS | |||||
---|---|---|---|---|---|---|

M | N | K | S | D | S | D |

6 | 25000 | 3 | 8.78 | 5.84 | 6.20 | 4.94 |

8 | 25000 | 4 | 11.31 | 6.63 | 8.83 | 6.26 |

9 | 25000 | 6 | 15.68 | 9.07 | 11.51 | 8.17 |

12 | 25000 | 9 | 20.94 | 11.31 | 14.25 | 10.37 |

12 | 25000 | 10 | 21.80 | 11.22 | 15.18 | 10.84 |

15 | 25000 | 15 | 26.33 | 13.32 | 17.54 | 12.21 |

16 | 25000 | 16 | 22.55 | 12.25 | 21.51 | 13.77 |

18 | 25000 | 21 | 25.80 | 13.37 | 23.02 | 14.14 |

20 | 25000 | 25 | 24.49 | 13.68 | 24.29 | 15.63 |

24 | 25000 | 8 | 24.62 | 11.92 | 17.77 | 11.33 |

24 | 25000 | 36 | 24.91 | 15.18 | 30.73 | 17.01 |

54 | 25000 | 27 | 28.69 | 17.92 | 32.09 | 17.28 |

96 | 25000 | 64 | 35.16 | 22.37 | 45.17 | 22.18 |

150 | 25000 | 125 | 40.65 | 25.70 | 48.41 | 25.26 |

216 | 25000 | 216 | 53.75 | 25.09 | 54.18 | 27.90 |

2048 | 2048 | 2048 | 61.02 | 30.37 | 61.61 | 31.06 |

We note here that ATLAS tends to outperform OpenBLAS when working at single precision until M = 24. At this point OpenBLAS pulls away; often delivering 20% more FLOPS than ATLAS.