BLAS GEMM Benchmarks

Home » Pages » BLAS GEMM Benchmarks

In a scientific application I develop we make extensive use of matrix-matrix multiplications. It is therefore important that these operations are performant. Highly tuned multiplication kernels are a core component of the basic linear algebra subprograms (BLAS) API. There exist a wide variety of BLAS implementations—both open source and proprietary—for almost all HPC platforms.

What follows are a series of benchmarks for the matrix sizes that arise in our application for a variety of BLAS libraries. It is hoped that these benchmarks will help users to make informed decisions when deciding on a specific BLAS.

The dimensions of the matrices, with the exception of the final test-case (M = N = K = 2048), are all real-world examples taken from my solver.

Methodology

When conducting the benchmarks the following procedures were employed:

all BLAS libraries were compiled/run in serial with all operations running on a single CPU core;
each benchmark was repeated 5000 times;
the benchmarking process was pinned to the first core on the system;
FLOPS were computed using 5000×(2×M×N×K)/Δt where N, M, and K are the relevant dimensions of the matrices and Δt is the wall clock time;
all matrices were stored row major order;
all matrices were initialised with random values with a consistent seed being used throughout;
the GEMM parameters α and β were taken as 1.0 and 0.0 respectively;
the transpose parameters were both set to CblasNoTrans;
all numbers are quoted in gigaflops.

Core i7 3770K

The following comparison is between ATLAS 3.11.11 and the current Git Head of OpenBLAS. Both were compiled using GCC 4.7.2 on a Gentoo Linux system. The CPU was a 3.5Ghz Intel Core i7 3770K with turbo-boost (to 3.9Ghz) enabled. Peak is ~62.7 gigaflops single precision and ~31.4 gigaflops double precision.

			ATLAS		OpenBLAS
M	N	K	S	D	S	D
6	25000	3	8.78	5.84	6.20	4.94
8	25000	4	11.31	6.63	8.83	6.26
9	25000	6	15.68	9.07	11.51	8.17
12	25000	9	20.94	11.31	14.25	10.37
12	25000	10	21.80	11.22	15.18	10.84
15	25000	15	26.33	13.32	17.54	12.21
16	25000	16	22.55	12.25	21.51	13.77
18	25000	21	25.80	13.37	23.02	14.14
20	25000	25	24.49	13.68	24.29	15.63
24	25000	8	24.62	11.92	17.77	11.33
24	25000	36	24.91	15.18	30.73	17.01
54	25000	27	28.69	17.92	32.09	17.28
96	25000	64	35.16	22.37	45.17	22.18
150	25000	125	40.65	25.70	48.41	25.26
216	25000	216	53.75	25.09	54.18	27.90
2048	2048	2048	61.02	30.37	61.61	31.06

We note here that ATLAS tends to outperform OpenBLAS when working at single precision until M = 24. At this point OpenBLAS pulls away; often delivering 20% more FLOPS than ATLAS.