Last week we have been awarded with the Direct Discretion project at IT4Innovations National Supercomputing Centre here in Ostrava and a static allocation of the majority if the Intel Xeon Phi accelerated nodes for almost 4 hours. Within this time we have run several scalability tests with main focus on the evaluation of Intel Xeon Phi acceleration of the Hybrid Total FETI solver.
We have run three types of tests: (i) standard configuration of the ESPRESO CPU which uses sparse direct solver (in this case MKL version of PARDISO) for stiffness matrix processing; (ii) using the local Schur complements of the stiffness matrices on the CPU; and finally (iii) using the Schur complement on the Intel Xeon Phi accelerator. The Schur complement method is described here or in more details in paper.
Salomon machine contains 432 compute nodes each with two Intel Xeon Phi 7120P accelerators and two 12-core Intel Xeon E5firstname.lastname@example.orgGHz CPUs. The 7120P has 16GB RAM each so in total ESPRESO can use 32 GB of fast GDDR memory. To this memory we can fit approximately 3 million DOF problem decomposed into 1000 subdomains, 500 on each accelerator. We have used up to 343 compute nodes, 686 accelerators, and this configuration allowed us to solve 1.033 billion DOF problem as shown in figure above.
We can see that Intel Xeon Phi acceleration speeds up the iterative solver 2.7 times when compared to the original CPU version. In case the Schur complement technique is used on CPU as well the speed us is 2.2.
Please note, the first generation of the Intel Xeon Phi coprocessor has been released at the same time as the Sandy Bridge architecture of the Intel Xeon processors. After the Sandy Bridge, there has been an Ivy Bridge and then the Haswell architecture. So in this test the CPU is two generation ahead.