Last week we have been awarded with the Direct Discretion project at IT4Innovations National Supercomputing Centre here in Ostrava and a static allocation of the majority if the Intel Xeon Phi accelerated nodes for almost 4 hours. Within this time we have run several scalability tests with main focus on the evaluation of Intel Xeon Phi acceleration of the Hybrid Total FETI solver.
We have run three types of tests: (i) standard configuration of the ESPRESO CPU which uses sparse direct solver (in this case MKL version of PARDISO) for stiffness matrix processing; (ii) using the local Schur complements of the stiffness matrices on the CPU; and finally (iii) using the Schur complement on the Intel Xeon Phi accelerator. The Schur complement method is described here or in more details in paper.
Salomon machine contains 432 compute nodes each with two Intel Xeon Phi 7120P accelerators and two 12-core Intel Xeon E5firstname.lastname@example.orgGHz CPUs. The 7120P has 16GB RAM each so in total ESPRESO can use 32 GB of fast GDDR memory. To this memory we can fit approximately 3 million DOF problem decomposed into 1000 subdomains, 500 on each accelerator. We have used up to 343 compute nodes, 686 accelerators, and this configuration allowed us to solve 1.033 billion DOF problem as shown in figure above.
We can see that Intel Xeon Phi acceleration speeds up the iterative solver 2.7 times when compared to the original CPU version. In case the Schur complement technique is used on CPU as well the speed us is 2.2.
Please note, the first generation of the Intel Xeon Phi coprocessor has been released at the same time as the Sandy Bridge architecture of the Intel Xeon processors. After the Sandy Bridge, there has been an Ivy Bridge and then the Haswell architecture. So in this test the CPU is two generation ahead.
The first “larger” tests of the GPU acceleration of Hybrid Total FETI method in ESPRESO has been performed on the world’s largest GPU accelerated machine. The problem that has been solved is the linear elasticity in 3D.
The method that has been tested uses GPUs to accelerate the processing of the stiffness matrices that are stored in a form of the Schur complement. More about this method can be found here or in more details in paper.
As the memory of the GPU is limited (Tesla K20X has “only” 6GB of RAM) and Schur complement is stored in general format, we are able to solve 0.3 million of unknowns per GPU only. We are currently implementing the support for symmetric Schur complements which will double the problem size solvable per one GPU. Taking this into account, we can expect to solve 100,000 unknowns per 1GB of RAM and therefore up to 1.6 million of unknowns on the newly released Nvidia Tesla P100 accelerator with 16 GB of RAM.
As these test are executed on Titan, we are comparing the AMD Opteron 6274 CPU with 16 cores and the Nvidia Tesla K20x GPU. We can observe the speedup up to 3.5 of the iterative solver when GPU is used in conjunction with the CPU.
As described here the preprocessing time to calculate the Schur complements is longer than in the case of factorization. Therefore the full advantage of the GPU acceleration can be taken only if problem needs high number (more than 400) of iteration.
In April the work related to scalability of the solver has reached the state, when we decided to run the first full scale test of the ESPRESO on the ORNL Titan supercomputer. In this test we have used only the CPU part of the machine and did not use the GPU accelerators. The physical problem we were solving was heat transfer or in other words the Laplace equation in 3D.
The method that allowed us to run such a large tests is the Hybrid Total FETI method that has been implemented into ESPRESO library under the umbrella of the EX2ACT (EXascale Algorithms and Advanced Computational Techniques) European project. For more information see the project website: www.exa2ct.eu.
The test has been performed using the 3D cube benchmark generator, which is fully parallel and is able to generate all matrices required by the Hybrid TFETI solver in several seconds. The problem has been decomposed into 17,576 clusters:
- 1 cluster of size 7.2 millions of unknowns per compute node
- 1210 subdomains per cluster of size 6859 unknowns
To sum up the problem of 124 billion of unknowns has been decomposed into 21 million of subdomains and solved in approximately 160 seconds including all required preprocessing associated with HTFETI method. The stopping criteria has been set to 10e-3.
The memory used for the largest run was 0.56 PetaByte (PB). In the Top500 there are machines, Sequoia or K computer, with total memory size close to 1.5 PB. On such machines we expect ESPRESO to be able to solve over 350 billion of unknowns
The experiments show, that scaling from 1,000 to 17,576 compute nodes, the iterative solver exhibits the parallel efficiency almost 95%. This is the critical part of the HTFETI solver, and therefore the most important observation.
We are about the test the strong scalability and also the liner elasticity soon.
The last week of November part of our team visited CSC – IT Helsinki. This Finnish IT center develops in the long term an open source multiphysical simulation software ELMER. This software is designed to solve large scale problems, and it already contains some domain decomposition methods including FETI. The goal of our group was made a connection between both software to utilize FETI from ESPRESO in ELMER. Our variant of FETI (compared to already implemented version in ELMER) uses more general routine to detect the singularity of stiffness matrices without stronger assumptions, and it contains many other features (Schur complement in dense format, support for Xeon Phi and GPU accelerators) for which this coupling makes sense. Moreover ESPRESO solver supports hybrid FETI version (collection of several subdomains per one MPI process + reduction of coarse problem dimension) which will be implemented (activated for ELMER) in next step.