Last week we have been awarded with the Direct Discretion project at IT4Innovations National Supercomputing Centre here in Ostrava and a static allocation of the majority if the Intel Xeon Phi accelerated nodes for almost 4 hours. Within this time we have run several scalability tests with main focus on the evaluation of Intel Xeon Phi acceleration of the Hybrid Total FETI solver.
We have run three types of tests: (i) standard configuration of the ESPRESO CPU which uses sparse direct solver (in this case MKL version of PARDISO) for stiffness matrix processing; (ii) using the local Schur complements of the stiffness matrices on the CPU; and finally (iii) using the Schur complement on the Intel Xeon Phi accelerator. The Schur complement method is described here or in more details in paper.
Salomon machine contains 432 compute nodes each with two Intel Xeon Phi 7120P accelerators and two 12-core Intel Xeon E5firstname.lastname@example.orgGHz CPUs. The 7120P has 16GB RAM each so in total ESPRESO can use 32 GB of fast GDDR memory. To this memory we can fit approximately 3 million DOF problem decomposed into 1000 subdomains, 500 on each accelerator. We have used up to 343 compute nodes, 686 accelerators, and this configuration allowed us to solve 1.033 billion DOF problem as shown in figure above.
We can see that Intel Xeon Phi acceleration speeds up the iterative solver 2.7 times when compared to the original CPU version. In case the Schur complement technique is used on CPU as well the speed us is 2.2.
Please note, the first generation of the Intel Xeon Phi coprocessor has been released at the same time as the Sandy Bridge architecture of the Intel Xeon processors. After the Sandy Bridge, there has been an Ivy Bridge and then the Haswell architecture. So in this test the CPU is two generation ahead.
The first “larger” tests of the GPU acceleration of Hybrid Total FETI method in ESPRESO has been performed on the world’s largest GPU accelerated machine. The problem that has been solved is the linear elasticity in 3D.
The method that has been tested uses GPUs to accelerate the processing of the stiffness matrices that are stored in a form of the Schur complement. More about this method can be found here or in more details in paper.
As the memory of the GPU is limited (Tesla K20X has “only” 6GB of RAM) and Schur complement is stored in general format, we are able to solve 0.3 million of unknowns per GPU only. We are currently implementing the support for symmetric Schur complements which will double the problem size solvable per one GPU. Taking this into account, we can expect to solve 100,000 unknowns per 1GB of RAM and therefore up to 1.6 million of unknowns on the newly released Nvidia Tesla P100 accelerator with 16 GB of RAM.
As these test are executed on Titan, we are comparing the AMD Opteron 6274 CPU with 16 cores and the Nvidia Tesla K20x GPU. We can observe the speedup up to 3.5 of the iterative solver when GPU is used in conjunction with the CPU.
As described here the preprocessing time to calculate the Schur complements is longer than in the case of factorization. Therefore the full advantage of the GPU acceleration can be taken only if problem needs high number (more than 400) of iteration.
In April the work related to scalability of the solver has reached the state, when we decided to run the first full scale test of the ESPRESO on the ORNL Titan supercomputer. In this test we have used only the CPU part of the machine and did not use the GPU accelerators. The physical problem we were solving was heat transfer or in other words the Laplace equation in 3D.
The method that allowed us to run such a large tests is the Hybrid Total FETI method that has been implemented into ESPRESO library under the umbrella of the EX2ACT (EXascale Algorithms and Advanced Computational Techniques) European project. For more information see the project website: www.exa2ct.eu.
The test has been performed using the 3D cube benchmark generator, which is fully parallel and is able to generate all matrices required by the Hybrid TFETI solver in several seconds. The problem has been decomposed into 17,576 clusters:
- 1 cluster of size 7.2 millions of unknowns per compute node
- 1210 subdomains per cluster of size 6859 unknowns
To sum up the problem of 124 billion of unknowns has been decomposed into 21 million of subdomains and solved in approximately 160 seconds including all required preprocessing associated with HTFETI method. The stopping criteria has been set to 10e-3.
The memory used for the largest run was 0.56 PetaByte (PB). In the Top500 there are machines, Sequoia or K computer, with total memory size close to 1.5 PB. On such machines we expect ESPRESO to be able to solve over 350 billion of unknowns
The experiments show, that scaling from 1,000 to 17,576 compute nodes, the iterative solver exhibits the parallel efficiency almost 95%. This is the critical part of the HTFETI solver, and therefore the most important observation.
We are about the test the strong scalability and also the liner elasticity soon.
The ESPRESO developer team gained access to the Titan machine through Director Discretion project. The project was awarded with 2,700,000 core-hours. This means that would be able to use the entire machine for up to 5 hours. This of course will not be the case, as thousands of smaller tests will lead to a version that will be able to efficiently use entire supercomputer.
The main objectives of this project are as follows: (1) performance evaluation of the ESPRESO H on all 18,000 compute nodes and identification of the bottlenecks at this scale using parallel problem generator, (2) optimization of the communication layer and all global operations, (3) development and optimization of the GPU accelerated version at large scale and (4) performance benchmarking using real-life problems of both CPU and GPU versions.
The goal of the READEX project is to improved energy-efficiency of applications in the field of High-Performance Computing. The project brings together European experts from different ends of the computing spectrum to develop a tools-aided methodology for dynamic auto-tuning, allowing users to automatically exploit the dynamic behaviour of their applications by adjusting the system to the actual resource requirements.
The task of IT4I consists in the evaluation of dynamism in HPC applications, manual tuning especially of the FETI domain decomposition solvers, combining direct and iterative methods, and evaluation and validation of the developed tool, taking results of the manual tuning as the baseline.
More information can be found at http://www.readex.eu.
The Intel® PCC at IT4Innovations National Supercomputing Center (Intel® PCC – IT4I) is developing highly parallel algorithms and libraries optimized for latest Intel parallel technologies. Main activities of the center are divided into two pillars: Development of highly parallel algorithms and libraries, and Development and support of HPC community codes. The pillar Development of highly parallel algorithms and libraries focuses on the development of the state-of-the-art sparse iterative linear solvers combined with appropriate preconditioners and domain decomposition methods, suitable for solution of very large problems distributed over tens of thousands of Xeon Phi accelerated nodes. Developed solvers will become part of IT4I in-house ESPRESO (ExaScale PaRallel FETI SOlver) library. Development and support of HPC community codes includes creating interface between ESPRESO and existing community codes Elmer and OpenFOAM Extend Project.
More details about the centre can be found at http://ipcc.it4i.cz.