ESPRESO for Accelerators

ESPRESO MIC and GPU are versions that are able to utilize the performance of modern many-core accelerators such Intel Xeon Phi or Nvidia GPU (Graphic Processing Units). They are based on the H version and therefore offers all its advantages including very good scalability.

In any FETI solver the most time consuming part of is a processing of generalized inverse of subdomains system matrices. In case of ESPRESO H this can take up to 90% of the solver runtime. As the system matrices in FEM are very sparse the H version is using sparse direct solver for this type of processing. Working with sparse structures is in general very inefficient on modern architectures because which relies on powerful SIMD units. Therefore we had to modify both the Total and Hybrid FETI algorithms to use dense data structure instead of sparse.

FETI solver is an iterative solver based on Conjugate Gradient (CG). In every iteration of the FETI solver code has to perform Sparse Matrix Vector Multiplication with gluing matrix B1T (SPMV) (step 1.), call the forward and backward substitution using the sparse direct solver (step 2. – solve ) and perform one more SPMV with B1 (step 3.). These three steps, in particular step 2, are the most time consuming parts of the solver.

The main idea behind the GPU acceleration of the ESPRESO is to use dense Schur complement (Sc) instead of sparse Cholesky decomposition of the sparse system matrix (K)

The main idea that allows the efficient acceleration of ESPRESO Solver is to use dense Schur complement (Sc) instead of sparse Cholesky decomposition of a sparse system matrix (K).

To be able to efficiently utilize a highly parallel many-core processors wind SIMD units  we had to avoid the usage of sparse structures (here B1 and K) by calculating the Schur complement Sc = B1 K-1 B1T. This new dense structure is then used to replace the three operations described in previous paragraph by a single dense matrix vector multiplication (DGEMM). This can be very efficiently executed on any accelerator as it can take an advantage of its fast memory.

The bottleneck of this approach is the calculation of the Schur complement. We have evaluated two approaches for this calculation using PARDISO sparse direct solver: (1) calculating the product K-1 B1T and then multiplying the result by B1 from the right. Especially the first step which includes the solve routine with large number of right hand sides (RHS) is extremely time consuming. The second approach uses special functions implemented in PARDISO SC designed to calculate directly the SC and exploit the sparsity of B1 matrix as RHS.

The calculation of the SC is done in the preprocessing stage, therefore it is equivalent to performing the factorization in the H version. We have evaluated how much time it takes to calculate the SC using both methods. We have also how many times it is slower then calculating the Cholesky decomposition (factorization step of a solver). The results are shown in the next figure.

Performance evaluation of the calculation of the Schur complement using regular sparse director solver (PARDISO) and special routine to calculate the SC (PARDISO SC). Calculation of the SC is also compared to calculation of the Cholesky decomposition (factorization) in the ESPRESO H.

Performance evaluation of the Schur complement calculation using sparse director solver (PARDISO) and special routine to calculate the SC (implemented in PARDISO SC). Calculation of the SC is also compared to calculation of the Cholesky decomposition (factorization) that is used by the ESPRESO H.

It can be seen that calculation of the Schur complement on Haswell processors (Intel Xeon E5-2680v3, 2.5GHz, 12cores) is between 5 and 25 times slower than the factorization. Slowdown is getting more noticeable as size of subdomains grow. In absolute values one can expect  slowdown from seconds to tens of seconds.

Example 1: Intel Xeon Phi acceleration on IT4I Salomon Supercomputer 

Hardware: 2x Intel Xeon E5-2680v3, 2.5GHz, 12cores – one compute node of Salomon supercomputer with 2x Intel Xeon Phi 7120p

Problems size: 2.9 millions unknowns decomposed; 2.1 million of unknowns undecomposed

Subdomain size: 2187 unknowns

Number of subdomains: 1331

Factorization processing time: 3.1 seconds

Schur complement processing time: 27.6 seconds

Iterative solver runtime for 500 iterations:

  • 2x CPU without Schur complement – 276 seconds
  • 2x Intel Xeon Phi 7120p with Schur complement – 110 seconds (2.5x speedup)

Example 2: Nvidia GPU Acceleration on ORNL Titan Supercomputer 

Hardware: 1x 16-core 2.2GHz AMD Opteron 6274 (Interlaces) with 1x Nvidia Tesla K20X

Problem size: 0.65 million of unknowns decomposed; 0.5 million of unknowns undecomposed

Subdomain size: 3993 unknowns

Number of subdomains: 150

Factorization processing time: 2.4 seconds

Schur complement processing time: 34.9 seconds

Iterative solver runtime for 500 iterations:

  • 1x CPU without Schur complement – 78.1 seconds
  • 1x Tesla K20X with Schur complement – 33.1 seconds (2.4x speedup)

 

The previous two examples shows the weak and strong side of the proposed approach: to pay the penalty once in the preprocessing stage and then execute all the iterations significantly faster.

We can evaluate the usability for three scenarios:

The first scenario is a simple linear elasticity problem. In this case the number of iterations us usually less than 100. It is clear that for this types of problems the preprocessing time is dominant and presented approach including does not bring any advantage.

The second scenario is for group of Ill-conditioned linear elasticity problems. These problems approximately converge in several hundreds of iterations. This is the border line situation where assembling Schur complements becomes useful and overall execution time is reduced.

The last group are transient and nonlinear problems with constant tangent stiness matrix. To solve these problems solver has to perform tens to hundreds of iterations in every time or non-linear solver step. This generates a thousands of iterations and the presented approach has a potential to run significantly faster.