Domain Decomposition¶
For more than two decades, the Finite Element Tearing and Interconnecting method (FETI, for more detail see, e.g., in [5], [7]) has been successfully used in the engineering community, primarily for very large problems arising from the discretization of partial differential equations. In such an approach the original structure is artificially decomposed into several nonoverlapping subdomains. Mutual continuity of primal variables between neighboring subdomains is enforced afterwards by dual variables, i.e., Lagrange multipliers (LM). They are usually obtained iteratively by some of the Krylov subspace methods, then the primal solution is evaluated locally on each subdomain. The FETI method is also applicable for contact problems with inequality constrains (see, e.g., [3], [4])
In 2006 Dostál et al. [2] introduced a new variant of an algorithm called Total FETI (or TFETI) in which Dirichlet boundary condition is enforced also by LM. Please note, there is no difference, if the hybrid variant stems from FETI or TFETI. The impact is identical in both cases. This paper addresses the modification of the TFETI variant, and is therefore called HTFETI.
The HTFETI method is a variant of hybrid FETI methods introduced by Klawonn and Rheinbach (see, e.g., [8], [9]) for FETI and FETIDP. In the original approach, the authors gather a number of subdomains in clusters which provides a threelevel domain decomposition approach. Each cluster consists of a certain number of subdomains and for these, a FETIDP system is set up. The clusters are then solved by a traditional FETI approach using projections to treat the non trivial kernels. In contrast, in HTFETI, a TFETI approach is used for the subdomains in each cluster and the FETI approach using projections for clusters.
The main advantage of HTFETI is the ability to solve problems decomposed into a very large number of subdomains.
To solve extremely large problems, the configuration of the HTFETI has to be different (Approach A: 1 cluster per compute node with large number of subdomains  up to several thousand) than for solving moderate problems (Approach B: 1 cluster per CPU core with small number of subdomains  up to a few hundred). In particular, Approach B can use subdomains of approximately 1,000 DOF to solve a 1 billion DOF problem on approximately 200 compute nodes with 64 GB of RAM. Solving the same problem with TFETI requires subdomains of approximately 90,000 DOF.
The comparison of HTFETI and TFETI methods is shown in Fig. 1. Using TFETI with large number of DOF per subdomain is an example of how to solve large problems using this technique. The figure shows that its scalability is not ideal, but is acceptable up to 3 billion DOF. However, the solution time in this range is still shorter than it would be with the use of the HTFETI technique. The figure clearly shows that for problems greater than 3 billion DOF, the HTFETI Approach A is better.
However, the HTFETI can be configured to the other extreme in which small number of subdomains per cluster (here 216) and very small number of DOF per subdomain (here 1536 DOF) are used. This approach is significantly faster than TFETI with large number of DOF per subdomain. Figure also shows that TFETI can be configured with smaller number of DOF per subdomain (here 20,577) to reduce processing time. However, this option is suitable only for very small problems as its scalability is unsatisfactory.
In sum, the figure shows that HTFETI provides better solution time for “smaller problems” up to 3 billion DOF and better scalability for large problems.
In the next section, the theory behind the HTFETI is described, followed by the description of the parallel implementation in the ESPRESO library.
Hybrid Total FETI Method¶
The FETI method, widely known for more than two decades, is an efficient tool for solving largescale problems in structural mechanics via Krylov subspace methods adapted to parallel machines. Although this paper is focused on the HTFETI, this section will first introduce the original FETI method on a simple cantilever beam
followed by an introduction of its hybrid variant.
FETI¶
In the simple engineering problem depicted in Fig. 2, the aim is to get deformation of the structure. A cantilever beam, fixed on the left side and loaded by the pressure \(p_y\) at the top, is discretizated with FEM [11]. The number of nodes is \(n_n^0\), and number of elements is \(n_e\). Above the mesh global stiffness matrix \({\mathbf{K}}_g\) and global RHS \({\mathbf{f}}_g\) (nodal forces) are assembled, whereas both objects are included into nodal equilibrium equation
To get the vector \({\mathbf{u}}_g\) (nodal displacements), Dirichlet boundary condition has to be taken into account due to a singular matrix \({\mathbf{K}}_g\) since it has a nonempty nullspace. Then the linear system can be solved by an iterative or direct solver.
Clearly, this undecomposed approach has its own limitations. The size of \({\mathbf{K}}_g\) can overload the computer memory which is one of the reasons one might employ domain decomposition methods. In the FETI case, the mesh is decomposed into four smaller submeshes (see Fig. 3a)
to avoid assembling the global objects \(\mathbf{K}_g\), \(\mathbf{f}_g\). Decomposition generally causes an increase in the number of nodes in cuts between subdomains, thus the global size of the unknowns of the decomposed problem is always bigger than the original one. In fact, it is not a problem, because any object in a FETI algorithm stored at one computational node does not have such dimension. Equilibrium equation for \(i\)th subdomain is
where \(\mathbf{K}_i\) and \(\mathbf{f}_i\) with the same meaning as before are assembled for all subdomains separately. On the RHS beside the vector \(\mathbf{f}_i\) the product \(\mathbf{B}^\top_i \boldsymbol{\lambda}\) appears. It is actually an additive nodal force vector which acts on the interface between subdomains and it substitutes influence transmitting from neighbors. Those four systems in (2) are supplemented by the global constraint condition
Equilibrium equations (2) together with condition (3) can be written globally as
or shortly
The meaning of symbols in (4) is obvious to detailed expression in (5). Generally, vector \(\mathbf{c}\) contains zero entries. Vector of LM
consists of four parts where the first subvector \(\boldsymbol{\lambda}^{d,1}\) enforces Dirichlet boundary condition, the second one \(\boldsymbol{\lambda}^{g,12}\) enforces connectivity between subdomains 1 and 2, etc. The division of global \(\boldsymbol{\lambda}\) into four parts does not relate to the number of subdomains. These four sets correspond to four cuts caused by the decomposition. The splitting into four subvectors shows how \(\boldsymbol{\lambda}\) can be stored according to numeric superindexes. For instance, the first part \(\boldsymbol{\lambda}^{d,1}\) is related to the first subdomain only, and it will never touch another one, therefore it is stored only on the first subdomain. The second part \(\boldsymbol{\lambda}^{g,12}\) connects subdomain 1 and 2, thus they are stored only there. Evidently, there is no need to assemble the global \(\boldsymbol{\lambda}\) vector on a single node.
In the next step, the primal variables will be eliminated. From Eq. (5) the vector of unknown displacements is
where \(\mathbf{R}=\text{diag}\left(\mathbf{R}_1,~\mathbf{R}_2,~\mathbf{R}_3,~\mathbf{R}_4\right)\) is blockdiagonal matrix, the columns of which define the basis of null space of \(\mathbf{K}\), and the vector \(\boldsymbol{\alpha}\) contains its amplitudes. Mutual relation between them is
The term (6) with the second condition in (5) eliminates the primal variables
similarly, as a combination of the first equation from (5) and Eq (7).
The last two terms together can be expressed as
in which the following substitutions are used
Projected Conjugate Gradient Method¶
In the previous steps, the primal variables \(\mathbf{u}\) were eliminated. The newly derived system of linear equations (10) can be favorably solved by iterative methods, e.g., the conjugate gradient method (CGM).
An approximation of LM used in Algorithm \(1\) in \(i\)th iteration is considered in the form
If an initial guess is chosen as a linear combination of basis vectors of matrix \(\mathbf{G}\) in the form of \(\boldsymbol{\lambda}_0=\mathbf{G} \mathbf{y}\), the vector \(\mathbf{y}\) can be simply determined using the second equation in (10) as
Such initial guess fully accomplishes the second equality in (10), therefore the rest of the approximation in (12) lies in the kernel of \(\mathbf{G}\).
During the iterative process, just the partial gradient \(\mathbf{g}_i\) is kept in the memory, and before using the complete form \(\overline{\mathbf{g}_{i+1}}\) it is modified by \(\boldsymbol{\alpha}_i\) to satisfy condition \(\mathbf{G}^\top\overline{\mathbf{g}_{i+1}}=\mathbf{o}\). Firstly, it appears if the initial conjugate vector \(\mathbf{w}_0\) is used as a contribution to the \(\boldsymbol{\lambda}_0\). To keep \(\mathbf{w}_0\) in the kernel of \(\mathbf{G}\), vector
has to be evaluated to satisfy the condition
The fulfillment of the condition above is equivalent to the projection of the part of the gradient
by the orthogonal projector
Due to the projection applied to each iteration, Algorithm 1 is called Projected Conjugate Gradient Method (PCGM) (for more details see [6]). The iterative process can be accelerated by Lumped or Dirichlet preconditioner, in this case the preconditioned gradient has to be projected again (for more detail see, e.g., [10]).
From FETI to HTFETI¶
FETI algorithm described in subsection FETI is efficient (besides other effects related to the condition of the system etc.) until CP
can be effectively factorized from the time point of view. Its dimension \(n_{cp}\) in linear elasticity depends on the number of subdomains \(n_{sub}\) (in 2D: \(n_{cp}=3\cdot n_{sub}\), in 3D: \(n_{cp}=6\cdot n_{sub}\)), and it is equal to all RBM in a decomposed structure. In Fig. 3, as illustrated using the FETI technique, each subdomain of the decomposed beam has three RBM (in 2D), therefore \(n_{cp}=3 \cdot 4 = 12\). CP dimension reduction by HTFETI can be demonstrated even in this small and simple benchmark. The permutation and splitting of matrix \(\mathbf{B}\) according to Fig. 3c reads
where matrix
consists of constraints gluing subdomains into 2 clusters, and
contains the rest of the equality constraints. This permuted system of the linear equation still has the structure for a FETI algorithm as in (4). The next equation shows the system after the permutation
This system can be rewritten in a simplified form
which can be solved by HTFETI method. The classical FETI method belongs to the group of dual Schur complement methods in which the primal variables \(\mathbf{u}\) are eliminated and satisfied exactly in each iteration. Dual variables \(\boldsymbol{\lambda}\), reaction forces between subdomains, are obtained by an iterative solver. In the case of HTFETI, not only the primal variables but also a subset of dual variables \(\boldsymbol{\lambda}_0\) are eliminated. This modification causes subdomains 1 and 2 to be partially glued onto interface into cluster 1, and subdomains 3 and 4 into cluster 2. After that, the structure behaves like a problem decomposed into two parts, and the dimension of CP is two times smaller accordingly (\(n_{cp}= 3 \cdot n_c = 3 \cdot 2=6\) where \(n_c\) is the number of clusters). The important thing is that the linear system (14) can be handled by the same methodology applied onto in subsections FETI and Projected Conjugate Gradient Method, so long as each object is relabeled by a tilde.
HTFETI  Optimal implementation¶
In the following text the methodology will be described on the first cluster and its subdomains 1, 2.
To keep optimal properties of the HTFETI code, the cluster stiffness matrix \(\tilde{\mathbf{K}}_1\) can not be assembled, but used implicitly. It requires modified routines for factorization and kernel detection. The factorization is obtained by solving \(\tilde{\mathbf{K}}_1\tilde{\mathbf{x}}=\tilde{\mathbf{b}}\) more detailed written as
with \(\mathbf{K}_{1:2}=\text{diag}(\mathbf{K}_1,~\mathbf{K}_2)\) and \(\mathbf{B}_{0,1:2}=(\mathbf{B}_{0,1},~\mathbf{B}_{0,2})\). Notation \(1:2\) points to the first and last ordinal number of the subdomains on the cluster. So for the second cluster it would be \(3:4\). The system (15) can be taken as a FETI problem to be solved not iteratively but by a direct solver. After the dualization it reads
with substitutions
To obtain the vector \(\tilde{\mathbf{x}}=(\mathbf{x}^\top,~\boldsymbol{\mu}^\top)^\top\), both systems , are subsequently solved in three steps
in which singular Schur complement
appears.
In the HTFETI method, not only do the subdomain matrices \(\mathbf{K}_i\) have to be factorized, but also \(\mathbf{F}_{0,1:2}\) and \(\mathbf{S}_{0,1:2}\)  one pair on each cluster. The dimension of \(\mathbf{F}_{0,1:2}\) is controlled by the number of LM which glue subdomains of the cluster, and the dimension of \(\mathbf{S}_{0,1:2}\) is given by the number of subdomains per cluster multiplied by the defect of \(\mathbf{K}_i\).
To get the kernel \(\tilde{\mathbf{R}}_1\) of the first cluster, the following term is written
or shortly
where the kernel is given by
Assuming that the subdomian kernels \(\mathbf{R}_1\) and \(\mathbf{R}_2\) are already known, the determination of \(\mathbf{H}_{1:2}\) remains. The first equation
from does not impose any special conditions onto \(\mathbf{H}_{1:2}\), because the product \(\mathbf{K}_{1:2} \mathbf{R}_{1:2}\) is already a zero matrix. The second equation reads
and it says, \(\mathbf{H}_{1:2}\) is kernel of \(\mathbf{G}_{0,1:2}\). It can be beneficial if kernels of subdomains stored in \(\mathbf{R}_{1,2}\) are obtained analytically and from one coordinate system (see, e.g., eq. (3.7) in ). Then the searched kernel is
where \(\mathbf{I}_{3,3}\in \mathbb{R}^{3\times3}\) is an identity matrix. Matrix \(\mathbf{H}_{1:2}\) is also the kernel of \(\mathbf{S}_{0,1:2}\), which can be regularized by it and then easily factorized.
Reference
[1]  ESPRESO  Exascale Parallel FETI Solver, http://espreso.it4i.cz. 
[2] 

[3] 

[4]  Z. Dostal, T. Kozubek, V. Vondrak, T. Brzobohaty, and A. Markopoulos.: Scalable TFETI algorithm for the solution of multibody contact problems of elasticity. International Journal for Numerical Methods in Engineering, pages n/a–n/a, 2009. 
[5] 

[6]  C. Farhat and F.X. Roux.: A method of finite element tearing and interconnecting and its parallel solution algorithm. International Journal for Numerical Methods in Engineering, 32(6):1205–1227, oct 1991. 
[7] 

[8] 

[9] 

[10] 

[11] 
