Improving the Connectivity Resilience of a Telecommunications Network to Multiple Link Failures Through a Third-Party Network

Currently, telecommunication networks are fully resilient, in terms of connectivity, to single link failures. On the other hand, multiple simultaneous link failures are becoming a concern to network operators, mainly due to malicious human activities. Full connectivity resilience to multiple link failures is too costly and other solutions must be envisaged. For a given maximum number of simultaneous link failures, the connectivity resilience metric adopted here is the minimum number of network node pairs that can still communicate for any set of failing links. In this work, the connectivity resilience to multiple link failures is improved by resorting to a third-party network for temporary additional connectivity (i.e., while the failing links are not reestablished). In such a solution, some nodes must be selected to act as gateway nodes between the two networks. For a given network topology and a given number of gateway nodes, the aim is to select the most appropriate gateway nodes so that the connectivity resilience is improved as much as possible. To address this problem, a Gateway Node Selection (GNS) algorithm is proposed where the most damaging sets of failing links are identified and, then, a set cover problem type is defined and solved to select the gateway nodes. The computational results demonstrate the effectiveness of the proposed GNS algorithm over two well-known network topologies.


INTRODUCTION
The resilience to failures of a given telecommunication network is generally defined as the capacity of the network to maintain its services after the failures. The network resilience to failures can be evaluated at different levels. Low impact (single) failures cause the rerouting of some service flows, potentially degrading service latency. Medium impact failures can also cause link congestion degrading not only service latency but also the available throughput. Large-scale failures have the potential of disconnecting the network in different components causing connectivity disruption between many pairs of network nodes, i.e., services between nodes in different components can no longer be supported.
In general, current telecommunication networks are fully resilient, in terms of connectivity, to single link failures. On the other hand, large-scale failures are becoming a concern to operators of telecommunication networks due to different reasons, as natural disasters [1] or malicious human activities [2]. In the latter case, current telecommunication networks are deployed over optical infrastructures which are vulnerable to many physical-layer attacks [3]. In particular, link cuts are a relatively straightforward method of a physical-layer attack and, depending on the selected links to be cut, it can severely degrade the services supported by the network.
Multiple link failures have been considered in different contexts in the last two decades. The impact of multiple link failures on optical networks based on wavelength division multiplexing (WDM) has been studied long time ago [4][5]. Multiple link failures have been modelled as shared risk link groups (SRLGs), i.e., groups of links with high probability of simultaneous failure [6][7]. The typical example is when multiple links share a single duct and, thus, the unattended cut of the duct makes all links to be simultaneously cut. More recently, multiple link failures have been modelled as regional failures in the context of natural disasters [1,[8][9]. In all these works, there is correlation of some kind between the failing links. Concerning uncorrelated failures, the particular case of dual link failures has also been addressed in different works [10][11][12][13]. Nevertheless, the case of more than two simultaneous uncorrelated failures (which can happen in malicious human activities when the attacker knows the network topology), has only been considered by very few works [14][15][16].
Full connectivity resilience to multiple link failures is too costly for a network operator as it requires too many physical links. Instead, operators need to enhance the connectivity resilience to such failures with solutions that do not represent a prohibitive investment. A recent example is the common emergency packet transport network proposed in [17] for disaster recovery. In that proposal, a third-party entity builds an emergency network with the surviving resources of multiple network operators (affected by a regional disaster) that can be jointly used by them. The emergency network is built in several steps to avoid confidential information leakage between network operators.
Here, we consider a simpler approach which does not require the third-party entity to exist (and also avoids confidential information leakage between operators). Consider an operator of a network deployed on an area (a region, a country or a continent) where other networks (of other operators) also exist. In the case of a multiple link failure, a cost effective solution is to resort to a third-party network to provide temporary connectivity between some network nodes while the failing links are not reestablished. The business relation between the two network operators can be one-way (the third-party operator charges the temporary connectivity provided when needed) or two-way (each operator provides the temporary connectivity when needed by the other), which might even eliminate the service costs charged between operators depending on the Service Level Agreement (SLA) defined between them. In any case, some nodes of one network must be selected to act as gateway nodes to the other network.
The connectivity resilience of the network is assumed to be the minimum number of network node pairs that can still communicate after any set of failing links (for a given maximum number L of simultaneous failures). Consider the example in Fig. 1 where two nodes of the network are set as gateway nodes, each one connecting to a node of the thirdparty network (in general, the number of gateway nodes can be two or more). Then, consider the simultaneous failure of the 3 links highlighted in dashed red in Fig. 2. Without the third-party network, this failure causes the network to be split in two components (one with 2 nodes and one with 5 nodes) and, before the recovery of at least one failing link, only nodes belonging to the same component can communicate between them. With the provision of a temporary "virtual link" by the third-party network between the gateway nodes (with a throughput defined in accordance to the SLA between the two operators), all network nodes can communicate between them. In fact, the two selected gateway nodes are optimal for any simultaneous failure of 3 links: these gateway nodes guarantee that any set of 3 failing links can disconnect at most one node from all others.  For a given network topology and a given number of gateway nodes, the aim is to select the gateway nodes to the third-party operator so that the connectivity resilience is improved as much as possible. To address this problem, a Gateway Node Selection (GNS) algorithm is proposed where the most damaging sets of failing links are identified and, then, a set cover problem type is defined and solved to select the most appropriate gateway nodes.
The paper is organized as follows. Section II describes how the connectivity resilience is evaluated for a given network either without gateway nodes or with a given set of gateway nodes. Section III describes the GNS algorithm. In section IV, a set of computational results is presented and the effectiveness of the GNS algorithm is discussed. Finally, Section V ends with the main conclusions of the work.

II. CONNECTIVITY RESILIENCE EVALUATION
First, consider the connectivity resilience evaluation of a given telecommunications network without gateway nodes. The network is defined by an undirected graph = ( , ). Each node of set is labelled as 1, 2, 3, …, , where = | |. Each link of set is defined by ( , ), with , ∈ and < . For a given a positive integer L, the connectivity resilience of graph for L simultaneous link failures is defined as the minimum number of node pairs that can communication whatever set of failing links. The connectivity resilience is given by the optimal solution of the Critical Link Detection (CLD) optimization problem defined as follows.
Consider a binary variable for each link ( , ) ∈ which is equal to 1 if link ( , ) is a critical (failing) link; or equal to 0, otherwise. Consider also a binary variable for each pair of nodes , ∈ , with < , which is equal to 1 if nodes and can communicate in the surviving network (i.e., in the network without the critical links). The CLD problem is defined by mixed integer linear programming as: Subject to: The objective function (1) is the minimization of the connectivity resilience, i.e., of the number of node pairs that can communicate in the surviving network. Constraint (2) guarantees that the number of critical links is not higher than . Constraints (3) guarantee that the end nodes of a link ( , ) ∈ can communicate if it is not a critical (failing) link, i.e., if = 0.
Constraints (4) guarantee that each pair of nodes , ∈ , with < , can communicate if there is a third node such that can communicate with both and . In constraints (4), { } represents if < , or otherwise (the same meaning for { } ). In order to minimize the number of constraints (4), it is enough to consider as the neighbor nodes of either or . In the notation of constraints (4), set ( , ) is the set of neighbor nodes of the node (among and ) with the lowest degree and, if nodes and are neighbors (i.e., if ( , ) ∈ ), one is excluded as neighbor of the other.
Finally, constraints (5-6) are the variable domain constraints. Note that only variables are declared as binary. Variables do not need to be declared as binary since they will be set to binary values in any optimal solution.
Consider now the connectivity resilience evaluation of a given telecommunications network with a given set of gateway nodes. We can model this case by: • considering an augmented graph ' by adding to graph one extra link per pair of gateway nodes, and • assuming that the extra links never fail (they represent the virtual links provided by the third-party operator only when needed).
So, the connectivity resilience of a network with a set of gateway nodes is the optimal solution of the CLD model defined over the augmented graph ' and setting the variables equal to 0 for all extra links ( , ). This optimization model will be referred henceforward as CLD-S model.

III. GATEWAY NODE SELECTION (GNS) ALGORITHM
Consider the notation introduced in the previous section. The GNS algorithm aims to select a set of nodes on a given graph to be configured as gateway nodes. The algorithm requires the determination not only of the most damaging set of failing links (i.e., the set of Critical Links) but also the second, third (and so on) most damaging sets of failing links.
The solution of the CLD model (presented in the previous section II) gives the most damaging set of failing links. To compute the next (i.e., the second) set of most damaging failing links, we need to add to the CLD model a constraint excluding the first solution from its feasible set. Consider the binary values of the variables of the optimal solution of the CLD represented by and consider the value given by the sum of all values , i.e., = ∑ ( , )∈ . The CLD-1 model is then defined as: Minimize (1) Subject to: The solution of CLD-1 model gives the second most damaging set of failing links. Again, considering the values of of the optimal solution of CLD-1 represented and = ∑ ( , )∈ , the CLD-2 model is defined as: Minimize (1) Subject to: (2-6), (7.1) The solution of CLD-2 model gives the third most damaging set of failing links. All next sets of most damaging failing links are computed by iteratively repeating this procedure. In general, the CLD-k model has one constraint for each of the k most damaging sets of failing links and its optimal solution provides the (k+1) th most damaging set of failing links.
Then, the selection of the gateway nodes is based on the previously computed most damaging sets of failing links. Note that the k th most damaging set, defined by the binary parameters , splits the network in two or more components. By computing the surviving graph = ( , ( , ) ∈ : = 0), we can determine the 2 largest components (in number of nodes). Consider and as the set of nodes of the largest and second largest component of the surviving graph . In order to guarantee that the nodes of the 2 largest components can communicate in the k th most damaging set of failing links, one needs to ensure that one gateway node belongs to and another gateway node belongs to .
The selection of the gateway nodes is based on a set cover problem (SCP). Consider the number of gateway nodes (to be selected) given by and the number of most damaging sets (to be considered) given by . Consider a binary variable for each node ∈ which is equal to 1 if node is selected as a gateway node, or equal to 0 otherwise. Consider also the parameter which is given by the closeness centrality value of node ∈ in graph (i.e., = 1 ∑ ⁄ and is the length of the shortest path from node to node in ). The selection of the gateway nodes is based on the optimal solution of the following SCP-K model: Subject to: The objective function (8) is the maximization of the closeness centrality of the nodes selected as gateway nodes. Constraint (9) guarantee that the number of gateway nodes is not higher than . Constraints (10-11) guarantee that each of the 2 largest components of each set of failing links contains (is covered by) one gateway node. Finally, constraints (12) are the variable domain constraints.
Note that, when the set of constraints (9-12) admits multiple solutions, the objective function (8) gives preference to the selection of more central nodes, minimizing in this way the average shortest path length from each node to the closest gateway node.
One difficulty in using the SCP-K model (8)(9)(10)(11)(12) is that one cannot know in advance the proper value of (i.e., how many most damaging sets of failing links to include in the model). The value of K should be as large as possible without turning the model infeasible. The Gateway Node Selection (GNS) algorithm proposed in this work is an iterative method that, at each iteration, the next most damaging set of failing links is computed and the resulting set cover problem is solved. The method ends when the set cover problem becomes infeasible. During all iterations, the connectivity resilience of each set cover solution is computed and the method selects the solution with the highest connectivity resilience value. The proposed GNS algorithm is presented in Algorithm 1. At the end of the algorithm, is the set of selected gateway nodes and is its resilience value (as provided by the CLD-S model).

IV. COMPUTATIONAL RESULTS
Consider the Janos-us (with 26 nodes, 42 links and 325 node pairs) and Germany50 (with 50 nodes, 88 links and 1225 node pairs) well-known network topologies [18] presented in Fig. 3. The figure also highlights in red the most damaging set of = 6 failing links when no gateway nodes are used. In Germany50, the failing link which is not visible in the figure is link (10,17). In the Janos-us case, the failing links split the topology in one component of 12 nodes, one component of 9 nodes and one component of 5 nodes. So, the connectivity resilience of Janos-us is 12 2 + 9 2 + 5 2 = 112 node pairs which is 34.5% of the total number of node pairs. In the Germany50 case, the failing links split the topology in one component of 34 nodes and one component of 16 nodes. So, the connectivity resilience of Germany50 is 34 2 + 16 2 = 681 node pairs which, in this case, is 55.6% of the total number of node pairs. The GNS algorithm was run for these two network topologies considering the number of failing links = 3, 4, 5 and 6 and the number of gateway nodes = 2, 3 and 4. The GNS algorithm was implemented in MATLAB and all optimization models were solved using CPLEX 12.8 software package. All results were obtained on a PC platform with an Intel Core i7 8 th generation processor and 32 GBytes of RAM. Table I shows the connectivity resilience (in number of node pairs that can communicate) of each network for each value of . Not surprisingly, a higher number of failing links results in a lower connectivity resilience value (more failing links disconnect more node pairs) and the absolute resilience values are higher for Germany50 since its total number of node pairs is higher than the total number of node pairs of Janos-us.  Janos-us  205  157  154  112  Germany50  1084  1000  796  681   Table II presents the computational results of the GNS algorithm for Janos-us network. Column 'Res' presents the connectivity resilience value of each solution. Then, column 'NIter' indicates how many iterations were run by the GNS algorithm and column 'Time' presents the running time (in seconds) of the GNS algorithm. The performance (in number of iterations and running time) observed in Table II shows that the GNS algorithm is efficient for Janos-us (largest running time below 3 minutes) implying that larger cases (in terms of and values) can also be solved. In terms of connectivity resilience, Fig. 4 presents the resilience gains (in percentage) obtained by the selected gateway nodes, i.e., the resilience gains from the values without gateway nodes (presented in Table I for Janos-us) and the values with gateway nodes (presented in Table II). The results in Fig. 4 show that the gains are very much parameter dependent. For example, with B = 2 gateway nodes, a high resilience gain of 40.1% is obtained for L = 4 failing links while an only moderate resilience gain of 11.7% is obtained for L = 5 failing links. Moreover, more gateway nodes do not always provide better resilience: increasing from 2 to 3 gateway nodes has improved the connectivity resilience for all cases but increasing from 3 to 4 gateway nodes has only improved the connectivity resilience for L = 6 failing links.  Table III presents the computational results of the GNS algorithm for Germany50 network (the meaning of each column is the same as in Table II). In the Germany50 case, the running times of GNS algorithm are in the order of some minutes for the problems instances with the largest and values (a little over 12 minutes in the worst case), indicating that the algorithm cannot solve much larger cases (in terms of and values).

Network L = 3 L = 4 L = 5 L = 6
As in the Janos-us case, Fig. 5 presents the resilience gains (in percentage) from the resilience values without gateway nodes (Table I for Germany50) to the resilience values with gateway nodes (Table III). In this case, it is clear that the resilience gains increase for larger number of failing links L, a behavior that is very different to the one observed on the Janos-us problem instances, highlighting also that the resilience gains are also dependent on the topology of the network. Nevertheless, an observation that is common to both networks is that more gateway nodes do not always provide better resilience. In the Germany50 case, increasing from 2 to 3 gateway nodes has only improved the connectivity resilience for L = 4 (very slightly) and for L = 5 (significantly) failing links and did not improve the connectivity resilience for L = 2 and 6 failing links. Like in Janos-us, increasing from 3 to 4 gateway nodes has only improved the connectivity resilience for L = 6 failing links. For illustration purposes, Fig. 6 presents for both networks the gateway nodes (highlighted in blue) provided by the GNS algorithm for B = 4 gateway nodes and considering L = 6 failing links. The figure also shows the most damaging set of failing links (highlighted in red) of these solutions.  = 180 node pairs, which represents an increase of 60.7% from the solution without gateway nodes (presented in Fig. 3). In the Germany50 case, the failing links split the topology in one component of 44 nodes and two components of 3 nodes each. So, the connectivity resilience of Germany50 with the four selected gateway nodes is now 44 2 + 2 × 3 2 = 952 node pairs, which represents an increase of 39.8% from the solution without gateway nodes (presented in Fig. 3).

V. CONCLUSIONS
Multiple simultaneous link failures are becoming a concern to network operators, mainly due to malicious human activities. Current telecommunication networks are deployed over optical infrastructures which are particularly vulnerable to link-cut attacks, a relatively straightforward method of a physical-layer attack. Full connectivity resilience to multiple link failures is too costly and, instead, operators need to enhance the connectivity resilience to multiple link failures with solutions that do not represent a prohibitive investment. One cost effective solution is to use a third-party network to provide temporary connectivity between some network nodes while the failing links are not reestablished. In such a solution, some nodes must be selected to act as gateway nodes between the two networks.
This work has addressed the gateway node selection problem aiming to improve the connectivity resilience of a given network as much as possible to multiple link failures. To address this problem, a Gateway Node Selection (GNS) algorithm was proposed where the most damaging sets of failing links are identified and, then, a set cover problem type is defined and solved to select the gateway nodes.
Two well-known network topologies were used in the computational results and the GNS algorithm was run for the simultaneous failure of up to a maximum of L = 6 links and to the selection of up to B = 4 gateway nodes. The computational results have shown that the proposed algorithm can solve all problem instances. Concerning the connectivity resilience gains obtained by using a third-party network, the results have shown that the gains are very much case dependent, i.e., they depend on the particular network topology, and on the particular values of L and B.