Evaluation and Design of Elastic Optical Networks Resilient to Multiple Node Failures

Consider an existing Elastic Optical Network (EON) with a given topology composed by nodes and connecting fibers, each fiber with a given spectrum capacity. Consider an estimated set of demands to be supported and a routing, modulation and spectrum assignment (RMSA) policy adopted by the operator both for the regular state and for the failure states. First, we address the resilience evaluation of the EON to multiple node failures. We adopt a worst-case approach by identifying the nodes (named critical nodes) whose simultaneous failure maximally reduce the demand percentage that is supported by the network and we use this percentage as the resilience metric. Then, for the same estimated demands, the same RMSA policy and a fiber budget equal to the total fiber length of the existing network, we address the design problem aiming to determine a new EON maximizing the resilience metric imposed by its critical nodes. We use a multi-start greedy randomized method that generates multiple EONs and returns the best one, i.e., the EON with the highest resilience metric. We run the evaluation and design methods on known network topologies. The computational results let us (i) analyze the efficiency of the methods and (ii) assess how far the resilience of existing networks are from the best ones.


Amaro de Sousa
Abstract-Consider an existing Elastic Optical Network (EON) with a given topology composed by nodes and connecting fibers, each fiber with a given spectrum capacity. Consider an estimated set of demands to be supported and a routing, modulation and spectrum assignment (RMSA) policy adopted by the operator both for the regular state and for the failure states. First, we address the resilience evaluation of the EON to multiple node failures. We adopt a worst-case approach by identifying the nodes (named critical nodes) whose simultaneous failure maximally reduce the demand percentage that is supported by the network and we use this percentage as the resilience metric. Then, for the same estimated demands, the same RMSA policy and a fiber budget equal to the total fiber length of the existing network, we address the design problem aiming to determine a new EON maximizing the resilience metric imposed by its critical nodes. We use a multi-start greedy randomized method that generates multiple EONs and returns the best one, i.e., the EON with the highest resilience metric. We run the evaluation and design methods on known network topologies. The computational results let us (i) analyze the efficiency of the methods and (ii) assess how far the resilience of existing networks are from the best ones.
Index Terms-EON; Transparent Optical Networks; Critical Node Detection; Resilient Network Design; Disasters

I. INTRODUCTION
Large-scale failures are becoming more frequent in time and wider in scope severely disrupting telecommunication networks and services [1]. So, both the impact evaluation of largescale failures on existing networks and the design of networks more resilient to large-scale failures are becoming key issues (two surveys addressing these issues are [2] on strategies to protect networks against large-scale natural disasters and [3] on security challenges in communication networks).
Large-scale failures might involve only network links or network nodes and links (a node failure implies that its links fail). For example, in malicious human attacks, node shutdowns are harder to realize than link cuts but are the most rewarding in the attacker's perspective (a node shutdown also shuts down multiple links). Moreover, power outages shut down nodes since fiber links do not require power supply. Here, we consider as large-scale failures the case of multiple node failures as they are the most harmful.
In EONs, the optical spectrum of each fiber link is organized in frequency slots (FSs). Each demand between a pair of nodes is routed over an end-to-end lightpath (we assume a transparent optical network). On each direction of a lightpath, data is converted in the source from electrical to optical domain using a modulation format (MF) emitting on a set of contiguous FSs, transmitted through a routing path over the optical network and converted back to electrical domain in the target node. At the network level, multiple lightpaths can be set up if their FSs do not overlap on any fiber link.
Due to many factors, there is a maximum length, named transparent reach, for the routing path of a lightpath. Also, the MF of a lightpath impacts both its transparent reach and its number of FSs. For example, in a single carrier lightpath, a 16-QAM MF carries twice the number of bits/symbol of the QPSK MF but imposes a shorter transparent reach. So, in a shorter routing path, the same line rate (in bits/second) can be transmitted by 16-QAM instead of QPSK with a half symbol rate which occupies less FSs [4]. Moreover, OFDM enables each lightpath to be composed by a bunch of subcarriers, which can be partially overlapping in the spectrum domain reaching spectrum gains [5], [6]. In this case, multiple sub-carriers with uniform symbol rate and bits/symbol can be selected for a required line rate so that the transparent reach of the lightpath is enough to the length of its routing path [4].
So, for a required demand line rate, a more spectrum efficient MF configuration is one that requires a smaller number of FSs but imposes a shorter transparent reach (the main principle behind the distance-adaptive spectrum allocation strategies [4], [7]). When the MF of each required lightpath is fixed, the decision on the routing path and FSs of each lightpath is known as the routing and spectrum assignment (RSA) problem. When multiple MFs are available, the assignment problem includes the selection of the MF configuration to each lightpath, which is known as the routing, modulation and spectrum assignment (RMSA) problem and has been addressed in many works and different contexts [4]- [15].
The failure of multiple nodes on an EON has three different impact factors. First, all demands with one end node which is a failure node are lost. Second, if the failure nodes disconnect the network into different components, all demands with end nodes in different components are also lost. Third, a demand whose routing path of its lightpath contains at least one failure node, if the demand is in the same network component, it might be reassigned with a different lightpath. Then, its reassignment might require a different MF in a longer routing path which, in turn, might require more FSs in more links. So, the network might not have enough spectrum resources to reassign lightpaths to all such demands.
Consider an existing EON with a given topology composed by nodes and connecting fibers, each fiber with a given capacity in number of FSs. Consider an estimated demand set and a RMSA policy adopted by the operator both for the regular and for the failure states. First, we address the resilience evaluation of the EON to multiple node failures. If the critical nodes are given, the RMSA can maximize the total demand still supported when all critical nodes fail, as in [16] where the approach in [17] is adapted to such case. Here, the critical nodes are not given. Instead, our resilience evaluation adopts a worst-case approach by identifying the critical nodes as the set of nodes whose simultaneous failure maximally reduce the demand percentage that is still supported. So, the critical nodes are the result of an optimization problem and the obtained demand percentage is used as the resilience metric. Then, for the same set of estimated demands, the same RMSA policy and a fiber budget equal to the total fiber length of an existing EON, we address the design problem aiming to determine a new EON maximizing the resilience metric imposed by its critical nodes.
Critical Node Detection (CND) problems have been considered in different contexts [18]- [21] and are gaining special attention in the vulnerability evaluation of telecommunication networks to large-scale failures [2]. Other metrics have been used to evaluate the network vulnerability in other contexts [22] or assuming multiple geographical correlated failures [23]. There are also works on improving the preparedness of networks to multiple failures, some by changing the network topology [24]- [26], while others by proposing strategies to recover from failures [27], [28]. CND is used in [29] as a resilience metric in the optimal robust node selection problem. Both the evaluation and network design of optical networks resilient to multiple node failures were addressed in [30]. In that work, though, the RMSA is not considered as spectrum capacity of links is assumed to be infinite, i.e., the third impact factor is ignored in the resilience evaluation.
The paper is organized as follows. Section II describes the RMSA policy considered both for the regular and for the failure states. Section III presents the resilience evaluation method while Section IV presents the EON design method. The computational results are discussed in Section V. Finally, Section VI draws the main conclusions of the work.

II. ROUTING, MODULATION AND SPECTRUM ASSIGNMENT
For a given EON and a given set of demands, the RMSA policy rules the way lightpaths are assigned both in the regular state and in any failure state. Here, we adapt the proposal in [10] to both cases. Consider the EON topology represented by an undirected graph G = (N, E) with a set of nodes N = {1, ..., |N |} and a set of links E ⊆ {(i, j) ∈ N ×N : i < j} whose lengths are represented as l ij . Set F = {1, 2, ..., |F |} is the ordered set of FSs available on each fiber link.
Set D is the estimated set of demands. Each d ∈ D is defined by its source node s d and target node t d , s d < t d (multiple demands between the same end nodes can exist and we let their supporting lightpaths to have different routing paths). Here, we assume a single line rate optical network (i.e., all demands require the same line rate in bits/second) but its generalization to multiple line rates is straightforward.
To model the RMSA solution, we need two additional sets. Set M is the set of MF configurations for the considered line rate. Each m ∈ M is defined by its number of contiguous FSs n m (this value includes the required guard band between lightpaths) and its transparent reach T m . Set P d is the set of lightpath candidate paths to demand d ∈ D. The optical length of path p ∈ P d is the sum of its link lengths plus a length value ∆ per intermediate node (which models the optical degradation suffered by a lightpath while traversing an intermediate optical switch). Each path p ∈ P d is defined by: • the binary parameters β p k which are equal to 1 if node k (which can be an end node) is in p or equal to 0 otherwise; • the binary parameters α p ij which are equal to 1 if link (i, j), i < j is in p or equal to 0 otherwise; • the integer parameter n p indicating the number of FSs of the most efficient MF configuration whose transparent reach is not smaller than the optical length of p.
In [10], each demand has a fixed required number of FSs. In our case, the required number of FSs is n p which depends on the candidate path p ∈ P d . We associate to each d ∈ D the parameter n d which gives the minimum number of FSs required by any of its candidate paths p ∈ P d , i.e., n d = min p∈P d n p . Consider set P e as the set of all candidate paths of all demands that include link e ∈ E. Similar to [10], a collision metric c e is computed for each link e ∈ E given by c e = d∈D p∈(P d ∩Pe) n p . Then, each candidate path p ∈ ∪ d∈D P d has an associated path length l p = e∈P c e used to break ties when selecting candidate paths. All l p values are precomputed and used as parameters in the RMSA.
The RMSA policy for the regular state is given by Algorithm 1, a greedy algorithm that starts with an empty network (i.e., all FSs are free in all links) and assigns iteratively to a demand d ∈ D, a lightpath p ∈ P d and a set of n p contiguous FSs. Each assignment is the one that packs as much as possible the lightpaths in the lowest spectrum. Algorithm 1 starts by computing the maximum value n among the n d values of all demands (Step 1) and initializes setD with all demands such that n d = n (Step 3). Then, for each candidate path p of each demand inD (Step 6), the algorithm computes the lowest set of n p contiguous FSs that can be assigned without overlap with previous assignments. The algorithm selects the candidate path whose highest selected FS index is the lowest among all and, as a tiebreaker, the one with the shorter path length l p (Steps 8 and 9). The selected path and associated set of FSs is used to assign the lightpath to the corresponding demand (Step 12) and the demand is removed from setD (Step 13). WhenD becomes empty, n is decreased and the algorithm continues until n reaches 0.
for all p ∈ P d , d ∈D do 7: f ← highest FS index of the lowest set of n p contiguous FSs that can be assigned on p to d without overlap with previous assignments 8: if f <f or (f =f and l p <l) then 9:f ← f ,l ← l p ,p ← p andd ← d A key issue in the RMSA is the set of candidate paths P d to consider for each demand d ∈ D. In [10], a k-shortest path algorithm is used with k = 3. Our tests have shown that a value of k = 7 is required but not necessarily to all demands. The best strategy is to consider for each demand d a number of candidate paths equal to the minimum between 7 and the number of nodes in the shortest path from s d to t d plus one.
While the RMSA policy in the regular state is defined by Algorithm 1, in a failure state a slightly different variant is used. Lightpaths not disrupted by any failure node are not changed. So, the algorithm considers the surviving network (i.e., without the failure nodes and incident links) with the FSs occupied by the non disrupted lightpaths. For the disrupted demands whose end nodes are in the same component, a new set of candidate paths and associated path lengths is computed. Then, the RMSA is similar to Algorithm 1 but considers the demands d in increasing order of their n d values (as opposed to the decreasing order of Algorithm 1). Since the aim is to reassign as much as possible the disrupted lightpaths, our tests have shown that the increasing order is better, on average, since the lightpaths requiring less number of FSs can better fit in the initial fragmented spectrum.

III. RESILIENCE EVALUATION PROBLEM
The EON resilience to multiple node failures measures their impact in the network capacity to support the estimated demands. For a given number c ∈ N of failure nodes, we adopt a worst-case approach by identifying a set of c critical nodes whose simultaneous failure maximally reduce the demand percentage that is supported.
For a given EON and a given set of lightpaths (assigned by the RMSA policy to a given set of demands), the determination of the critical nodes is a min-max bi-level optimization problem: at the bottom level, the RMSA policy aims to maximize the demand percentage that is supported by a given set of node failures; at the top level, the set of node failures aims to minimize the demand percentage that the RMSA policy is able to support. We solve the problem heuristically by computing 2 sets of failure nodes, running the RMSA policy for each set and selecting the most damaging set as the critical node set.
The first set of failure nodes is computed by solving the weighted version of the Critical Node Detection (CND) problem shown to be efficiently solved by mixed integer linear programming [21]. To compute the second set of failure nodes, we propose a Node Demand Centrality (NDC) metric and use it in a greedy approach to iteratively select the failure nodes. Next, we describe separately each of the two methods.
CND based method. For each node i ∈ N , consider a binary variable v i indicating whether i is a critical node or not. For each node pair (i, j), with i < j, consider: (i) a weight w ij given by the sum of all demands d ∈ D whose end nodes are i and j, (ii) a set N ij which is the set of adjacent nodes to i (on graph G) if the degree of node i is not higher than the degree of node j, or the set of adjacent nodes to j otherwise, and (iii) a binary variable u ij which is 1 if nodes i and j are connected or 0 otherwise. For a given number c of critical nodes, the CND problem is defined as: The objective (1) is to minimize the total weighted connectivity, i.e., the sum of the weights of the node pairs that remain connected when the critical nodes are removed. Constraint (2) ensures that at most c nodes are selected as critical nodes (in optimal solutions, c nodes are selected). Constraints (3) guarantee that a pair of adjacent nodes is connected if none of the two nodes is a critical node. Constraints (4) are an efficient generalization of constraints (3) for the node pairs that are not adjacent in G: node pair (i, j) is connected if there is a noncritical node k ∈ N ij such that k is connected to both i and j. Constraints (5-6) are the variable domain constraints. As noted in [21], constraints (6) can be replaced by u ij ≥ 0, reducing the number of binary variables.
The set of failure nodes is computed by determining an optimal solution of this model using an available ILP solver. Note that such solution is an heuristic solution for our problem since it does not take into account neither the transparent reach of lightpaths nor the spectrum capacity of fiber links.
NDC (Node Demand Centrality) based method. The proposed demand centrality of each node k ∈ N aims to measure the impact of the node failure on the demands between all other node pairs. Let us denote as p d the lightpath p ∈ P d assigned to a demand d ∈ D. The resources used by each lightpath p d , denoted as S d , are given by its number of FSs times the number of hops of its routing path, i.e., S d = n d × (i,j)∈E α p ij . Then, the node failure impact is measured as a combination of two quantities: (i) Q 1 with the total demand that can no longer be supported and (ii) Q 2 with the minimum resources increase required to reassign new lightpaths to demands that can be connected.
So, for each node k ∈ N and for each lightpath assigned to a demand d between a pair of other nodes whose routing path includes k, we compute the candidate path p d ∈ P d that does not include k and requires the least amount of resources S d . If such candidate path does not exist, demand d is added to Q 1 . If it exists and S d > S d , the value is added to Q 2 , or otherwise the demand is ignored. At the end, the demand centrality r k of node k is r k = (Z × Q 1 ) + Q 2 . The factor Z defines the relative weight between the two quantities. Based on preliminary tests, the best results are obtained when Z is either the highest value of if any of such values was added to Q 2 , or is 1 otherwise.
The set of failure nodes C is determined with a greedy algorithm, presented in Algorithm 2, which uses the demand centrality value of each node to select the failure nodes. The algorithm starts with graph G (representing the EON network) and set D (of all demands with lightpaths assigned by the RMSA policy) in Line 1. On each cycle, the algorithm (i) computes the demand centrality r k of each node k (Lines 4-23), (ii) computes the nodek with the highest demand centrality (Line 24), (iii) selects nodek as a failure node (Line 25), (iv) the demands routed through nodek are removed from D (Line 26) and (v) nodek and its incident links are removed from G (Line 27). The algorithm ends when the desired number c of nodes has been determined (Line 28).

IV. NETWORK DESIGN PROBLEM
For the same set of demands, the same RMSA policy and a fiber budget B equal to the total fiber length of an existing EON, the design problem determines a new EON maximizing the resilience metric imposed by its critical nodes.
We have seen in the previous section that the evaluation (i.e., the determination of the resilience metric imposed by the EON critical nodes) is a min-max bi-level optimization problem. In the network design case, since we aim to compute an EON maximizing its evaluation value, this problem is a max-minmax tri-level optimization problem. To solve this problem, we use a multi-start greedy randomized heuristic similar to the one proposed in [30] that generates multiple EONs and returns the one whose resilience metric is the highest.
First, the greedy randomized algorithm (Algorithm 3) is used to compute each new EON. Algorithm 3 starts with an initial graph G = (N, E 0 ) composed by the set of nodes N of the original EON and by the set of links E 0 given by the Relative Neighbourhood Graph (RNG) [31]. RNG is defined as follows: nodes i, j ∈ N are connected by a link Algorithm 2 NDC based method 1: Given G = (N, E) and demand set D. if k ∈ p d then 8: Compute p d ∈ P d that does not include k and requires the least amount of resources S d 9: if p d does not exist then 10: c k ← c k + 1 11: else 12: if S d > S d then 13:

14:
Z ← max (Z, end for 24:k ← index k such that r k is maximal 25: if and only if there is no other node k ∈ N \{i, j} such that l ik ≤ l ij and l jk ≤ l ij . Then, the algorithm randomly selects one link (i s , j s ) at a time until no new link can be added within the remaining budget B R . The probability of each link being selected at each iteration is as follows. Assume E s is the set of already selected links, δ i is the degree of node i in G = (N, E s ) and the remaining budget is B R = B − (i,j)∈Es l ij . For all node pairs (i, j) / ∈ E s such that l ij ≤ B R and at least one of the nodes has the lowest degree in G, the probability of selecting link (i, j) is: while for all other node pairs (i, j), P (i, j) = 0.
Multiple runs of Algorithm 3 generate different EONs. So, in a multi-start greedy randomized algorithm, we run multiple times Algorithm 3, evaluate the resilience metric of each EON and return the best generated one. The multi-start greedy randomized algorithm is presented in Algorithm 4 with a stopping criteria given by a pre-defined number of iterations. The best EON is defined asḠ with a resilience metricz. Algorithm 4 starts by initializingḠ = (N, ∅) andz = 0. At each iteration, Algorithm 4 (i) generates a new EON G (Line 3), (ii) checks if G is valid (Line 4), (iii) computes its resilience value z 1 by the CDN based method, (iv) if z 1 is better thanz, computes its resilience value z 2 by the NDC based method (Algorithm 2) and the resilience value z of G (Lines 6-8) and (v) if z is better than the resilience valuez of the current best EON,Ḡ andz are updated accordingly (Lines 9-10).

Algorithm 3 Greedy Randomized Algorithm
Select a link (i s , j s ) with link probabilities given by (7) 5: if G is a valid EON then 5: Compute z 1 using the CND based method 6: if z 1 >z then 7: Compute z 2 using Algorithm 2 8: if z >z then 10:Ḡ ← G ,z ← z Two issues require further explanation. First, a randomly generated EON G is valid (Line 4) if it can support all demands with the RMSA policy. To validate G , we run Algorithm 1 and check if the highest FS index is within the total number of FSs available on each fiber link. Moreover, when the topology of the original EON is 2-connected, we also require G to be 2-connected (i.e, any single node failure still allows the establishment of a lightpath between any pair of nodes within the transparent reach T = max m∈M T m ).
Second, for each valid EON, Algorithm 4 computes first its resilience value z 1 by the CDN based method (Line 5). If z 1 ≤z, the EON cannot be better than the best one found so far and, so, the EON can be discarded without running Algorithm 2. The results show that z 1 is computed much faster than z 2 and, so, Algorithm 4 is more time efficient in this way.

V. COMPUTATIONAL RESULTS
The computational results are based on 3 network topologies with public available information: Germany50 [32], Palmet-toNet [33] and Missouri Network Alliance (MissouriNA) [33]. Table I presents their topology characteristics in terms of number of nodes |N | and fiber links |E|, minimum (δ min ), average (δ) and maximum (δ max ) node degree and an indication (in column '2-C') if the topology is 2-connected. Although the geographical location of nodes is known, the geographical routes of fiber links is not. So, to compute link lengths, we have assumed that links follow the shortest path over the Earth surface. Table II presents the resulting length characteristics in terms of minimum (l min ), average (l), maximum (l max ) and total (L) link length, and diameter, i.e., the highest length among all shortest paths adding ∆ per intermediate node (the length ∆ modeling the degradation suffered by a lightpath on each intermediate node was 60 Km).  Concerning fiber capacity, we consider each fiber with a capacity of |F | = 320 FSs which corresponds to a spectral grid of granularity 12.5 GHz. Concerning MF configurations (recall discussion on the Introduction), realistic transparent reach values are hard to get not only because new researches are periodically reporting reach gains (new MFs, more efficient signal processing, etc) but also because equipment vendors do not announce them in their next generation products due to market competition. So, we have considered |M | = 4 available MF configurations with number of FSs n m and transparent reach T m shown in Table III   All results were obtained using the optimization software Gurobi Optimizer version 8.0.0, with programming language MatLab version 9.4.0.813654 (R2018a), running on a PC with an Intel Core i7-8700, 3.2 GHz and 16 GB RAM. Table IV presents the resilience evaluation results. Column 'RMSA' presents the runtime (in seconds) of Algorithm 1 showing that the RMSA policy for the regular state takes a considerably large amount of the total runtime. The resilience evaluation (value and runtime) of the CND and the NDC based methods are presented separately (the best resilience values highlighted in bold). These results show that the CDN based method (i.e., computing the critical nodes based on the impact of node failures on the connectivity between the other nodes) is the best heuristic for larger values of c while the NDC based method (i.e., computing the critical nodes based on the impact of the node failures on the supported demand between the other nodes) is the best heuristic for the smallest values of c and only in the Germany50 instances. Concerning runtimes, as observed in Section IV, the CND based method is computed quicker than the NDC based method. Tables V and VI present the network design results (running Algorithm 4 with 1000 iterations). Table V shows the performance of the generation and validation part of the multistart greedy randomized algorithm showing the number of valid EONs (out of the 1000), and the runtime spent in the generation, topology validation and RMSA validation. Once again, the RMSA is the part by far most time consuming. Moreover, the more supported demands, the less number of generated EONs are valid (behavior easily observed in the Germany50 instances). Table VI presents the results of the evaluation of the valid EONs (resilience values of the original cases repeated in column 'Original' for comparison reasons). The most important conclusion is that the resilience value of the best EONs is always much higher than the original EONs. A second conclusion is that the resilience evaluation also takes a significant amount of runtime. Note that the total runtime of the network design task is given by the sum of the 3 time values of Table V and the time value of Table VI. So, in overall, the method takes several hours to run, which is still reasonable for a network design task.  Figure 1 presents the original topologies and the best topologies obtained for c = 3 critical nodes and for the instances with more demands of each network. To understand the differences, links of the best topology not in the original topology are highlighted in dashed blue and the critical nodes of each case are represented with red squares. The number of links highlighted in blue clearly shows that the resilience improvement of the network design solutions is obtained with topologies which are very different from the original ones.
Another interesting aspect is the comparison of the node degree distributions between the original topologies and the topologies of the best network design solutions. Figure 2 shows these distributions for the 8 instances with the best EONs obtained for c = 3 critical nodes (original topologies in blue and best topologies in green). In the best solutions, there is a decrease of the number of nodes with the lowest and highest degrees and an increase of the number of nodes with degrees closer to the average. This observation also stands for the other values of c showing that EONs resilient to multiple node failures tend to have more homogeneous node degrees.

VI. CONCLUSIONS
In this work, we have considered the evaluation and network design of EONs resilient to multiple node failures. First, we have addressed the resilience of EONs to multiple node failures by identifying the critical nodes whose simultaneous failure maximally reduce the demand percentage that is supported by the network. Then, for the same estimated demands, the same RMSA policy and a fiber budget equal to the total fiber length of an existing EON, we have addressed the design of a new EON maximizing the resilience metric imposed by its critical nodes. For both tasks, we have proposed heuristic methods that were evaluated on known network topologies.
The results showed that the network design solutions are much more resilient to multiple node failures. The improvements are obtained with topologies with more homogeneous node degrees which are very different from the original ones. In computing terms, a key aspect is the RMSA which was the most computational demanding part of the proposed methods. Note that the adopted RMSA policy assumes a restoration mechanism where disrupted demands are reassigned as much as possible with new lightpaths supporting the same line rate. A topic that deserves further study is to consider bandwidth squeezed protection/restoration mechanisms (exploiting the advanced flexibility provided by sliceable bandwidth-variable transponders) [6] where the line rate supported by lightpaths is dynamically reduced so that more lightpaths can be accommodated in the case of large-scale failures.