*** : comments regarding open discussion points for the reader @@@ : comments regarding open discussion points for the author. Kick me if there are still too many of these by the time you read this document! The Integration Layer ===================== Contains: @@@ To be fixed up when this is a bit more complete. * Concepts * API * Mechanism * Typical Operations Concepts ======== * Cluster Controller: The single node which is responsible for coordinating a cluster transition. The first task during cluster transition is election of such a Cluster Controller. * Cluster Map: A "Cluster Map" is the picture, maintained at any node in a cluster, of what the cluster appears to look like from the point of view of that node. Every cluster has a Cluster Map which indicates the members of that cluster. This is the "Current Cluster Map". During a Cluster transition, we must also maintain a "Proposed Cluster Map", which describes the state we think we are moving to. This map is dynamic: a cluster transition may involve resetting of communication links all over the cluster, resulting in many nodes temporarily dropping and (hopefully) resuming communication with their neighbours, or multiple new nodes finding and rejoining our cluster. The Proposed Cluster Map is updated on each such communications event. Once the Proposed Cluster Map has stabilised, the Cluster Controller makes a Commit Cluster Map out of it and begins a 2-phase commit of that new state around the cluster. API === No, there's nothing here yet. Overview ======== * Overview The Integration Layer is supposed to do absolutely nothing while the cluster is running normally. Only when circumstances change is it invoked. The Events which affect the Integration Layer are usually generated by the Link Layer. They include: ++ New Link This event merely announces the discovery of a new node on the network, and the establishment of a Link to that node. No link state is implicit in this message: we expect a Link State event to follow, and no cluster transition is invoked until we get that Link State event. ++ Link State A Link's state (Down, Reset, Degraded, or Up) has changed. Oops: time to begin a cluster transition. If a cluster transition is already in progress, this requires us to update the Proposed Cluster Map and adjust the transition state if necessary. *Every* Link State event causes a cluster transition to be initiated on the local node. ++ Operator Intervention We must support external commands to modify the cluster map. For example, the operator may request that a specific node be taken out of the cluster for maintenance; node shutdown must also perform a clean cluster exit. Operation Requirements ====================== Before launching into a design discussion, we need to be clear about some of the properties required of the cluster integration layer. This section will act as a rationale for some of the design decisions below. Remember that all cluster transitions merely involve merging or partitioning of existing clusters. This property is a result of the design decision to consider both newly restarted hosts and hosts evicted from an existing cluster to be single-node clusters in their own right: the construction of a new, larger cluster of such hosts is merely a sub-case of the merging together of two clusters, rather than being a special case in its own right. * Partially connected clusters This happens. We need to deal with the situation where some members of a cluster can see some, but not all, of the others. This can result from a fault occurring in a working cluster. We also need to deal with the situation where we are proposing the merging of two or more smaller clusters, but where not all of the merging machines can see each other. We can define the Cluster Superset at any node to be the set of all nodes which are connected to that node, or which are in the Cluster Superset of any connected nodes. The Cluster Superset is the transitive closure of the direct point-to-point node connectivity relation. We have several requirements resulting from the need to deal with partially connected Supersets. First of all, the cluster integration protocol needs, at some level, to propagate cluster connectivity information over all the nodes in the Superset. When we merge a set of machines into a larger cluster, we need somebody to select the largest fully-connected subgraph of the Cluster Superset (or, rather, the subgraph with the most votes) to form a new cluster. We need to ensure that this configuration is stable. Any attempt by one of the rejected nodes to join the new cluster must be rejected. We can ensure this by automatically failing any merge between clusters if any of the nodes in one of the merging clusters is unable to see any of the nodes in any other cluster in the merge. * Operation Requirements: Cluster merge A cluster merge is the process which occurs whenever any node detects the existence of a neighbouring node which has the same cluster name but which is not currently a member of its cluster. The cluster merge simply combines the two newly connected clusters together into one larger cluster. Given that all nodes start off operating as single-node clusters, this mechanism will deal with individual nodes joining an existing cluster as well as with two previously-partitioned clusters reestablishing connection with each other. As we just mentioned, it is important for a cluster merge to result in a new cluster which actually works. We require that clusters must be fully connected at all times, so this means that a merge must preserve this property. We can simplify our operation enormously by ensuring that we only ever try to merge working clusters. If we do this, then by the time we come to do a merge, the merging clusters will already have elected their own Cluster Controller nodes (CCs), so the merge becomes a simple matter of one CC handing control of its cluster's nodes to another CC. The CCs can exchange information about each others' clusters, and within each merging cluster the new proposed cluster membership can be propagated by the CC so that pairwise connections between all the newly discovered nodes can be established. If any of these connections fail, then the CCs concerned have to abort the merge, and must not make another attempt at a merge until some other cluster transition event occurs which makes it useful to retry the merge. What happens, then, if we attempt a cluster merge and find that there are just a few pairwise connections between the clusters which cannot be established? We reject the merge, but it doesn't need to stop there. By breaking up one or more of the smaller clusters to evict some of the nodes which are failing to establish cross connections, we might be able to establish a new larger cluster. However, at the same time we do not want to risk breaking up existing clusters unnecessarily just to attempt such a modified merge. We can introduce a new rule which helps here: after a failed cluster merge, the cluster with the most votes never, ever tries to break up. However, we can allow smaller clusters to fragment voluntarily. If a CC decides not to proceed with a merge with a higher-voting cluster, it can look at which inter-cluster connections succeeded and which failed, and can try to identify exactly which of its own members are in fact able to see all of the other cluster's members. If it finds that some of its nodes would in fact be able to cross over to the other cluster, it can evict those members from the local cluster. Those evicted cluster members will immediately attempt a cluster merge themselves, and in that process should always look for the largest neigbouring cluster to join, which in this case will be the larger of the two clusters which tried to merge in the first place. The result of this will be that we can donate nodes from one cluster to another to grow the larger of the clusters, while preventing partially-connected nodes from ever disrupting the membership of the larger cluster. The cluster breakup mechanism will therefore never break quorum. * Operation Requirements: Cluster breakup and communication faults Cluster breakup is potentially a little more complex, since when a disconnection event occurs in a working cluster, we do not yet know the extent of the damage: perhaps one single node has failed, or perhaps _all_ of the other nodes have become "failed" because our network connection to them has died. One thing to note straight away is that a single channel error event must not be sufficient to trigger cluster disintegration: if only one channel in a point-to-point link (ie. only one route of multiple redundant routes between a pair of hosts) has died, we must retain the cluster. We still want to trigger a cluster transition of some form, as the loss of any channel always causes a complete link reset of the communications between those two nodes and we need to recover the context of any messages which were in transit between the nodes at the time. However, if the link does recover successfully (say, it fails over to a backup ethernet wire), we can record the resulting cluster transition as a null event: we still need to perform some recovery to retry any messages which were in transit when the error occurred, but we do not need to recover any cluster-wide global state in this case. *** DISCUSSION POINT: *** Does the above point make sense? We _do_ need to perform some recovery in this case, so as far as the cluster integration layer is concerned this section is entirely accurate. However, we need to think about the visibility of such null transitions to higher layers of the cluster software. We can provide reliable data pipes between processes in the cluster at a higher level using packet sequence numbers which recover themselves over a cluster transition. Think about two communicating high-level processes --- a web server and a CGI server, say --- in the case where an unrelated node drops out of the cluster. Those two processes should not be affected by the cluster transition, other than at most a temporary stall in the servicing of cluster IO and DLM requests. The IO pipe's internal recovery from to channel failover can be transparent to the applications. It is already planned for cluster transitions to be given a brief synopsis of whether any significant node changes have occurred during the transition, so by just making the rule that all communication errors produce a full cluster transition, we are not forcing all of the rest of the cluster software to do full recovery. On seeing the flags "no cluster membership change" in the transition block, they can simply avoid doing persistent state recovery and limit themselves to communications recovery. General operation ================= OK, I admit it. It would be inaccurate to describe the Integration Layer's operation, as described in the rest of this document, as being purely a consequence of the requirements above. Some of the requirements already stated are actually requirements which arise from design decisions, not functional requirements. Please bear with me on this --- I *think* that where I've described a mechanism, rather than a rationale, in the above text, I have good cause in that the mechanism represents a broad design decision (rather than a mere point of detail) which simplifies the entire layer's design. I'll now try to explain the broad principles of operation I envisage for the Integration Layer, how it all works together and why the overall plan makes sense. You might have noticed that I have simply assumed, in the above, that every single cluster transition can be expressed as a combination of cluster breakups and cluster merges. This really, really simplifies things enormously. For one, it means that we never, ever have to deal with individual nodes in this model: every isolated node is nothing more than a one-node cluster with an already-elected cluster controller. Cooperation between different, already-established groups of nodes therefore reduces to a problem of point-to-point communications between individual CC nodes. Such cooperation between currently-unclustered nodes is *always* a cluster merge operation, never anything else (except that, possibly, we may want to add diagnostic and monitoring code on top: that does not invalidate the Integration mechanism as presented here). A second major advantage of the merge/breakup concept is that it deals very cleanly with errors which occur during a cluster transition. If a cluster detects an internal error (loss of a node or link) while in the progress of completing a cluster transition, it immediately drops everything and sorts out its internal problems first. Thirdly, it also simplifies the management of complex faults in a cluster where complete any-to-any connectivity is lost. Such faults are dealt with in the breakup phase of the transition, and in that phase, we always know _exactly_ who is already in the old cluster and we have to deal with them alone: we never have to deal with any nodes outside the cluster until the cluster's own internal state has been reestablished. It is important to refer back here to the general design principle of hierarchical phases of operation. In this merge/breakup model, cluster breakup is a more fundamental operation than cluster merge. Remember that the hierarchical phases principle means that any error takes us to a previous phase, and forward progress is only ever achieved by consensus. By dealing with the cluster breakup phase prior to the cluster merge phase, we ensure that merging is always a cooperative operation between working clusters: any internal cluster error is dealt with before we ever get as far as merging clusters. If, during a merge, one of the merging clusters detects the loss of an internal node, that cluster simply aborts the merge and reverts to the prior phase, reconfiguring itself to deal with its internal error before bothering with the rest of the world. The fact that we will start rebuilding this state by reentering the abort phase must ensure that we propagate the error to the rest of the nodes we were merging with, so that they themselves can detect the abort of merge phase. The expected consequence of this is that the cluster which detected the internal loss of node will quietly resolve its internal state before trying to join anybody else. Either it will complete that before the rest of the Cluster Superset finishes merging (in which case it will try to trigger another merge), or before (in which case the in-progress merge will be aborted). Either way, the previous, aborted merge is automatically resumed only after the component clusters have dealt with node loss. Cluster merging simply does not have to deal with node loss: it only has to deal with a general "oops, something went wrong" error which always results in a regression to the abort phase. Mechanism ========= A Cluster Transition involves progress through a set of distinct phases. The execution of each phase may be very rapid if we happen to know in advance what sort of transition is occuring (eg. controlled eviction of a single node does not have to perform a complete CC reelection if the remaining nodes can all still talk to each other afterwards, and the evicted node was not previously the CC). This section will contain a complete description of each of these phases. The descriptions will begin with the objectives of each phase: what the pre-conditions and post-conditions are. The pre-conditions are sufficient to describe those things which the phase does NOT have to achieve, but can simply assume have already been achieved. The post-conditions describe the objectives of the phase by establishing the final results that the phase is required to achieve. ++ Abort Phase: tell everybody that something terrible has happened. Pre-conditions: None at all: we can't assume anything here since this is the state we enter on getting an error of any description, either a recursive error during the integration process or a fault detected during normal running of the cluster. Post-conditions: All cluster nodes which are still communicating with us, and which are in the current Commit Cluster; as well as all nodes which we may have been trying to merge with since beginning this cluster transition; have also returned to Abort Phase. The connectivity of the Commit Cluster is moderately stable. Mechanism: The only inputs to the abort phase is the previous Committed Cluster Map, which exists at all times and which is not actually modified by the Integration Layer transition mechanism at all (at least, not until the entire transition process has completed and a new Commit Map has been established); and the current state of each Link with nodes with the same cluster name. When no Link state changes concerning neighbours in the Commit Map have occurred within a certain period (the Cluster Settling interval), we can enter Discovery Phase and start sending Discovery messages on all Links. If a neighbour sends us a Discovery message before our Settling interval is complete, we buffer it for use once we have settled and entered Discovery phase ourselves. We must not actually respond to that message until we have Settled. ++ Discovery phases: Gather information about the new state of the Cluster Superset Discussion: The first process we enter in a cluster transition is the breakup phase: the eviction of any unreachable nodes from the old Cluster, plus potentially the eviction of further nodes if the remaining nodes are no longer all fully connected to each other. To do this, we need somebody actually make the decision on what the "best" fully-connected subset of the old Cluster is still viable. We will call the node which makes this decision the Cluster Elector for the breakup phase. We also need to remember that when we lose contact with a node in the cluster, that node may decide to evict itself from our cluster and form a new cluster on its own behalf, but communications with that node may be reestablished before we have finished our own cluster transition. This is dealt with by observing that the lost and recovered node must have a changed Cluster ID, and preventing nodes with the wrong Cluster ID from participating actively in the Breakup phase. The discovery phase therefore also needs to establish, for each node, whether the neighbouring nodes still have the same Cluster ID as we do. In this situation, the lost node will just have to wait until all of the nodes which have still got the old Cluster ID have finished their eviction phase and have entered Cluster Merge phase (in other words, the whole mess is dealt with before we leave the cluster transition altogether, but we don't have to deal with it just yet: the ordering of breakup phase before merge phase naturally deals with this sort of fault). If no nodes survive with that cluster ID, then that's fine too --- the resulting smaller clusters will enter the merge phase eventually. We define the term Related Nodes to refer to the set of neighbours of a given node which (a) are still connected to that node, and (b) have the same cluster ID (which necessarily implies that they were part of the same Commit Set as us previously). The Related Node Superset of the old Commit Cluster Map is the transitive closure of the Related Node function, and includes all of the old nodes which are connected together over any multi-hop route of Related nodes. Cluster transitions can take arbitrary amounts of time, due to the fact that the Abort phase will stall for as long as a communication Link is unstable (in other words, faults which are transitioning rapidly are bad news --- we need to think about mechanisms by which to dampen the effects of such nodes, but such a mechanism probably involves voluntary exit of a node experiencing such problems and doesn't directly change the Integration Layer's behaviour at all). As a result of this, we can imagine a not-quite-partition of the cluster, in which the cluster partitions but then a single node which acts as a bridge between the partitions comes back. If that node has not yet completed its own cluster transition (evicting itself from the previous cluster in the process), then all is fine: that node is still Related to some others, and the Abort which follows the link state change when the node rejoins its neighbours will eventually lead to the two partitions joining the same Related Node Superset. If the "bridge" node has transitioned, on the other hand, we have two Related Node Supersets in the new cluster which are theoretically able to merge but which are divided by an unrelated node. At this point we are just doing breakup, not merge, though, so we have to deal with the Related Supersets as quite distinct entities. Therefore, we do _not_ want to pass discovery information over the bridge node: discovery information must only be passed to Related Nodes, not beyond. Mechanism: For the Cluster Elector, we choose that node with the greatest "Value Function", where the Value Function of a Node is some function ordered by: * -ve number of degraded links to nodes which are still in the node's Commit Map. We want to select a value function which minimises the amount of degraded link traffic in the common failure modes. Do we do this best by maximising the number of non-degraded links active at the Elector node, or minimising the number of degraded links? Probably the latter, so the value function (for now) always prefers nodes with few degraded links. Then, * CC: The existing CC in the Related Node Superset is always preferred as Cluster Elector if it is still reachable via any route through that Superset (unless it has many degraded links due to recent faults). Failing that, * Total number of votes from Related Nodes present on, or on nodes Related to that node; then, in case of tie, by * Number of other nodes Related to that node; then, * "Metric" of that node (a static configuration value intended to represent available CPU power); then, * Lexicographic order of the node's unique nodename. The important point about this function is that it is unambiguous. Any node can generate a Value for itself, and pass that to other nodes; given that information, any two nodes will always agree on which node has the higher Value. *** Note that by using the degraded link count as a high-priority sort key and the node Metric as a low order key, we end up preferring to minimise degraded link traffic rather than placing the Elector on the most CPU-capable node in the cluster. Comments? Maximising the connectivity of the resulting cluster in the Eviction phase may require a good CPU --- should we optimise for that rather than optimising for communications at this point? What we now need is a mechanism by which all connectivity information in the entire Cluster can be propagated to a single node which all nodes agree is the best one to be Cluster Elector. The mechanism needs to be able to deal with situations like this: Node: A-------B-------C-------D-------E Value: 4 1 3 2 5 where we have very little connectivity in the cluster: maybe only the point-to-point serial cables between the nodes have survived, and each cluster can see only its immediate neighbours. We cannot simply propagate connectivity information to our neighbour with the best "value", since nodes B and D in this example will end up sending their information in opposite directions. So, first of all we will decide who is to be the Elector by propagating Value information between nodes. Only once one node decides it has won the Election will we propagate information, and the Elector will guide that phase by explicitly telling the other nodes what decisions it has come to. ++ Discovery Phase. Pre-conditions: All of our neighbours within the Commit Map have either entered abort state or are unreachable: we have Reset our Links to them and have waited long enough to be sure that either they have already completed the Reset negotiation or they aren't going to. Post-conditions: Everybody in the local Cluster Superset has propagated its local view of cluster connectivity (ie. which other nodes it can still see) to a new node which will become the "Cluster Elector", which controls the move to the next phase. Mechanism: In this phase we want to propagate the Election Value function between nodes, and we want to propagate connectivity maps for the entire cluster back to the Elector. We choose the following algorithm: Each node starts by sending its own value function to each of its Related neighbours. Each subsequent message between nodes includes a Value Function plus a non-empty list of connectivity maps. The Value Function of any message is the maximum of the Value Functions of any nodes whose connectivity is listed in the map. (The connectivity of a node is just a complete list of all other nodes still connected to that node, plus the Link state: DEGRADED or UP.) When a node has received the initial Value Functions of each of its neighbours, it can start to propagate connectivity maps. Propagation works from the node with the lowest Value upwards. The node with the lowest Value begins by sending its connectivity map to its neighbour with the highest Value. When nodes has receive messages, they update their local database of connectivity maps to maintain both (a) the total set of all nodes whose connectivity information is known, and (b) a list of connectivity maps which are known to each neighbour. If a node discovers, in this manner, a new node which has a higher value function than any @@@ Described more fully in discovery.txt ++ Eviction Phase Pre-conditions: one node --- the Elector node --- has a full map of the cluster connectivity and knows that it is the Elector. All other reachable Related nodes are still blocked somewhere in the Discovery phase. Post-conditions: we have established a Cluster Controller within the current Commit Cluster to guide us through the rest of the cluster transition, or we have left the cluster (and reformed a new cluster which will try to merge if possible with any neighbouring clusters). The transition from Discovery phase to Eviction Phase is guided by the unique Elector node. The Elector is responsible for making a unilateral decision about which of the current members of the cluster should survive the transition and which get evicted. This phase decides which node is going to act as Cluster Coordinator ("CC") for the remainder of the transition. Mechanism: The Elector node makes a decision about the new shape of the cluster. It must calculate the fully-connected subset of all surviving related nodes, which maximises: * Total Votes; then, * Total Metric (CPU capacity); then * Total number of nodes. This subset of nodes becomes the New Cluster Map. The Elector then determines, from that map, the maximal subset of nodes which is fully connected by non-degraded links only. This further subset is the "Functioning Subset" of the cluster, and the nodes in that subset are "Functioning" nodes. Other nodes are marked as "Degraded" nodes. Discussion: Degraded Nodes It will be up to individual nodes to decide which of their internal services to start, and which to suspend, on a transition between Functioning and Degraded mode. The most important reason for maintaining Degraded nodes in the cluster is to sustain Quorum: any votes belonging to a Degraded node still count towards the Quorum in a cluster, and can therefore mean the difference between life and death of critical cluster services on a cluster failure. Consider a cluster of 5 voting nodes: a communications fault may remove high-quality access to three of those nodes but we may still have degraded links from the remaining two nodes to one of the failed nodes. In that situation, we still have 3 votes out of 5 --- enough, barely, for quorum --- as long as that degraded node remains in the cluster. However, if we remove it from the cluster (even cleanly, returning its votes), we are left with only 2 votes out of 4, and we lose Quorum. However, in most cases, we hope that loss of high-quality communications to a node does not lose _all_ degraded communications: that's why we have backup serial interconnects, for example. In the above example, if the 2 evicted nodes lose direct contact with the surviving cluster, they can still exit from the cluster cleanly as long as there is some multi-hop path between the nodes. In this case, Quorum management must be involved in the cluster transition. For good quorum maintenance, it is absolutely critical that we perform that eviction in a clean manner which allows the expected votes in the surviving cluster to be adjusted if necessary. to that set of nodes any other Related nodes which can see each node in the Functioning Cluster, but which have Degraded connections to one or more Functioning Nodes. These nodes with Degraded connections can still belong to the new cluster, but they are marked as Degraded, not Functioning, members. They are still present for sysadmin and vote tiebreaking purposes, Error recovery: The Eviction phase is not quite as simple as previous phases with respect to error recovery, but it is not too complex to understand as long as we bear in mind exactly what sort of recovery we need to deal with. The reason that the Eviction phase introduces new complications is that up until this phase, we have not started changing the membership of the cluster (as defined by the Cluster ID at each neighbouring node). During Eviction, we are for the first time performing a controlled modification of the cluster membership. Fortunately, we can still reuse the existing error recovery mechanisms. If we lose our Link to a Related node during this phase, we just restart the cluster transition by going back to Abort phase. This is fine. It doesn't matter exactly who in the cluster has completed Eviction and who has not. Any nodes whose eviction has completed will now have a new Cluster ID; nodes which have not will still have the old Cluster ID. The Discovery phase explicitly limits itself to discovering only Related nodes, so after any error, we already have the required recovery mechanism in place to work out who has and who has not yet completed the Eviction process. ++ Propagation Phase This phase propagates all information about nodes still reachable in the cluster, and inter-node communication paths still working, to the CC node. ++ Proposal Phase The CC generates a new Commit Cluster Map, and proposes that map to every node. ++ Commit Phase: The CC completes the transition by promoting the Commit Cluster Map to be the Current Cluster Map, and requests all other nodes to do the same. The cluster transition is seen by each node to be complete once this occurs: the CC sees the transition first and other nodes see it once the Commit message arrives. Note that within each of these phases, we may have to pass information to nodes which are not necessarily connected directly to the CC. If a cluster fault has ended up producing a partially-connected set of nodes, then at least CC election information and cluster eviction information must be passed, preferably along with a new cluster map indicating to the rejected node just which machines it could not see in the cluster (for diagnostic purposes, not for correctness: you cannot base your local correctness on the state of an invisible node!). The details of each phase are as follows: * Abort Phase: The aim of the abort phase is to make sure that every node which is a member of, or which is connected to, the cluster, has entered cluster transition state. When we first begin the abort phase, we reset every Link within the cluster. No Resets occur after this: they are not necessary, since any received communications error represents an implicit guarantee that the remote node has entered the cluster transition phase itself, anyway. We need to be aware that the entire cluster transition has some pretty fundamental complications: we can get a new cluster transition during it. For example, a cluster merge can be brought to a screaming halt by the death of one node in one of the merging clusters. Therefore, we need to be able to reenter abort phase at any point in the cluster transition, and this new abort must also reset communication links with the new members we were trying to merge with at the time. We make the simple rule that all Links to nodes with the same cluster name must be reset during a transition, even if those links are to nodes currently known not to be in our cluster. (Remember, a cluster partition due to partial cluster connectivity will result in two or more separate clusters from the Integration Layer point of view, but although those clusters will have different IDs, their basic cluster name will be the same.) However, we do not have 100% synchronous cluster-wide state transitions (this job would be _so_ much easier if we did!). When we do any state transition, we cannot be entirely sure that our neighbours are currently in the same state that we are. We may be in abort phase when our neighbours are not. They may still be in abort phase when we have already decided to exit that phase. *** @@@ I need to think about this a good deal more when the exact transition between states has been outlined in a great deal more detail than we have here. *** From this, it also follows that a Link reset from a node not currently in our cluster must be treated as a serious error if that node is currently trying to merge with us (so we need to reenter abort phase), but if a reset arrives on a Link to a node which is not currently in the cluster due to a previous failed cluster merge, it does not necessarily cause an abort. It might, however, cause the local cluster to begin reevaluation of the partial connectivity of the Cluster Superset, initiating a voluntary Abort to start the cluster merge process. * Discovery Phase: * Election Phase: The Worry Set ============= We maintain, at all times, a set of nodes called the Worry Set. This set contains a list of all nodes which we care about at the moment. A Link State change for a node in the worry set always restarts the cluster transition with a move to Abort Phase. Why do we need this? Nodes outside our immediate cluster should not be able to kill the cluster if they die. If we have evicted a node (due to voluntary exit or due to a fault) or have rejected an attempt by that node to merge with our cluster, then that node may still have a connection to us but the state of that connection is not important to the cluster. We have already decided that the other node is not going to join in the fun, so we shouldn't care if that node disappears. However, things get more complicated during a cluster transition, where the Current Cluster Map and the New Cluster Map are quite distinct things. When we are in the middle of proposing a cluster merge, for example, the new prospective members of the cluster are not yet in our cluster, but if a link to one of them disappears, we most definitely need to abort the merge. On the other hand, if we have already decided that some remote node is to be evicted from the cluster during the cluster transition, and that node then dies, then we probably don't care. During the normal running of a cluster, the Worry Set is the set of all nodes in the Current Cluster Map. During the cluster breakup phase, the Worry Set must be the same: during that phase we are concerned above all else in the discovery of what our old cluster now looks like, finding out which nodes and which links have survived. We are not at all interested in any other nodes at this point. Into the merge phase, we are in a different situation. All of the nodes which contact us and try to merge with us must be added immediately to our Worry Set, so that the entire merge mechanism can be aborted and restarted if one of those neighbours dies. Once into the merge phase, we add to the Worry Set every member of every cluster which tries to merge with us. We remove them from the Worry Set if we end up deciding to reject the merge. Once the cluster transition is complete, of course, the new Commit Map becomes the Worry Set. LocalWords: CC ie pre neighbour ve nodename minimise optimise optimising txt LocalWords: maximises sysadmin tiebreaking IDs