*** : comments regarding open discussion points for the reader
@@@ : comments regarding open discussion points for the author.  Kick
      me if there are still too many of these by the time you read
      this document!


The Integration Layer
=====================

Contains: @@@ To be fixed up when this is a bit more complete.

* Concepts
* API
* Mechanism
* Typical Operations

Concepts
========

* Cluster Controller:

   The single node which is responsible for coordinating a cluster
   transition.  The first task during cluster transition is election
   of such a Cluster Controller. 

* Cluster Map:

   A "Cluster Map" is the picture, maintained at any node in a
   cluster, of what the cluster appears to look like from the point of
   view of that node.  Every cluster has a Cluster Map which indicates
   the members of that cluster.  This is the "Current Cluster Map".

   During a Cluster transition, we must also maintain a "Proposed
   Cluster Map", which describes the state we think we are moving to.
   This map is dynamic: a cluster transition may involve resetting of
   communication links all over the cluster, resulting in many nodes
   temporarily dropping and (hopefully) resuming communication with
   their neighbours, or multiple new nodes finding and rejoining our
   cluster.  The Proposed Cluster Map is updated on each such
   communications event.  Once the Proposed Cluster Map has
   stabilised, the Cluster Controller makes a Commit Cluster Map out
   of it and begins a 2-phase commit of that new state around the
   cluster.

API
===

No, there's nothing here yet.  

Overview
========

* Overview

  The Integration Layer is supposed to do absolutely nothing while the
  cluster is running normally.  Only when circumstances change is it
  invoked.  The Events which affect the Integration Layer are usually
  generated by the Link Layer.  They include:

 ++ New Link

    This event merely announces the discovery of a new node on the
    network, and the establishment of a Link to that node.  No link
    state is implicit in this message: we expect a Link State event to
    follow, and no cluster transition is invoked until we get that
    Link State event.

 ++ Link State

    A Link's state (Down, Reset, Degraded, or Up) has changed.  Oops:
    time to begin a cluster transition.  If a cluster transition is
    already in progress, this requires us to update the Proposed
    Cluster Map and adjust the transition state if necessary.

    *Every* Link State event causes a cluster transition to be
    initiated on the local node.

 ++ Operator Intervention

    We must support external commands to modify the cluster map.  For
    example, the operator may request that a specific node be taken
    out of the cluster for maintenance; node shutdown must also
    perform a clean cluster exit.


Operation Requirements
======================

Before launching into a design discussion, we need to be clear about
some of the properties required of the cluster integration layer.
This section will act as a rationale for some of the design decisions
below.

Remember that all cluster transitions merely involve merging or
partitioning of existing clusters.  This property is a result of the
design decision to consider both newly restarted hosts and hosts
evicted from an existing cluster to be single-node clusters in their
own right: the construction of a new, larger cluster of such hosts is
merely a sub-case of the merging together of two clusters, rather than
being a special case in its own right.

* Partially connected clusters

  This happens.  We need to deal with the situation where some members
  of a cluster can see some, but not all, of the others.  This can
  result from a fault occurring in a working cluster.  We also need to
  deal with the situation where we are proposing the merging of two or
  more smaller clusters, but where not all of the merging machines can
  see each other.

  We can define the Cluster Superset at any node to be the set of all
  nodes which are connected to that node, or which are in the Cluster
  Superset of any connected nodes.  The Cluster Superset is the
  transitive closure of the direct point-to-point node connectivity
  relation.

  We have several requirements resulting from the need to deal with
  partially connected Supersets.  First of all, the cluster
  integration protocol needs, at some level, to propagate cluster
  connectivity information over all the nodes in the Superset.  When
  we merge a set of machines into a larger cluster, we need somebody
  to select the largest fully-connected subgraph of the Cluster
  Superset (or, rather, the subgraph with the most votes) to form a
  new cluster.

  We need to ensure that this configuration is stable.  Any attempt by
  one of the rejected nodes to join the new cluster must be rejected.
  We can ensure this by automatically failing any merge between
  clusters if any of the nodes in one of the merging clusters is
  unable to see any of the nodes in any other cluster in the merge.

* Operation Requirements: Cluster merge

  A cluster merge is the process which occurs whenever any node
  detects the existence of a neighbouring node which has the same
  cluster name but which is not currently a member of its cluster.
  The cluster merge simply combines the two newly connected clusters
  together into one larger cluster.  Given that all nodes start off
  operating as single-node clusters, this mechanism will deal with
  individual nodes joining an existing cluster as well as with two
  previously-partitioned clusters reestablishing connection with each
  other.

  As we just mentioned, it is important for a cluster merge to result
  in a new cluster which actually works.  We require that clusters
  must be fully connected at all times, so this means that a merge
  must preserve this property.
  
  We can simplify our operation enormously by ensuring that we only
  ever try to merge working clusters.  If we do this, then by the time
  we come to do a merge, the merging clusters will already have
  elected their own Cluster Controller nodes (CCs), so the merge
  becomes a simple matter of one CC handing control of its cluster's
  nodes to another CC.  The CCs can exchange information about each
  others' clusters, and within each merging cluster the new proposed
  cluster membership can be propagated by the CC so that pairwise
  connections between all the newly discovered nodes can be
  established.  If any of these connections fail, then the CCs
  concerned have to abort the merge, and must not make another attempt
  at a merge until some other cluster transition event occurs which
  makes it useful to retry the merge.

  What happens, then, if we attempt a cluster merge and find that
  there are just a few pairwise connections between the clusters which
  cannot be established?  We reject the merge, but it doesn't need to
  stop there.  By breaking up one or more of the smaller clusters to
  evict some of the nodes which are failing to establish cross
  connections, we might be able to establish a new larger cluster.
  However, at the same time we do not want to risk breaking up
  existing clusters unnecessarily just to attempt such a modified
  merge.

  We can introduce a new rule which helps here: after a failed cluster
  merge, the cluster with the most votes never, ever tries to break
  up.  However, we can allow smaller clusters to fragment
  voluntarily.  If a CC decides not to proceed with a merge with a
  higher-voting cluster, it can look at which inter-cluster
  connections succeeded and which failed, and can try to identify
  exactly which of its own members are in fact able to see all of the
  other cluster's members.  If it finds that some of its nodes would
  in fact be able to cross over to the other cluster, it can evict
  those members from the local cluster.  Those evicted cluster members
  will immediately attempt a cluster merge themselves, and in that
  process should always look for the largest neigbouring cluster to
  join, which in this case will be the larger of the two clusters
  which tried to merge in the first place.

  The result of this will be that we can donate nodes from one cluster
  to another to grow the larger of the clusters, while preventing
  partially-connected nodes from ever disrupting the membership of the
  larger cluster.  The cluster breakup mechanism will therefore never
  break quorum.

* Operation Requirements: Cluster breakup and communication faults

  Cluster breakup is potentially a little more complex, since when a
  disconnection event occurs in a working cluster, we do not yet know
  the extent of the damage: perhaps one single node has failed, or
  perhaps _all_ of the other nodes have become "failed" because our
  network connection to them has died.

  One thing to note straight away is that a single channel error event
  must not be sufficient to trigger cluster disintegration: if only
  one channel in a point-to-point link (ie. only one route of multiple
  redundant routes between a pair of hosts) has died, we must retain
  the cluster.  

  We still want to trigger a cluster transition of some form, as the
  loss of any channel always causes a complete link reset of the
  communications between those two nodes and we need to recover the
  context of any messages which were in transit between the nodes at
  the time.  However, if the link does recover successfully (say, it
  fails over to a backup ethernet wire), we can record the resulting
  cluster transition as a null event: we still need to perform some
  recovery to retry any messages which were in transit when the error
  occurred, but we do not need to recover any cluster-wide global
  state in this case.

    *** DISCUSSION POINT: ***

    Does the above point make sense?  We _do_ need to perform some
    recovery in this case, so as far as the cluster integration layer
    is concerned this section is entirely accurate.  However, we need
    to think about the visibility of such null transitions to higher
    layers of the cluster software.

    We can provide reliable data pipes between processes in the
    cluster at a higher level using packet sequence numbers which
    recover themselves over a cluster transition.  Think about two
    communicating high-level processes --- a web server and a CGI
    server, say --- in the case where an unrelated node drops out of
    the cluster.  Those two processes should not be affected by the
    cluster transition, other than at most a temporary stall in the
    servicing of cluster IO and DLM requests.  The IO pipe's internal
    recovery from to channel failover can be transparent to the
    applications.

    It is already planned for cluster transitions to be given a brief
    synopsis of whether any significant node changes have occurred
    during the transition, so by just making the rule that all
    communication errors produce a full cluster transition, we are not
    forcing all of the rest of the cluster software to do full
    recovery.  On seeing the flags "no cluster membership change" in
    the transition block, they can simply avoid doing persistent state
    recovery and limit themselves to communications recovery.


General operation
=================

OK, I admit it.  It would be inaccurate to describe the Integration
Layer's operation, as described in the rest of this document, as being
purely a consequence of the requirements above.  Some of the
requirements already stated are actually requirements which arise from
design decisions, not functional requirements.  Please bear with me on
this --- I *think* that where I've described a mechanism, rather than
a rationale, in the above text, I have good cause in that the
mechanism represents a broad design decision (rather than a mere point
of detail) which simplifies the entire layer's design.  I'll now try
to explain the broad principles of operation I envisage for the
Integration Layer, how it all works together and why the overall plan
makes sense.

You might have noticed that I have simply assumed, in the above, that
every single cluster transition can be expressed as a combination of
cluster breakups and cluster merges.  This really, really simplifies
things enormously.  For one, it means that we never, ever have to deal
with individual nodes in this model: every isolated node is nothing
more than a one-node cluster with an already-elected cluster
controller.

Cooperation between different, already-established groups of nodes
therefore reduces to a problem of point-to-point communications
between individual CC nodes.  Such cooperation between
currently-unclustered nodes is *always* a cluster merge operation,
never anything else (except that, possibly, we may want to add
diagnostic and monitoring code on top: that does not invalidate the
Integration mechanism as presented here).

A second major advantage of the merge/breakup concept is that it deals
very cleanly with errors which occur during a cluster transition.  If
a cluster detects an internal error (loss of a node or link) while in
the progress of completing a cluster transition, it immediately drops
everything and sorts out its internal problems first.

Thirdly, it also simplifies the management of complex faults in a
cluster where complete any-to-any connectivity is lost.  Such faults
are dealt with in the breakup phase of the transition, and in that
phase, we always know _exactly_ who is already in the old cluster and
we have to deal with them alone: we never have to deal with any nodes
outside the cluster until the cluster's own internal state has been
reestablished.

It is important to refer back here to the general design principle of
hierarchical phases of operation.  In this merge/breakup model, cluster
breakup is a more fundamental operation than cluster merge.  Remember
that the hierarchical phases principle means that any error takes us to
a previous phase, and forward progress is only ever achieved by
consensus.  By dealing with the cluster breakup phase prior to the
cluster merge phase, we ensure that merging is always a cooperative
operation between working clusters: any internal cluster error is
dealt with before we ever get as far as merging clusters.  

If, during a merge, one of the merging clusters detects the loss of an
internal node, that cluster simply aborts the merge and reverts to the
prior phase, reconfiguring itself to deal with its internal error
before bothering with the rest of the world.  The fact that we will
start rebuilding this state by reentering the abort phase must ensure
that we propagate the error to the rest of the nodes we were merging
with, so that they themselves can detect the abort of merge phase.

The expected consequence of this is that the cluster which detected
the internal loss of node will quietly resolve its internal state
before trying to join anybody else.  Either it will complete that
before the rest of the Cluster Superset finishes merging (in which
case it will try to trigger another merge), or before (in which case
the in-progress merge will be aborted).  Either way, the previous,
aborted merge is automatically resumed only after the component
clusters have dealt with node loss.  Cluster merging simply does not
have to deal with node loss: it only has to deal with a general "oops,
something went wrong" error which always results in a regression to
the abort phase.


Mechanism
=========

  A Cluster Transition involves progress through a set of distinct
  phases.  The execution of each phase may be very rapid if we happen
  to know in advance what sort of transition is occuring
  (eg. controlled eviction of a single node does not have to perform a
  complete CC reelection if the remaining nodes can all still talk to
  each other afterwards, and the evicted node was not previously the
  CC).  

  This section will contain a complete description of each of these
  phases.  The descriptions will begin with the objectives of each
  phase: what the pre-conditions and post-conditions are.  The
  pre-conditions are sufficient to describe those things which the
  phase does NOT have to achieve, but can simply assume have already
  been achieved.  The post-conditions describe the objectives of the
  phase by establishing the final results that the phase is required
  to achieve.

 ++ Abort Phase: tell everybody that something terrible has happened.

    Pre-conditions: 

    None at all: we can't assume anything here since this is the state
    we enter on getting an error of any description, either a
    recursive error during the integration process or a fault detected
    during normal running of the cluster.

    Post-conditions: 

    All cluster nodes which are still communicating with us, and which
    are in the current Commit Cluster; as well as all nodes which we
    may have been trying to merge with since beginning this cluster
    transition; have also returned to Abort Phase.  The connectivity
    of the Commit Cluster is moderately stable.

    Mechanism:

    The only inputs to the abort phase is the previous Committed
    Cluster Map, which exists at all times and which is not actually
    modified by the Integration Layer transition mechanism at all (at
    least, not until the entire transition process has completed and a
    new Commit Map has been established); and the current state of
    each Link with nodes with the same cluster name.

    When no Link state changes concerning neighbours in the Commit Map
    have occurred within a certain period (the Cluster Settling
    interval), we can enter Discovery Phase and start sending
    Discovery messages on all Links.

    If a neighbour sends us a Discovery message before our Settling
    interval is complete, we buffer it for use once we have settled
    and entered Discovery phase ourselves.  We must not actually
    respond to that message until we have Settled.

 ++ Discovery phases: Gather information about the new state of the
		      Cluster Superset

    Discussion:

    The first process we enter in a cluster transition is the breakup
    phase: the eviction of any unreachable nodes from the old Cluster,
    plus potentially the eviction of further nodes if the remaining
    nodes are no longer all fully connected to each other.

    To do this, we need somebody actually make the decision on what
    the "best" fully-connected subset of the old Cluster is still
    viable.  We will call the node which makes this decision the
    Cluster Elector for the breakup phase.

    We also need to remember that when we lose contact with a node in
    the cluster, that node may decide to evict itself from our cluster
    and form a new cluster on its own behalf, but communications with
    that node may be reestablished before we have finished our own
    cluster transition.  This is dealt with by observing that the lost
    and recovered node must have a changed Cluster ID, and preventing
    nodes with the wrong Cluster ID from participating actively in the
    Breakup phase.  The discovery phase therefore also needs to
    establish, for each node, whether the neighbouring nodes still
    have the same Cluster ID as we do.

    In this situation, the lost node will just have to wait until all
    of the nodes which have still got the old Cluster ID have finished
    their eviction phase and have entered Cluster Merge phase (in
    other words, the whole mess is dealt with before we leave the
    cluster transition altogether, but we don't have to deal with it
    just yet: the ordering of breakup phase before merge phase
    naturally deals with this sort of fault).  If no nodes survive
    with that cluster ID, then that's fine too --- the resulting
    smaller clusters will enter the merge phase eventually.

    We define the term Related Nodes to refer to the set of neighbours
    of a given node which (a) are still connected to that node, and
    (b) have the same cluster ID (which necessarily implies that they
    were part of the same Commit Set as us previously).

    The Related Node Superset of the old Commit Cluster Map is the
    transitive closure of the Related Node function, and includes all
    of the old nodes which are connected together over any multi-hop
    route of Related nodes.

    Cluster transitions can take arbitrary amounts of time, due to the
    fact that the Abort phase will stall for as long as a
    communication Link is unstable (in other words, faults which are
    transitioning rapidly are bad news --- we need to think about
    mechanisms by which to dampen the effects of such nodes, but such
    a mechanism probably involves voluntary exit of a node
    experiencing such problems and doesn't directly change the
    Integration Layer's behaviour at all).

    As a result of this, we can imagine a not-quite-partition of the
    cluster, in which the cluster partitions but then a single node
    which acts as a bridge between the partitions comes back.  If that
    node has not yet completed its own cluster transition (evicting
    itself from the previous cluster in the process), then all is
    fine: that node is still Related to some others, and the Abort
    which follows the link state change when the node rejoins its
    neighbours will eventually lead to the two partitions joining the
    same Related Node Superset.

    If the "bridge" node has transitioned, on the other hand, we have
    two Related Node Supersets in the new cluster which are
    theoretically able to merge but which are divided by an unrelated
    node.  At this point we are just doing breakup, not merge, though,
    so we have to deal with the Related Supersets as quite distinct
    entities.  Therefore, we do _not_ want to pass discovery
    information over the bridge node: discovery information must only
    be passed to Related Nodes, not beyond.

    Mechanism:

    For the Cluster Elector, we choose that node with the greatest
    "Value Function", where the Value Function of a Node is some
    function ordered by:

      * -ve number of degraded links to nodes which are still in the
	node's Commit Map.  We want to select a value function which
	minimises the amount of degraded link traffic in the common
	failure modes.  Do we do this best by maximising the number of
	non-degraded links active at the Elector node, or minimising
	the number of degraded links?  Probably the latter, so the
	value function (for now) always prefers nodes with few
	degraded links.  Then,

      * CC: The existing CC in the Related Node Superset is always
        preferred as Cluster Elector if it is still reachable via any
        route through that Superset (unless it has many degraded links
        due to recent faults).  Failing that,

      * Total number of votes from Related Nodes present on, or on
        nodes Related to that node; then, in case of tie, by

      * Number of other nodes Related to that node; then,

      * "Metric" of that node (a static configuration value intended
        to represent available CPU power); then,

      * Lexicographic order of the node's unique nodename.

    The important point about this function is that it is unambiguous.
    Any node can generate a Value for itself, and pass that to other
    nodes; given that information, any two nodes will always agree on
    which node has the higher Value.

    *** Note that by using the degraded link count as a high-priority
    sort key and the node Metric as a low order key, we end up
    preferring to minimise degraded link traffic rather than placing
    the Elector on the most CPU-capable node in the cluster.
    Comments?  Maximising the connectivity of the resulting cluster in
    the Eviction phase may require a good CPU --- should we optimise
    for that rather than optimising for communications at this point?

    What we now need is a mechanism by which all connectivity
    information in the entire Cluster can be propagated to a single
    node which all nodes agree is the best one to be Cluster Elector.
    The mechanism needs to be able to deal with situations like this:

    Node:	A-------B-------C-------D-------E
    Value:	4	1	3	2	5

    where we have very little connectivity in the cluster: maybe only
    the point-to-point serial cables between the nodes have survived,
    and each cluster can see only its immediate neighbours.  We cannot
    simply propagate connectivity information to our neighbour with
    the best "value", since nodes B and D in this example will end up
    sending their information in opposite directions.  So, first of
    all we will decide who is to be the Elector by propagating Value
    information between nodes.  Only once one node decides it has won
    the Election will we propagate information, and the Elector will
    guide that phase by explicitly telling the other nodes what
    decisions it has come to.

 ++ Discovery Phase.

    Pre-conditions: All of our neighbours within the Commit Map have
    either entered abort state or are unreachable: we have Reset our
    Links to them and have waited long enough to be sure that either
    they have already completed the Reset negotiation or they aren't
    going to.

    Post-conditions: Everybody in the local Cluster Superset has
    propagated its local view of cluster connectivity (ie. which other
    nodes it can still see) to a new node which will become the
    "Cluster Elector", which controls the move to the next phase.

    Mechanism:

    In this phase we want to propagate the Election Value function
    between nodes, and we want to propagate connectivity maps for the
    entire cluster back to the Elector.  We choose the following
    algorithm:

      Each node starts by sending its own value function to each of its
      Related neighbours.  

      Each subsequent message between nodes includes a Value Function
      plus a non-empty list of connectivity maps.  The Value Function
      of any message is the maximum of the Value Functions of any
      nodes whose connectivity is listed in the map.  (The
      connectivity of a node is just a complete list of all other
      nodes still connected to that node, plus the Link state:
      DEGRADED or UP.)

      
      When a node has received the initial Value Functions of each of
      its neighbours, it can start to propagate connectivity maps.
      Propagation works from the node with the lowest Value upwards.  

      The node with the lowest Value begins by sending its
      connectivity map to its neighbour with the highest Value.

      When nodes has receive messages, they update their local
      database of connectivity maps to maintain both (a) the total set
      of all nodes whose connectivity information is known, and (b) a
      list of connectivity maps which are known to each neighbour.  If
      a node discovers, in this manner, a new node which has a higher
      value function than any

      @@@ Described more fully in discovery.txt

 ++ Eviction Phase

    Pre-conditions: one node --- the Elector node --- has a full map
    of the cluster connectivity and knows that it is the Elector.  All
    other reachable Related nodes are still blocked somewhere in the
    Discovery phase.

    Post-conditions: we have established a Cluster Controller
    within the current Commit Cluster to guide us through the rest of
    the cluster transition, or we have left the cluster (and reformed
    a new cluster which will try to merge if possible with any
    neighbouring clusters).

    The transition from Discovery phase to Eviction Phase is guided by
    the unique Elector node.  The Elector is responsible for making a
    unilateral decision about which of the current members of the
    cluster should survive the transition and which get evicted.

    This phase decides which node is going to act as Cluster
    Coordinator ("CC") for the remainder of the transition.

    Mechanism:

    The Elector node makes a decision about the new shape of the
    cluster.  It must calculate the fully-connected subset of all
    surviving related nodes, which maximises:

    * Total Votes; then,

    * Total Metric (CPU capacity); then

    * Total number of nodes.

    This subset of nodes becomes the New Cluster Map.  The Elector
    then determines, from that map, the maximal subset of nodes which
    is fully connected by non-degraded links only.  This further
    subset is the "Functioning Subset" of the cluster, and the nodes
    in that subset are "Functioning" nodes.  Other nodes are marked as
    "Degraded" nodes.

    Discussion: Degraded Nodes

    It will be up to individual nodes to decide which of their
    internal services to start, and which to suspend, on a transition
    between Functioning and Degraded mode.  The most important reason
    for maintaining Degraded nodes in the cluster is to sustain
    Quorum: any votes belonging to a Degraded node still count towards
    the Quorum in a cluster, and can therefore mean the difference
    between life and death of critical cluster services on a cluster
    failure. 

    Consider a cluster of 5 voting nodes: a communications fault may
    remove high-quality access to three of those nodes but we may
    still have degraded links from the remaining two nodes to one of
    the failed nodes.  In that situation, we still have 3 votes out of
    5 --- enough, barely, for quorum --- as long as that degraded node
    remains in the cluster.  However, if we remove it from the cluster
    (even cleanly, returning its votes), we are left with only 2 votes
    out of 4, and we lose Quorum.  

    However, in most cases, we hope that loss of high-quality
    communications to a node does not lose _all_ degraded
    communications: that's why we have backup serial interconnects,
    for example.  In the above example, if the 2 evicted nodes lose
    direct contact with the surviving cluster, they can still exit
    from the cluster cleanly as long as there is some multi-hop path
    between the nodes.  In this case, Quorum management must be
    involved in the cluster transition.  For good quorum maintenance,
    it is absolutely critical that we perform that eviction in a clean
    manner which allows the expected votes in the surviving cluster to
    be adjusted if necessary.

   to that set of nodes any other Related nodes which can
    see each node in the Functioning Cluster, but which have Degraded
    connections to one or more Functioning Nodes.  These nodes with
    Degraded connections can still belong to the new cluster, but they
    are marked as Degraded, not Functioning, members.  They are still
    present for sysadmin and vote tiebreaking purposes,

    Error recovery:

    The Eviction phase is not quite as simple as previous phases with
    respect to error recovery, but it is not too complex to understand
    as long as we bear in mind exactly what sort of recovery we need
    to deal with.  The reason that the Eviction phase introduces new
    complications is that up until this phase, we have not started
    changing the membership of the cluster (as defined by the Cluster
    ID at each neighbouring node).  During Eviction, we are for the
    first time performing a controlled modification of the cluster
    membership.

    Fortunately, we can still reuse the existing error recovery
    mechanisms.  If we lose our Link to a Related node during this
    phase, we just restart the cluster transition by going back to
    Abort phase.  This is fine.  It doesn't matter exactly who in the
    cluster has completed Eviction and who has not.  Any nodes whose
    eviction has completed will now have a new Cluster ID; nodes which
    have not will still have the old Cluster ID.  The Discovery phase
    explicitly limits itself to discovering only Related nodes, so
    after any error, we already have the required recovery mechanism
    in place to work out who has and who has not yet completed the
    Eviction process.

 ++ Propagation Phase

    This phase propagates all information about nodes still reachable
    in the cluster, and inter-node communication paths still working,
    to the CC node.

 ++ Proposal Phase

    The CC generates a new Commit Cluster Map, and proposes that map
    to every node.

 ++ Commit Phase:

    The CC completes the transition by promoting the Commit Cluster
    Map to be the Current Cluster Map, and requests all other nodes to
    do the same.  The cluster transition is seen by each node to be
    complete once this occurs: the CC sees the transition first and
    other nodes see it once the Commit message arrives.

  Note that within each of these phases, we may have to pass
  information to nodes which are not necessarily connected directly to
  the CC.  If a cluster fault has ended up producing a
  partially-connected set of nodes, then at least CC election
  information and cluster eviction information must be passed,
  preferably along with a new cluster map indicating to the rejected
  node just which machines it could not see in the cluster (for
  diagnostic purposes, not for correctness: you cannot base your local
  correctness on the state of an invisible node!).

The details of each phase are as follows:

* Abort Phase:

  The aim of the abort phase is to make sure that every node which is
  a member of, or which is connected to, the cluster, has entered
  cluster transition state.  When we first begin the abort phase, we
  reset every Link within the cluster.  No Resets occur after this:
  they are not necessary, since any received communications error
  represents an implicit guarantee that the remote node has entered
  the cluster transition phase itself, anyway.

  We need to be aware that the entire cluster transition has some
  pretty fundamental complications: we can get a new cluster
  transition during it.  For example, a cluster merge can be brought
  to a screaming halt by the death of one node in one of the merging
  clusters.  Therefore, we need to be able to reenter abort phase at
  any point in the cluster transition, and this new abort must also
  reset communication links with the new members we were trying to
  merge with at the time.

  We make the simple rule that all Links to nodes with the same
  cluster name must be reset during a transition, even if those links
  are to nodes currently known not to be in our cluster.  (Remember, a
  cluster partition due to partial cluster connectivity will result in
  two or more separate clusters from the Integration Layer point of
  view, but although those clusters will have different IDs, their
  basic cluster name will be the same.)

  However, we do not have 100% synchronous cluster-wide state
  transitions (this job would be _so_ much easier if we did!).  When
  we do any state transition, we cannot be entirely sure that our
  neighbours are currently in the same state that we are.  

  We may be in abort phase when our neighbours are not.  They may
  still be in abort phase when we have already decided to exit that
  phase.  *** @@@ I need to think about this a good deal more when the
  exact transition between states has been outlined in a great deal
  more detail than we have here. ***

  From this, it also follows that a Link reset from a node not
  currently in our cluster must be treated as a serious error if that
  node is currently trying to merge with us (so we need to reenter
  abort phase), but if a reset arrives on a Link to a node which is
  not currently in the cluster due to a previous failed cluster merge,
  it does not necessarily cause an abort.  It might, however, cause
  the local cluster to begin reevaluation of the partial connectivity
  of the Cluster Superset, initiating a voluntary Abort to start the
  cluster merge process.

* Discovery Phase:

  
* Election Phase:


The Worry Set
=============

We maintain, at all times, a set of nodes called the Worry Set.  This
set contains a list of all nodes which we care about at the moment.  A
Link State change for a node in the worry set always restarts the
cluster transition with a move to Abort Phase.

Why do we need this?  Nodes outside our immediate cluster should not
be able to kill the cluster if they die.  If we have evicted a node
(due to voluntary exit or due to a fault) or have rejected an attempt
by that node to merge with our cluster, then that node may still have
a connection to us but the state of that connection is not important
to the cluster.  We have already decided that the other node is not
going to join in the fun, so we shouldn't care if that node
disappears.

However, things get more complicated during a cluster transition,
where the Current Cluster Map and the New Cluster Map are quite
distinct things.  When we are in the middle of proposing a cluster
merge, for example, the new prospective members of the cluster are not
yet in our cluster, but if a link to one of them disappears, we most
definitely need to abort the merge.  On the other hand, if we have
already decided that some remote node is to be evicted from the
cluster during the cluster transition, and that node then dies, then
we probably don't care.

During the normal running of a cluster, the Worry Set is the set of
all nodes in the Current Cluster Map.

During the cluster breakup phase, the Worry Set must be the same:
during that phase we are concerned above all else in the discovery of
what our old cluster now looks like, finding out which nodes and which
links have survived.  We are not at all interested in any other
nodes at this point.

Into the merge phase, we are in a different situation.  All of the
nodes which contact us and try to merge with us must be added
immediately to our Worry Set, so that the entire merge mechanism can
be aborted and restarted if one of those neighbours dies.  Once into
the merge phase, we add to the Worry Set every member of every cluster
which tries to merge with us.  We remove them from the Worry Set if we
end up deciding to reject the merge.

Once the cluster transition is complete, of course, the new Commit Map
becomes the Worry Set.
 LocalWords:  CC ie pre neighbour ve nodename minimise optimise optimising txt
 LocalWords:  maximises sysadmin tiebreaking IDs