Hierarchical Clusters
=====================
last modified:
$Id: hierarchy.txt,v 1.4 1999/12/17 14:52:15 sct Exp $

[ Read past the definitions to the justifications bit if you just want
  to get an overview of what we're trying to achieve here. ]

DEFINITIONS
-----------

For the discussion of hierarchical clusters, we need to make a few
definitions.  

First of all, let us define what a hierarchical cluster is.  A single,
FLAT cluster is some virtual entity born from a collaboration between
(probably nearby) nodes on a network.  We can call this a FIRST-LEVEL
cluster.  Those nodes are the clusters MEMBERs, and together they form
the clusters MEMBERSHIP LIST.  Each cluster has, at all times, a
unique privileged node known as the CLUSTER LEADER.  That node does
plays a key role in coordinating cluster transitions, but other than
that it does not necessarily perform any special work during the
normal running of the cluster.

If we want to combine clusters into larger units, we want to be able
to support the binding of many clusters into a single HIGHER-LEVEL
cluster, or METACLUSTER.  That cluster is formed from the clustering
together of all of the cluster leaders of the cluster's MEMBER
CLUSTERS: those cluster leaders, together, form the metacluster's
membership list.  The metacluster's subcluster list is therefore a
different concept from its membership list: if a subcluster undergoes
a state transition and elects a new cluster leader, then that new
leader will replace the old one in the metacluster's membership list,
but the metacluster's subcluster list remains the same.  The complete
list of all member clusters of a metacluster, and all the members of
those member clusters right down to the first-level clusters, forms
the metacluster's SUBCLUSTER LIST.

A metacluster of first-level clusters constitutes a second-level
cluster.  A metacluster of those second-level clusters forms a
third-level cluster, and so on.

We can observe that in a large corporate network being managed as a
single cluster (we can dream, can't we? :), we really do not want all
of our labs' test clusters to cause corporate-wide cluster transitions
whenever a test machine dies.  Not at the rate at which I reboot test
machines, at any rate.  This is the "S-Cluster" case.

On the other hand, we can also envisage large compute clusters of tens
or hundreds of thousands of nodes in which we want to support high
rates of cluster traffic on cluster-wide shared resources (such as
cluster-wide locks).  We simply *have* to reclaim such locks when a
node dies (DLM semantics require it), so in this second, large
"L-Cluster", every node is significant as far as cluster-wide recovery
is concerned.

This implies that, when looking at any given level of the cluster
hierarchy, there may be nodes underneath which participate in
recovery, as well as those which do not.  We define these as PEER
nodes and SATELLITE nodes respectively.  For any cluster, all of the
peers taking part in that cluster's recovery form the clusters PEER
LIST or PEERAGE.  Any given node may be a member of the peerage for
its own cluster and for any number of higher-level clusters.  Above
that level --- it's PEER LEVEL --- the node is a satellite.  A true
satellite in the first-level cluster has a peer level of zero.


*** MISSING: Define exactly what a satellite's relation with the
    higher level clusters is.  In particular, does a satellite always
    proxy through peers in its own cluster?  There may not be any such
    nodes.  The satellite may _have_ to go to some outside node in a
    higher level cluster.

*** Think also about what happens if a node is a peer only up to a
    certain level: it loses the ability to be a cluster leader above
    that level.  Does this affect proxying much?  I need to draw this
    out before expanding on this much, and before completing any more
    detail on the recovery specs.

This has one major implication for the design of the integration layer.
The fact that only cluster leaders can be members of a higher level
cluster implies that only nodes with the highest peer level in any
cluster may be cluster leaders for that cluster.  If we did not observe
this rule then we might find that a higher-level cluster could not
convene because none of the nodes capable of being peers at that level
were leaders of their respective clusters.


CONNECTIVITY
------------

The peerage for a cluster contains all nodes in that cluster expected
to partake in cluster-wide shared state such as lock management.  Any
peer may be acting as a lock holder or a lock master at any point in
time.  As a result, we should expect that a node be connected to all
its peers at all times.

However, although a flat cluster guarantees any-to-any connectivity, a
metacluster only guarantees such connectivity between its own cluster
members (ie. the leaders of the subclusters).  We need to augment this
with a fault-reporting mechanism by which any peer can report
suspected loss of connectivity to another, so that one of the
partially connected nodes can be removed from the cluster.  The
metacluster mechanism does not need to provide the fully-connected
guarantee itself as long as there is an independent fault manager
somewhere in the software stack which can deal with a breakdown in
communications.


JUSTIFICATION
-------------

What does this big mess of stuff actually add to our clustering?
First of all, we note that some form of hierarchy is absolutely
necessary in order to scale.  We cannot guarantee any-to-any
connectivity between a million nodes without swamping our
communications fabric.  In an L-Cluster of such a size, we might
expect the MTBF to be under sufficient control that we only suffer a
cluster transition every few minutes, but it is hard to imagine
establishing any-to-any comms between that many machines quickly
enough that we can recover in time for the next failure.

In a distributed S-Cluster of similar size, the situation is
different.  We expect a lower MTBF (or at least the mean time between
transitions, MTBT) to begin with (because we may have test machines
and workstations being rebooted for perfectly valid reasons).  We also
expect to have much less bandwidth between some of the metacluster
members (we may have remote clusters linked over external IP
connections) and to have very asymmetric load capacity (say, only a
few servers capable of acting as metacluster leaders at each site).

How can we deal with these problems?  First of all, in an L-Cluster,
the limiting factor is the requirements for synchronisation between
all machines in the cluster when something goes wrong.  Verification
of any-to-any connectivity requiring even only a million broadcasts
will be hideously expensive.  However, with a dense enough fabric, we
might expect that broadcast from one specific node can be done
efficiently enough, even if only simulated via point-to-point
interconnects.  We can use the cluster hierarchy to distribute
broadcast messages from the cluster leader to each cluster member,
ie. to the leaders of all the subclusters, and repeat the process down
the cluster hierarchy until all nodes have been contacted.  The acks
from each node can likewise be propagated back up the tree and merged
into a single ack at the top level.  The total number of packets
transmitted over the entire fabric is O(n) but the total forwarding
distance is only O(log(n)).

In other words, as long as recovery of the cluster services (such as
the DLM) can be performed using such fan-out/fan-in broadcast+ack
communication primitives, recovery of large numbers of machines should
be feasible in a short time.  By using hierarchical clusters, the
individual cluster membership transitions involved can be lightweight,
typically only involving ten or so nodes at once (the most complex
cases involving death of a cluster leader).  However, that membership
transition can result in the eventual recovery of large numbers of
nodes, because we can allow a low-level cluster transition to result
in peer recovery of any metaclusters involved.

If a cluster transition only involves the arrival or departure of
nodes which are satellites to the metacluster, then the metacluster's
peer list is unaffected and no recovery is needed at all.  This is the
property which allows us to conceive of extremely large S-Clusters
using this design.  The fact that nodes may be peers only up to a
certain level and satellites above that level allows us to limit the
number of peers which are present in the higher levels of the
metacluster.  This reduces both the number of transitions in those
higher levels (keeping cluster availability high), it also reduces the
high-level cluster's traffic.

In short, we envision two scenarios.  In L-Clusters, the top-level
cluster traffic is high but the fabric is dense, and the main job of
the hierarchy is to limit the cost of completing cluster membership
transitions.  Recovery itself is always done at the top level of the
cluster, because most resources in the cluster are shared around the
entire cluster.

In S-Clusters, most resources are local to subgroups in the cluster
hierarchy.  The cluster hierarchy's main job is to prevent there from
being many transitions at all in the top level cluster, even if lower
level clusters are transitioning regularly.  The cluster hierarchy
removes the need for the top levels to carry any traffic at all when
an internal node of some subcluster dies, as long as that node's peer
level is less than the metacluster's own level.


NAMING
------

In a cluster hierarchy, any one node may in fact be a member (or a peer,
or a satellite) of many nested clusters.  When a process wants to access
a cluster resource, it will need to indicate which of these metaclusters
it is talking about.  We also need a way to identify other nodes in the
cluster or metacluster.

There is one major requirement for the naming which we must not forget,
however.  It is important that the clustering API supports the ability
for existing clusters to be bound into metaclusters without upsetting
any applications already running on the old clusters.  In other words,
adding a new level of metacluster should not invalidate the names being
used to identify existing clusters and nodes.  This basically prevents
us from using any naming which relies on a cluster root.

@@@

Possibilities:

Use node names of the form "/REDHAT/SCOT/SCT/DEV/DESKTOP"
Valid identifiers for that node will include 
	"REDHAT/SCOT/SCT/DEV/DESKTOP" // omit leading forward slash
	"SCOT/SCT/DEV/DESKTOP"
	"SCT/DEV/DESKTOP"
	"DEV/DESKTOP"
	"DESKTOP"
as seen from the same cluster (ie. the "/REDHAT/SCOT/SCT/DEV" cluster).

We may also have a separate cluster called
"/REDHAT/SCOT/SCT/TEST/TEST1" with a cluster node member called
"/REDHAT/SCOT/SCT/TEST/TEST1/FSTEST1".  From that node, the previous
node could be referred to by its fully qualified cluster name (ie. the
name beginning "/REDHAT/...".  It could also be referred to as

	"../../DEV/DESKTOP"
	"../../../SCT/DEV/DESKTOP"
	"../../../../SCOT/SCT/DEV/DESKTOP"

but not "../../../../../REDHAT/SCOT/SCT/DEV/DESKTOP", since that many
".."s leads us to something called "/" which does not resolve to a
named cluster.  (Ie. there is no root level directory as such.)


RECOVERY
--------

In a hierarchical cluster, we have different membership lists which can
change when transitions occur.  At each level of the cluster, we have
both the cluster membership and the peer membership to worry about.

These two types of recovery are quite distinct things.  In a large
L-Cluster, for example, when a node dies it will cause a transition in
the smallest cluster of which it is a member, and will of course also
cause a peer membership change there.

However, in the higher-level metaclusters in that L-Cluster, there will
be no cluster membership transitions (unless we were unlucky enough that
the node which died was the cluster leader of its local cluster, in
which case there will be membership transitions at higher levels as new
a cluster leader is elected).  

There _will_, however, be a peer membership change in all parent
clusters of the local cluster of the dead node.  That is the whole point
of the L-Cluster: cluster leaders in each metacluster can communicate
the information that a certain node is dead upwards to the top-level
metacluster, and we can then take any necessary corrective action to
perform a controlled transition on the peer membership of that cluster
without having to perform expensive verification of the integrity of the
clusters and the connectivity between every single surviving pair of
nodes in the full cluster.

Consider then what happens if we have, for example, a large cluster
filesystem using a distributed lock manager which is consistent over the
entire L-Cluster of several thousand nodes in a large High Performance
Computing cluster.  On death of a node in that cluster, a few localised
machines perform a cluster membership transition, and then they perform
a peerage recovery (performing recovery of any cluster services which
are operating at the level of that local cluster). 

That local cluster then sends a message to the next higher level cluster
indicating that there is a change in the peer list for that cluster, and
peerage recovery happens for that cluster... and so on all the way to
the top-level cluster.  Therefore, when one machine joins or leaves the
cluster, we do not have to perform a full cluster membership transition
for the whole cluster: rather, we have the much simpler task of dealing
with a peerage transition in which we are simply told by one of the
cluster members exactly what has happened to the peer list.  We don't
have to work out what went wrong.

This is necessarily a more scalable scenario than having every member of
a 10,000-node cluster have to negotiate with every single one of its
neighbours every time a node joins or leaves the cluster!  But is it
scalable enough?  Can we perform recovery efficiently for coherent
services such as a DLM on such a large cluster when the peerage changes?

I believe we can, because the hierarchical nature of the cluster gives
us a way to contact all nodes in the cluster efficiently.  On peerage
transition, the cluster leader can begin the transition by sending a
command to each of its cluster members.  This is a hierarchical cluster,
so the top-level metacluster members are actually the cluster leaders of
the next-level-down metaclusters, and those cluster leaders can then
fan-out the message down to all of their cluster members, and so on down
the cluster hierarchy until all peers in the entire cluster have been
contacted.  Replies can fan-in up the cluster hierarchy in a similar
manner.

As long as peerage recovery can be expressed in terms of such
fan-out/fan-in broadcast operations, plus limited point-to-point traffic
between specific nodes, we can use the cluster hierarchy to make peerage
recovery sufficiently scalable to work well even on enormously large HPC
clusters.


What about the S-Cluster case?  In a typical S-Cluster, this distinction
between peer and membership transition again does exactly what we want
it to do.  If we have some remote location in a company with its own
internal metacluster, and further sub-clusters inside that according to
the pattern of use of the machines there, then in an S-Cluster we would
only have a few machines at that entire site which were peers for the
organisation's higher level clusters.  As a result, any node arrivals or
departures within that site, as long as they don't concern those
high-level peer servers, will not cause any membership _or_ peer
transition in the top level clusters.


IRREGULAR HIERARCHIES
---------------------

Think about the implications of:
* Different cluster subtrees having different depth
* Binding clusters together into metaclusters without reconfiguring
  every member of every subcluster.


AUTHENTICATION
--------------

Permission --- to join a cluster, to become a peer (especially for
metaclusters).  If we have different cluster passwords for a
metacluster, then how do we decide which one wins (ie. which is the
real metacluster, which is spoofed?  Over an organisation, can we
realistically rely on nobody doing the wrong thing?)


 LocalWords:  subcluster metacluster's metacluster ie subclusters MTBF comms IP
 LocalWords:  MTBT acks ack metaclusters MEMBERs cluster's DLM proxying HPC
 LocalWords:  transitioning localised neighbours organisation's spoofed