Cluster Structure
=================

The following components form the core of the cluster design:

* Channel layer:	point-to-point communications
* Link layer:		reliable bound channel sets between node pairs
* Integration layer:	forms the cluster topology
* Recovery Layer:	performs recovery and controlled service startup/
			teardown after a cluster transition.

There are also four key services which are central to the basic cluster
APIs:
* JDB:			stores persistent cluster internal state
* Quorum Layer:		determines who has quorum
* Barrier services:	provides cluster-wide synchronisation services
* Namespace service:	provides a cluster-wide name/value binding service.


Component Descriptions
======================

* Channel layer:

  The low-level physical communications layer.  This layer maintains
  multiple interfaces (which might be IP, serial or SCSI, for
  example).   Neighbourhood discovery is performed on each interface
  via broadcast, and after any neighbour is found, a handshake is
  performed which creates a permanent point-to-point channel to that
  neighbour over the given interface.  The channel supports sequenced
  data delivery, heartbeat link liveness verification, and controlled
  reset on error. 

  Interfaces are entirely independent of each other.  If the same host
  is found on multiple interfaces (ie. we have multiple connections to
  that host), the connection on interface is maintained independently
  of the others.

  Each low-level channel has a metric which determines how "good" that
  channel is for carrying cluster traffic.  A low-performance link is
  defined as one with a negative metric: serial lines should have this
  property, for example.  

* Link layer:

  Built on the channel layer, the link layer constructs a higher-level
  communications mechanism which binds together all available channels
  to any given host.  The link is in one of four states at any time:

    DOWN:	No connection to remote host is held.
    RESET:	We have had an error, and are temporarily resetting
		all of the channels in the link.  
    DEGRADED:	At least one channel is up and running, but its metric
		is negative.
    UP:		At least one "good" channel is up and running.

  When constructing a new cluster state, the upper layers will use a
  degraded link to perform a clean cluster transition evicting one of
  the nodes on that link from the cluster.  We can use the degraded
  link to perform this eviction cleanly, adjusting quorum to take its
  departure into account.  

* Integration layer

  This layer performs transitions of the overall cluster topology,
  merging neighbouring clusters, evicting dead or misbehaving members
  and ensuring transactional transitions between cluster topologies.

  We have to define very carefully what a "cluster" is in this
  context.  A cluster is an agreement between a set of nodes that all
  of those nodes can communicate with each other and are able to form
  a useful working group together.  The cluster is not just the set of
  connected nodes: it is the *agreement* of connectivity.  Cluster
  transitions, such as the joining of a new node into the cluster, are
  atomic operations in which all nodes agree to the new group
  topology.  These transitions are transactional and atomic.

  A key concept here is the cluster ID.  The ID is unique to a single
  "incarnation" of the cluster.  Whenever a cluster splits into
  multiple pieces, any surviving subcluster which has more than half
  the votes of the old cluster or more retains its cluster ID: all
  other machines are evicted from the cluster and must obtain a new
  cluster ID.  (As a tie-breaker, a cluster with precisely half the
  votes retains its cluster ID iff it includes the previous cluster's
  "cluster controller" node among its members.)

  A new cluster ID is generated whenever a machine first enters the
  integration layer (either after eviction from a cluster or on
  initial startup).  Such a machine considers itself to form a
  single-node cluster.  As a result, we never have to deal with nodes
  joining or leaving clusters: we only ever have to deal with entire
  clusters joining or splitting from each other.

  The cluster ID includes the nodename of the node which generated the
  ID, and a timestamp.  This is done to generate a cluster ID which is
  unique for all time.

  When a cluster partition occurs, at most one of the resulting new
  smaller clusters will retain the original cluster ID.  However, the
  remaining cluster members are not all left alone to pick up the
  pieces: although they reenter the integration process as single-node
  clusters, they will not leave the cluster transition until as many
  as possible of those nodes have merged into larger clusters, each of
  which keeps the cluster ID of just one of the subclusters of which
  it is composed.

  Each cluster also has a sequence number which is incremented on each
  cluster transition, providing applications with an easy way of
  polling for potential changed cluster state.

  The cluster state transition which occurs when a new node joins the
  cluster or an existing node loses connectivity for any reason is
  described in integration.txt.

* Recovery Manager:

  This is the layer which integrates the "Core" (for want of a better
  term --- this describes it as well as anything) of the cluster with
  the Rest Of The World --- ie. services.

  The main job of the transition layer is to "recover" from cluster
  transitions completed by the integration layer.  Recovery
  constitutes a multitude of operations:

 ++ Internal recovery of all registered permanent services must be
    performed.  This typically involves reconstructing the global
    state of that service based on (a) the set of nodes in the new
    cluster and any differences between that and the previous set
    (note that care must be taken if we take a transition during this
    recovery!!), and (b) the state of this service on those nodes.
    For example, a distributed lock manager might recalculate the hash
    function used to distribute lock directories over the available
    nodes, and then reconstruct the lock database based upon the set
    of locks currently held on each node.

 ++ Recovery of user services.  These services are not necessarily
    running all the time, and do not necessarily possess internal
    global state (although they may well rely on external global
    state, for example the contents of a shared cluster filesystem).
    On cluster transition, we need to be able to allow user services
    which have disappeared from the cluster to be reinstantiated on a
    new node (ie. failover), or to notify already-running services of
    a change of status (eg. failback after the service's original node
    returns).  

    Most importantly, we need to manage the order in which these
    services are restarted.  There is no point in restarting the CGI
    server until the underlying cluster filesystem has recovered.

  The transition layer must therefore know about the dependencies
  between services.  Dependencies may be inferred from the use of
  Names in the Cluster Namespace: the Namespace can feed dependency
  information to the transition manager if appropriate.  However, the
  Namespace on its own cannot control recovery: it is the transition
  manager which allows separate services to coordinate controlled
  stage-by-stage recovery over the cluster.

  One final job of the transition manager is that it must be able to
  disable access to certain cluster services until recovery is
  complete (with or without the cooperation of the service
  concerned).  It is quite possible that an application can continue
  quite blindly over a cluster transition as if nothing had happened
  (this is High Availability in action!), but obviously any requests
  from that application to the cluster filesystem or to the lock
  manager must be deferred while those services are engaged in
  recovery.

* JDB:

  A transactional database is required to store local state.  Every
  node with a non-zero vote (a "voting member" of the cluster) is
  required to have a jdb on local read/write persisitent storage in
  which the Quorum layer can perform reliable updates to simple
  configuration and state information.  Sleepycat libdb2 (as found
  GPLed in glibc-2.1) should be quite sufficient.

* Quorum Layer:

  Keeps track of "Quorum", or the Majority Voting Rights, in a
  cluster.

  Quorum is necessary to protect cluster-wide shared persistent
  state.  It is essential to avoid problems when we have "cluster
  partition": a possible type of fault in which some of the cluster
  members have lost communications with the rest, but where the nodes
  themselves are still working.  In a partitioned cluster, we need
  some mechanism we can rely on to ensure that at most one partition
  has the right to update the cluster's shared persistent state.
  (That state might be a shared disk, for example.)

  Quorum is maintained by assigning a number of votes to each node.
  This is a configuration property of the node.  The Quorum manager
  keeps track of two separate vote counts: the "Cluster Votes", which
  is the sum of the votes of every node which is a member of the
  cluster, and the "Expected Votes", which is the sum of the votes on
  every node which has ever been seen by any voting member of the
  cluster.  (The storage of those node records is one reason why the
  Quorum layer requires a JDB in this design.)

  The cluster has Quorum if, and only if, it posesses MORE than half
  of the Expected Votes.  This guarantees that the known nodes which
  are not in this cluster can not possibly form a Quorum on their own.

  There is one other thing we need to do to be completely secure here.
  If completely new nodes (ie. those never before seen in the cluster)
  are allowed to join a partition which has no Quorum, we must prevent
  the new nodes from adding their votes to that partition and
  disturbing the Expected Votes calculation of other nodes or
  partitions.  To prevent this, a new node is not allowed to vote
  until it has, at least once, joined a cluster which already has
  Quorum.  This means that during initial cluster configuration, the
  sysadmin must manually enable Voting Rights on at least one node
  before the cluster can obtain Quorum for itself.

* Barrier services

  Barrier services provide a basic synchronisation mechanism for any
  group of processes in the cluster.  A barrier operation involves all
  the cooperating processes waiting on the same barrier: only when all
  of them have reached the barrier will any of them be allowed to
  proceed.

  The barrier operation is required extensively by the recovery code for
  other services, which is what justifies its inclusion as a core
  cluster service.

* Namespace services

  The cluster namespace is a non-persistant hierarchial namespace into
  which any node can register names.  The guts of the namespace API
  includes mechanisms by which processes can not only register names,
  but also set up dependencies on names.

  All failover services will be coordinated through the namespace.  If
  multiple nodes try to register the same name (and if they request
  exclusive naming), for example, only one will be granted the name.  If
  that node dies, another's request for the name will succeed, and
  failover will be triggered once the name assignment is complete.

  The namespace will also provide a natural mechanism for determining
  dependencies between services: any process can register a dependency
  on a name and will receive asynchronous callbacks if the condition of
  that name changes (either if the name disappears or its ownership
  changes due to the death of a process holding the name).

  As a consequence of the binding of names to services, it becomes easy
  for any application to find all the services in a cluster or to locate
  on which node a specific service (such as, for example, a
  failover-capable printer spool queue) currently resides.

  Finally, the namespace aims to make cluster administration simple by
  providing a simple method by which the admin can query all or selected
  bound names in a cluster, much as /proc already provides the dynamic
  information on a single Linux node.