New stuff added since the last distribution of the design docs:

* Thoughts on communications

* Thoughts on an event model for local synchronisation

* Recovery: add concepts of membership recovery and peer recovery to the
  hierarchy document.

* Barrier operations

* Privilege spaces (uid/gid protection on namespaces, locks, barriers etc?)
  Include a cookie-authenticating "agent" to allow processes to connect to 
  sockets in other privilege spaces.

  For local communication only, of course!  (Regulates access to API, not to
  cluster sockets).

  Especially important to have protection of locks/names etc. against
  user apps.  Need this to be persistent, even outside the running of
  the cluster software, so that an inactive service still has a
  namespace which cannot be polluted before that service becomes
  active?


---
Communication primitives required by the communication layer:
+ Point-to-point connectivity
  Keepalive may be selectively disabled for non-CC nodes
+ Broadcast over the whole cluster: for recovery purposes, use fan-in/
  fan-out based on the current cluster geometry.

Integration:
+ Cluster membership transition 


---
Membership stuff:

Nodes can be members of a cluster either as SATELLITES, which do not
participate in recovery of cluster-wide shared state, or as PEERS
which are always involved in recovery.  

+ CLUSTER/METACLUSTER MEMBERSHIP is the membership of the integration 
  layer, ie. the list of subcluster leaders which the integration
  layer has bound into a single group.

+ PEER MEMBERSHIP [PEERAGE] is the total leaf-node membership of all
  member clusters and subclusters.  For any level of the cluster
  hierarchy, the group membership ONLY includes nodes which are
  recoverable PEERS at that level.

+ CLUSTER LEADERSHIP DEPTH is the number of cluster layers for which
  any given nodes is the cluster leader.

Each membership list has a separately maintained incarnation number.
The cluster membership incarnation is maintained by the integration
layer; the group membership is maintained by the recovery layer.

---
Naming:

Name cluster trees like filesystems.  Each knows if it is the root,
and leaf clusters can use "../clus" names to address nearby clusters
in a way which is invariant when we bind new metaclusters out of our
current top-level cluster.

Possibilities:

Use node names of the form "/REDHAT/SCOT/SCT/DEV/DESKTOP"
Valid identifiers for that node will include 
	"REDHAT/SCOT/SCT/DEV/DESKTOP" // omit leading forward slash
	"SCOT/SCT/DEV/DESKTOP"
	"SCT/DEV/DESKTOP"
	"DEV/DESKTOP"
	"DESKTOP"
as seen from the same cluster (ie. the "/REDHAT/SCOT/SCT/DEV" cluster).

We may also have a separate cluster called
"/REDHAT/SCOT/SCT/TEST/TEST1" with a cluster node member called
"/REDHAT/SCOT/SCT/TEST/TEST1/FSTEST1".  From that node, the previous
node could be referred to by its fully qualified cluster name (ie. the
name beginning "/REDHAT/...".  It could also be referred to as

	"../../DEV/DESKTOP"
	"../../../SCT/DEV/DESKTOP"
	"../../../../SCOT/SCT/DEV/DESKTOP"

but not "../../../../../REDHAT/SCOT/SCT/DEV/DESKTOP", since that many
".."s leads us to something called "/" which does not resolve to a
named cluster.  (Ie. there is no root level directory as such.)

---

Recovery is triggered either by:
 Cluster transition, or
 Group membership transition.

---

"DSR" Dynamic Source Routing adaptive routing protocol --- get
references from Peter.

---

All peers (at any particular level) must be able to see all other
peers at that level.  In a hierarchial cluster we may not always
enforce this, but we _will_ allow an application which detects such an
error to request validation of that link and to provoke a cluster
transition if that link has really failed.

---

Todo:

Think (hard) about proxying semantics and recovery.  We can get a
handle on this by thinking of proxy recovery in terms of a full
recovery of all the satellites in the failed proxy's subtree.  This
means of course that satellites are affected by recovery, but we still
have the enormous advantage that that recovery is strictly confined to
the failed proxy's subtree, even if the proxied resource belonged to a
higher level cluster than that.

---

Rethink comms at the integration layer: does it really help to have
broadcast if the broadcast results in N-1 messages anyway?  Is there
any reason to change things if the end goal is to guarantee any-to-any
comms?  Any true broadcast mechanism might help, but we would still
have to build a reliable delivery guarantee on top of that, which
smells strongly of over-engineering to me.

---

Think about recovery of proxy services.  Is it viable to allow
satellites to hold resources, or should we limit them to reading
resources?  Preserving callback semantics in the latter case would be
good, as otherwise we don't get much benefit from the cluster.