New stuff added since the last distribution of the design docs: * Thoughts on communications * Thoughts on an event model for local synchronisation * Recovery: add concepts of membership recovery and peer recovery to the hierarchy document. * Barrier operations * Privilege spaces (uid/gid protection on namespaces, locks, barriers etc?) Include a cookie-authenticating "agent" to allow processes to connect to sockets in other privilege spaces. For local communication only, of course! (Regulates access to API, not to cluster sockets). Especially important to have protection of locks/names etc. against user apps. Need this to be persistent, even outside the running of the cluster software, so that an inactive service still has a namespace which cannot be polluted before that service becomes active? --- Communication primitives required by the communication layer: + Point-to-point connectivity Keepalive may be selectively disabled for non-CC nodes + Broadcast over the whole cluster: for recovery purposes, use fan-in/ fan-out based on the current cluster geometry. Integration: + Cluster membership transition --- Membership stuff: Nodes can be members of a cluster either as SATELLITES, which do not participate in recovery of cluster-wide shared state, or as PEERS which are always involved in recovery. + CLUSTER/METACLUSTER MEMBERSHIP is the membership of the integration layer, ie. the list of subcluster leaders which the integration layer has bound into a single group. + PEER MEMBERSHIP [PEERAGE] is the total leaf-node membership of all member clusters and subclusters. For any level of the cluster hierarchy, the group membership ONLY includes nodes which are recoverable PEERS at that level. + CLUSTER LEADERSHIP DEPTH is the number of cluster layers for which any given nodes is the cluster leader. Each membership list has a separately maintained incarnation number. The cluster membership incarnation is maintained by the integration layer; the group membership is maintained by the recovery layer. --- Naming: Name cluster trees like filesystems. Each knows if it is the root, and leaf clusters can use "../clus" names to address nearby clusters in a way which is invariant when we bind new metaclusters out of our current top-level cluster. Possibilities: Use node names of the form "/REDHAT/SCOT/SCT/DEV/DESKTOP" Valid identifiers for that node will include "REDHAT/SCOT/SCT/DEV/DESKTOP" // omit leading forward slash "SCOT/SCT/DEV/DESKTOP" "SCT/DEV/DESKTOP" "DEV/DESKTOP" "DESKTOP" as seen from the same cluster (ie. the "/REDHAT/SCOT/SCT/DEV" cluster). We may also have a separate cluster called "/REDHAT/SCOT/SCT/TEST/TEST1" with a cluster node member called "/REDHAT/SCOT/SCT/TEST/TEST1/FSTEST1". From that node, the previous node could be referred to by its fully qualified cluster name (ie. the name beginning "/REDHAT/...". It could also be referred to as "../../DEV/DESKTOP" "../../../SCT/DEV/DESKTOP" "../../../../SCOT/SCT/DEV/DESKTOP" but not "../../../../../REDHAT/SCOT/SCT/DEV/DESKTOP", since that many ".."s leads us to something called "/" which does not resolve to a named cluster. (Ie. there is no root level directory as such.) --- Recovery is triggered either by: Cluster transition, or Group membership transition. --- "DSR" Dynamic Source Routing adaptive routing protocol --- get references from Peter. --- All peers (at any particular level) must be able to see all other peers at that level. In a hierarchial cluster we may not always enforce this, but we _will_ allow an application which detects such an error to request validation of that link and to provoke a cluster transition if that link has really failed. --- Todo: Think (hard) about proxying semantics and recovery. We can get a handle on this by thinking of proxy recovery in terms of a full recovery of all the satellites in the failed proxy's subtree. This means of course that satellites are affected by recovery, but we still have the enormous advantage that that recovery is strictly confined to the failed proxy's subtree, even if the proxied resource belonged to a higher level cluster than that. --- Rethink comms at the integration layer: does it really help to have broadcast if the broadcast results in N-1 messages anyway? Is there any reason to change things if the end goal is to guarantee any-to-any comms? Any true broadcast mechanism might help, but we would still have to build a reliable delivery guarantee on top of that, which smells strongly of over-engineering to me. --- Think about recovery of proxy services. Is it viable to allow satellites to hold resources, or should we limit them to reading resources? Preserving callback semantics in the latter case would be good, as otherwise we don't get much benefit from the cluster.