Cluster Structure ================= The following components form the core of the cluster design: * Channel layer: point-to-point communications * Link layer: reliable bound channel sets between node pairs * Integration layer: forms the cluster topology * Recovery Layer: performs recovery and controlled service startup/ teardown after a cluster transition. There are also four key services which are central to the basic cluster APIs: * JDB: stores persistent cluster internal state * Quorum Layer: determines who has quorum * Barrier services: provides cluster-wide synchronisation services * Namespace service: provides a cluster-wide name/value binding service. Component Descriptions ====================== * Channel layer: The low-level physical communications layer. This layer maintains multiple interfaces (which might be IP, serial or SCSI, for example). Neighbourhood discovery is performed on each interface via broadcast, and after any neighbour is found, a handshake is performed which creates a permanent point-to-point channel to that neighbour over the given interface. The channel supports sequenced data delivery, heartbeat link liveness verification, and controlled reset on error. Interfaces are entirely independent of each other. If the same host is found on multiple interfaces (ie. we have multiple connections to that host), the connection on interface is maintained independently of the others. Each low-level channel has a metric which determines how "good" that channel is for carrying cluster traffic. A low-performance link is defined as one with a negative metric: serial lines should have this property, for example. * Link layer: Built on the channel layer, the link layer constructs a higher-level communications mechanism which binds together all available channels to any given host. The link is in one of four states at any time: DOWN: No connection to remote host is held. RESET: We have had an error, and are temporarily resetting all of the channels in the link. DEGRADED: At least one channel is up and running, but its metric is negative. UP: At least one "good" channel is up and running. When constructing a new cluster state, the upper layers will use a degraded link to perform a clean cluster transition evicting one of the nodes on that link from the cluster. We can use the degraded link to perform this eviction cleanly, adjusting quorum to take its departure into account. * Integration layer This layer performs transitions of the overall cluster topology, merging neighbouring clusters, evicting dead or misbehaving members and ensuring transactional transitions between cluster topologies. We have to define very carefully what a "cluster" is in this context. A cluster is an agreement between a set of nodes that all of those nodes can communicate with each other and are able to form a useful working group together. The cluster is not just the set of connected nodes: it is the *agreement* of connectivity. Cluster transitions, such as the joining of a new node into the cluster, are atomic operations in which all nodes agree to the new group topology. These transitions are transactional and atomic. A key concept here is the cluster ID. The ID is unique to a single "incarnation" of the cluster. Whenever a cluster splits into multiple pieces, any surviving subcluster which has more than half the votes of the old cluster or more retains its cluster ID: all other machines are evicted from the cluster and must obtain a new cluster ID. (As a tie-breaker, a cluster with precisely half the votes retains its cluster ID iff it includes the previous cluster's "cluster controller" node among its members.) A new cluster ID is generated whenever a machine first enters the integration layer (either after eviction from a cluster or on initial startup). Such a machine considers itself to form a single-node cluster. As a result, we never have to deal with nodes joining or leaving clusters: we only ever have to deal with entire clusters joining or splitting from each other. The cluster ID includes the nodename of the node which generated the ID, and a timestamp. This is done to generate a cluster ID which is unique for all time. When a cluster partition occurs, at most one of the resulting new smaller clusters will retain the original cluster ID. However, the remaining cluster members are not all left alone to pick up the pieces: although they reenter the integration process as single-node clusters, they will not leave the cluster transition until as many as possible of those nodes have merged into larger clusters, each of which keeps the cluster ID of just one of the subclusters of which it is composed. Each cluster also has a sequence number which is incremented on each cluster transition, providing applications with an easy way of polling for potential changed cluster state. The cluster state transition which occurs when a new node joins the cluster or an existing node loses connectivity for any reason is described in integration.txt. * Recovery Manager: This is the layer which integrates the "Core" (for want of a better term --- this describes it as well as anything) of the cluster with the Rest Of The World --- ie. services. The main job of the transition layer is to "recover" from cluster transitions completed by the integration layer. Recovery constitutes a multitude of operations: ++ Internal recovery of all registered permanent services must be performed. This typically involves reconstructing the global state of that service based on (a) the set of nodes in the new cluster and any differences between that and the previous set (note that care must be taken if we take a transition during this recovery!!), and (b) the state of this service on those nodes. For example, a distributed lock manager might recalculate the hash function used to distribute lock directories over the available nodes, and then reconstruct the lock database based upon the set of locks currently held on each node. ++ Recovery of user services. These services are not necessarily running all the time, and do not necessarily possess internal global state (although they may well rely on external global state, for example the contents of a shared cluster filesystem). On cluster transition, we need to be able to allow user services which have disappeared from the cluster to be reinstantiated on a new node (ie. failover), or to notify already-running services of a change of status (eg. failback after the service's original node returns). Most importantly, we need to manage the order in which these services are restarted. There is no point in restarting the CGI server until the underlying cluster filesystem has recovered. The transition layer must therefore know about the dependencies between services. Dependencies may be inferred from the use of Names in the Cluster Namespace: the Namespace can feed dependency information to the transition manager if appropriate. However, the Namespace on its own cannot control recovery: it is the transition manager which allows separate services to coordinate controlled stage-by-stage recovery over the cluster. One final job of the transition manager is that it must be able to disable access to certain cluster services until recovery is complete (with or without the cooperation of the service concerned). It is quite possible that an application can continue quite blindly over a cluster transition as if nothing had happened (this is High Availability in action!), but obviously any requests from that application to the cluster filesystem or to the lock manager must be deferred while those services are engaged in recovery. * JDB: A transactional database is required to store local state. Every node with a non-zero vote (a "voting member" of the cluster) is required to have a jdb on local read/write persisitent storage in which the Quorum layer can perform reliable updates to simple configuration and state information. Sleepycat libdb2 (as found GPLed in glibc-2.1) should be quite sufficient. * Quorum Layer: Keeps track of "Quorum", or the Majority Voting Rights, in a cluster. Quorum is necessary to protect cluster-wide shared persistent state. It is essential to avoid problems when we have "cluster partition": a possible type of fault in which some of the cluster members have lost communications with the rest, but where the nodes themselves are still working. In a partitioned cluster, we need some mechanism we can rely on to ensure that at most one partition has the right to update the cluster's shared persistent state. (That state might be a shared disk, for example.) Quorum is maintained by assigning a number of votes to each node. This is a configuration property of the node. The Quorum manager keeps track of two separate vote counts: the "Cluster Votes", which is the sum of the votes of every node which is a member of the cluster, and the "Expected Votes", which is the sum of the votes on every node which has ever been seen by any voting member of the cluster. (The storage of those node records is one reason why the Quorum layer requires a JDB in this design.) The cluster has Quorum if, and only if, it posesses MORE than half of the Expected Votes. This guarantees that the known nodes which are not in this cluster can not possibly form a Quorum on their own. There is one other thing we need to do to be completely secure here. If completely new nodes (ie. those never before seen in the cluster) are allowed to join a partition which has no Quorum, we must prevent the new nodes from adding their votes to that partition and disturbing the Expected Votes calculation of other nodes or partitions. To prevent this, a new node is not allowed to vote until it has, at least once, joined a cluster which already has Quorum. This means that during initial cluster configuration, the sysadmin must manually enable Voting Rights on at least one node before the cluster can obtain Quorum for itself. * Barrier services Barrier services provide a basic synchronisation mechanism for any group of processes in the cluster. A barrier operation involves all the cooperating processes waiting on the same barrier: only when all of them have reached the barrier will any of them be allowed to proceed. The barrier operation is required extensively by the recovery code for other services, which is what justifies its inclusion as a core cluster service. * Namespace services The cluster namespace is a non-persistant hierarchial namespace into which any node can register names. The guts of the namespace API includes mechanisms by which processes can not only register names, but also set up dependencies on names. All failover services will be coordinated through the namespace. If multiple nodes try to register the same name (and if they request exclusive naming), for example, only one will be granted the name. If that node dies, another's request for the name will succeed, and failover will be triggered once the name assignment is complete. The namespace will also provide a natural mechanism for determining dependencies between services: any process can register a dependency on a name and will receive asynchronous callbacks if the condition of that name changes (either if the name disappears or its ownership changes due to the death of a process holding the name). As a consequence of the binding of names to services, it becomes easy for any application to find all the services in a cluster or to locate on which node a specific service (such as, for example, a failover-capable printer spool queue) currently resides. Finally, the namespace aims to make cluster administration simple by providing a simple method by which the admin can query all or selected bound names in a cluster, much as /proc already provides the dynamic information on a single Linux node.