Hierarchical Clusters ===================== last modified: $Id: hierarchy.txt,v 1.4 1999/12/17 14:52:15 sct Exp $ [ Read past the definitions to the justifications bit if you just want to get an overview of what we're trying to achieve here. ] DEFINITIONS ----------- For the discussion of hierarchical clusters, we need to make a few definitions. First of all, let us define what a hierarchical cluster is. A single, FLAT cluster is some virtual entity born from a collaboration between (probably nearby) nodes on a network. We can call this a FIRST-LEVEL cluster. Those nodes are the clusters MEMBERs, and together they form the clusters MEMBERSHIP LIST. Each cluster has, at all times, a unique privileged node known as the CLUSTER LEADER. That node does plays a key role in coordinating cluster transitions, but other than that it does not necessarily perform any special work during the normal running of the cluster. If we want to combine clusters into larger units, we want to be able to support the binding of many clusters into a single HIGHER-LEVEL cluster, or METACLUSTER. That cluster is formed from the clustering together of all of the cluster leaders of the cluster's MEMBER CLUSTERS: those cluster leaders, together, form the metacluster's membership list. The metacluster's subcluster list is therefore a different concept from its membership list: if a subcluster undergoes a state transition and elects a new cluster leader, then that new leader will replace the old one in the metacluster's membership list, but the metacluster's subcluster list remains the same. The complete list of all member clusters of a metacluster, and all the members of those member clusters right down to the first-level clusters, forms the metacluster's SUBCLUSTER LIST. A metacluster of first-level clusters constitutes a second-level cluster. A metacluster of those second-level clusters forms a third-level cluster, and so on. We can observe that in a large corporate network being managed as a single cluster (we can dream, can't we? :), we really do not want all of our labs' test clusters to cause corporate-wide cluster transitions whenever a test machine dies. Not at the rate at which I reboot test machines, at any rate. This is the "S-Cluster" case. On the other hand, we can also envisage large compute clusters of tens or hundreds of thousands of nodes in which we want to support high rates of cluster traffic on cluster-wide shared resources (such as cluster-wide locks). We simply *have* to reclaim such locks when a node dies (DLM semantics require it), so in this second, large "L-Cluster", every node is significant as far as cluster-wide recovery is concerned. This implies that, when looking at any given level of the cluster hierarchy, there may be nodes underneath which participate in recovery, as well as those which do not. We define these as PEER nodes and SATELLITE nodes respectively. For any cluster, all of the peers taking part in that cluster's recovery form the clusters PEER LIST or PEERAGE. Any given node may be a member of the peerage for its own cluster and for any number of higher-level clusters. Above that level --- it's PEER LEVEL --- the node is a satellite. A true satellite in the first-level cluster has a peer level of zero. *** MISSING: Define exactly what a satellite's relation with the higher level clusters is. In particular, does a satellite always proxy through peers in its own cluster? There may not be any such nodes. The satellite may _have_ to go to some outside node in a higher level cluster. *** Think also about what happens if a node is a peer only up to a certain level: it loses the ability to be a cluster leader above that level. Does this affect proxying much? I need to draw this out before expanding on this much, and before completing any more detail on the recovery specs. This has one major implication for the design of the integration layer. The fact that only cluster leaders can be members of a higher level cluster implies that only nodes with the highest peer level in any cluster may be cluster leaders for that cluster. If we did not observe this rule then we might find that a higher-level cluster could not convene because none of the nodes capable of being peers at that level were leaders of their respective clusters. CONNECTIVITY ------------ The peerage for a cluster contains all nodes in that cluster expected to partake in cluster-wide shared state such as lock management. Any peer may be acting as a lock holder or a lock master at any point in time. As a result, we should expect that a node be connected to all its peers at all times. However, although a flat cluster guarantees any-to-any connectivity, a metacluster only guarantees such connectivity between its own cluster members (ie. the leaders of the subclusters). We need to augment this with a fault-reporting mechanism by which any peer can report suspected loss of connectivity to another, so that one of the partially connected nodes can be removed from the cluster. The metacluster mechanism does not need to provide the fully-connected guarantee itself as long as there is an independent fault manager somewhere in the software stack which can deal with a breakdown in communications. JUSTIFICATION ------------- What does this big mess of stuff actually add to our clustering? First of all, we note that some form of hierarchy is absolutely necessary in order to scale. We cannot guarantee any-to-any connectivity between a million nodes without swamping our communications fabric. In an L-Cluster of such a size, we might expect the MTBF to be under sufficient control that we only suffer a cluster transition every few minutes, but it is hard to imagine establishing any-to-any comms between that many machines quickly enough that we can recover in time for the next failure. In a distributed S-Cluster of similar size, the situation is different. We expect a lower MTBF (or at least the mean time between transitions, MTBT) to begin with (because we may have test machines and workstations being rebooted for perfectly valid reasons). We also expect to have much less bandwidth between some of the metacluster members (we may have remote clusters linked over external IP connections) and to have very asymmetric load capacity (say, only a few servers capable of acting as metacluster leaders at each site). How can we deal with these problems? First of all, in an L-Cluster, the limiting factor is the requirements for synchronisation between all machines in the cluster when something goes wrong. Verification of any-to-any connectivity requiring even only a million broadcasts will be hideously expensive. However, with a dense enough fabric, we might expect that broadcast from one specific node can be done efficiently enough, even if only simulated via point-to-point interconnects. We can use the cluster hierarchy to distribute broadcast messages from the cluster leader to each cluster member, ie. to the leaders of all the subclusters, and repeat the process down the cluster hierarchy until all nodes have been contacted. The acks from each node can likewise be propagated back up the tree and merged into a single ack at the top level. The total number of packets transmitted over the entire fabric is O(n) but the total forwarding distance is only O(log(n)). In other words, as long as recovery of the cluster services (such as the DLM) can be performed using such fan-out/fan-in broadcast+ack communication primitives, recovery of large numbers of machines should be feasible in a short time. By using hierarchical clusters, the individual cluster membership transitions involved can be lightweight, typically only involving ten or so nodes at once (the most complex cases involving death of a cluster leader). However, that membership transition can result in the eventual recovery of large numbers of nodes, because we can allow a low-level cluster transition to result in peer recovery of any metaclusters involved. If a cluster transition only involves the arrival or departure of nodes which are satellites to the metacluster, then the metacluster's peer list is unaffected and no recovery is needed at all. This is the property which allows us to conceive of extremely large S-Clusters using this design. The fact that nodes may be peers only up to a certain level and satellites above that level allows us to limit the number of peers which are present in the higher levels of the metacluster. This reduces both the number of transitions in those higher levels (keeping cluster availability high), it also reduces the high-level cluster's traffic. In short, we envision two scenarios. In L-Clusters, the top-level cluster traffic is high but the fabric is dense, and the main job of the hierarchy is to limit the cost of completing cluster membership transitions. Recovery itself is always done at the top level of the cluster, because most resources in the cluster are shared around the entire cluster. In S-Clusters, most resources are local to subgroups in the cluster hierarchy. The cluster hierarchy's main job is to prevent there from being many transitions at all in the top level cluster, even if lower level clusters are transitioning regularly. The cluster hierarchy removes the need for the top levels to carry any traffic at all when an internal node of some subcluster dies, as long as that node's peer level is less than the metacluster's own level. NAMING ------ In a cluster hierarchy, any one node may in fact be a member (or a peer, or a satellite) of many nested clusters. When a process wants to access a cluster resource, it will need to indicate which of these metaclusters it is talking about. We also need a way to identify other nodes in the cluster or metacluster. There is one major requirement for the naming which we must not forget, however. It is important that the clustering API supports the ability for existing clusters to be bound into metaclusters without upsetting any applications already running on the old clusters. In other words, adding a new level of metacluster should not invalidate the names being used to identify existing clusters and nodes. This basically prevents us from using any naming which relies on a cluster root. @@@ Possibilities: Use node names of the form "/REDHAT/SCOT/SCT/DEV/DESKTOP" Valid identifiers for that node will include "REDHAT/SCOT/SCT/DEV/DESKTOP" // omit leading forward slash "SCOT/SCT/DEV/DESKTOP" "SCT/DEV/DESKTOP" "DEV/DESKTOP" "DESKTOP" as seen from the same cluster (ie. the "/REDHAT/SCOT/SCT/DEV" cluster). We may also have a separate cluster called "/REDHAT/SCOT/SCT/TEST/TEST1" with a cluster node member called "/REDHAT/SCOT/SCT/TEST/TEST1/FSTEST1". From that node, the previous node could be referred to by its fully qualified cluster name (ie. the name beginning "/REDHAT/...". It could also be referred to as "../../DEV/DESKTOP" "../../../SCT/DEV/DESKTOP" "../../../../SCOT/SCT/DEV/DESKTOP" but not "../../../../../REDHAT/SCOT/SCT/DEV/DESKTOP", since that many ".."s leads us to something called "/" which does not resolve to a named cluster. (Ie. there is no root level directory as such.) RECOVERY -------- In a hierarchical cluster, we have different membership lists which can change when transitions occur. At each level of the cluster, we have both the cluster membership and the peer membership to worry about. These two types of recovery are quite distinct things. In a large L-Cluster, for example, when a node dies it will cause a transition in the smallest cluster of which it is a member, and will of course also cause a peer membership change there. However, in the higher-level metaclusters in that L-Cluster, there will be no cluster membership transitions (unless we were unlucky enough that the node which died was the cluster leader of its local cluster, in which case there will be membership transitions at higher levels as new a cluster leader is elected). There _will_, however, be a peer membership change in all parent clusters of the local cluster of the dead node. That is the whole point of the L-Cluster: cluster leaders in each metacluster can communicate the information that a certain node is dead upwards to the top-level metacluster, and we can then take any necessary corrective action to perform a controlled transition on the peer membership of that cluster without having to perform expensive verification of the integrity of the clusters and the connectivity between every single surviving pair of nodes in the full cluster. Consider then what happens if we have, for example, a large cluster filesystem using a distributed lock manager which is consistent over the entire L-Cluster of several thousand nodes in a large High Performance Computing cluster. On death of a node in that cluster, a few localised machines perform a cluster membership transition, and then they perform a peerage recovery (performing recovery of any cluster services which are operating at the level of that local cluster). That local cluster then sends a message to the next higher level cluster indicating that there is a change in the peer list for that cluster, and peerage recovery happens for that cluster... and so on all the way to the top-level cluster. Therefore, when one machine joins or leaves the cluster, we do not have to perform a full cluster membership transition for the whole cluster: rather, we have the much simpler task of dealing with a peerage transition in which we are simply told by one of the cluster members exactly what has happened to the peer list. We don't have to work out what went wrong. This is necessarily a more scalable scenario than having every member of a 10,000-node cluster have to negotiate with every single one of its neighbours every time a node joins or leaves the cluster! But is it scalable enough? Can we perform recovery efficiently for coherent services such as a DLM on such a large cluster when the peerage changes? I believe we can, because the hierarchical nature of the cluster gives us a way to contact all nodes in the cluster efficiently. On peerage transition, the cluster leader can begin the transition by sending a command to each of its cluster members. This is a hierarchical cluster, so the top-level metacluster members are actually the cluster leaders of the next-level-down metaclusters, and those cluster leaders can then fan-out the message down to all of their cluster members, and so on down the cluster hierarchy until all peers in the entire cluster have been contacted. Replies can fan-in up the cluster hierarchy in a similar manner. As long as peerage recovery can be expressed in terms of such fan-out/fan-in broadcast operations, plus limited point-to-point traffic between specific nodes, we can use the cluster hierarchy to make peerage recovery sufficiently scalable to work well even on enormously large HPC clusters. What about the S-Cluster case? In a typical S-Cluster, this distinction between peer and membership transition again does exactly what we want it to do. If we have some remote location in a company with its own internal metacluster, and further sub-clusters inside that according to the pattern of use of the machines there, then in an S-Cluster we would only have a few machines at that entire site which were peers for the organisation's higher level clusters. As a result, any node arrivals or departures within that site, as long as they don't concern those high-level peer servers, will not cause any membership _or_ peer transition in the top level clusters. IRREGULAR HIERARCHIES --------------------- Think about the implications of: * Different cluster subtrees having different depth * Binding clusters together into metaclusters without reconfiguring every member of every subcluster. AUTHENTICATION -------------- Permission --- to join a cluster, to become a peer (especially for metaclusters). If we have different cluster passwords for a metacluster, then how do we decide which one wins (ie. which is the real metacluster, which is spoofed? Over an organisation, can we realistically rely on nobody doing the wrong thing?) LocalWords: subcluster metacluster's metacluster ie subclusters MTBF comms IP LocalWords: MTBT acks ack metaclusters MEMBERs cluster's DLM proxying HPC LocalWords: transitioning localised neighbours organisation's spoofed