Quorum ====== *** Note: this whole business is one I want to talk about at some *** length. There are tradeoffs to be made between simplicity, *** adaptiveness and predictability. There are lots of interesting *** things we can do with the quorum stuff but in the end we want to *** avoid overengineering things. I haven't had time to finish all my *** thoughts on this yet but I expect we can hash the important issues *** out together better than I can on my own, especially regarding the *** impact on system administration. (The real problem resolves around *** whether the cluster can effectively maintain the "expected votes" *** completely automatically or not.) Quorum is at first glance a relatively simple topic. In any cluster, we have the concept of the number of votes present in active, surviving members of the cluster; and the total number of votes (the "Expected Votes") present in the cluster including down or unreachable nodes. If we can see an overall majority of expected votes, then we have quorum. Unfortunately, things are not quite so simple (you _knew_ I was going to say this, didn't you? :) There are a number of issues which complicate the whole business: * Dynamic cluster configuration. Ideally, we would like the cluster to maintain Expected Votes itself, automatically. When a node first starts up, how does it know how many other nodes it has to wait for until it decides that it has established a quorum? Existing cluster implementations often solve this problme by forcing expected votes to be a static configuration variable requiring manual setup. We really want to do better than that if we can. * Voluntary exit versus node failure. When a node loses primary communication with its peers, it may still have a backup, degraded communication link over which it can negotiate a clean, voluntary exit from the cluster. Similarly, if the sysadmin brings a node down for routine maintenance, the node can withdraw from the cluster cleanly. In both cases, we might prefer the vote of that node to be withdrawn from the cluster in a controlled manner which adjusts the Expected Votes of the remaining nodes, to maximise the chance of retaining Quorum (remember, if we remove a Vote, we _always_ make it harder to retain Quorum over future failures unless we also adjust the Expected Votes to compensate). To provide a Voluntary Exit mechanism, we want to have a way by which nodes can withdraw not only their vote, but also their Expected Vote contribution from the cluster. If we allow this, the condition must be persistent: the node cannot then be allowed to rejoin the cluster and give its Expected Vote back unconditionally. If the Expected Votes could be cast back in all cases, there would be a danger that a number of nodes which had been shutdown and voluntarily exited from the cluster, might then try to reform a cluster on their own if they happen to reboot themselves into a separate partition from the original cluster. We can deal with these concerns in the following manner: * Every voting node must maintain a persistent "Voting" flag indicating whether or not that node is willing to offer its vote as an Expected Vote. The Voting flag must start off cleared. * If the Voting flag is not set, then that node will never offer its vote for the purpose of creating Quorum. However, if it joins a cluster which already has Quorum, then it will set its own Voting flag and contribute its vote to the Quorum. * On a voluntary exit, the node will either (a) withdraw its Expected Vote and clear its Voting flag, or (b) withdraw its Vote but keep its Expected Vote and its Voting flag. This should be sysadmin-configurable. Behaviour (a) allows for more dynamic cluster self-management, but behaviour (b) allows powered-down nodes to remember the cluster Expected Votes on reboot and to form quorum themselves. On involuntary exit (loss of cluster connectivity or failure of a node), we always have behaviour (b). Behaviour (a) is only permitted when the cluster still has Quorum. One (deliberate) consequence of this is that a set of nodes which have just been rebooted for the first time, or after clean shutdown, are not allowed to form quorum on their own. In such a situation, the node can either join an existing Quorate cluster, or the sysadmin can forcibly set the Voting flag on one of the nodes, in which case all of the communicating nodes will see that single-vote Quorum and will be able to set their own Voting flags. This is all quite intentional: if we are using the Voting flags then Expected Votes is dynamic, and we have to take great care to ensure that a node which has returned its Expected Votes never reconfigures Quorum on its own. ================================================================ Random thoughts and scribbles about quorum: What exactly are we trying to achieve? Exactly one aim: quorum can only be present in at most one place in the presence of cluster partition. What about dynamic cluster self-configuration? Think about new nodes joining: make sure they don't try to form quorum if they happen not to find the rest of the cluster immediately. On recovery from failure, however, nodes _must_ be allowed to try to form quorum. Think about a partitioned cluster where new nodes join a non-quorate partition. Does that partition suddenly get quorum if enough new nodes join? uhoh, definitely not. Solve all of this by making a single prerequisite: expected votes can only be adjusted if we already have quorum. Bootstrapping quorum will therefore require operator intervention. What information do we need to maintain to calculate expected votes?