Quorum ====== When it comes down to it, quorum is essentially the deceptively complex business of establishing some property which can only ever be held in at most one partition of a partitioned cluster. Quorum is at first glance a relatively simple topic. In any cluster, we have the concept of the number of votes present in active, surviving members of the cluster; and the total number of votes (the "Expected Votes") present in the cluster including down or unreachable nodes. If we can see an overall majority of expected votes, then we have quorum. Unfortunately, things are not quite so simple (you _knew_ I was going to say this, didn't you? :) There are a number of issues which complicate the whole business: * Dynamic cluster configuration. Ideally, we would like the cluster to maintain Expected Votes itself, automatically. When a node first starts up, how does it know how many other nodes it has to wait for until it decides that it has established a quorum? Existing cluster implementations often solve this problme by forcing expected votes to be a static configuration variable requiring manual setup. We really want to do better than that if we can. * Voluntary exit versus node failure. When a node loses primary communication with its peers, it may still have a backup, degraded communication link over which it can negotiate a clean, voluntary exit from the cluster. Similarly, if the sysadmin brings a node down for routine maintenance, the node can withdraw from the cluster cleanly. In both cases, we might prefer the vote of that node to be withdrawn from the cluster in a controlled manner which adjusts the Expected Votes of the remaining nodes, to maximise the chance of retaining Quorum (remember, if we remove a Vote, we _always_ make it harder to retain Quorum over future failures unless we also adjust the Expected Votes to compensate). To provide a Voluntary Exit mechanism, we want to have a way by which nodes can withdraw not only their vote, but also their Expected Vote contribution from the cluster. If we allow this, the condition must be persistent: the node cannot then be allowed to rejoin the cluster and give its Expected Vote back unconditionally. If the Expected Votes could be cast back in all cases, there would be a danger that a number of nodes which had been shutdown and voluntarily exited from the cluster, might then try to reform a cluster on their own if they happen to reboot themselves into a separate partition from the original cluster. We can deal with these concerns in the following manner: * We will maintain a cluster-wide "quorum database" listing each node and the votes owned by that node. * The quorum database will be persistent, and must be replicated (with a serial number for conflict resolution) on every voting node in the cluster. * Modifications can be made to the quorum database ONLY IF QUORUM IS ALREADY HELD. Now, if a node is removed manually from the cluster, the cluster's expected votes can be adjusted accordingly. If a cluster partition occurs and a manually-removed new node rejoins a non-quorate partition, it is impossible for quorum to be regained accidentally, no matter how many such nodes rejoin: quorum is required before those new nodes will be able to vote. The observant reader will note that this makes it impossible for a brand new cluster to obtain quorum. We must provide a bootstrap facility to allow the system administrator to add a casting vote to the quorum database manually before a newly configured cluster can achieve quorum. The casting vote will grant quorum (it will be the only vote in the quorum database, so votes == expected votes), and once that happens, all of the nodes in the new cluster can then register themselves in the quorum database. Of course, if a node leaves the cluster unexpectedly, then its vote should remain in the quorum database even though the vote is no longer being cast: the quorum database simply grants a node the right to vote. Extra votes =========== The case of the two-node cluster is a notorious special case for quorum. In this case, it is in principle impossible to have a symmetric quorum with the same number of votes, and still to have quorum survive a single node failure. That makes failover of quorate services hard on a two-node cluster!! The basic problem is that if a node loses contact with its partner, it has no way to be sure whether the other node is actually dead (and therefore failover should occur), or whether in fact the failure was in the connection between the two machines instead. There are two solutions to this problem in widespread use. One is to make sure that the partner is dead by killing it, hard --- SGI's FailSafe product allows one node to deactivate the power supply on its partner, for example. The second is to add an extra, external vote of some description to act as a tie-breaker. A "quorum disk" --- a disk, usually SCSI, which is connected simultaneously to both nodes in the cluster --- is often used for this. Any alternative quorum sources can be integrated into this quorum design. The only restriction is that the quorum source must be registered in the quorum database before its votes may be used. This cluster design already offers one feature designed to distinguish between communications split and node failure. The use of "degraded" backup connections between nodes for the cluster integration protocol allows controlled negotiation of the eviction of a single node from the cluster if a partition occurs on the primary cluster interconnect. Is there anything we can do if a node dies altogether, though, to recover its lost vote? The answer is strictly "no" if we cannot tell the difference between a dead node and a disconnected node. However, if we have faith in our backup communications to the lost node, and if we can convince ourselves with good confidence that the lost node is in fact truly dead (and therefore it cannot be part of a partitioned cluster), then we can use a "casting vote" concept to recover quorum: * Whenever true quorum (quorum without a casting vote is achieved), any casting vote present in the quorum database is removed and a single "floating vote" is registered. The casting and floating votes do not contribute towards the expected votes. The casting vote counts towards quorum votes, but the floating vote does not. * If any cluster transition results in the loss of true quorum but semi-quorum (we have exactly half of the expected votes) remains, AND if we can verify absolutely for sure that at least one of the lost nodes is truly dead, THEN we can convert any existing floating vote into a casting vote. * The casting vote mechanism will also be used to "kick-start" an initial cluster as described above: manual creation of a casting vote will enable quorum to be established in a new cluster. The judgement of whether another node is dead or not will have to be configured by the system administrator. By default, the cluster will never ever risk creating a cluster partition, and will not use a casting vote.