Quorum
======

When it comes down to it, quorum is essentially the deceptively complex
business of establishing some property which can only ever be held in at
most one partition of a partitioned cluster.

Quorum is at first glance a relatively simple topic.  In any cluster, we
have the concept of the number of votes present in active, surviving
members of the cluster; and the total number of votes (the "Expected
Votes") present in the cluster including down or unreachable nodes.  If
we can see an overall majority of expected votes, then we have quorum.

Unfortunately, things are not quite so simple (you _knew_ I was going to
say this, didn't you? :) There are a number of issues which complicate
the whole business:

* Dynamic cluster configuration.  Ideally, we would like the cluster
  to maintain Expected Votes itself, automatically.  When a node first
  starts up, how does it know how many other nodes it has to wait for
  until it decides that it has established a quorum?  Existing cluster
  implementations often solve this problme by forcing expected votes
  to be a static configuration variable requiring manual setup.  We
  really want to do better than that if we can.

* Voluntary exit versus node failure.  When a node loses primary
  communication with its peers, it may still have a backup, degraded
  communication link over which it can negotiate a clean, voluntary
  exit from the cluster.  Similarly, if the sysadmin brings a node
  down for routine maintenance, the node can withdraw from the cluster
  cleanly.  In both cases, we might prefer the vote of that node to be
  withdrawn from the cluster in a controlled manner which adjusts the
  Expected Votes of the remaining nodes, to maximise the chance of
  retaining Quorum (remember, if we remove a Vote, we _always_ make it
  harder to retain Quorum over future failures unless we also adjust
  the Expected Votes to compensate).

To provide a Voluntary Exit mechanism, we want to have a way by which
nodes can withdraw not only their vote, but also their Expected Vote
contribution from the cluster.  If we allow this, the condition must
be persistent: the node cannot then be allowed to rejoin the cluster
and give its Expected Vote back unconditionally.  If the Expected
Votes could be cast back in all cases, there would be a danger that a
number of nodes which had been shutdown and voluntarily exited from
the cluster, might then try to reform a cluster on their own if they
happen to reboot themselves into a separate partition from the
original cluster.


We can deal with these concerns in the following manner:

* We will maintain a cluster-wide "quorum database" listing each node
  and the votes owned by that node.

* The quorum database will be persistent, and must be replicated (with a
  serial number for conflict resolution) on every voting node in the
  cluster. 

* Modifications can be made to the quorum database ONLY IF QUORUM IS
  ALREADY HELD.

Now, if a node is removed manually from the cluster, the cluster's
expected votes can be adjusted accordingly.  If a cluster partition
occurs and a manually-removed new node rejoins a non-quorate partition,
it is impossible for quorum to be regained accidentally, no matter how
many such nodes rejoin: quorum is required before those new nodes will
be able to vote.

The observant reader will note that this makes it impossible for a brand
new cluster to obtain quorum.  We must provide a bootstrap facility to
allow the system administrator to add a casting vote to the quorum
database manually before a newly configured cluster can achieve quorum.
The casting vote will grant quorum (it will be the only vote in the
quorum database, so votes == expected votes), and once that happens, all
of the nodes in the new cluster can then register themselves in the
quorum database.

Of course, if a node leaves the cluster unexpectedly, then its vote
should remain in the quorum database even though the vote is no longer
being cast: the quorum database simply grants a node the right to vote.


Extra votes
===========

The case of the two-node cluster is a notorious special case for
quorum.  In this case, it is in principle impossible to have a symmetric
quorum with the same number of votes, and still to have quorum survive a
single node failure.  That makes failover of quorate services hard on a
two-node cluster!!

The basic problem is that if a node loses contact with its partner, it
has no way to be sure whether the other node is actually dead (and
therefore failover should occur), or whether in fact the failure was in
the connection between the two machines instead.

There are two solutions to this problem in widespread use.  One is to
make sure that the partner is dead by killing it, hard --- SGI's
FailSafe product allows one node to deactivate the power supply on its
partner, for example.  The second is to add an extra, external vote of
some description to act as a tie-breaker.  A "quorum disk" --- a disk,
usually SCSI, which is connected simultaneously to both nodes in the
cluster --- is often used for this.

Any alternative quorum sources can be integrated into this quorum
design.  The only restriction is that the quorum source must be
registered in the quorum database before its votes may be used.

This cluster design already offers one feature designed to distinguish
between communications split and node failure.  The use of "degraded"
backup connections between nodes for the cluster integration protocol
allows controlled negotiation of the eviction of a single node from the
cluster if a partition occurs on the primary cluster interconnect.  

Is there anything we can do if a node dies altogether, though, to
recover its lost vote?  The answer is strictly "no" if we cannot tell
the difference between a dead node and a disconnected node.

However, if we have faith in our backup communications to the lost node,
and if we can convince ourselves with good confidence that the lost node
is in fact truly dead (and therefore it cannot be part of a partitioned
cluster), then we can use a "casting vote" concept to recover quorum:

* Whenever true quorum (quorum without a casting vote is achieved), any
  casting vote present in the quorum database is removed and a single
  "floating vote" is registered.  The casting and floating votes do not
  contribute towards the expected votes.  The casting vote counts
  towards quorum votes, but the floating vote does not.

* If any cluster transition results in the loss of true quorum but
  semi-quorum (we have exactly half of the expected votes) remains, AND
  if we can verify absolutely for sure that at least one of the lost
  nodes is truly dead, THEN we can convert any existing floating vote
  into a casting vote.

* The casting vote mechanism will also be used to "kick-start" an
  initial cluster as described above: manual creation of a casting vote
  will enable quorum to be established in a new cluster.

The judgement of whether another node is dead or not will have to be
configured by the system administrator.  By default, the cluster will
never ever risk creating a cluster partition, and will not use a casting
vote.