Quorum
======

*** Note: this whole business is one I want to talk about at some
*** length.  There are tradeoffs to be made between simplicity,
*** adaptiveness and predictability.  There are lots of interesting
*** things we can do with the quorum stuff but in the end we want to
*** avoid overengineering things.  I haven't had time to finish all my
*** thoughts on this yet but I expect we can hash the important issues
*** out together better than I can on my own, especially regarding the
*** impact on system administration.  (The real problem resolves around
*** whether the cluster can effectively maintain the "expected votes"
*** completely automatically or not.)

Quorum is at first glance a relatively simple topic.  In any cluster,
we have the concept of the number of votes present in active,
surviving members of the cluster; and the total number of votes (the
"Expected Votes") present in the cluster including down or unreachable
nodes.  If we can see an overall majority of expected votes, then we
have quorum.

Unfortunately, things are not quite so simple (you _knew_ I was going
to say this, didn't you? :) There are a number of issues which
complicate the whole business:

* Dynamic cluster configuration.  Ideally, we would like the cluster
  to maintain Expected Votes itself, automatically.  When a node first
  starts up, how does it know how many other nodes it has to wait for
  until it decides that it has established a quorum?  Existing cluster
  implementations often solve this problme by forcing expected votes
  to be a static configuration variable requiring manual setup.  We
  really want to do better than that if we can.

* Voluntary exit versus node failure.  When a node loses primary
  communication with its peers, it may still have a backup, degraded
  communication link over which it can negotiate a clean, voluntary
  exit from the cluster.  Similarly, if the sysadmin brings a node
  down for routine maintenance, the node can withdraw from the cluster
  cleanly.  In both cases, we might prefer the vote of that node to be
  withdrawn from the cluster in a controlled manner which adjusts the
  Expected Votes of the remaining nodes, to maximise the chance of
  retaining Quorum (remember, if we remove a Vote, we _always_ make it
  harder to retain Quorum over future failures unless we also adjust
  the Expected Votes to compensate).

To provide a Voluntary Exit mechanism, we want to have a way by which
nodes can withdraw not only their vote, but also their Expected Vote
contribution from the cluster.  If we allow this, the condition must
be persistent: the node cannot then be allowed to rejoin the cluster
and give its Expected Vote back unconditionally.  If the Expected
Votes could be cast back in all cases, there would be a danger that a
number of nodes which had been shutdown and voluntarily exited from
the cluster, might then try to reform a cluster on their own if they
happen to reboot themselves into a separate partition from the
original cluster.


We can deal with these concerns in the following manner:

* Every voting node must maintain a persistent "Voting" flag
  indicating whether or not that node is willing to offer its vote as
  an Expected Vote.  The Voting flag must start off cleared.

* If the Voting flag is not set, then that node will never offer its
  vote for the purpose of creating Quorum.  However, if it joins a
  cluster which already has Quorum, then it will set its own Voting
  flag and contribute its vote to the Quorum.

* On a voluntary exit, the node will either (a) withdraw its Expected
  Vote and clear its Voting flag, or (b) withdraw its Vote but keep
  its Expected Vote and its Voting flag.  This should be
  sysadmin-configurable.  Behaviour (a) allows for more dynamic
  cluster self-management, but behaviour (b) allows powered-down nodes
  to remember the cluster Expected Votes on reboot and to form quorum
  themselves.

  On involuntary exit (loss of cluster connectivity or failure of a
  node), we always have behaviour (b).

  Behaviour (a) is only permitted when the cluster still has Quorum.


One (deliberate) consequence of this is that a set of nodes which have
just been rebooted for the first time, or after clean shutdown, are
not allowed to form quorum on their own.  In such a situation, the
node can either join an existing Quorate cluster, or the sysadmin can
forcibly set the Voting flag on one of the nodes, in which case all of
the communicating nodes will see that single-vote Quorum and will be
able to set their own Voting flags.  This is all quite intentional: if
we are using the Voting flags then Expected Votes is dynamic, and we
have to take great care to ensure that a node which has returned its
Expected Votes never reconfigures Quorum on its own.


================================================================

Random thoughts and scribbles about quorum:

What exactly are we trying to achieve?  Exactly one aim: quorum can
only be present in at most one place in the presence of cluster
partition.

What about dynamic cluster self-configuration?  Think about new nodes
joining: make sure they don't try to form quorum if they happen not to
find the rest of the cluster immediately.  On recovery from failure,
however, nodes _must_ be allowed to try to form quorum.

Think about a partitioned cluster where new nodes join a non-quorate
partition.  Does that partition suddenly get quorum if enough new
nodes join?  uhoh, definitely not.

Solve all of this by making a single prerequisite: expected votes can
only be adjusted if we already have quorum.  Bootstrapping quorum will
therefore require operator intervention.

What information do we need to maintain to calculate expected votes?