Barrier Operations
==================
$Id: barrier.txt,v 1.3 1999/12/16 16:52:17 sct Exp $


Goals and discussion
--------------------

Barrier functionality is commonly required in distributed systems.  It
provides a synchronisation mechanism by which multiple nodes can
coordinate activity.

The basic definition of a barrier is that it provides a guaranteed
synchronisation point in distributed processing which is valid over
all nodes in the cluster: it strictly divides time into a pre-barrier
and a pos-barrier phase.  The barrier may not necessarily complete at
exactly the same time on every node, but there is absolute guarantee
that the barrier will not complete on any node unless all other nodes
have begun the barrier.

Why is this useful?  Sometimes in cluster operations you have some
task which is broken up into phases, and you need to make sure that
all nodes in the cluster have completed one phase of the task before
you let anybody move on to the next phase.  Cluster recovery (after a
cluster transition occurs) is full of such cases: for example, when
you recover a shared-data service such as the cluster namespace, you
want to suspend processing of new namespace requests everywhere before
you start any recovery; and you want to make sure that all nodes have
completed their namespace recovery before you allow any node to start
processing new namespace requests.  Barriers between these phases
allow such synchronisation, and it is because the cluster's basic
recovery process requires barriers that a barrier API must be included
in the core cluster services.


Functional requirements
-----------------------

There are several different things that we want barriers to be able to
provide.  In thinking about barrier functionality, remember:

 * You cannot have a barrier without knowing in advance what the
   membership of the barrier is.

   Say you want to synchronise the start of some work being performed in
   a distributed application.  The definition of the barrier is that
   once all clients have requested the barrier, the barrier completes.
   However, when the first two or three processes in the application
   start up and request their barrier, how does the barrier API know
   that there are in fact more process not yet connected to the barriers
   which will want to participate in the barrier?

   In other words, as the processes are starting up, we are faced with
   the possibility that all processes *currently* connected to the
   barrier have triggered a barrier progression, but that there are
   still processes which have not reached the barrier because they have
   not yet even registered with the barrier API.

   So, we cannot progress the barrier until after we know that all users
   of that barrier have registered with the barrier API, *and* all
   registered clients of the barrier API have requested the barrier
   itself.

   In other words, we have to have some mechanism in the API for closing
   off the list of participants in the barrier.

Therefore, barrier use is in two phases.  First, one or more processes
around the cluster register as users of a named barrier.  Secondly, we
perform barrier synchronisations: any process can request a barrier, and
the operation only completes when all processes with an existing
connection to the barrier have made the same request.

There are several different ways we may want to specify barrier
membership.  Most application processes will want to be able to use an
API which either says

   "there will be exactly N members of the barrier, don't allow it to
   progress until that many processes have joined",

   "All barrier joins have been registered, enable the barrier now", or

   "Trust me, I'll not advance the barrier until all interested
   processes have joined, so freeze the membership list as soon as you
   see a barrier advance."

In addition, recovery really wants a special case:

   "Exactly one process from each node in the cluster will participate
   in the barrier, and the barrier is destroyed on cluster transition."

Having this functionality present in the barrier API allows cluster
recovery and cluster startup to share the barrier API as a way of
synchronising during cluster transitions.  In particular, it means that
when a node starts up, the server processes implementing higher level
cluster APIs can use the barrier API to ensure that recovery of the
existing nodes in the cluster waits for the corresponding server
processes in the new node to be up and running before the barrier is
advanced.

Multi-stage recovery is important for many services: for example, in the
cluster namespace, we must wait for all nodes to stall processing of new
requests before we start recovering the namespace state; and we must
wait for all nodes to finish that recovery before re-enabling the API
processing.  We can use multi-step barriers for that.

We associate a cluster-wide count with each barrier.  Whenever a request
is made for a barrier "NAME:<count>", the request will be blocked until
all other connected barrier clients have also made a request for
barriers "NAME:<n>" where n >= count.  This allows some clients to miss
out some stages of the barrier sequence if they want to.