Barrier Operations ================== $Id: barrier.txt,v 1.3 1999/12/16 16:52:17 sct Exp $ Goals and discussion -------------------- Barrier functionality is commonly required in distributed systems. It provides a synchronisation mechanism by which multiple nodes can coordinate activity. The basic definition of a barrier is that it provides a guaranteed synchronisation point in distributed processing which is valid over all nodes in the cluster: it strictly divides time into a pre-barrier and a pos-barrier phase. The barrier may not necessarily complete at exactly the same time on every node, but there is absolute guarantee that the barrier will not complete on any node unless all other nodes have begun the barrier. Why is this useful? Sometimes in cluster operations you have some task which is broken up into phases, and you need to make sure that all nodes in the cluster have completed one phase of the task before you let anybody move on to the next phase. Cluster recovery (after a cluster transition occurs) is full of such cases: for example, when you recover a shared-data service such as the cluster namespace, you want to suspend processing of new namespace requests everywhere before you start any recovery; and you want to make sure that all nodes have completed their namespace recovery before you allow any node to start processing new namespace requests. Barriers between these phases allow such synchronisation, and it is because the cluster's basic recovery process requires barriers that a barrier API must be included in the core cluster services. Functional requirements ----------------------- There are several different things that we want barriers to be able to provide. In thinking about barrier functionality, remember: * You cannot have a barrier without knowing in advance what the membership of the barrier is. Say you want to synchronise the start of some work being performed in a distributed application. The definition of the barrier is that once all clients have requested the barrier, the barrier completes. However, when the first two or three processes in the application start up and request their barrier, how does the barrier API know that there are in fact more process not yet connected to the barriers which will want to participate in the barrier? In other words, as the processes are starting up, we are faced with the possibility that all processes *currently* connected to the barrier have triggered a barrier progression, but that there are still processes which have not reached the barrier because they have not yet even registered with the barrier API. So, we cannot progress the barrier until after we know that all users of that barrier have registered with the barrier API, *and* all registered clients of the barrier API have requested the barrier itself. In other words, we have to have some mechanism in the API for closing off the list of participants in the barrier. Therefore, barrier use is in two phases. First, one or more processes around the cluster register as users of a named barrier. Secondly, we perform barrier synchronisations: any process can request a barrier, and the operation only completes when all processes with an existing connection to the barrier have made the same request. There are several different ways we may want to specify barrier membership. Most application processes will want to be able to use an API which either says "there will be exactly N members of the barrier, don't allow it to progress until that many processes have joined", "All barrier joins have been registered, enable the barrier now", or "Trust me, I'll not advance the barrier until all interested processes have joined, so freeze the membership list as soon as you see a barrier advance." In addition, recovery really wants a special case: "Exactly one process from each node in the cluster will participate in the barrier, and the barrier is destroyed on cluster transition." Having this functionality present in the barrier API allows cluster recovery and cluster startup to share the barrier API as a way of synchronising during cluster transitions. In particular, it means that when a node starts up, the server processes implementing higher level cluster APIs can use the barrier API to ensure that recovery of the existing nodes in the cluster waits for the corresponding server processes in the new node to be up and running before the barrier is advanced. Multi-stage recovery is important for many services: for example, in the cluster namespace, we must wait for all nodes to stall processing of new requests before we start recovering the namespace state; and we must wait for all nodes to finish that recovery before re-enabling the API processing. We can use multi-step barriers for that. We associate a cluster-wide count with each barrier. Whenever a request is made for a barrier "NAME:", the request will be blocked until all other connected barrier clients have also made a request for barriers "NAME:" where n >= count. This allows some clients to miss out some stages of the barrier sequence if they want to.