General Issues around Programming Interfaces
============================================
$Id: api.txt,v 1.1 1999/12/13 12:45:37 sct Exp $

Here we will discuss a number of issues relevant to the creation of
clustering APIs.  These issues only concern the API as expressed on a
single node.  They may deal with communication between a user process
and a cluster server process on one node, for example, but we be will
completely ignoring issues to do with communication between cluster
nodes here.

Issues to be dealt with uniformly:
----------------------------------

In designing the way we build APIs for the cluster core services,
there are a number of specific problems we have to solve, including:

 * Dealing with client death

 * Security

 * Cluster node naming

 * The specialised needs of recoverable services

 * Efficient ways to specify async event delivery

Those are the problems themselves, but we also need to bear in mind
that to keep the code maintainable we need to aim for:

 * Consistent APIs

 * Isolation of the future complexity of hierarchical clusters in a
   forward-compatible manner

 * Modular implementation of the common API functionality such as
   security and event delivery.

 * Flexible implementation.  As an absolute minimum, I require that all
   the APIs being developed are capable of being implemented both in
   kernel space (only on Linux boxes) and in user space (using threads
   if necessary, and preferably portable to other Unixen).


Dealing with client death
-------------------------

However the API is implemented, it is absolutely imperative that the
code implementing the service on each node is able to detect the death
of a client process which is holding a cluster resource, so that it can
release that resource.  If the service is being implemented in the
kernel then obviously it is easy enough to track process death, but we
need this functionality for services implemented by server daemons too.

A cheap way of obtaining this notification is for the service to be
implemented over sockets.  Unix domain sockets have reasonable
efficiency for local process intercommunication, and the server can
easily detect the death of a process connected by such a socket.

A kernel-based implementation of cluster functionality will probably use
dedicated syscalls instead, but kernel implementations always have more
freedom to play clever games to achieve the necessary functionality.  It
is the user-space implementation which is more constrained by available
unix functionality, so right at the start I will make the assumption
that all requests passed between client processes using the cluster APIs
and cluster service daemons implementing those services will be done
using unix domain sockets.  (Other communication channels, such as
signals and shared memory, can be used to augment the socket-based
communication too, of course.)


Security
--------

Why do we need security?

Our filesystems, shared printer queues and cluster namespace will all
be using cluster resources.  They will _all_ require barriers for
recovery; some will require cluster locks, some will require cluster
name bindings.

If we allow unrestricted access to these resources by any user, then
we have just allowed an unprivileged user to violate the integrity of
the entire cluster core.  

Simple Unix uid mapping is not sufficient to solve the security problem.
In a large heterogeneous hierarchical cluster there may be many uid
spaces participating in the cluster.  We also want to support
partitioned namespaces: not only should individual resources (which we
might conceivably uid-protect) be inaccessible to the wrong user, we may
sometimes also want the namespace itself to be unbrowsable to
unprivileged clients: thus we need protection on the whole namespace
too, not just uid protection on individual objects.

For the purposes of the core cluster APIs, I propose a very simple
mechanism to allow for multiple security domains: simply use the
filesystem to mediate the communication between client processes and
the cluster service daemons, and use filenames to name security
domains.  For example, we might have pathnames

	/tmp/cluster/<CLUSTER-NAME>/sockets/namespace/USER 
	/tmp/cluster/<CLUSTER-NAME>/sockets/namespace/SYSTEM

which refer to unix-domain sockets by which clients processes can
connect to and send request to the local node's cluster namespace
daemon.  There would be no difference between these two files except
for permissions: normal filesystem modes and attributes can be used to
set appropriate permissions on each such socket file.  

Note that we define these two security domains to be valid on all
cluster systems.  SYSTEM is the default security domain in which
privileged cluster services operate; USER is the default domain for
application cluster requests.

There is a second advantage to this mechanism: it allows us to pass
security credentials for these cluster sockets around using standard
unix fd-passing.  This leaves the option open that at some point we can
implement a security server to authenticate individual processes' access
to higher privilege levels, granting those privileges by passing back an
appropriate socket fd.

If we have many such security domains, then obviously we want some way
to ensure that they can be set up automatically in each socket
directory and that permissions are set up appropriately on each host.
However, at this stage this mechanism simply lets us say that security
domains via socket permissions provide the functionality we want: the
management on top of that is another layer which we can ignore for
now.  (The API won't care who is responsible for setting up these
sockets as long as they exist.)

If we want to apply uid-based ownerships to objects within a specific
cluster service then we are certainly free to do so, so long as we can
detect the uid associated with a connection to one of the server
sockets.  

Note that the core API library services being shared by the various
cluster APIs can use these security domains in whatever way they want.
They may choose to define each security domain as a separate namespace
or to have them share a namespace.  They may choose to allow browsing
between security domains or not.  The correct decision may very well
depend on the service: lock manager domains may want to be totally
opaque to each other, but the cluster namespace service's whole point is
to make all the cluster name bindings visible in a single namespace, for
example.

*** I'm open to suggestions concerning whether or not uid/gid/mode
*** protection should be implemented as a required option in the APIs.
*** The requirement that the cluster should still work if we cross uid
*** mapping domains makes this unclear.  The real question is: should
*** arbitrary unprivileged code have access to the cluster namespaces at
*** all?  If so, then uid protection on resources is necessary.


Cluster Node Naming
-------------------

Consider a hierarchical cluster containing a top level cluster ORG,
and subclusters all the way down to ORG/UK/EDIN/DEV/TEST/TEST1.  A
node in that cluster --- say, ORG/UK/EDIN/DEV/TEST/TEST1/TESTBOX2 ---
may want to participate in cluster services at any of these levels.
We need to be to specify exactly which of these layered clusters we
are referencing in every single API call to a clustered resource.

Fortunately, the pathname-based access to cluster services proposed
for security domains is also ideal for separating out access to
different clusters: we simply include a cluster-name component in the
pathname used to access the cluster socket.  We already have a
requirement that any cluster name will never appear more than once in
a fully qualified hierarchical cluster or node name, so these names
are guaranteed to represent unique references to a specific level of
the cluster hierarchy on any given node.

As far as the API is concerned, I propose that all access to cluster
services by a process be preceded by a call to

	extern struct cluster_handle *
	get_cluster_handle(const char *cluster_name,
			   const char *security_domain,
			   int flags);

and that the "struct cluster_handle *" returned by this call be used
as the first parameter to every subsequent call to cluster resource
APIs for that cluster.  This allows the user to talk to multiple
clusters and multiple security domains at once, while still hiding the
details of how we might implement these features.


Recoverable Services
--------------------

Things are more complex when we are doing recovery.  The APIs must be
able to continue to work selectively during recovery.  

As an example, the barrier API may be being used by a pair of
cooperating user processes on two different nodes when a third node
joins or leaves the cluster.  That cluster transition does not affect
the barrier, so the user applications should not notice any change:
the barrier API should just be stalled temporarily during the
transition.  However, an internal cluster service such as the
namespace service is a completely different matter: it may rely on the
barrier API in order to synchronise its own recovery.

Similarly, lock manager traffic may be suspended during transition and
resumed afterwards, but a clustered filesystem may want to perform its
own lock manager operations during the recovery period.  Note that
this example shows that we must be prepared to have the same resource
visible both during recovery and for normal operations: we cannot
simply partition the resources visible during recovery into a separate
security domain as described above.

The default behaviour for the API _must_ be that a cluster transition
causes a temporary stall but no other observable behaviour for
applications which are not using resources affected by the transition.
For APIs such as the namespace and lock manager, a node death releasing
a resource ought to be largely indistinguishable from a voluntary
release of the same resource as far as the effect on other nodes in the
cluster is concerned.

So we have two problems: 

 * How do we specify that a specific API request is recovery-privileged
   and should not wait until the end of recovery?

and

 * How do we restrict this functionality to privileged processes only?

We can achieve both of these by adding a recovery-ok flag to the flags
in the "flags" handle to the get_cluster_handle call, and restricting
that flag to be legal only on the SYSTEM security domain.  That allows
an application to obtain a separate cluster handle to pass through API
calls which want to run during recovery.  It avoids polluting the API of
other calls with recovery information.


Asynchronous Event Delivery
---------------------------

One of the important things that the cluster API must be able to do is
to deliver events asynchronously to processes.  Cluster transitions,
loss of a barrier if a participating node or process dies, or
notification that another process wants to steal a lock are all async
events which a process using the cluster API will want to deal with.

So, how do we handle cluster callbacks to the user process?  The normal
Unix mechanism for this is to use signals, of course.  However, we have
several problems with that:

 * We do not necessarily want our cluster service daemons to run with
   privileges to kill every process in the system; and

 * It is impossible to pass arbitrary data through a signal on Unixen
   which do not implement posix queued signals;

 * The client cannot tell how many signals were received;

 * There are a limited number of signals available but arbitrarily many
   different events which may occur in a cluster (in a lock manager API,
   for example, each lock a client process owns may have its own
   callback routine to be called when that lock is wanted by another
   node). 

We can address all of these problems by using sockets.  We can allow the
client to give the server an arbitrary cookie of information to be
associated with a potential future event.  If that event occurs, the
server will simply pass that event back to the client via a unix domain
socket reserved for those callbacks.  Even if the Unix implementation
being used only supports SIGIO, that is enough for the local library
part of the API to trap the IO, decode the cookie and perform the
callback.

This obviously restricts the caller's freedom to use SIGIO.  That's
unfortunate, but unavoidable if realtime sigio is not available.  On
recent Linux kernels we can use realtime queued sigio to use a different
signal for the async callback socket, to avoid interfering with the
normal signals.  

Obviously, a mechanism by which the signals can be blocked is required,
and this mechanism must be exposed in a portable way to the user.

More of a problem, we must have a way to sycnchronise these cookies with
other foreground events.  That is the responsibility of the API library
in the user process.  For example, if the user creates a lock with a
callback, then deletes that lock, there is a chance that the server has
already sent us a callback on that lock: the API must, when it is
decoding the cookie, detect that the callback is no longer valid and
must silently discard it rather than delivering it to the application.


----------------------------------------------------------------
 LocalWords:  APIs async Unixen namespace uid namespaces unbrowsable unix fd ok
 LocalWords:  service's ORG subclusters struct const