General Issues around Programming Interfaces ============================================ $Id: api.txt,v 1.1 1999/12/13 12:45:37 sct Exp $ Here we will discuss a number of issues relevant to the creation of clustering APIs. These issues only concern the API as expressed on a single node. They may deal with communication between a user process and a cluster server process on one node, for example, but we be will completely ignoring issues to do with communication between cluster nodes here. Issues to be dealt with uniformly: ---------------------------------- In designing the way we build APIs for the cluster core services, there are a number of specific problems we have to solve, including: * Dealing with client death * Security * Cluster node naming * The specialised needs of recoverable services * Efficient ways to specify async event delivery Those are the problems themselves, but we also need to bear in mind that to keep the code maintainable we need to aim for: * Consistent APIs * Isolation of the future complexity of hierarchical clusters in a forward-compatible manner * Modular implementation of the common API functionality such as security and event delivery. * Flexible implementation. As an absolute minimum, I require that all the APIs being developed are capable of being implemented both in kernel space (only on Linux boxes) and in user space (using threads if necessary, and preferably portable to other Unixen). Dealing with client death ------------------------- However the API is implemented, it is absolutely imperative that the code implementing the service on each node is able to detect the death of a client process which is holding a cluster resource, so that it can release that resource. If the service is being implemented in the kernel then obviously it is easy enough to track process death, but we need this functionality for services implemented by server daemons too. A cheap way of obtaining this notification is for the service to be implemented over sockets. Unix domain sockets have reasonable efficiency for local process intercommunication, and the server can easily detect the death of a process connected by such a socket. A kernel-based implementation of cluster functionality will probably use dedicated syscalls instead, but kernel implementations always have more freedom to play clever games to achieve the necessary functionality. It is the user-space implementation which is more constrained by available unix functionality, so right at the start I will make the assumption that all requests passed between client processes using the cluster APIs and cluster service daemons implementing those services will be done using unix domain sockets. (Other communication channels, such as signals and shared memory, can be used to augment the socket-based communication too, of course.) Security -------- Why do we need security? Our filesystems, shared printer queues and cluster namespace will all be using cluster resources. They will _all_ require barriers for recovery; some will require cluster locks, some will require cluster name bindings. If we allow unrestricted access to these resources by any user, then we have just allowed an unprivileged user to violate the integrity of the entire cluster core. Simple Unix uid mapping is not sufficient to solve the security problem. In a large heterogeneous hierarchical cluster there may be many uid spaces participating in the cluster. We also want to support partitioned namespaces: not only should individual resources (which we might conceivably uid-protect) be inaccessible to the wrong user, we may sometimes also want the namespace itself to be unbrowsable to unprivileged clients: thus we need protection on the whole namespace too, not just uid protection on individual objects. For the purposes of the core cluster APIs, I propose a very simple mechanism to allow for multiple security domains: simply use the filesystem to mediate the communication between client processes and the cluster service daemons, and use filenames to name security domains. For example, we might have pathnames /tmp/cluster//sockets/namespace/USER /tmp/cluster//sockets/namespace/SYSTEM which refer to unix-domain sockets by which clients processes can connect to and send request to the local node's cluster namespace daemon. There would be no difference between these two files except for permissions: normal filesystem modes and attributes can be used to set appropriate permissions on each such socket file. Note that we define these two security domains to be valid on all cluster systems. SYSTEM is the default security domain in which privileged cluster services operate; USER is the default domain for application cluster requests. There is a second advantage to this mechanism: it allows us to pass security credentials for these cluster sockets around using standard unix fd-passing. This leaves the option open that at some point we can implement a security server to authenticate individual processes' access to higher privilege levels, granting those privileges by passing back an appropriate socket fd. If we have many such security domains, then obviously we want some way to ensure that they can be set up automatically in each socket directory and that permissions are set up appropriately on each host. However, at this stage this mechanism simply lets us say that security domains via socket permissions provide the functionality we want: the management on top of that is another layer which we can ignore for now. (The API won't care who is responsible for setting up these sockets as long as they exist.) If we want to apply uid-based ownerships to objects within a specific cluster service then we are certainly free to do so, so long as we can detect the uid associated with a connection to one of the server sockets. Note that the core API library services being shared by the various cluster APIs can use these security domains in whatever way they want. They may choose to define each security domain as a separate namespace or to have them share a namespace. They may choose to allow browsing between security domains or not. The correct decision may very well depend on the service: lock manager domains may want to be totally opaque to each other, but the cluster namespace service's whole point is to make all the cluster name bindings visible in a single namespace, for example. *** I'm open to suggestions concerning whether or not uid/gid/mode *** protection should be implemented as a required option in the APIs. *** The requirement that the cluster should still work if we cross uid *** mapping domains makes this unclear. The real question is: should *** arbitrary unprivileged code have access to the cluster namespaces at *** all? If so, then uid protection on resources is necessary. Cluster Node Naming ------------------- Consider a hierarchical cluster containing a top level cluster ORG, and subclusters all the way down to ORG/UK/EDIN/DEV/TEST/TEST1. A node in that cluster --- say, ORG/UK/EDIN/DEV/TEST/TEST1/TESTBOX2 --- may want to participate in cluster services at any of these levels. We need to be to specify exactly which of these layered clusters we are referencing in every single API call to a clustered resource. Fortunately, the pathname-based access to cluster services proposed for security domains is also ideal for separating out access to different clusters: we simply include a cluster-name component in the pathname used to access the cluster socket. We already have a requirement that any cluster name will never appear more than once in a fully qualified hierarchical cluster or node name, so these names are guaranteed to represent unique references to a specific level of the cluster hierarchy on any given node. As far as the API is concerned, I propose that all access to cluster services by a process be preceded by a call to extern struct cluster_handle * get_cluster_handle(const char *cluster_name, const char *security_domain, int flags); and that the "struct cluster_handle *" returned by this call be used as the first parameter to every subsequent call to cluster resource APIs for that cluster. This allows the user to talk to multiple clusters and multiple security domains at once, while still hiding the details of how we might implement these features. Recoverable Services -------------------- Things are more complex when we are doing recovery. The APIs must be able to continue to work selectively during recovery. As an example, the barrier API may be being used by a pair of cooperating user processes on two different nodes when a third node joins or leaves the cluster. That cluster transition does not affect the barrier, so the user applications should not notice any change: the barrier API should just be stalled temporarily during the transition. However, an internal cluster service such as the namespace service is a completely different matter: it may rely on the barrier API in order to synchronise its own recovery. Similarly, lock manager traffic may be suspended during transition and resumed afterwards, but a clustered filesystem may want to perform its own lock manager operations during the recovery period. Note that this example shows that we must be prepared to have the same resource visible both during recovery and for normal operations: we cannot simply partition the resources visible during recovery into a separate security domain as described above. The default behaviour for the API _must_ be that a cluster transition causes a temporary stall but no other observable behaviour for applications which are not using resources affected by the transition. For APIs such as the namespace and lock manager, a node death releasing a resource ought to be largely indistinguishable from a voluntary release of the same resource as far as the effect on other nodes in the cluster is concerned. So we have two problems: * How do we specify that a specific API request is recovery-privileged and should not wait until the end of recovery? and * How do we restrict this functionality to privileged processes only? We can achieve both of these by adding a recovery-ok flag to the flags in the "flags" handle to the get_cluster_handle call, and restricting that flag to be legal only on the SYSTEM security domain. That allows an application to obtain a separate cluster handle to pass through API calls which want to run during recovery. It avoids polluting the API of other calls with recovery information. Asynchronous Event Delivery --------------------------- One of the important things that the cluster API must be able to do is to deliver events asynchronously to processes. Cluster transitions, loss of a barrier if a participating node or process dies, or notification that another process wants to steal a lock are all async events which a process using the cluster API will want to deal with. So, how do we handle cluster callbacks to the user process? The normal Unix mechanism for this is to use signals, of course. However, we have several problems with that: * We do not necessarily want our cluster service daemons to run with privileges to kill every process in the system; and * It is impossible to pass arbitrary data through a signal on Unixen which do not implement posix queued signals; * The client cannot tell how many signals were received; * There are a limited number of signals available but arbitrarily many different events which may occur in a cluster (in a lock manager API, for example, each lock a client process owns may have its own callback routine to be called when that lock is wanted by another node). We can address all of these problems by using sockets. We can allow the client to give the server an arbitrary cookie of information to be associated with a potential future event. If that event occurs, the server will simply pass that event back to the client via a unix domain socket reserved for those callbacks. Even if the Unix implementation being used only supports SIGIO, that is enough for the local library part of the API to trap the IO, decode the cookie and perform the callback. This obviously restricts the caller's freedom to use SIGIO. That's unfortunate, but unavoidable if realtime sigio is not available. On recent Linux kernels we can use realtime queued sigio to use a different signal for the async callback socket, to avoid interfering with the normal signals. Obviously, a mechanism by which the signals can be blocked is required, and this mechanism must be exposed in a portable way to the user. More of a problem, we must have a way to sycnchronise these cookies with other foreground events. That is the responsibility of the API library in the user process. For example, if the user creates a lock with a callback, then deletes that lock, there is a chance that the server has already sent us a callback on that lock: the API must, when it is decoding the cookie, detect that the callback is no longer valid and must silently discard it rather than delivering it to the application. ---------------------------------------------------------------- LocalWords: APIs async Unixen namespace uid namespaces unbrowsable unix fd ok LocalWords: service's ORG subclusters struct const