A Distributed Lock Manager for Linux Peter J. Braam, Stelias Computing Purpose: - review some of the VAX Cluster DLM design - highlight discussion points for a Linux implementation Background information: VAX Clusters - Roy Davis book from DTP Open VMS programming concepts manual, chapter 17 Open VMS Systems Services Reference manual, ENQ(W), DEQ, GETLKI(W) Resource Directories and Masters and Lock Owners ------------------------------------------------ As part of acquiring a lock on an object, the object needs to be named. Resources name objects one wants to lock. A resource is a string with a agreed maximum length: struct lck_resname { char *rn_name; uint16 rn_len; int32 rn_hash; } VaxClusters pack this data differntly. A 31 byte string is available for the name, the first byte holds the length. Perhaps this is a useful future optimization. Resources are managed as resource trees. Each name is unique to a parent. To acquire a lock on a resource, a system must first acquire a lock on all ancestors in the resource tree. All locks on resources in a particular tree are managed by a single system. Usually the resource manager is the first system to acquire a lock on the root of that tree. To locate the system managing a particular resource, a distributed resource directory is consulted. The system with knowledge of the mastery for a particular resource is found through hashing the resource name. The hash function will give us an index in the lock directory weight vector (LDWV), which lists the managing system for each hash value. Note that systems can be listed multiple times in the lock directory weight vector allowing more powerful systems to manage a larger resource directory. If the lock directory does not find a system managing a particular resource then it should allocate the calling system as the new manager. When the last lock on a resource tree is removed, the resource directory should be informed that the resource is "free" again. Remarks: 1. We will initially use dcache hash function. 2. How do we identify systems in the cluster? This is done through a membership database which holds the addresses of each system. There will be multiple channels between systems. 3. We run the resource directory service as a separate daemon or as part of the lock server on each system? Resource Directory API (internal to the lock manager) - res_getmgr(struct lck_resnm *, struct sysid *); called by a system when it needs to find the resource master. If the resource directory already has the resource, it returns the sysid of the mastering node. Otherwise it returns 0, indicating that the calling system will master the resource tree. - res_free(struct lck_resnm *); called by the resource master when the last lock or lock request for a resource is released. The resource directory releases the resource. We will have security domains naming lock resource name spaces. For each lock name space we have separate connections to deliver requests and events, with access control managed through the connection. We probably need a system name space and one name space per cluster process (e.g. a clustered database) at the minimum. XXX So any process can start a new lock resource name space, right? The master of a resource will be the system that manages all lock requests for a resource. Lock requests are separate data structures from the lock data. For every resource mastered, queues of granted, converting and waiting lock requests are managed. as well as master copy of the lock data. This mastering system manages locks both for itself and for other systems requesting a lock on a resource in the same tree. Locks can be accessed on a system through a lockid - the lock id is local to the system. If a system has a lock on a resource but is not he master of that resource then the system will hold a copy of the resource and lock data and request information. This is a duplicate copy, the other copy of the lock information is on the master. If a cluster transition takes place, locks need to be remastered and these duplicate copies allow this to happen when a resource master leaves the cluster. Locks on resources are handled as a tree. Before a process can lock a resource it must lock all resources which are ancestors of this lock. No lock can be dequeued unless no descendants remain. The benefits of a tree organization of locks are twofold: - low level locks can be held at coarse granularity - resource names are unique to the parent Locking ------- Every system in the cluster will provide lock service and each cluster member can request a lock on a resource, remove a lock, or convert a lock to a different type of lock, provided it can get a connection to the resource name space. The system mastering the resource will maintain lock management data. Lock management data involves: - being able to find lock data given the resource name - maintaining a list of granted lock requests on a resource - maintaining a list of requests for lock conversions (ie. a change of mode of the lock) - maintaining a list of waiting new lock requests struct lck_req { uint32 req_id; struct listhead req_chain; /* on granted, converting, waiting queue */ lck_mode req_mode; /* requested mode */ } Like VAX Clusters we will have six modes of locks. Informally, the locks request read and write permissions on a resource. Formally there is a conversion table, determining what locks can exist concurrently. EX Exclusive RW No other process can get R or W access PW Prot. Write RW No other process can get a W access PR Prot. Read R No other process can get a W access CW Concurrent Write W No restrictions on other processes CR Concurrent Read R No restrictions on other processes NL Null - To indicate an interest in a resource These lock modes and their compatibility are discussed the VAXCluster book. Compatibility governs new lock requests and conversions, i.e. conversion or new locks are granted if the new mode is compatible with all granted modes. EX PW PR CW CR NL EX 0 0 0 0 0 1 PW 0 0 0 0 1 1 PR 0 0 1 0 1 1 CW 0 0 0 1 1 1 CR 0 1 1 1 1 1 NL 1 1 1 1 1 1 Note that this is symmetric. By associating a level with a lock mode, a necessary condition for conversion is that the level is not increasing. int lck_compat(lck_mode a, lck_mode b) The lock levels are: - EX - PW - PR and CW - CR - NL There are two types of requests for locks: - get a new lock on a resource - convert a lock on a resource When new locks are requested, the lock manager will grant this if and only the following 3 conditions hold: - the queue of ungranted conversion requests is empty - the queue of ungranted new lock requests is empty - the granted locks on the resource all have a mode compatible with the mode which is requested. A conversion request can immediately be granted if: - the mode requested is compatible with modes on all the granted locks If it cannot be granted immediately, the lock manager places the conversion request at the end of the conversion queue. When that happens, the request will be reconsidered when all other conversion requests have been granted or canceled. A lock which is in the conversion queue is still granted in the mode from which it needs to convert. Note that the lock manager places lock requests in either the conversion _or_ the granted, _or_ the ungranted queue, and remembers both the granted and requested mode for locks in the conversion queue. Hence two conversion requests on a granted lock can not placed simultaneously in the VAX lock manager. We still need to define what we do with a second conversion request coming in for a lock. Probably is should over-write the first, but with a warning. Three events lead the lock manager on a system to visit the conversion and waiting queue: - a lock in the granted queue is canceled - a conversion request succeeds - another ungranted request in the waiting or conversion queue is granted When one of these events takes place all requests in the conversion queue are visited first and then those in the waiting queue. The VAX Clusters book highlights two problems: - There is an "indefinite postponement problem" with this algorithm: succesful immediate conversion requests can hold up queued conversion requests. - new NL requests are queued and postponed if conversion requests exist. There are two optional flags to change the behaviour when this problem is expected and the application wants to explicitly avoid it: - pass a flag to queue the conversion (even if it can be granted immediately). This leads to first-come-first-serve on ALL conversion requests, not just for those that cannot be granted immediately. (QUEUECONV) - pass a flag to expedite a new NL lock request. (EXPEDITE) Another flag on the request allows conversion requests to either be granted immediately, or to be rejected when they cannot be granted. (NOQUEUE) There is a great variety in the exact behaviour of the lock manager when a request is processed, and granted. We discuss this below. Lock blocks contain the information held by the master of the lock and the owner. If the owner of the lock equals the master there is only one copy of the block. The lock block contains "links" for a variety of lists on which it sits, the sysid of the system owning the lock, a process id for the process which requested the lock, the requested mode of the lock, the granted mode of the lock. The lock owner holds a completion routine address, and a blocking trigger address, as well as a semaphore. The node mastering the lock has to notify owners of events that need action. The VAX Cluster book goes into considerable detail about the memory allocation for lock blocks. The lock id is an index into a complicated table. Probably a standard slab allocation, with the lockid being the pointer to the lock block, is easier. But that suffers from the fact that lockid's would be re-used rapidly, and it lacks locality of reference for lockid's. Lockid's are local to a system and are not passed over the wire (don't think so)? Lock Services ------------- Lock services should be available in three forms: - as a kernel level library - as a kernel level daemon - through a new system call VAX Clusters made use of a lock status block: struct lck_status { uint16 lckst_cond; uint16 lckst_reserved; uint32 lckst_id; char lckst_val[16]; } XXX Seems reasonable. Let's see how the API evolves. When a lock is or a conversion is requested the caller passes in a pointer to an allocated lck_status block. (XXX how does the caller get this lck_status_block?) The lock id is filled in by the lock service on the system requesting the lock, before anythning else is done, and returned to the caller in all cases. So the lock id may be used for further conversion or cancellation requests. The lcst_val block is for the use of applications to exchange information. The lckst_cond gives the condition in which the request resulted, i.e. if the lock was granted, converted etc. The following status values are returned by the VAX DLM: For lock conversion and acquisition: ------------------------------------ 1. NORMAL: success 2. ABORT: the lock was dequeued before ENQ could grant the lock 3. CANCEL: the conversion request was canceled before it could be granted 4. DEADLOCK: a deadlock was detected 5. ILLRSDM: no adequate permissions on this resource domain 6. NODOMAIN: not a valid resource domain for this system 7. VALNOTVALID: the lock value block is not valid (see below) For dequeueing: --------------- 1. NORMAL: success 2. ACCVIO: the value block specified by the caller cannot be accessed 3. CANCELGRANT: the cancellation request was unsuccessful because the conversion request had be granted already. 4. ILLRSDM: no adequate permissions on this resource domain 5. IVLOCKID: invalid lockid was specified or not enough priviliges 6. SUBLOCKS: lock has sublocks and cannot be dequeued Note that we fill in the lockid as part of the synchronous component of lock request processing. I.e. before any blocking operation occurs we allocate the lock_id. The advantage of doing this is that lock requests which have not yet reached the queues can be canceled (see below). New and conversion lock requests have as parameters: - requested mode -- mode to acquire or convert to - flags -- indicate special action to be undertaken by lock manager - a resource name -- what needs to be locked - a resource domain id and access mode -- probably we ignore this in the first instance - a lck_status block -- with the id of a lock to convert. - id of a parent lock -- optional, for acquiring a sublock - semaphore - completion routine - blocking trap - a pointer to parameters for either trap They return an integer containing status information. The return value is computed immediately without blocking by the lock manager, and contains condition information that can be acquired by the lock manager on the system, without having to contact the resource manager on another system. The conditions returned are: - L_NORMAL - lock was granted or queued - L_SYNC - lock or conversion was granted synchronously - EACCESS - error accessing the lock status block or resource name (segv) - L_BADMODE - bad mode specified - L_CVTNOTGR - conversion on ungranted lock - L_CVTNOPAR - parent not granted on sublock - L_NOTQUEUED - the lock could not be granted immediately and the NOQUEUE flag was given - L_BADRES - the resource length exceeds the system bound - L_BADST - the lock status block is not present - L_BADLCK - no lock id was given when required (for conversion) The status code in the lock status block contains further information after the request has been fully processed: - L_SUCCESS - the lock was granted or converted - L_ABORT - lock was dequeued before it could be acquired. - L_CANCEL - the lock was canceled before the request could be granted Flags can give the following indications to the lock manager: - LCK_NOQUEUE : don't queue the request, return a failure status - LCK_SYNCSTS : return L_SYNC if the lock was synchronously acquired or converted; if so, do not fire a completion routine. Otherwise the return code is L_SUCCESS and the completion routine and sempaphore are triggered when the conversion/acquisition completes - LCK_SYSLCK : the resourname is a system resource name (otherwise a procces/user resource is assumed) - LCK_CONVERT : the lock is a conversion request - LCK_CVTSYS : convert the process owned lock to a system lock - LCK_EXPEDITE : grant NL lock requests without queueing - LCK_QUEUECONV : place the conversion request on the queue The following two flags might be useful when we have deadlock detection and lock quota implemented, but not initially: - LCK_NOQUOTA : the lock is not charged quota for the lock request - LCK_NODLCK : the process is OK if the lock is merely queued (switches off deadlock detection) - LCK_NOBLCKDLCK : the process will release it's lock when asked to give it up and the firing of the blocking completion may block but should not imply deadlock. Further error conditions can occur when the resource management is more complete, when we have quota or access control on resources. For each locking call we will have a blocking and non-blocking version. The completion routine should be fired when locks are granted or canceled and the blocking trigger when a new lock cannot be granted or an existing one converted due to a granted lock. I'm proposing to add one flag not present on VMS which is - LCK_BLCKRETRY : If a new or conversion request comes in, and this flag is set for every lock for which a blocking trigger is fired, the lock mananger will retry the acquisition or conversion when the blocking triggers have run. With this flag on a lock request, a system indicates that it will dequeue the lock during the blocking trigger. This is very useful for file systems and avoids a second lock request. If a user level process makes lock requests, these routines should be queued for execution in user mode in the process, otherwise the kernel can execute them without delay. ??? Do we do this as for signal handlers? Which comes first, the signal or the lock traps. ??? Does the lock manager wait for the triggers and completions to have run before returning to the client? ??? If a process sleeps on syncrhonous lock acquisition, what signals do we let in? ??? We are going to hang a list of acquired locks & queued requests in the process structure. Process cleanup should dequeue granted locks or convert them to system owned locks. Which depends on the application - opinions here? Queued requests should be canceled - this should likely be done before dequeuing the granted locks. Lock cancelation takes a few forms. Lock requests can be canceled and granted locks can be dequeued. Variants of dequeueing are: - cancel/dequeue a particular request/lock - cancel all sublocks of a parent - cancel all locks of a process. The VMS dequeuing mechanism passes the lockid of the lock to be released as a parameter to the lock dequeue request. Since this is not known until the lock has been queued, it seems impossible to cancel a queued lock request before it has been queued (this is a race condition that is problematic in process termination). Locking API ----------- int lck_enqueue (???? event_flag, /* IN, flag to be set when lock is granted or canceled */ uint32 lck_mode, /* IN, requested mode */ struct lck_status *lsb, /* OUT lock status block, with version */ Lock version/value blocks ------------------------- Lock blocks can contain version information. This version information can be used by applications to communicate what version of the resource they want to lock. Versions can be - written: the lock value info are copied from the lock status block in the request to the resource lock value field held by the lock manager. - returned: the lock value block is copied from the resource to the lock reques lock value field - none: the lock value fields are not changed. new mode NL CR CW PR PW EX held mode NL ret ret ret ret ret ret CR none ret ret ret ret ret CW none none ret ret ret ret PR none none none ret ret ret PW write write write write write ret EX write write write write write write Note that this table is essentially saying that whenever a protect write or read lock has been released the value is written into the resource. When a lock is promoted (say from NL to CR) the lock value information is returned to the requestor. Lock value blocks are 16 bytes long in VMS. The completion and blocking triggers ------------------------------------ - Completion routine With lock acquisition or conversion a completion routine can be passed. This routine is run when the request is granted or converted. This function is also delivered when the when the lock or conversion request is canceled (dequeue with the cancel flag set). The routine can be passed a parameter (such as the address of the lock status, although that should perhaps be included by default.) - Blocking routine This routine is called for locks that have been granted already and are holding up another lock acquisition or conversion request. The parameter for the completion routine is also given to the blocking routine. Cluster transitions and the DLM ------------------------------- The basis for the algorithm is the following sequence of crude operations: 1. when the new cluster membership is known, each system can compute a new LDWV to locate resource directories. Discard existing resource directories. 2. Each node mastering a resource should discard the locks it doesn't own itself. For each resource this node still has locks, it should register itself as the resource master with the resource directory. 3. Each node should re-acquire each lock it owns but does not master. The problem with this algorithm is that it is doing much more than is needed in many cases and it can lead to unfortunate mastery: Trees will be remastered by the first system to touch them. This is bad in case that system is slow, and in case that system has many fewer locks in the resource tree than another system, which now has to use this as a master. This situation can be improved by first letting systems with a nonzero weight in the LDWV to remaster their locks and then other systems. Another problem with the basic algorithm is that too much work is often done: - if a system with weight 0 leaves or enters, nothing needs to be done - when a system joins the cluster, only the directory needs to be affected not the lock database - when a system leaves the cluster the minimum that needs to be done is: a) deleting locks held by the system b) adjusting directory information if the weight was not 0 c) remastering trees mastered by the leaving system d) granting locks blocked by lock held on the leaving system Changes in this direction are possible and greatly reduced the rebuild times of the lock database. Systems with lock directory weight 0 do not get to master trees if there are at least 2 systems with non-zero weight in the cluster. ??? I'd like to discuss how methods like these can use the cluster infrastructure to organise things like: - first let nonzero weight systems do such and so - then let zero weight systems do their bit A further improvement to rebuilding comes for organizing clean exits from the system, by asking the node leaving the system to remaster trees it masters. ??? It seems reasonable to me to let the cluster controller control the recovery of the lock database. Dynamic remastering -------------------- The next improvement to the lock manager is to move lock mastering on the basis of power of the system and activity. The rules for mastering become: - a resource is initially mastered by the first system to acquire a lock - if multiple systems acquire a lock on a resource, systems with higher lock directory weight should master the resource - if multiple systems of the same weight acquire locks, the system with the highest activity count should master the resource. The rules are subject to thresholds. Activity is measured every second by forming a geometric average of lock acquisition, conversion and dequeueing operations. When a system mastering a resource is servicing a request from another system, it can mark the lock for remastering if - the weight of the other system is bigger than its own weight. This causes a "check for better master" flag to be set. - the activity on the lock is at least xx (ops/sec) and the activity of the remote system is at least xx (ops/sec) higher than its own activity. This sets the "remaster pending" flag. Every 8 seconds the lock manager looks at resource trees and can move up to 5 of them. It moves up to 5 trees during each such run. A tree it masters is moved if one of the following two situations is true: - the check for better master flag is set - the "remaster pending" flag is set and a) the benefit outweighs the cost b) the benefit is not outweighing the cost, but the same system has set the "remaster pending flag" three times during the previous 24 seconds. The cost is the number of locks that need to move, the benefit is 16 times the difference in activity between the new master and the current master. A different form of remastery concerns trees that are mastered by a node, but used by just a single other node. In this case it maskes sense to move the mastery of that tree to the node using the tree. VMS does this once a minute. ??? The remastering of locks may require some form of two phase commit between the resource, the current master and the new master. How can this be organized.