A Distributed Lock Manager for Linux

Peter J. Braam, Stelias Computing


Purpose: 
 - review some of the VAX Cluster DLM design
 - highlight discussion points for a Linux implementation

Background information:
 VAX Clusters - Roy Davis book from DTP
 Open VMS programming concepts manual, chapter 17
 Open VMS Systems Services Reference manual, ENQ(W), DEQ, GETLKI(W)


Resource Directories and Masters and Lock Owners
------------------------------------------------

As part of acquiring a lock on an object, the object needs to be
named.   Resources name objects one wants to lock.

A resource is a string with a agreed maximum length:
struct lck_resname {
       char   *rn_name;
       uint16  rn_len; 
       int32   rn_hash;
}

VaxClusters pack this data differntly.  A 31 byte string is available
for the name, the first byte holds the length. Perhaps this is a
useful future optimization.

Resources are managed as resource trees.  Each name is unique to a
parent.  To acquire a lock on a resource, a system must first acquire
a lock on all ancestors in the resource tree.  All locks on resources
in a particular tree are managed by a single system.  Usually the
resource manager is the first system to acquire a lock on the root of
that tree.

To locate the system managing a particular resource, a distributed
resource directory is consulted.  The system with knowledge of the
mastery for a particular resource is found through hashing the
resource name.

The hash function will give us an index in the lock directory weight
vector (LDWV), which lists the managing system for each hash value.
Note that systems can be listed multiple times in the lock directory
weight vector allowing more powerful systems to manage a larger
resource directory.

If the lock directory does not find a system managing a particular
resource then it should allocate the calling system as the new
manager.

When the last lock on a resource tree is removed, the resource
directory should be informed that the resource is "free" again. 

Remarks:

1. We will initially use dcache hash function. 

2. How do we identify systems in the cluster? This is done through a
membership database which holds the addresses of each system. There
will be multiple channels between systems.


3. We run the resource directory service as a separate daemon or as
part of the lock server on each system?

Resource Directory API (internal to the lock manager)

 - res_getmgr(struct lck_resnm *, struct sysid *);

   called by a system when it needs to find the resource master.  If
   the resource directory already has the resource, it returns  the sysid 
   of the mastering node.  Otherwise it returns 0, indicating that the 
   calling system will master the resource tree. 


 - res_free(struct lck_resnm *);

   called by the resource master when the last lock or lock request 
   for a resource is released.  The resource directory releases the 
   resource.

We will have security domains naming lock resource name spaces. For
each lock name space we have separate connections to deliver requests
and events, with access control managed through the connection.  We
probably need a system name space and one name space per cluster
process (e.g. a clustered database) at the minimum. 

XXX So any process can start a new lock resource name space, right?

The master of a resource will be the system that manages all lock
requests for a resource.  Lock requests are separate data structures
from the lock data.  For every resource mastered, queues of granted,
converting and waiting lock requests are managed. as well as master
copy of the lock data.  This mastering system manages locks both for
itself and for other systems requesting a lock on a resource in the
same tree.

Locks can be accessed on a system through a lockid - the lock id is
local to the system.

If a system has a lock on a resource but is not he master of that
resource then the system will hold a copy of the resource and lock
data and request information.  This is a duplicate copy, the other
copy of the lock information is on the master.  If a cluster
transition takes place, locks need to be remastered and these
duplicate copies allow this to happen when a resource master leaves
the cluster.

Locks on resources are handled as a tree.  Before a process can lock a 
resource it must lock all resources which are ancestors of this
lock. No lock can be dequeued unless no descendants remain. The
benefits of a tree organization of locks are twofold:
 - low level locks can be held at coarse granularity
 - resource names are unique to the parent


Locking
-------

Every system in the cluster will provide lock service and each cluster
member can request a lock on a resource, remove a lock, or convert a
lock to a different type of lock, provided it can get a connection to
the resource name space.  The system mastering the resource will
maintain lock management data.

Lock management data involves:
 - being able to find lock data given the resource name
 - maintaining a list of granted lock requests on a resource
 - maintaining a list of requests for lock conversions (ie. a change
   of mode of the lock)
 - maintaining a list of waiting new lock requests

struct lck_req {
       uint32	       req_id;
       struct listhead req_chain; /* on granted, converting, waiting
                                     queue */
       lck_mode        req_mode;  /* requested mode */
}

Like VAX Clusters we will have six modes of locks.  Informally, the
locks request read and write permissions on a resource.  Formally
there is a conversion table, determining what locks can exist
concurrently. 

EX	 Exclusive		RW	No other process can get R or W access
PW	 Prot. Write		RW	No other process can get a W access
PR	 Prot. Read		R	No other process can get a W access
CW	 Concurrent Write	W	No restrictions on other processes
CR	 Concurrent Read	R	No restrictions on other processes
NL	 Null			-	To indicate an interest in a resource

These lock modes and their compatibility are discussed the VAXCluster
book.  Compatibility governs new lock requests and conversions,
i.e. conversion or new locks are granted if the new mode is compatible 
with all granted modes.

	EX	PW	PR	CW	CR	NL
EX	0	0	0	0	0	1
PW	0	0	0	0	1	1
PR	0	0	1	0	1	1
CW	0	0	0	1	1	1
CR	0	1	1	1	1	1
NL	1	1	1	1	1	1

Note that this is symmetric.  By associating a level with a lock mode,
a necessary condition for conversion is that the level is not
increasing. 

	    int lck_compat(lck_mode a, lck_mode b)

The lock levels are:

- EX
- PW
- PR and CW
- CR
- NL

There are two types of requests for locks:
 - get a new lock on a resource
 - convert a lock on a resource

When new locks are requested, the lock manager will grant this if and
only the following 3 conditions hold:
 - the queue of ungranted conversion requests is empty
 - the queue of ungranted new lock requests is empty
 - the granted locks on the resource all have a mode compatible with 
   the mode which is requested. 

A conversion request can immediately be granted if:
 - the mode requested is compatible with modes on all the granted
   locks

If it cannot be granted immediately, the lock manager places the
conversion request at the end of the conversion queue.  When that
happens, the request will be reconsidered when all other conversion
requests have been granted or canceled.  A lock which is in the
conversion queue is still granted in the mode from which it needs to
convert. 

Note that the lock manager places lock requests in either the
conversion _or_ the granted, _or_ the ungranted queue, and remembers
both the granted and requested mode for locks in the conversion queue.
Hence two conversion requests on a granted lock can not placed
simultaneously in the VAX lock manager. We still need to define what
we do with a second conversion request coming in for a lock. Probably
is should over-write the first, but with a warning. 

Three events lead the lock manager on a system to visit the conversion
and waiting queue:  
 - a lock in the granted queue is canceled
 - a conversion request succeeds
 - another ungranted request in the waiting or conversion queue is granted

When one of these events takes place all requests in the conversion queue 
are visited first and then those in the waiting queue. 

The VAX Clusters book highlights two problems: 
 - There is an "indefinite postponement problem" with this algorithm:
   succesful immediate conversion requests can hold up queued conversion
   requests. 
 - new NL requests are queued and postponed if conversion requests exist.

There are two optional flags to change the behaviour when this problem 
is expected and the application wants to explicitly avoid it:
 - pass a flag to queue the conversion (even if it can be granted
   immediately).  This leads to first-come-first-serve on ALL
   conversion requests, not just for those that cannot be granted
   immediately.  (QUEUECONV)
 - pass a flag to expedite a new NL lock request. (EXPEDITE)

Another flag on the request allows conversion requests to either be
granted immediately, or to be rejected when they cannot be
granted. (NOQUEUE)

There is a great variety in the exact behaviour of the lock manager
when a request is processed, and granted.  We discuss this below.

Lock blocks contain the information held by the master of the lock and
the owner.  If the owner of the lock equals the master there is only
one copy of the block.   The lock block contains "links" for a variety 
of lists on which it sits, the sysid of the system owning the lock, a
process id for the process which requested the lock, the requested
mode of the lock, the granted mode of the lock.  The lock owner holds
a completion routine address, and a blocking trigger address, as well
as a semaphore.  The node mastering the lock has to notify owners of
events that need action. 

The VAX Cluster book goes into considerable detail about the memory
allocation for lock blocks.  The lock id is an index into a
complicated table.  Probably a standard slab allocation, with the
lockid being the pointer to the lock block, is easier.  But that
suffers from the fact that lockid's would be re-used rapidly, and it
lacks locality of reference for lockid's. Lockid's are local to a
system and are not passed over the wire (don't think so)?



Lock Services
-------------

Lock services should be available in three forms:

- as a kernel level library
- as a kernel level daemon
- through a new system call 

VAX Clusters made use of a lock status block:

struct lck_status {
       uint16  lckst_cond;
       uint16  lckst_reserved;
       uint32  lckst_id;
       char    lckst_val[16];
}

XXX Seems reasonable.  Let's see how the API evolves.

When a lock is or a conversion is requested the caller passes in a
pointer to an allocated lck_status block. (XXX how does the caller get
this lck_status_block?) The lock id is filled in by the lock service
on the system requesting the lock, before anythning else is done, and
returned to the caller in all cases.  So the lock id may be used for
further conversion or cancellation requests.  The lcst_val block is
for the use of applications to exchange information.

The lckst_cond gives the condition in which the request resulted,
i.e. if the lock was granted, converted etc.  The following status
values are returned by the VAX DLM:

For lock conversion and acquisition:
------------------------------------ 

 1. NORMAL: success
 2. ABORT: the lock was dequeued before ENQ could grant the lock
 3. CANCEL: the conversion request was canceled before it could be
            granted
 4. DEADLOCK: a deadlock was detected
 5. ILLRSDM: no adequate permissions on this resource domain
 6. NODOMAIN: not a valid resource domain for this system
 7. VALNOTVALID: the lock value block is not valid (see below)

For dequeueing:
---------------

 1. NORMAL: success
 2. ACCVIO: the value block specified by the caller cannot be accessed
 3. CANCELGRANT: the cancellation request was unsuccessful because the 
conversion request had be granted already.
 4. ILLRSDM: no adequate permissions on this resource domain
 5. IVLOCKID: invalid lockid was specified or not enough priviliges
 6. SUBLOCKS: lock has sublocks and cannot be dequeued

Note that we fill in the lockid as part of the synchronous component
of lock request processing.  I.e. before any blocking operation occurs
we allocate the lock_id.  The advantage of doing this is that lock
requests which have not yet reached the queues can be canceled (see
below).

New and conversion lock requests have as parameters:

 - requested mode -- mode to acquire or convert to
 - flags -- indicate special action to be undertaken by lock manager

 - a resource name -- what needs to be locked
 - a resource domain id and access mode -- probably we ignore this in
   the first instance
 - a lck_status block -- with the id of a lock to convert.

 - id of a parent lock -- optional, for acquiring a sublock

 - semaphore
 - completion routine
 - blocking trap
 - a pointer to parameters for either trap

They return an integer containing status information.  The return
value is computed immediately without blocking by the lock manager,
and contains condition information that can be acquired by the lock
manager on the system, without having to contact the resource manager
on another system.  The conditions returned are:

 - L_NORMAL - lock was granted or queued
 - L_SYNC - lock or conversion was granted synchronously
 - EACCESS - error accessing the lock status block or resource name
   (segv)
 - L_BADMODE - bad mode specified
 - L_CVTNOTGR - conversion on ungranted lock
 - L_CVTNOPAR - parent not granted on sublock
 - L_NOTQUEUED - the lock could not be granted immediately and the
   NOQUEUE flag was given
 - L_BADRES - the resource length exceeds the system bound
 - L_BADST - the lock status block is not present
 - L_BADLCK - no lock id was given when required (for conversion)

The status code in the lock status block contains further information
after the request has been fully processed:
 - L_SUCCESS - the lock was granted or converted 
 - L_ABORT - lock was dequeued before it could be acquired.
 - L_CANCEL - the lock was canceled before the request could be granted

Flags can give the following indications to the lock manager:
 - LCK_NOQUEUE : don't queue the request, return a failure status
 - LCK_SYNCSTS : return L_SYNC if the lock was synchronously acquired
   or converted; if so, do not fire a completion routine.  Otherwise the
   return code is L_SUCCESS and the completion routine and sempaphore are 
   triggered when the conversion/acquisition completes
 - LCK_SYSLCK : the resourname is a system resource name (otherwise a
   procces/user resource is assumed)
 - LCK_CONVERT : the lock is a conversion request
 - LCK_CVTSYS : convert the process owned lock to a system lock
 - LCK_EXPEDITE : grant NL lock requests without queueing
 - LCK_QUEUECONV : place the conversion request on the queue

The following two flags might be useful when we have deadlock
detection and lock quota implemented, but not initially:
 - LCK_NOQUOTA : the lock is not charged quota for the lock request
 - LCK_NODLCK : the process is OK if the lock is merely queued
   (switches off deadlock detection)
 - LCK_NOBLCKDLCK : the process will release it's lock when asked
   to give it up and the firing of the blocking completion may block
   but should not imply deadlock.

Further error conditions can occur when the resource management is
more complete, when we have quota or access control on resources.

For each locking call we will have a blocking and non-blocking version. 

The completion routine should be fired when locks are granted or
canceled and the blocking trigger when a new lock cannot be granted or 
an existing one converted due to a granted lock.  

I'm proposing to add one flag not present on VMS which is 
 - LCK_BLCKRETRY : If a new or conversion request comes in, and this
   flag is set for every lock for which a blocking trigger is fired, 
   the lock mananger will retry the acquisition or conversion when the 
   blocking triggers have run.  

With this flag on a lock request, a system indicates that it will
dequeue the lock during the blocking trigger.  This is very useful for
file systems and avoids a second lock request.

If a user level process makes lock requests, these routines should be
queued for execution in user mode in the process, otherwise the kernel
can execute them without delay.

???  Do we do this as for signal handlers?   Which comes first, the
signal or the lock traps. 

???  Does the lock manager wait for the triggers and completions to
have run before returning to the client? 

???  If a process sleeps on syncrhonous lock acquisition, what signals 
do we let in?

???  We are going to hang a list of acquired locks & queued requests
in the process structure.  Process cleanup should dequeue granted
locks or convert them to system owned locks.  Which depends on the
application - opinions here?  Queued requests should be canceled -
this should likely be done before dequeuing the granted locks.

Lock cancelation takes a few forms. Lock requests can be canceled and
granted locks can be dequeued. Variants of dequeueing are:

 - cancel/dequeue a particular request/lock
 - cancel all sublocks of a parent
 - cancel all locks of a process. 

The VMS dequeuing mechanism passes the lockid of the lock to be
released as a parameter to the lock dequeue request.  Since this is
not known until the lock has been queued, it seems impossible to
cancel a queued lock request before it has been queued (this is a race
condition that is problematic in process termination).

Locking API
-----------

int lck_enqueue
    (???? event_flag,  /* IN, flag to be set when lock is granted or
			  canceled */
     uint32 lck_mode,  /* IN, requested mode */
     struct lck_status *lsb, /* OUT lock status block, with version */
     			  


Lock version/value blocks
-------------------------

Lock blocks can contain version information.  This version information 
can be used by applications to communicate what version of the
resource they want to lock. 

Versions can be 

 - written: the lock value info are copied from the lock status block
 in the request to the resource lock value field held by the lock
 manager.  
 - returned: the lock value block is copied from the resource to the
lock reques lock value field
 - none: the lock value fields are not changed.


   new mode	NL	CR	CW	PR	PW	EX	
held mode
NL		ret	ret	ret	ret	ret	ret
CR		none	ret	ret	ret	ret	ret
CW		none	none	ret	ret	ret	ret
PR		none	none	none	ret	ret	ret
PW		write	write	write	write	write	ret
EX		write	write	write	write	write	write

Note that this table is essentially saying that whenever a protect
write or read lock has been released the value is written into the
resource. When a lock is promoted (say from NL to CR) the lock value
information is returned to the requestor.

Lock value blocks are 16 bytes long in VMS.  

The completion and blocking triggers
------------------------------------

- Completion routine

With lock acquisition or conversion a completion routine can be
passed. This routine is run when the request is granted or converted. 

This function is also delivered when the when the lock or conversion
request is canceled (dequeue with the cancel flag set). 

The routine can be passed a parameter (such as the address of the lock 
status, although that should perhaps be included by default.) 

- Blocking routine

This routine is called for locks that have been granted already and
are holding up another lock acquisition or conversion request. The
parameter for the completion routine is also given to the blocking
routine. 



Cluster transitions and the DLM
-------------------------------

The basis for the algorithm is the following sequence of crude
operations:

1. when the new cluster membership is known, each system can compute a 
new LDWV to locate resource directories.  Discard existing resource
directories. 

2. Each node mastering a resource should discard the locks it doesn't
own itself.  For each resource this node still has locks, it should
register itself as the resource master with the resource directory.

3. Each node should re-acquire each lock it owns but does not master.

The problem with this algorithm is that it is doing much more than is
needed in many cases and it can lead to unfortunate mastery:

Trees will be remastered by the first system to touch them.  This
is bad in case that system is slow, and in case that system has many
fewer locks in the resource tree than another system, which now has to 
use this as a master.

This situation can be improved by first letting systems with a nonzero 
weight in the LDWV to remaster their locks and then other systems.

Another problem with the basic algorithm is that too much work is
often done:

 - if a system with weight 0 leaves or enters, nothing needs to be
done
 - when a system joins the cluster, only the directory needs to be
affected not the lock database
 - when a system leaves the cluster the minimum that needs to be done
is:
   a) deleting locks held by the system
   b) adjusting directory information if the weight was not 0
   c) remastering trees mastered by the leaving system
   d) granting locks blocked by lock held on the leaving system

Changes in this direction are possible and greatly reduced the rebuild 
times of the lock database.  Systems with lock directory weight 0 do
not get to master trees if there are at least 2 systems with non-zero
weight in the cluster.

???  I'd like to discuss how methods like these can use the cluster
     infrastructure to organise things like:
		    - first let nonzero weight systems do such and so
		    - then let zero weight systems do their bit

A further improvement to rebuilding comes for organizing clean exits
from the system, by asking the node leaving the system to remaster
trees it masters.

??? It seems reasonable to me to let the cluster controller control
    the recovery of the lock database.


Dynamic remastering
--------------------

The next improvement to the lock manager is to move lock mastering on
the basis of power of the system and activity.  The rules for
mastering become: 

 - a resource is initially mastered by the first system to acquire a
   lock 
 - if multiple systems acquire a lock on a resource, systems with
   higher lock directory weight should master the resource
 - if multiple systems of the same weight acquire locks, the system
   with the highest activity count should master the resource.

The rules are subject to thresholds.  Activity is measured every
second by forming a geometric average of lock acquisition, conversion
and dequeueing operations.   When a system mastering a resource is
servicing a request from another system, it can mark the lock for
remastering if 

 - the weight of the other system is bigger than its own weight. This 
   causes a "check for better master" flag to be set.
 - the activity on the lock is at least xx (ops/sec) and the activity
   of the remote system is at least xx (ops/sec) higher than its own
   activity.  This sets the "remaster pending" flag. 

Every 8 seconds the lock manager looks at resource trees and can move
up to 5 of them.  It moves up to 5 trees during each such run.  A tree 
it masters is moved if one of the following two situations is true:

 - the check for better master flag is set
 - the "remaster pending" flag is set and 
       a) the benefit outweighs the cost
       b) the benefit is not outweighing the cost, but the same
	  system has set the "remaster pending flag" three times
	  during the previous 24 seconds. 

The cost is the number of locks that need to move, the benefit is 16
times the difference in activity between the new master and the
current master.

A different form of remastery concerns trees that are mastered by a
node, but used by just a single other node.  In this case it maskes
sense to move the mastery of that tree to the node using the tree. VMS 
does this once a minute. 

??? The remastering of locks may require some form of two phase commit 
    between the resource, the current master and the new master.  How can
    this be organized.