Making a Service Highly Available

So, you have a service, and would like to make it highly available.  How do you go about it?  This document is a step-by-step guide for  doing just that.

An Overview of High-Availability Concepts

High-Availability systems basically do three things:
  1. Start and stop resources (services) so that continuous operation is provided
  2. Monitor servers so that crashes and hangs are detected
  3. Monitor resources so that it is known if the service is really working.
When a cluster first comes up, the HA system starts the services up.  When a service or server stops, the HA system simply restarts the service on a working machine.  In this way, it is analogous to a cluster-wide init process.  From the point of view of the application and the clients of the services, it is as though the application crashed and restarted.  If you application and clients can tolerate a very quick crash and reboot cycle, then you can make your application highly-available.

There are really only two things that you have to do to make a service highly available:
Examples of resources which one might need to make available to support a given service:
Before you can make an application highly available, you have to know what resources it takes to make the application run correctly.
Step 1: Figure out what resources your application needs to run.

Sit down with an editor or piece of paper or a whiteboard or whatever you like best.  Think about your application architecture. Now that you know what resources you need to run, you need to decide how to make them available to at least two nodes of your cluster.

There are basically two different techniques for making the resources available on the various nodes of the cluster:

Duplicating Resources

If you choose to duplicate a resource, you have to worry about the effects of the two different copies of the resource being different.  If it's OK for them to be different, then you're lucky.  If they have to be in reasonable synchronization, you can still count yourself fortunate  If they have to be exactly (or nearly so) identical, then you have a more interesting problem which may take some thought to solve effectively.  In general, things are easy when there is little data which changes infrequently, and hard when there is a great deal of data which changes frequently.

There are several basic techniques which can be used for duplicating resources between nodes of a cluster.  They are:
Let's take a few examples of things which are progressively more difficult to duplicate effectively.

Application/service software

It is very common to install certain pieces of application software on both machines of a failover pair.  It is easy to do, and since most software changes relatively infrequently, this is pretty simple.  A little more thought brings up a few more considerations about even this simple case.

Configuration Information

Most applications have some kind of configuration information.  Fortunately, configuration information isn't often bulky, it doesn't often pose problems