Making a Service Highly Available

So, you have a service, and would like to make it highly available. How do you go about it? This document is a step-by-step guide for doing just that.

An Overview of High-Availability Concepts

High-Availability systems basically do three things:

Start and stop resources (services) so that continuous operation is provided
Monitor servers so that crashes and hangs are detected
Monitor resources so that it is known if the service is really working.

When a cluster first comes up, the HA system starts the services up. When a service or server stops, the HA system simply restarts the service on a working machine. In this way, it is analogous to a cluster-wide init process. From the point of view of the application and the clients of the services, it is as though the application crashed and restarted. If you application and clients can tolerate a very quick crash and reboot cycle, then you can make your application highly-available.

There are really only two things that you have to do to make a service highly available:

Figure out the complete of resources that are needed to make the service run
Make sure those resources are available to every machine that might run the service

Examples of resources which one might need to make available to support a given service:

IP addresses
Internet addresses
Databases
Configuration files
Service software (binaries, help files, etc.)
Devices (disks, tapes, printers, etc)

Before you can make an application highly available, you have to know what resources it takes to make the application run correctly.
Step 1: Figure out what resources your application needs to run.

Sit down with an editor or piece of paper or a whiteboard or whatever you like best. Think about your application architecture.

Where does it get its input?
Where does it store it's information?
Where does it perform it's output?
How are the components connected?
How does it start up?
What happens to clients when your server crashes and restarts?

Now that you know what resources you need to run, you need to decide how to make them available to at least two nodes of your cluster.

There are basically two different techniques for making the resources available on the various nodes of the cluster:

Duplicate them
Share them

Duplicating Resources

If you choose to duplicate a resource, you have to worry about the effects of the two different copies of the resource being different. If it's OK for them to be different, then you're lucky. If they have to be in reasonable synchronization, you can still count yourself fortunate If they have to be exactly (or nearly so) identical, then you have a more interesting problem which may take some thought to solve effectively. In general, things are easy when there is little data which changes infrequently, and hard when there is a great deal of data which changes frequently.

There are several basic techniques which can be used for duplicating resources between nodes of a cluster. They are:

Manual copying (installing from a common CD, etc.)
Automated replication from a common "master" source (using rsync, scp, etc.)
Application-specific (built-in) replication techniques (DNS, NIS, LDAP, etc.)
General replication techniques (scheduled rsync, DRBD, etc.)

Let's take a few examples of things which are progressively more difficult to duplicate effectively.

Application/service software

It is very common to install certain pieces of application software on both machines of a failover pair. It is easy to do, and since most software changes relatively infrequently, this is pretty simple. A little more thought brings up a few more considerations about even this simple case.

What are the implications for licensing the software?
What is the effect of the two copies of the software being out of sync when a failover occurs?
What procedures are you going to implement to make sure both copies of the software get updated?

Configuration Information

Most applications have some kind of configuration information. Fortunately, configuration information isn't often bulky, it doesn't often pose problems