Making a Service Highly Available
So, you have a service, and would like to make it highly available. How
do you go about it? This document is a step-by-step guide for
doing just that.
An Overview of High-Availability Concepts
High-Availability systems basically do three things:
- Start and stop resources (services) so that continuous operation is
provided
- Monitor servers so that crashes and hangs are detected
- Monitor resources so that it is known if the service is really working.
When a cluster first comes up, the HA system starts the services up. When
a service or server stops, the HA system simply restarts the service on a
working machine. In this way, it is analogous to a cluster-wide init
process. From the point of view of the application and the clients
of the services, it is as though the application crashed and restarted. If
you application and clients can tolerate a very quick crash and reboot
cycle, then you can make your application highly-available.
There are really only two things that you have to do to make a service highly
available:
- Figure out the complete of resources that are needed to make the service
run
- Make sure those resources are available to every machine that might
run the service
Examples of resources which one might need to make available to support a
given service:
- IP addresses
- Internet addresses
- Databases
- Configuration files
- Service software (binaries, help files, etc.)
- Devices (disks, tapes, printers, etc)
Before you can make an application highly available, you have to know what
resources it takes to make the application run correctly.
Step 1: Figure out what resources your application needs to run.
Sit down with an editor or piece of paper or a whiteboard or whatever you
like best. Think about your application architecture.
- Where does it get its input?
- Where does it store it's information?
- Where does it perform it's output?
- How are the components connected?
- How does it start up?
- What happens to clients when your server crashes and restarts?
Now that you know what resources you need to run, you need to decide how
to make them available to at least two nodes of your cluster.
There are basically two different techniques for making the resources available
on the various nodes of the cluster:
- Duplicate them
- Share them
Duplicating Resources
If you choose to duplicate a resource, you have to worry about the effects
of the two different copies of the resource being different. If it's
OK for them to be different, then you're lucky. If they have to be
in reasonable synchronization, you can still count yourself fortunate If
they have to be exactly (or nearly so) identical, then you have a more interesting
problem which may take some thought to solve effectively. In general,
things are easy when there is little data which changes infrequently, and
hard when there is a great deal of data which changes frequently.
There are several basic techniques which can be used for duplicating resources
between nodes of a cluster. They are:
- Manual copying (installing from a common CD, etc.)
- Automated replication from a common "master" source (using rsync, scp,
etc.)
- Application-specific (built-in) replication techniques (DNS, NIS, LDAP,
etc.)
- General replication techniques (scheduled rsync, DRBD, etc.)
Let's take a few examples of things which are progressively more difficult
to duplicate effectively.
Application/service software
It is very common to install certain pieces of application software on both
machines of a failover pair. It is easy to do, and since most software
changes relatively infrequently, this is pretty simple. A little more
thought brings up a few more considerations about even this simple case.
- What are the implications for licensing the software?
- What is the effect of the two copies of the software being out of sync
when a failover occurs?
- What procedures are you going to implement to make sure both copies
of the software get updated?
Configuration Information
Most applications have some kind of configuration information. Fortunately,
configuration information isn't often bulky, it doesn't often pose problems