Linux-HA: Proposed Development Roadmap


My desire is to see Linux-HA software be competitive with commercial HA systems within two years.  This means by 1Q2001, Linux-HA will be listed and reviewed as a high quality viable solution to High-Availability issues in an article similar to the D. H. Brown HA system review.  This is a good goal, and from my perspective, an ambitious goal.  This is intended  that Linux hardware, service, operations or support or other related matters be excluded from consideration for this time period.

HA Functions

Here are the high level system functions which D. H. Brown used to review their candidate HA systems.  Functions over which Linux has no control (like hardware configurations) have been separated out and put at the end of this document.

In order to create some kind of a priority order out of these tasks, I've divided them into three not-altogether-arbitrary categories.  These are:

Primary HA functions
Things no self-respecting HA system can do without
Secondary HA functions
Things which are extremely nice, but not as centrally important as the Primary HA functions.
Ignored HA functions
These are things which are, from my perspective, outside the scope of Linux-HA for one reason or another.

Primary HA Functions

  1. Cluster Backup and Recovery -- node failover technologies
  2. Cluster Configurability -- ability to operate with a wide variety of hardware and software configurations
  3. HA administration -- tools and interfaces to ease cluster management
  4. Hardware and software RAID
  5. In-System Failure recovery - ability to recover from errors within a node without failing over to another node

Secondary HA Functions

  1. In-System failure avoidance -- monitoring error rates and parameters to take components out of service before they fail
  2. In-system Service Processor Features -- enable reboot recovery and operations with vendor failure notification
  3. Single-System Image -- ability to make resources within the cluster appear to be present on all systems
  4. Disaster recovery -- remote data duplication, geographic mirroring, and remote failover

Ignored HA Functions

  1. Cluster Concurrent Database Access -- support for parallel databases
  2. In-System Online Service

Current Linux HA assessment

Linux currently has facilities for performing portions of these services to some degree.  In most cases, the facilities for each of these functions are not well-integrated with each other into a system. Below is a table of facilities which Linux currently has some degree of support for.
Facility Linux  technologies Comments
Backup and Recovery FAKE, heartbeat Not integrated, and not enough.  The core cluster management subsystem is missing.  Without a journalling filesystem, we can't have shared disks in a practical sense.  CODA might be enough for the short term.
Configurability bazaar :-) I expect that Linux could eventually lead the pack here, because of the nature of the development model, and the fact that Linux runs on many hardware platforms.   I expect people to implement HA systems consisting of two PCs with a couple of serial cables. Really need a good resource model.
HA Administration Customized scripts (bash, Perl, etc.) This is actually pretty much the same as saying that we don't have any, but our source is open.  I think this is pretty important.
Hardware and Software RAID SCSI-based RAID, and the md driver. Large-scale hardware RAID solutions tend to be vendor-specific.  Also, Linux will have CODA which is analagous to a RAID facility. We're actually in reasonable shape here, but I think we lack integration with large-scale RAID devices.
In-system Failure Recovery ifconfig, FAKE We have some basic tools to allow this to take place, but no management infrastructure to decide what to do and when to do it.  We have little or nothing to allow us to fail over disks to other controllers.  Device drivers could be a help with this item.
In-System failure avoidance lm78 voltage and temperature monitors Missing basic infrastructure to hook it into.
In-System Service Processor Features watchdog driver and hardware devices Not enough for some cases.  Should hook into the heartbeat driver. (oooh!)
Single-System Image CODA and NIS automount maps do this to some extent The usual meaning of this not-very-well-defined phrase would normally include things more than having the filesystem maps look nearly the same.  I'm confused about everything that this means.
Disaster recovery CODA could be a help here.  Multicasting.  Routing code (?) Need multicast or other specialized heartbeat systems, along with fancy routing technologies.  This sounds like a big deal.

As I read this, the thing that is critically missing is the centerpiece of the HA system -- the core HA system manger.  It is the item to which all the other pieces connect.
about this site