Friday, January 1, 2016

How to make a RAID array failsafe

The computer that we used to manage the source code for our Defense Nuclear Agency prototype was an IBM System 95 server.  One of the nice features of this system was a three hard drive RAID array.  The special property of this system was that if one of the drives failed, the contents of that drive are recomputed on the fly based on the other two good hard drives.  The drives were hot-swappable, so it was possible to eject the bad drive and replace it with a new one while the system was running.  That drive would then be formatted and it contents rebuilt from the other two drives while the system was in use.  Very cool technology!

One day we decided to reboot the server.  When we did the OS/2 operating system produced an error message on startup saying that one of the drives in the RAID array had failed.  We were shocked.  We should have received an indication of this when the drive actually failed.

So what happened?  These special hot swappable hard drives have an error light that turns on, and also a piezo buzzer that sounds when the drive fails.  The trouble is that the drive failed so completely that the piezo buzzer also failed.  We never heard the audible error sound that we were supposed to hear.  This was a vulnerable moment for us.  If one more of the hard drives were to suffer failure we would have lost work.  We did back up our code periodically to an external cartridge, but it had not been done for weeks.

So clearly IBM had some work to do perfecting their RAID products.  For example, if the piezo buzzers on the other hard disks were designed to sound when any of the other drives failed, this might have proven much more effective.

No comments: