Monthly Archives: July 2009

Things you never wish to see shown on your RAID LED display

84 DRIVE FAILURE BOX #1, BAY1
102 VOLUME #0 STATE INTERIM RECOVERY
102 VOLUME #1 STATE INTERIM RECOVERY
102 VOLUME #2 STATE INTERIM RECOVERY
102 VOLUME #3 STATE INTERIM RECOVERY
102 VOLUME #4 STATE INTERIM RECOVERY
102 VOLUME #5 STATE INTERIM RECOVERY
84 DRIVE FAILURE BOX #1, BAY2
84 DRIVE FAILURE BOX #1, BAY3
101 VOLUME #0 STATE FAILED
101 VOLUME #1 STATE FAILED
101 VOLUME #2 STATE FAILED
101 VOLUME #3 STATE FAILED
101 VOLUME #4 STATE FAILED
101 VOLUME #5 STATE FAILED
84 DRIVE FAILURE BOX #1, BAY4
84 DRIVE FAILURE BOX #1, BAY5
84 DRIVE FAILURE BOX #1, BAY6
84 DRIVE FAILURE BOX #1, BAY7
84 DRIVE FAILURE BOX #1, BAY8
84 DRIVE FAILURE BOX #1, BAY9
84 DRIVE FAILURE BOX #1, BAY10
84 DRIVE FAILURE BOX #1, BAY11
84 DRIVE FAILURE BOX #1, BAY12

Pretty ugly eh?  It’s the kind of error that brings me out in a cold sweat every time I get emails from our users.  Generally complaints that the databases are running slowly, or that files are disappearing from directories, that home directories are empty, reports that the filesystems have become read-only.

Of course when I go to look at the machine the display apparently tells me that an entire box of drives (we have 2 boxes with 12 drives in) has suddenly failed.  The RAID volumes can’t maintain such a loss of drives, hence we INTERIM RECOVERY followed by STATE FAILED as more drives drop out of the array.

The weird thing is of course is that there’s nothing wrong with the drives at all, they’re sat there blinking little green lights at me telling me they are just fine.

The unit is an HP StorageWorks Modular Smart Array 1000, and I have to doubt the Smart moniker in this case, as it is the single most unreliable piece of hardware we own, apart from perhaps the HP blades it is attached to.  Apple RAID units, Transtec RAID units, all the RAID5’d servers seem to pretty much be able to hold themselves together, but not this one.

Every time this happens we get an engineer called out, they plug a serial console into the unit, reset the error states on the drives and volumes, reboot the RAID and everything comes up smelling of roses.  However trying to get them to send an engineer out is an exercise in frustration.  It would of course be possible to affect this fix ourselves, given a laptop with a serial port, and one of HP’s magical and deeply proprietary 259992-001 console serial cables.  Do we have one with our kit?  No.  How much do they cost? About £120.  How much did we spend on the kit in the first place?  Well over £100,000.

I will never, ever buy or recommend the purchase of another bit of HP kit as long as I am in the position to do so.  Grr.