"The duration of this rebuild operation is important because a quick rebuild lowers the probability that data will be lost through a second drive failure before the rebuild completes."
That's a quote from one of those bought-and-paid-for "technical reports" that some vendors like to use as marketing tools.
The problem is that it's not true (or, at the very least, it's not as true as they'd like it to be). The unfortunate fact is that improving RAID rebuild times does close to nothing to reduce the risk of double disk failures in parity RAID groups.
Here's the inaccurate theory:
You have a parity RAID group, and one disk fails. The array immediately goes and picks out a hot spare, and starts rebuilding the data from the lost disk, but it's working without a net until that data can be rebuilt. Murphy's law kicks in, and all of a sudden, a disk in the SAME RAID GROUP has a head crash, you can no longer rebuild your data. You then go get your last backup tape and you go update your resume while the restore progresses.
The problem is that the math doesn't match the anecdotal frequency of the notorious double disk failure. Basically, that kind of thing shouldn't happen very often at all. Seagate advertises a 1.2 million hour MTBF on the Barracuda ES.2 SATA drives (they kind they sell to storage vendors like EMC, IBM, and others). Let's assume that number is, well, optimistic and divide it by 3 for a 400,000 hour MTBF. Then let's assume that it takes 48 hours to complete a 5-disk RAID group rebuild. The chances of one of those 4 surviving members failing during 48 hours is:
(48*4)/400,000 = .00048, or .048%, or once in about 4800 rebuilds.
Now that's a lot of rebuilds before you actually hit a problem. Keep in mind that we're actually using 1/3 of Seagate's MTBF rating for its least expensive drive. But I can't seem to go a week without meeting with someone that's had one of these bite them in the last two years. I'll grant you that a lot of the people I talk to meet with me exactly because they're unhappy with their storage vendor, but it's still a huge discrepancy.
Some folks will associate the second failure with the increased duty cycle during a rebuild causing a second crash, or to use bathtub curves and "bad batch of disks" theories. But that's all missing the point.
What's really happening:
The truth is actually pretty mundane - it has to do with undetected errors on the surviving disks in the RAID group. When you do a rebuild from parity, part of that process means that you have to read every single bit on every single surviving disk in the RAID group. If you can't read even one of those bits, you're not going to be able to determine what was on the failed disk. And the array usually reports it as a double disk failure. Which, in fact, it is. You have one completely failed disk, followed by a read error on a surviving disk.
Disk vendors measure this stuff, and publish it in the spec sheets for their drives. It's expressed by Seagate, for example, as "Nonrecoverable Read Errors per Bits Read." Basically, this is what appears to be a huge number that represents average number of bits you can expect to successfully read off the disk before before you encounter a bit you can't read.
This happens all the time - we just don't notice it until it happens during a rebuild. Under normal operations, if an array sends a read request to a disk and doesn't get a response, it can always just rebuild the requested data from parity or the mirror. It's just a little burp, but if you're rebuilding a parity RAID group, the array goes from a little burp to a full-on run for the bathroom.
And this brings the math closer to what we all would expect from experience. For example, if we can expect that we'll encounter a read error with every 100 terabytes of data read, and we know that the rebuild of a 4+1 RAID group made up of 1TB drives is going to require 4 terabytes of data to be read, we'll see about one in 25 rebuilds will fail (unless the array vendor has some magic to help out).
The implications:
So, knowing this, what can we expect? Well, first, we can expect that faster parity rebuilds doesn't translate into higher reliability - most of the time, the matter of a second drive failing is a matter of fate that's set before the first disk ever fails - a faster rebuild only means that you'll find out about the failure a little sooner. Second, we know that our exposure to the problem directly correlates to the size of the disks involved - the bigger the disks, or more accurately, the more data in the RAID group, the higher the chance you'll hit one of these read errors.
What can be done to reduce the risk:
There are a number of techniques that can be employed to reduce the risk. Some vendors implement them, some don't, but I'd certainly use them as evaluation criteria for any storage buy, whether direct-attached or SAN-attached.
- Background verification - This is a process that verifies the integrity on an ongoing basis, the goal being to find the latent errors on the drive while you have the ability to recover from them
- Predictive disk failure - This is when the array pre-emptively fails a troubled disk before it actually fails. Usually, the data are copied from the failing disk, rather than being rebuilt from parity. This reduces the time to fail out the disk and reduces load on the array since parity is not re-calculated. Most importantly, you can suffer a read error during the process and the RAID group can survive.
- RAID-6 - Since parity is calculated twice, you can survive a read error at any point in time. The chances that the same block would be bad in two places is statistically insignificant.
In most environments, a combination of verification and predictive disk failure makes double disk failure a very remote possibility. RAID-6 becomes attractive when you have a need for very large RAID groups with very large disks with relatively high unrecoverable read rates (typically SATA, although the vendors have improved those numbers significantly in last few generations).
As for the white paper and vendor from which this came, I'm not sure what to think. The vendor is trying to mislead customers into thinking that there is a linear relationship between rebuild rates and data availability in a parity RAID scheme. Either this is a shell game to hide the fact they don't have enterprise class features on their arrays or they really don't know the true cause of double disk failures. The first is pretty shameful, and the second is pretty scary.