Consultor Eletrônico

Status: Verified

GOAL:

A note about RAID 5

GOAL:

Information regarding RAID 5

FACT(s) (Environment):

All Supported Operating Systems
Progress/OpenEdge Versions

FIX:

Hardware performance improvements(cpu, memory, disk controllers, disks, etc.) have affected raid 5 performance characteristics too.

When Progress first began doing Progress database performance work in 1990, it was on a Sequent box with 16 Mhz 386 processors and many 330 megabyte Fujitsu SCSI disks, around 2400 rpm. SCSI controllers were 5 MB/sec then.

Actually processors are almost 250 times faster, drives that spin 4 times faster, 320 MB/sec disk controllers, and drives with a thousand times the capacity, and vast amounts of memory. File System have improved considerably also. Single disk benchmarks commonly achieve 50 megabytes per second or more when writing sequentially.

In general, disk capacity has grown at a rate of 100x per decade, and disk bandwidth has grown at 10x per decade. Even so, one should be extremely cautious when considering RAID 5, particularly for smallish arrays.

Digression:
Note that at 50 megabytes per second, assuming it can sustain that rate indefinitely, which is unlikely, it would take 2 hours and 48 minutes to read the entire contents of a 500 GB disk drive.

In a simple RAID 5 array, up to a point, write performance is a function of the number of disk drives in the array. Let's look at a couple of examples:

Small array
Assuming 4 disk drives configured for RAID 5 (3 drives is the smallest possible RAID 5 array), with a passable disk controller (the cheapest ones aren't). Disk blocks are stripped over the 4 disks and for every 3 blocks there is one "parity block". 25 percent of the storage space is used for pair blocks.

The parity block contains an exclusive or (xor) of the contents of the other 3 blocks (other types are parity schemes can also be used). The parity blocks are arranged so that they rotate among the drives and are evenly distributed over them. When it xor 3 blocks together to produce a 4th block, it can recover any one of the 4 blocks by xor'ing the remaining 3 together. So if a single drive fails, it can be reconstructed by reading all the data on all the remaining drives and doing the appropriate calculation.

When it write a data block to one of the disks, it also have to write a new parity block to another disk. That is done by subtracting out the old data and adding in the new (xor is like an add without carrying). Assuming the previous copy of the block being written and the copy of the parity block are both in cache, just two disk writes are required. If not, then one or both have to be read first.

That means with 4 drives, in the best case, it can do two writes at a time, if it is writing to two disjoint pairs of drives, if the controller can support that. Not all writes will be disjoint so sometimes it has to wait its turn.

Or, if it's not writing, it can in theory read from all 4 drives at the same time. Some controllers can almost accomplish that and others can't.
His write bandwidth for the 4 disk raid 5 array is thus a bit less than half the read bandwidth and about the same as for two separate drives.

Large array
Now make the array bigger. Assume the raid 5 array has 20 disk drives.
It has one parity block for every 19 data blocks. 5 percent of the storage space is used for parity.

As before, writing a data block requires updating the parity block as well.
That means it can do 10 writes simultaneously to disjoint pairs. Or it can do 20 simultaneous reads. In theory. Most controllers can't do that, so it is needed a fancier controllers in the array, or multiple controllers behind a fast interface of some sort, like a Fibre-Channel adap.ter.

There are other complications. There's the problem of those parity blocks.
If they aren't cached, they have to be read from disk, which is not a good thing. And, since it is writing pairs of blocks and one write might fail, it has to do some sort of two-phase commit so it know which block to recover when a failure occurs and only one of the two blocks is written. So it need some cache memory to log the two-phase commit and to keep copies of blocks that are being written. That cache memory will have to be backed up with battery power in case the power fails. How long the array can survive without power varies greatly from one product to another.

It can also use that memory to cache some reads too to avoid some disk reads when the cache can satisfy them. But: A disk array of 20 140 GB SCSI drives has a total capacity of 2,800 GB. A 2 GB cache can hold only a small fraction of that. Yes, it can make the cache bigger, but even with 32 GB, it only have just over 1 percent. So the cache efficiency will usually be limited, even taking into account that it probably aren't filling the whole array with data and the part being actively used will be less than that /most/ of the time.

All of that stuff adds to the expense and reduces the overall throughput a bit, as compared to my simple calculation above.

Still, with 20 drives, the overall performance will be significantly better than with just 4. Not too surprising.

It is quite possible that the
ormal/ write workload it has on the 20 disk array is low enough that it can get acceptable performance from a raid 5 setup. This is highly unlikely with a 4 drive array.

But several other problems remain.

When a drive fails (and one will, every now and then), it must be replaced and its contents reconstructed. While that is being done, the contents of all 19 other drives has to be read. The disk array might conceivably have dual-port controllers in it to reduce interference. It costs more. Still, it will take a fairly long time to read all the disks and during that time performance for normal operations will be reduced.

For any type of bulk write operations, like disk-to-disk backups, dumps and loads, etc, write performance may also become a problem. Writes can be cached and that helps, but also reduces the effectiveness of the cache for reading.

A simple RAID 5 array can tolerate /one/ disk failure at a a time. But as it make the array bigger, the probability of two failures goes up. With RAID 10, by striping mirrored pairs, the array can tolerate multiple failures as long as it don't lose two drives in the same pair.

To sum up, decisions have gotten more complicated. The storage vendors increasingly emphasize other things and talk less and less about performance. Yes it can get acceptable performance from a RAID 5 array if it is big enough, but there are still disadvantages. Regardless of what kind of array used for databases, forget about capacity and worry about throughput and reliability. The more drives the better..