March 5, 2008

On Parity Lost

I just finished reading a paper presented at FAST ’08 from the University of Wisconsin, Madison (including the first and senior authors) and NetApp: Parity Lost and Parity Regained. The paper discusses the concept of parity pollution, where, in the words of the authors, “corrupt data in one block of a stripe spreads to other blocks through various parity calculations.”

With the NetApp employees as (middle) authors, the paper seems to have a slight orientation towards a NetApp perspective, but it does mention other filesystems, including ZFS both specifically and later by implication when discussing filesystems that use parental checksums, RAID and scrubbing to protect data. (Interestingly, the first specific mention of ZFS contains a gaffe where they refer to it using “RAID-4” instead of RAID-Z.) The authors make an attempt to quantify the probability of loss or corruption – arriving at 0.486% probability per year for RAID-Scrub-Parent Checksum (implying ZFS) and 0.031% probability per year for RAID-Scrub-Block Checksum-Physical ID-Logical ID (WAFL) when using nearline drives in a 4 data disk, 1 parity disk RAID configuration and a read-write-scrub access pattern of 0.2-0.2-0.6.

Unfortunately, it seems the authors have not actually analyzed ZFS, just a less-adequate filesystem similar to it. The authors do quietly bring this point up in the discussion section, first noting “Another interesting scheme that could be analyzed is one with RAID-Z protection (instead of RAID-4 or RAID-5), where only full-stripe writes are performed and data is protected with parental checksums.” Doing only full-stripe writes removes one of the two mechanisms leading to polluted parity and data loss in their ZFS-like model; the other mechanism is scrubbing.

However, at least to my untrained eye, it seems like ZFS’s scrubbing mechanism won’t cause polluted parity, either. Later in the discussion the authors state:

In the absence of techniques such as version mirroring, schemes that protect data by placing checksum or identity protections on the access path should use the same access path for disk scrubbing, parity calculation, and reconstructing data.

This is what I understand ZFS to do; according to Bob Netherton’s blog at Sun, “ZFS scrubbing and resilvering only touches blocks that contain actual data.” Jeff Bonwick writes on his Sun blog entry about RAID-Z:

Well, the tricky bit here is RAID-Z reconstruction. Because the stripes are all different sizes, there’s no simple formula like “all the disks XOR to zero.” You have to traverse the filesystem metadata to determine the RAID-Z geometry. […] Going through the metadata means that ZFS can validate every block against its 256-bit checksum as it goes. Traditional RAID products can’t do this; they simply XOR the data together blindly.

Which brings us to the coolest thing about RAID-Z: self-healing data. In addition to handling whole-disk failure, RAID-Z can also detect and correct silent data corruption. Whenever you read a RAID-Z block, ZFS compares it against its checksum. If the data disks didn’t return the right answer, ZFS reads the parity and then does combinatorial reconstruction to figure out which disk returned bad data. It then repairs the damaged disk and returns good data to the application.

Note that part about combinatorial reconstruction – if I understand correctly, combined with full-stripe writes, this will make the chances of parity pollution in ZFS under the Parity Lost model zero.

I assume that – especially with three authors from NetApp onboard – the Parity Lost analysis of WAFL is correct, and that parity pollution is a real, albeit rare problem on NetApp filers. However, even if I’m wrong about ZFS not having this problem, it’s worth noting that ZFS gives you non-parity based RAID options which will certainly be immune to this problem, while NetApp doesn’t – that is to say, you can create mirrors in addition to RAID-Z and RAID-Z2 arrays with ZFS, but you have to choose between RAID-4 and RAID-DP with NetApp.

One final note: It’s disappointing that there don’t appear to be any published papers from Sun that describe the workings of ZFS. The authors of Parity Lost had to cite Sun employee blogs and a PDF slideshow as their references for ZFS; those were also the best sources that I could find. I realize that Sun did release the source code, and in a way, that’s the ultimate documentation, but it just isn’t the same as a paper in a refereed journal.

Share this:

Related