On Parity Lost
I just finished reading a paper presented at FAST ’08 from the University of Wisconsin, Madison (including the first and senior authors) and NetApp: Parity Lost and Parity Regained. The paper discusses the concept of parity pollution, where, in the words of the authors, “corrupt data in one block of a stripe spreads to other blocks through various parity calculations.”
With the NetApp employees as (middle) authors, the paper seems to have a slight orientation towards a NetApp perspective, but it does mention other filesystems, including ZFS both specifically and later by implication when discussing filesystems that use parental checksums, RAID and scrubbing to protect data. (Interestingly, the first specific mention of ZFS contains a gaffe where they refer to it using “RAID-4” instead of RAID-Z.) The authors make an attempt to quantify the probability of loss or corruption – arriving at 0.486% probability per year for RAID-Scrub-Parent Checksum (implying ZFS) and 0.031% probability per year for RAID-Scrub-Block Checksum-Physical ID-Logical ID (WAFL) when using nearline drives in a 4 data disk, 1 parity disk RAID configuration and a read-write-scrub access pattern of 0.2-0.2-0.6.