fio vs. ZFS compression
Summary: When testing ZFS read performance with fio, compression settings on the file system may cause you to test cache performance instead of physical disk performance.
Background: Testing was done on a FreeBSD 8.3-STABLE system, with an eleven-disk, non-root zpool:
$ zpool status pool: tank01 state: ONLINE scan: scrub repaired 0 in 0h11m with 0 errors on Fri Jun 14 17:20:12 2013 config: NAME STATE READ WRITE CKSUM tank01 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 da1 ONLINE 0 0 0 da2 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 da3 ONLINE 0 0 0 da4 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 da5 ONLINE 0 0 0 da6 ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 da7 ONLINE 0 0 0 da8 ONLINE 0 0 0 mirror-4 ONLINE 0 0 0 da9 ONLINE 0 0 0 da10 ONLINE 0 0 0 logs da11 ONLINE 0 0 0 errors: No known data errors
ZFS and zpool versions were 5 and 28, respectively. Recordsize was 128K. Compression was set to “on”.
Data disks were Toshiba model MK1001TRKB; the separate log was an STEC ZeusRAM C018. Drives were connected to a single LSI SAS2008 HBA.
The system had a single 2.13GHz Intel Xeon E5606 processor, and 12GB of RAM; via
/boot/loader.conf, ARC size was limited to 6GB.
$test types of “read” and “randread”, fio was run as follows, five times for each test:
fio --directory=. --name=$filename --rw=$test --bs=128k --size=36G --numjobs=1 --time_based --runtime=60 --group_reporting
System activity was monitored during each run using a combination of sysctl, iostat, vmstat and top. Fio “IO file” size was measured using “du -h”.
IO files were then written to using the following command:
fio --directory=. --name=$filename --rw=write --bs=128k --size=36G --numjobs=1 --group_reporting
After writing to the IO files, tests were re-run, again five times per test.
Fio appears to create new IO files in a highly-compressible format. With compression on, files that have not been written to are 512 bytes in size; without compression, they are the full size specified in the fio command line (36GB in this case). Once the files have been written to, on a file system with compression turned on, they grew in size to about 14GB (i.e. the files had a compression ratio slightly better than 2:1).
Performance was substantially better on files that had not been written to as opposed to those that had been written to; run-to-run variation, as measured by the standard deviation, was larger on the written files:
|Test||mean MB/s unwritten file||stdev MB/s unwritten file||mean IOPs unwritten file||stdev IOPs unwritten file||mean MB/s written file||stdev MB/s written file||mean IOPs written file||stdev IOPs written file|
top values suggest that benchmark performance was bounded by the CPU on unwritten files, and zpool disks on the written files.
sysctl counters and iostat indicated that effectively no reads for the unwritten files were served from disk but came instead from (prefetch) cache; written files exercised the disks, but when data was read from cache for the written files, it came predominantly from the ARC.
The randread/written file results, in aggregate, have far greater variation as a proportion of the achieved performance; looking at individual runs an interesting pattern emerges:
|Test number||MB/s||IOPs||ARC cache hit ratio||MRU hits||MFU hits|
Specifically, performance got better with each run, apparently as a result of ARC caching. Initially, cache hits seem to be drawn from the MRU, but by the fourth and fifth tests, the MFU is more heavily used. The ARC caches uncompressed data, but even at 36GB of file data with a 6GB ARC, it is reasonable that some proportion of “random” read data will be served from the ARC. The possible implication that the ARC is successfully able to adapt to fio’s random read workload would be interesting to look at in greater depth.
Conclusion: Although this is a limited set of data, two results can reasonably be drawn from it:
- The combination of cache and compression in ZFS can have impressive performance benefits; and
- ARC and prefetch cache are each relevant in different performance domains.
Acknowledgements: I am indebted to my former employer for allowing me free usage of the system tested in this blog post.
Filling in the Missing Parts of NetApp’s API
Late last year, NetApp released long-overdue Python and Ruby support in their SDK, officially known as the NetApp Manageability SDK. The SDK download is – oddly and unfortunately – still buried behind a paywall, and you have to submit a web form about how you plan to use it to get access to the download; otherwise it’s available to all.
But perhaps there’s good reason for hiding the download away: There are still large gaps in the API. For instance, say you want to change the security mode of a qtree? You’re out of luck. (Makes one wonder how NetApp implements this functionality in OnCommand System Manager – are they eating their own dogfood?)
That said, if you’re willing to venture off the beaten (and supported) path, you can use the undocumented system-cli API call. Here’s how I’m using it in a Python wrapper I’m working on that makes the SDK feel a little bit less like handling thinly-varnished XML:
Replacing a Failed NetApp Drive with an Un-zeroed Spare
Jason Boche has a post on the method he used to replace a failed drive on a filer with an un-zeroed spare (transferred from a lab machine); my procedure was a little different.
In this example, I’ll be installing a replacement drive pulled from aggr0 on another filer. Note that this procedure is not relevant for drive failures covered by a support contract, where you will receive a zeroed replacement drive directly from NetApp.
- Physically remove failed drive and replace with working drive. This will generate log messages similar to the following:
May 27 11:02:36 filer01 [raid.disk.missing: info]: Disk 1b.51 Shelf 3 Bay 3 [NETAPP X268_SGLXY750SSX AQNZ] S/N [5QD599LZ] is missing from the system May 27 11:03:00 filer01 [monitor.globalStatus.ok: info]: The system's global status is normal. May 27 11:03:16 filer01 [scsi.cmd.notReadyCondition: notice]: Disk device 0a.51: Device returns not yet ready: CDB 0x12: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x0)(7715). May 27 11:03:25 filer01 [sfu.firmwareUpToDate: info]: Firmware is up-to-date on all disk shelves. May 27 11:03:27 filer01 [diskown.changingOwner: info]: changing ownership for disk 0a.51 (S/N P8G9SMDF) from unowned (ID -1) to filer01 (ID 135027165) May 27 11:03:27 filer01 [raid.assim.rg.missingChild: error]: Aggregate foreign:aggr0, rgobj_verify: RAID object 0 has only 1 valid children, expected 14. May 27 11:03:27 filer01 [raid.assim.plex.missingChild: error]: Aggregate foreign:aggr0, plexobj_verify: Plex 0 only has 0 working RAID groups (2 total) and is being taken offline May 27 11:03:27 filer01 [raid.assim.mirror.noChild: ALERT]: Aggregate foreign:aggr0, mirrorobj_verify: No operable plexes found. May 27 11:03:27 filer01 [raid.assim.tree.foreign: error]: raidtree_verify: Aggregate aggr0 is a foreign aggregate and is being taken offline. Use the 'aggr online' command to bring it online. May 27 11:03:27 filer01 [raid.assim.tree.dupName: error]: Duplicate aggregate names found, an instance of foreign:aggr0 is being renamed to foreign:aggr0(1). May 27 11:03:28 filer01 [sfu.firmwareUpToDate: info]: Firmware is up-to-date on all disk shelves. May 27 11:04:40 filer01 [asup.smtp.sent: notice]: System Notification mail sent: System Notification from filer01 (RAID VOLUME FAILED) ERROR May 27 11:04:42 filer01 [asup.post.sent: notice]: System Notification message posted to NetApp: System Notification from filer01 (RAID VOLUME FAILED) ERROR
Note line 6, where it identifies the newly-added disk as part of “foreign:aggr0” and missing the rest of its RAID group; “foreign:aggr0” is taken offline in line 9. In line 10, “foreign:aggr0” is renamed to “foreign:aggr0(1)” because the filer already has an aggr0, as you might expect. Be sure to note the new aggregate name, as you will need it for later steps.
- Verify aggregate status and names:
filer01> aggr status Aggr State Status Options aggr0 online raid_dp, aggr root aggr1 online raid_dp, aggr aggr0(1) failed raid_dp, aggr diskroot, lost_write_protect=off, foreign partial aggr2 online raid_dp, aggr nosnap=on
- Double-check the name of the foreign, offline aggregate that was brought in with the replacement drive, and destroy it:
filer01> aggr destroy aggr0(1) Are you sure you want to destroy this aggregate? yes Aggregate 'aggr0(1)' destroyed.
- Verify that the aggregate has been removed:
filer01> aggr status Aggr State Status Options aggr0 online raid_dp, aggr root aggr1 online raid_dp, aggr aggr2 online raid_dp, aggr nosnap=on
- Zero the new spare. First, confirm it is un-zeroed:
filer01> vol status -s Spare disks RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- Spare disks for block or zoned checksum traditional volumes or aggregates spare 0a.53 0a 3 5 FC:B - ATA 7200 635555/1301618176 635858/1302238304 (not zeroed) spare 0a.69 0a 4 5 FC:B - ATA 7200 635555/1301618176 635858/1302238304 spare 1b.51 1b 3 3 FC:A - ATA 7200 635555/1301618176 635858/1302238304 (not zeroed) spare 1b.61 1b 3 13 FC:A - ATA 7200 635555/1301618176 635858/1302238304 spare 1b.87 1b 5 7 FC:A - ATA 7200 847555/1735794176 847827/1736350304 spare 1b.89 1b 5 9 FC:A - ATA 7200 847555/1735794176 847827/1736350304
In this example, we actually have two un-zeroed spares – the newly replaced drive (1b.51) and another drive (0a.53). Zero them both:
filer01> disk zero spares
And verify that they have been zeroed:
filer01> vol status -s Spare disks RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- Spare disks for block or zoned checksum traditional volumes or aggregates spare 0a.53 0a 3 5 FC:B - ATA 7200 635555/1301618176 635858/1302238304 spare 0a.69 0a 4 5 FC:B - ATA 7200 635555/1301618176 635858/1302238304 spare 1b.51 1b 3 3 FC:A - ATA 7200 635555/1301618176 635858/1302238304 spare 1b.61 1b 3 13 FC:A - ATA 7200 635555/1301618176 635858/1302238304 spare 1b.87 1b 5 7 FC:A - ATA 7200 847555/1735794176 847827/1736350304 spare 1b.89 1b 5 9 FC:A - ATA 7200 847555/1735794176 847827/1736350304
- Done. You have replaced a failed drive with a zeroed spare.
Running NetApp’s aggrSpaceCheck without turning on RSH
When upgrading a NetApp filer from a pre-7.3 release to 7.3, metadata is apparently moved from within the FlexVol into the containing aggregate. If your aggregate is tight on space – more than 96% full – NetApp requires that you complete extra verification steps to ensure that you can complete the upgrade. From the Data ONTAP® 18.104.22.168 Release Notes (NOW login required):
If you suspect that your system has almost used all of its free space, or if you use thin provisioning, you should check the amount of space in use by each aggregate. If any aggregate is 97 percent full or more, do not proceed with the upgrade until you have used the Upgrade Advisor or aggrSpaceCheck tools to determine your system capacity and plan your upgrade.
Upgrade Advisor is a great tool, and I heartily recommend you use it for your upgrade. However, it doesn’t give you a lot of visibility into what’s being checked for here. Lucky for us, NetApp offers an alternative tool: aggrSpaceCheck (NOW login required).
Duplicity to Amazon S3 on FreeBSD: Building on the work of others
(This post adds only a couple small details to work described at randys.org and cenolan.com – go there for background on this post and useful scripts for automated Duplicity backup to S3.)
First off, if you want to use Duplicity installed from FreeBSD Ports to backup to Amazon S3, be sure to also install the
If you attempt to run the backup script described at randys.org or cenolan.com from cron, you may run into an error similar to the following:
Practical Limits of NetApp Deduplication
I’ve blogged before about the limits of NetApp’s A-SIS (Deduplication). In practical use, however, those limits can be even lower – here’s why:
Suppose, for example, that you have a FAS2050; the maximum size FlexVol that you can dedupe is 1 TB. If the volume has ever been larger than 1 TB and then shrunk below that limit, it can’t be deduped, and, of course, you can’t grow a volume with A-SIS enabled beyond 1 TB. Fair enough, you say – but consider those limitations in the case of a volume where you aren’t sure how large it will eventually grow.
If you think your volume could eventually grow beyond 1 TB (deduped), and you’re getting a healthy 50% savings from dedupe you’ll actually need to undo A-SIS at 500GB. If you let your deduped data approach filling a 1TB volume, you will not be able to run “sis undo” – you’ll run out of space. TR-3505 has this to say about it:
Note that if sis undo starts processing and then there is not enough space to undeduplicate, it will stop, complain with a message about insufficient space, and leave the flexible volume dense. All data is still accessible, but some block sharing is still occurring. Use “df –s” to understand how much free space you really have and then either grow the flexible volume or delete data or Snapshot copies to provide the needed free space.
Bottom line: Either be absolutely sure you won’t ever need to grow your volume beyond the A-SIS limitations of your hardware platform, or run “sis undo” before the sum of the “used” and “saved” columns of “df -s” reaches the volume limit.
Postscript: If you were thinking – like I was – that ONTAP 7.3 would up the A-SIS limitations, apparently you need to think again.
Postscript 2: See also NOW KB35784, as pointed out by Dan C on Scott Lowe’s blog.
Amazon Elastic Block Store is out!
Amazon’s much-awaited Elastic Block Store for EC2 is out this morning; I’m excited to give this a try. A couple downers from the announcement: The pricing is somewhat high – $0.10 per allocated GB per month plus $0.10 per 1 million I/O requests – and the reliability isn’t where I’d like it to be. Specifically, Amazon notes:
Volumes that operate with 20 GB or less of modified data since their most recent Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume. This compares with commodity hard disks that will typically fail with an AFR of around 4%, making EBS volumes 10 times more reliable than typical commodity disk drives.
Because Amazon EBS servers are replicated within a single Availability Zone, mirroring data across multiple Amazon EBS volumes in the same Availability Zone will not significantly improve volume durability.
That last sentence makes it sound like there is a 0.1% – 0.5% chance of catastrophic data loss of many distinct EBS volumes in an availability zone. If that’s the case, that’s scary – off the top of my head, I’d say your run-of-the mill “Enterprise” SAN doesn’t have a one-in-two hundred risk of catastrophic failure per year.
More links, not all of which I’ve had a chance to fully digest yet:
On Parity Lost
I just finished reading a paper presented at FAST ’08 from the University of Wisconsin, Madison (including the first and senior authors) and NetApp: Parity Lost and Parity Regained. The paper discusses the concept of parity pollution, where, in the words of the authors, “corrupt data in one block of a stripe spreads to other blocks through various parity calculations.”
With the NetApp employees as (middle) authors, the paper seems to have a slight orientation towards a NetApp perspective, but it does mention other filesystems, including ZFS both specifically and later by implication when discussing filesystems that use parental checksums, RAID and scrubbing to protect data. (Interestingly, the first specific mention of ZFS contains a gaffe where they refer to it using “RAID-4” instead of RAID-Z.) The authors make an attempt to quantify the probability of loss or corruption – arriving at 0.486% probability per year for RAID-Scrub-Parent Checksum (implying ZFS) and 0.031% probability per year for RAID-Scrub-Block Checksum-Physical ID-Logical ID (WAFL) when using nearline drives in a 4 data disk, 1 parity disk RAID configuration and a read-write-scrub access pattern of 0.2-0.2-0.6.
VMware’s Comparison of Storage Protocol Performance
VMware has just released a paper entitled Comparison of Storage Protocol Performance (seen at Scale the Mind and blog.scottlowe.org); maybe this will help deflate some of the too-often repeated speculation that NFS is too slow for VMware ESX.