fio vs. ZFS compression

Summary: When testing ZFS read performance with fio, compression settings on the file system may cause you to test cache performance instead of physical disk performance.

Background: Testing was done on a FreeBSD 8.3-STABLE system, with an eleven-disk, non-root zpool:

$ zpool status
  pool: tank01
 state: ONLINE
  scan: scrub repaired 0 in 0h11m with 0 errors on Fri Jun 14 17:20:12 2013
config:

	NAME        STATE     READ WRITE CKSUM
	tank01      ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    da1     ONLINE       0     0     0
	    da2     ONLINE       0     0     0
	  mirror-1  ONLINE       0     0     0
	    da3     ONLINE       0     0     0
	    da4     ONLINE       0     0     0
	  mirror-2  ONLINE       0     0     0
	    da5     ONLINE       0     0     0
	    da6     ONLINE       0     0     0
	  mirror-3  ONLINE       0     0     0
	    da7     ONLINE       0     0     0
	    da8     ONLINE       0     0     0
	  mirror-4  ONLINE       0     0     0
	    da9     ONLINE       0     0     0
	    da10    ONLINE       0     0     0
	logs
	  da11      ONLINE       0     0     0

errors: No known data errors

ZFS and zpool versions were 5 and 28, respectively. Recordsize was 128K. Compression was set to “on”.

Data disks were Toshiba model MK1001TRKB; the separate log was an STEC ZeusRAM C018. Drives were connected to a single LSI SAS2008 HBA.

The system had a single 2.13GHz Intel Xeon E5606 processor, and 12GB of RAM; via vfs.zfs.arc_max in /boot/loader.conf, ARC size was limited to 6GB.

Tests run:

For $test types of “read” and “randread”, fio was run as follows, five times for each test:

fio --directory=. 
  --name=$filename 
  --rw=$test 
  --bs=128k 
  --size=36G 
  --numjobs=1 
  --time_based 
  --runtime=60 
  --group_reporting

System activity was monitored during each run using a combination of sysctl, iostat, vmstat and top. Fio “IO file” size was measured using “du -h”.

IO files were then written to using the following command:

fio --directory=. 
  --name=$filename 
  --rw=write 
  --bs=128k 
  --size=36G 
  --numjobs=1 
  --group_reporting

After writing to the IO files, tests were re-run, again five times per test.

Results:

Fio appears to create new IO files in a highly-compressible format. With compression on, files that have not been written to are 512 bytes in size; without compression, they are the full size specified in the fio command line (36GB in this case). Once the files have been written to, on a file system with compression turned on, they grew in size to about 14GB (i.e. the files had a compression ratio slightly better than 2:1).

Performance was substantially better on files that had not been written to as opposed to those that had been written to; run-to-run variation, as measured by the standard deviation, was larger on the written files:

Test mean MB/s unwritten file stdev MB/s unwritten file mean IOPs unwritten file stdev IOPs unwritten file mean MB/s written file stdev MB/s written file mean IOPs written file stdev IOPs written file
read 2458.4 11.7 19666 93 897.4 33.4 7178 267
randread  2337.8 2.3 18703 20 40.6 14.2 324 114

The vmstat, iostat and top values suggest that benchmark performance was bounded by the CPU on unwritten files, and zpool disks on the written files.

sysctl counters and iostat indicated that effectively no reads for the unwritten files were served from disk but came instead from (prefetch) cache; written files exercised the disks, but when data was read from cache for the written files, it came predominantly from the ARC.

The randread/written file results, in aggregate, have far greater variation as a proportion of the achieved performance; looking at individual runs an interesting pattern emerges:

Test number MB/s IOPs ARC cache hit ratio MRU hits MFU hits
1 20.1 160 0.15 856 646
2 36.2 289 0.51 6060 2790
3 40.4 323 0.56 5504 5289
4 47.8 382 0.63 1348 12992
5 58.3 466 0.69 1761 17633

Specifically, performance got better with each run, apparently as a result of ARC caching. Initially, cache hits seem to be drawn from the MRU, but by the fourth and fifth tests, the MFU is more heavily used. The ARC caches uncompressed data, but even at 36GB of file data with a 6GB ARC, it is reasonable that some proportion of “random” read data will be served from the ARC. The possible implication that the ARC is successfully able to adapt to fio’s random read workload would be interesting to look at in greater depth.

Conclusion: Although this is a limited set of data, two results can reasonably be drawn from it:

  • The combination of cache and compression in ZFS can have impressive performance benefits; and
  • ARC and prefetch cache are each relevant in different performance domains.

Acknowledgements: I am indebted to my former employer for allowing me free usage of the system tested in this blog post.