Jason Boche has a post on the method he used to replace a failed drive on a filer with an un-zeroed spare (transferred from a lab machine); my procedure was a little different.
In this example, I’ll be installing a replacement drive pulled from aggr0 on another filer. Note that this procedure is not relevant for drive failures covered by a support contract, where you will receive a zeroed replacement drive directly from NetApp.
- Physically remove failed drive and replace with working drive. This will generate log messages similar to the following:
May 27 11:02:36 filer01 [raid.disk.missing: info]: Disk 1b.51 Shelf 3 Bay 3 [NETAPP X268_SGLXY750SSX AQNZ] S/N [5QD599LZ] is missing from the system May 27 11:03:00 filer01 [monitor.globalStatus.ok: info]: The system's global status is normal. May 27 11:03:16 filer01 [scsi.cmd.notReadyCondition: notice]: Disk device 0a.51: Device returns not yet ready: CDB 0x12: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x0)(7715). May 27 11:03:25 filer01 [sfu.firmwareUpToDate: info]: Firmware is up-to-date on all disk shelves. May 27 11:03:27 filer01 [diskown.changingOwner: info]: changing ownership for disk 0a.51 (S/N P8G9SMDF) from unowned (ID -1) to filer01 (ID 135027165) May 27 11:03:27 filer01 [raid.assim.rg.missingChild: error]: Aggregate foreign:aggr0, rgobj_verify: RAID object 0 has only 1 valid children, expected 14. May 27 11:03:27 filer01 [raid.assim.plex.missingChild: error]: Aggregate foreign:aggr0, plexobj_verify: Plex 0 only has 0 working RAID groups (2 total) and is being taken offline May 27 11:03:27 filer01 [raid.assim.mirror.noChild: ALERT]: Aggregate foreign:aggr0, mirrorobj_verify: No operable plexes found. May 27 11:03:27 filer01 [raid.assim.tree.foreign: error]: raidtree_verify: Aggregate aggr0 is a foreign aggregate and is being taken offline. Use the 'aggr online' command to bring it online. May 27 11:03:27 filer01 [raid.assim.tree.dupName: error]: Duplicate aggregate names found, an instance of foreign:aggr0 is being renamed to foreign:aggr0(1). May 27 11:03:28 filer01 [sfu.firmwareUpToDate: info]: Firmware is up-to-date on all disk shelves. May 27 11:04:40 filer01 [asup.smtp.sent: notice]: System Notification mail sent: System Notification from filer01 (RAID VOLUME FAILED) ERROR May 27 11:04:42 filer01 [asup.post.sent: notice]: System Notification message posted to NetApp: System Notification from filer01 (RAID VOLUME FAILED) ERROR
Note line 6, where it identifies the newly-added disk as part of “foreign:aggr0” and missing the rest of its RAID group; “foreign:aggr0” is taken offline in line 9. In line 10, “foreign:aggr0” is renamed to “foreign:aggr0(1)” because the filer already has an aggr0, as you might expect. Be sure to note the new aggregate name, as you will need it for later steps.
- Verify aggregate status and names:
filer01> aggr status Aggr State Status Options aggr0 online raid_dp, aggr root aggr1 online raid_dp, aggr aggr0(1) failed raid_dp, aggr diskroot, lost_write_protect=off, foreign partial aggr2 online raid_dp, aggr nosnap=on
- Double-check the name of the foreign, offline aggregate that was brought in with the replacement drive, and destroy it:
filer01> aggr destroy aggr0(1) Are you sure you want to destroy this aggregate? yes Aggregate 'aggr0(1)' destroyed.
- Verify that the aggregate has been removed:
filer01> aggr status Aggr State Status Options aggr0 online raid_dp, aggr root aggr1 online raid_dp, aggr aggr2 online raid_dp, aggr nosnap=on
- Zero the new spare. First, confirm it is un-zeroed:
filer01> vol status -s Spare disks RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- Spare disks for block or zoned checksum traditional volumes or aggregates spare 0a.53 0a 3 5 FC:B - ATA 7200 635555/1301618176 635858/1302238304 (not zeroed) spare 0a.69 0a 4 5 FC:B - ATA 7200 635555/1301618176 635858/1302238304 spare 1b.51 1b 3 3 FC:A - ATA 7200 635555/1301618176 635858/1302238304 (not zeroed) spare 1b.61 1b 3 13 FC:A - ATA 7200 635555/1301618176 635858/1302238304 spare 1b.87 1b 5 7 FC:A - ATA 7200 847555/1735794176 847827/1736350304 spare 1b.89 1b 5 9 FC:A - ATA 7200 847555/1735794176 847827/1736350304
In this example, we actually have two un-zeroed spares – the newly replaced drive (1b.51) and another drive (0a.53). Zero them both:
filer01> disk zero spares
And verify that they have been zeroed:
filer01> vol status -s Spare disks RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- Spare disks for block or zoned checksum traditional volumes or aggregates spare 0a.53 0a 3 5 FC:B - ATA 7200 635555/1301618176 635858/1302238304 spare 0a.69 0a 4 5 FC:B - ATA 7200 635555/1301618176 635858/1302238304 spare 1b.51 1b 3 3 FC:A - ATA 7200 635555/1301618176 635858/1302238304 spare 1b.61 1b 3 13 FC:A - ATA 7200 635555/1301618176 635858/1302238304 spare 1b.87 1b 5 7 FC:A - ATA 7200 847555/1735794176 847827/1736350304 spare 1b.89 1b 5 9 FC:A - ATA 7200 847555/1735794176 847827/1736350304
- Done. You have replaced a failed drive with a zeroed spare.
I’ve just begun to pull together some interesting data on a series of Linux VM crashes I’ve seen. I don’t have a resolution yet, but some interesting patterns have emerged.
A CentOS 4.x or 5.x guest will crash with a message similar to the following on its console:
[<f883b299>] .text.lock.scsi_error+0x19/0x34 [scsi_mod]
[<f88c19ce>] mptscsih_io_done+0x5ee/0x608 [mptscsi] (…)
RIP [<ffffffff8014c562>] list_del+0x48/0x71 RSP <ffffffff80425d00> <0>Kernel Panic - not syncing: Fatal exception
A hard reset (i.e. pressing the reset button on the VM’s console) is required to reboot the guest.
When upgrading a NetApp filer from a pre-7.3 release to 7.3, metadata is apparently moved from within the FlexVol into the containing aggregate. If your aggregate is tight on space – more than 96% full – NetApp requires that you complete extra verification steps to ensure that you can complete the upgrade. From the Data ONTAP® 18.104.22.168 Release Notes (NOW login required):
If you suspect that your system has almost used all of its free space, or if you use thin provisioning, you should check the amount of space in use by each aggregate. If any aggregate is 97 percent full or more, do not proceed with the upgrade until you have used the Upgrade Advisor or aggrSpaceCheck tools to determine your system capacity and plan your upgrade.
Upgrade Advisor is a great tool, and I heartily recommend you use it for your upgrade. However, it doesn’t give you a lot of visibility into what’s being checked for here. Lucky for us, NetApp offers an alternative tool: aggrSpaceCheck (NOW login required).
There have been a few posts elsewhere discussing file-level recovery for Linux VMs on NetApp NFS datastores, but none that have dealt specifically with Linux LVM-encapsulated partitions.
Here’s our in-house procedure for recovery; note that we do not have FlexClone licensed on our filers.
Scott Lowe recently linked to a VMware KB article entitled Storing swap files on VMFS when running virtual machines from NFS. The article (from 3/31/2008) is perhaps the latest word from VMware in the frustrating back-and-forth on whether placing an ESX VM’s swap on NFS is acceptable or not.
I’ve blogged before about the limits of NetApp’s A-SIS (Deduplication). In practical use, however, those limits can be even lower – here’s why:
Suppose, for example, that you have a FAS2050; the maximum size FlexVol that you can dedupe is 1 TB. If the volume has ever been larger than 1 TB and then shrunk below that limit, it can’t be deduped, and, of course, you can’t grow a volume with A-SIS enabled beyond 1 TB. Fair enough, you say – but consider those limitations in the case of a volume where you aren’t sure how large it will eventually grow.
If you think your volume could eventually grow beyond 1 TB (deduped), and you’re getting a healthy 50% savings from dedupe you’ll actually need to undo A-SIS at 500GB. If you let your deduped data approach filling a 1TB volume, you will not be able to run “sis undo” – you’ll run out of space. TR-3505 has this to say about it:
Note that if sis undo starts processing and then there is not enough space to undeduplicate, it will stop, complain with a message about insufficient space, and leave the flexible volume dense. All data is still accessible, but some block sharing is still occurring. Use “df –s” to understand how much free space you really have and then either grow the flexible volume or delete data or Snapshot copies to provide the needed free space.
Bottom line: Either be absolutely sure you won’t ever need to grow your volume beyond the A-SIS limitations of your hardware platform, or run “sis undo” before the sum of the “used” and “saved” columns of “df -s” reaches the volume limit.
Postscript: If you were thinking – like I was – that ONTAP 7.3 would up the A-SIS limitations, apparently you need to think again.
I just finished reading a paper presented at FAST ’08 from the University of Wisconsin, Madison (including the first and senior authors) and NetApp: Parity Lost and Parity Regained. The paper discusses the concept of parity pollution, where, in the words of the authors, “corrupt data in one block of a stripe spreads to other blocks through various parity calculations.”
With the NetApp employees as (middle) authors, the paper seems to have a slight orientation towards a NetApp perspective, but it does mention other filesystems, including ZFS both specifically and later by implication when discussing filesystems that use parental checksums, RAID and scrubbing to protect data. (Interestingly, the first specific mention of ZFS contains a gaffe where they refer to it using “RAID-4” instead of RAID-Z.) The authors make an attempt to quantify the probability of loss or corruption – arriving at 0.486% probability per year for RAID-Scrub-Parent Checksum (implying ZFS) and 0.031% probability per year for RAID-Scrub-Block Checksum-Physical ID-Logical ID (WAFL) when using nearline drives in a 4 data disk, 1 parity disk RAID configuration and a read-write-scrub access pattern of 0.2-0.2-0.6.