Tagged: zero
Replacing a Failed NetApp Drive with an Un-zeroed Spare
Jason Boche has a post on the method he used to replace a failed drive on a filer with an un-zeroed spare (transferred from a lab machine); my procedure was a little different.
In this example, I’ll be installing a replacement drive pulled from aggr0 on another filer. Note that this procedure is not relevant for drive failures covered by a support contract, where you will receive a zeroed replacement drive directly from NetApp.
- Physically remove failed drive and replace with working drive. This will generate log messages similar to the following:
May 27 11:02:36 filer01 [raid.disk.missing: info]: Disk 1b.51 Shelf 3 Bay 3 [NETAPP X268_SGLXY750SSX AQNZ] S/N [5QD599LZ] is missing from the system May 27 11:03:00 filer01 [monitor.globalStatus.ok: info]: The system's global status is normal. May 27 11:03:16 filer01 [scsi.cmd.notReadyCondition: notice]: Disk device 0a.51: Device returns not yet ready: CDB 0x12: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x0)(7715). May 27 11:03:25 filer01 [sfu.firmwareUpToDate: info]: Firmware is up-to-date on all disk shelves. May 27 11:03:27 filer01 [diskown.changingOwner: info]: changing ownership for disk 0a.51 (S/N P8G9SMDF) from unowned (ID -1) to filer01 (ID 135027165) May 27 11:03:27 filer01 [raid.assim.rg.missingChild: error]: Aggregate foreign:aggr0, rgobj_verify: RAID object 0 has only 1 valid children, expected 14. May 27 11:03:27 filer01 [raid.assim.plex.missingChild: error]: Aggregate foreign:aggr0, plexobj_verify: Plex 0 only has 0 working RAID groups (2 total) and is being taken offline May 27 11:03:27 filer01 [raid.assim.mirror.noChild: ALERT]: Aggregate foreign:aggr0, mirrorobj_verify: No operable plexes found. May 27 11:03:27 filer01 [raid.assim.tree.foreign: error]: raidtree_verify: Aggregate aggr0 is a foreign aggregate and is being taken offline. Use the 'aggr online' command to bring it online. May 27 11:03:27 filer01 [raid.assim.tree.dupName: error]: Duplicate aggregate names found, an instance of foreign:aggr0 is being renamed to foreign:aggr0(1). May 27 11:03:28 filer01 [sfu.firmwareUpToDate: info]: Firmware is up-to-date on all disk shelves. May 27 11:04:40 filer01 [asup.smtp.sent: notice]: System Notification mail sent: System Notification from filer01 (RAID VOLUME FAILED) ERROR May 27 11:04:42 filer01 [asup.post.sent: notice]: System Notification message posted to NetApp: System Notification from filer01 (RAID VOLUME FAILED) ERROR
Note line 6, where it identifies the newly-added disk as part of “foreign:aggr0” and missing the rest of its RAID group; “foreign:aggr0” is taken offline in line 9. In line 10, “foreign:aggr0” is renamed to “foreign:aggr0(1)” because the filer already has an aggr0, as you might expect. Be sure to note the new aggregate name, as you will need it for later steps.
- Verify aggregate status and names:
filer01> aggr status Aggr State Status Options aggr0 online raid_dp, aggr root aggr1 online raid_dp, aggr aggr0(1) failed raid_dp, aggr diskroot, lost_write_protect=off, foreign partial aggr2 online raid_dp, aggr nosnap=on
- Double-check the name of the foreign, offline aggregate that was brought in with the replacement drive, and destroy it:
filer01> aggr destroy aggr0(1) Are you sure you want to destroy this aggregate? yes Aggregate 'aggr0(1)' destroyed.
- Verify that the aggregate has been removed:
filer01> aggr status Aggr State Status Options aggr0 online raid_dp, aggr root aggr1 online raid_dp, aggr aggr2 online raid_dp, aggr nosnap=on
- Zero the new spare. First, confirm it is un-zeroed:
filer01> vol status -s Spare disks RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- Spare disks for block or zoned checksum traditional volumes or aggregates spare 0a.53 0a 3 5 FC:B - ATA 7200 635555/1301618176 635858/1302238304 (not zeroed) spare 0a.69 0a 4 5 FC:B - ATA 7200 635555/1301618176 635858/1302238304 spare 1b.51 1b 3 3 FC:A - ATA 7200 635555/1301618176 635858/1302238304 (not zeroed) spare 1b.61 1b 3 13 FC:A - ATA 7200 635555/1301618176 635858/1302238304 spare 1b.87 1b 5 7 FC:A - ATA 7200 847555/1735794176 847827/1736350304 spare 1b.89 1b 5 9 FC:A - ATA 7200 847555/1735794176 847827/1736350304
In this example, we actually have two un-zeroed spares – the newly replaced drive (1b.51) and another drive (0a.53). Zero them both:
filer01> disk zero spares
And verify that they have been zeroed:
filer01> vol status -s Spare disks RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- Spare disks for block or zoned checksum traditional volumes or aggregates spare 0a.53 0a 3 5 FC:B - ATA 7200 635555/1301618176 635858/1302238304 spare 0a.69 0a 4 5 FC:B - ATA 7200 635555/1301618176 635858/1302238304 spare 1b.51 1b 3 3 FC:A - ATA 7200 635555/1301618176 635858/1302238304 spare 1b.61 1b 3 13 FC:A - ATA 7200 635555/1301618176 635858/1302238304 spare 1b.87 1b 5 7 FC:A - ATA 7200 847555/1735794176 847827/1736350304 spare 1b.89 1b 5 9 FC:A - ATA 7200 847555/1735794176 847827/1736350304
- Done. You have replaced a failed drive with a zeroed spare.