May 28, 2011

Replacing a Failed NetApp Drive with an Un-zeroed Spare

Jason Boche has a post on the method he used to replace a failed drive on a filer with an un-zeroed spare (transferred from a lab machine); my procedure was a little different.

In this example, I’ll be installing a replacement drive pulled from aggr0 on another filer. Note that this procedure is not relevant for drive failures covered by a support contract, where you will receive a zeroed replacement drive directly from NetApp.

Physically remove failed drive and replace with working drive. This will generate log messages similar to the following:

May 27 11:02:36 filer01 [raid.disk.missing: info]: Disk 1b.51 Shelf 3 Bay 3 [NETAPP   X268_SGLXY750SSX AQNZ] S/N [5QD599LZ] is missing from the system
May 27 11:03:00 filer01 [monitor.globalStatus.ok: info]: The system's global status is normal. 
May 27 11:03:16 filer01 [scsi.cmd.notReadyCondition: notice]: Disk device 0a.51: Device returns not yet ready: CDB 0x12: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x0)(7715).
May 27 11:03:25 filer01 [sfu.firmwareUpToDate: info]: Firmware is up-to-date on all disk shelves.
May 27 11:03:27 filer01 [diskown.changingOwner: info]: changing ownership for disk 0a.51 (S/N P8G9SMDF) from unowned (ID -1) to filer01 (ID 135027165)
May 27 11:03:27 filer01 [raid.assim.rg.missingChild: error]: Aggregate foreign:aggr0, rgobj_verify: RAID object 0 has only 1 valid children, expected 14.
May 27 11:03:27 filer01 [raid.assim.plex.missingChild: error]: Aggregate foreign:aggr0, plexobj_verify: Plex 0 only has 0 working RAID groups (2 total) and is being taken offline
May 27 11:03:27 filer01 [raid.assim.mirror.noChild: ALERT]: Aggregate foreign:aggr0, mirrorobj_verify: No operable plexes found.
May 27 11:03:27 filer01 [raid.assim.tree.foreign: error]: raidtree_verify: Aggregate aggr0 is a foreign aggregate and is being taken offline. Use the 'aggr online' command to bring it online.
May 27 11:03:27 filer01 [raid.assim.tree.dupName: error]: Duplicate aggregate names found, an instance of foreign:aggr0 is being renamed to foreign:aggr0(1).
May 27 11:03:28 filer01 [sfu.firmwareUpToDate: info]: Firmware is up-to-date on all disk shelves.
May 27 11:04:40 filer01 [asup.smtp.sent: notice]: System Notification mail sent: System Notification from filer01 (RAID VOLUME FAILED) ERROR
May 27 11:04:42 filer01 [asup.post.sent: notice]: System Notification message posted to NetApp: System Notification from filer01 (RAID VOLUME FAILED) ERROR

Note line 6, where it identifies the newly-added disk as part of “foreign:aggr0” and missing the rest of its RAID group; “foreign:aggr0” is taken offline in line 9. In line 10, “foreign:aggr0” is renamed to “foreign:aggr0(1)” because the filer already has an aggr0, as you might expect. Be sure to note the new aggregate name, as you will need it for later steps.

Verify aggregate status and names:

filer01> aggr status
           Aggr State           Status            Options
          aggr0 online          raid_dp, aggr     root
          aggr1 online          raid_dp, aggr     
       aggr0(1) failed          raid_dp, aggr     diskroot, lost_write_protect=off,
                                foreign           
                                partial           
          aggr2 online          raid_dp, aggr     nosnap=on

Double-check the name of the foreign, offline aggregate that was brought in with the replacement drive, and destroy it:
```
filer01> aggr destroy aggr0(1)
Are you sure you want to destroy this aggregate? yes
Aggregate 'aggr0(1)' destroyed.
```

Verify that the aggregate has been removed:

filer01> aggr status          
           Aggr State           Status            Options
          aggr0 online          raid_dp, aggr     root
          aggr1 online          raid_dp, aggr     
          aggr2 online          raid_dp, aggr     nosnap=on

Zero the new spare. First, confirm it is un-zeroed:

filer01> vol status -s

Spare disks

RAID Disk	Device	HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
---------	------	------------- ---- ---- ---- ----- --------------    --------------
Spare disks for block or zoned checksum traditional volumes or aggregates
spare   	0a.53	0a    3   5   FC:B   -  ATA   7200 635555/1301618176 635858/1302238304 (not zeroed)
spare   	0a.69	0a    4   5   FC:B   -  ATA   7200 635555/1301618176 635858/1302238304 
spare   	1b.51	1b    3   3   FC:A   -  ATA   7200 635555/1301618176 635858/1302238304 (not zeroed)
spare   	1b.61	1b    3   13  FC:A   -  ATA   7200 635555/1301618176 635858/1302238304 
spare   	1b.87	1b    5   7   FC:A   -  ATA   7200 847555/1735794176 847827/1736350304 
spare   	1b.89	1b    5   9   FC:A   -  ATA   7200 847555/1735794176 847827/1736350304

In this example, we actually have two un-zeroed spares – the newly replaced drive (1b.51) and another drive (0a.53). Zero them both:

filer01> disk zero spares

And verify that they have been zeroed:

filer01> vol status -s

Spare disks

RAID Disk	Device	HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
---------	------	------------- ---- ---- ---- ----- --------------    --------------
Spare disks for block or zoned checksum traditional volumes or aggregates
spare   	0a.53	0a    3   5   FC:B   -  ATA   7200 635555/1301618176 635858/1302238304 
spare   	0a.69	0a    4   5   FC:B   -  ATA   7200 635555/1301618176 635858/1302238304 
spare   	1b.51	1b    3   3   FC:A   -  ATA   7200 635555/1301618176 635858/1302238304 
spare   	1b.61	1b    3   13  FC:A   -  ATA   7200 635555/1301618176 635858/1302238304 
spare   	1b.87	1b    5   7   FC:A   -  ATA   7200 847555/1735794176 847827/1736350304 
spare   	1b.89	1b    5   9   FC:A   -  ATA   7200 847555/1735794176 847827/1736350304

Done. You have replaced a failed drive with a zeroed spare.

Thinking Sysadmin

Replacing a Failed NetApp Drive with an Un-zeroed Spare

One comment

Share this:

Related

One comment