Replacing a Failed NetApp Drive with an Un-zeroed Spare

Jason Boche has a post on the method he used to replace a failed drive on a filer with an un-zeroed spare (transferred from a lab machine); my procedure was a little different.

In this example, I’ll be installing a replacement drive pulled from aggr0 on another filer. Note that this procedure is not relevant for drive failures covered by a support contract, where you will receive a zeroed replacement drive directly from NetApp.

  • Physically remove failed drive and replace with working drive. This will generate log messages similar to the following:
    May 27 11:02:36 filer01 [raid.disk.missing: info]: Disk 1b.51 Shelf 3 Bay 3 [NETAPP   X268_SGLXY750SSX AQNZ] S/N [5QD599LZ] is missing from the system
    May 27 11:03:00 filer01 [monitor.globalStatus.ok: info]: The system's global status is normal. 
    May 27 11:03:16 filer01 [scsi.cmd.notReadyCondition: notice]: Disk device 0a.51: Device returns not yet ready: CDB 0x12: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x0)(7715).
    May 27 11:03:25 filer01 [sfu.firmwareUpToDate: info]: Firmware is up-to-date on all disk shelves.
    May 27 11:03:27 filer01 [diskown.changingOwner: info]: changing ownership for disk 0a.51 (S/N P8G9SMDF) from unowned (ID -1) to filer01 (ID 135027165)
    May 27 11:03:27 filer01 [raid.assim.rg.missingChild: error]: Aggregate foreign:aggr0, rgobj_verify: RAID object 0 has only 1 valid children, expected 14.
    May 27 11:03:27 filer01 [raid.assim.plex.missingChild: error]: Aggregate foreign:aggr0, plexobj_verify: Plex 0 only has 0 working RAID groups (2 total) and is being taken offline
    May 27 11:03:27 filer01 [raid.assim.mirror.noChild: ALERT]: Aggregate foreign:aggr0, mirrorobj_verify: No operable plexes found.
    May 27 11:03:27 filer01 [raid.assim.tree.foreign: error]: raidtree_verify: Aggregate aggr0 is a foreign aggregate and is being taken offline. Use the 'aggr online' command to bring it online.
    May 27 11:03:27 filer01 [raid.assim.tree.dupName: error]: Duplicate aggregate names found, an instance of foreign:aggr0 is being renamed to foreign:aggr0(1).
    May 27 11:03:28 filer01 [sfu.firmwareUpToDate: info]: Firmware is up-to-date on all disk shelves.
    May 27 11:04:40 filer01 [asup.smtp.sent: notice]: System Notification mail sent: System Notification from filer01 (RAID VOLUME FAILED) ERROR
    May 27 11:04:42 filer01 [asup.post.sent: notice]: System Notification message posted to NetApp: System Notification from filer01 (RAID VOLUME FAILED) ERROR
    

    Note line 6, where it identifies the newly-added disk as part of “foreign:aggr0” and missing the rest of its RAID group; “foreign:aggr0” is taken offline in line 9. In line 10, “foreign:aggr0” is renamed to “foreign:aggr0(1)” because the filer already has an aggr0, as you might expect. Be sure to note the new aggregate name, as you will need it for later steps.

  • Verify aggregate status and names:
    filer01> aggr status
               Aggr State           Status            Options
              aggr0 online          raid_dp, aggr     root
              aggr1 online          raid_dp, aggr     
           aggr0(1) failed          raid_dp, aggr     diskroot, lost_write_protect=off,
                                    foreign           
                                    partial           
              aggr2 online          raid_dp, aggr     nosnap=on
    
  • Double-check the name of the foreign, offline aggregate that was brought in with the replacement drive, and destroy it:
    filer01> aggr destroy aggr0(1)
    Are you sure you want to destroy this aggregate? yes
    Aggregate 'aggr0(1)' destroyed.
    
  • Verify that the aggregate has been removed:
    filer01> aggr status          
               Aggr State           Status            Options
              aggr0 online          raid_dp, aggr     root
              aggr1 online          raid_dp, aggr     
              aggr2 online          raid_dp, aggr     nosnap=on
    
  • Zero the new spare. First, confirm it is un-zeroed:
    filer01> vol status -s
    
    Spare disks
    
    RAID Disk	Device	HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
    ---------	------	------------- ---- ---- ---- ----- --------------    --------------
    Spare disks for block or zoned checksum traditional volumes or aggregates
    spare   	0a.53	0a    3   5   FC:B   -  ATA   7200 635555/1301618176 635858/1302238304 (not zeroed)
    spare   	0a.69	0a    4   5   FC:B   -  ATA   7200 635555/1301618176 635858/1302238304 
    spare   	1b.51	1b    3   3   FC:A   -  ATA   7200 635555/1301618176 635858/1302238304 (not zeroed)
    spare   	1b.61	1b    3   13  FC:A   -  ATA   7200 635555/1301618176 635858/1302238304 
    spare   	1b.87	1b    5   7   FC:A   -  ATA   7200 847555/1735794176 847827/1736350304 
    spare   	1b.89	1b    5   9   FC:A   -  ATA   7200 847555/1735794176 847827/1736350304 
    

    In this example, we actually have two un-zeroed spares – the newly replaced drive (1b.51) and another drive (0a.53). Zero them both:

    filer01> disk zero spares
    

    And verify that they have been zeroed:

    filer01> vol status -s
    
    Spare disks
    
    RAID Disk	Device	HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
    ---------	------	------------- ---- ---- ---- ----- --------------    --------------
    Spare disks for block or zoned checksum traditional volumes or aggregates
    spare   	0a.53	0a    3   5   FC:B   -  ATA   7200 635555/1301618176 635858/1302238304 
    spare   	0a.69	0a    4   5   FC:B   -  ATA   7200 635555/1301618176 635858/1302238304 
    spare   	1b.51	1b    3   3   FC:A   -  ATA   7200 635555/1301618176 635858/1302238304 
    spare   	1b.61	1b    3   13  FC:A   -  ATA   7200 635555/1301618176 635858/1302238304 
    spare   	1b.87	1b    5   7   FC:A   -  ATA   7200 847555/1735794176 847827/1736350304 
    spare   	1b.89	1b    5   9   FC:A   -  ATA   7200 847555/1735794176 847827/1736350304 
    
  • Done. You have replaced a failed drive with a zeroed spare.
Advertisements

One comment

  1. eddie

    Exactly what I need to do. I guess I got a spare disk that was part of “vol1” on another filer and ontap automatically created the vol1 aggregate and offlined it.