Replacing a Failed NetApp Drive with an Un-zeroed Spare

Jason Boche has a post on the method he used to replace a failed drive on a filer with an un-zeroed spare (transferred from a lab machine); my procedure was a little different.

In this example, I’ll be installing a replacement drive pulled from aggr0 on another filer. Note that this procedure is not relevant for drive failures covered by a support contract, where you will receive a zeroed replacement drive directly from NetApp.

  • Physically remove failed drive and replace with working drive. This will generate log messages similar to the following:
    May 27 11:02:36 filer01 [raid.disk.missing: info]: Disk 1b.51 Shelf 3 Bay 3 [NETAPP   X268_SGLXY750SSX AQNZ] S/N [5QD599LZ] is missing from the system
    May 27 11:03:00 filer01 [monitor.globalStatus.ok: info]: The system's global status is normal. 
    May 27 11:03:16 filer01 [scsi.cmd.notReadyCondition: notice]: Disk device 0a.51: Device returns not yet ready: CDB 0x12: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x0)(7715).
    May 27 11:03:25 filer01 [sfu.firmwareUpToDate: info]: Firmware is up-to-date on all disk shelves.
    May 27 11:03:27 filer01 [diskown.changingOwner: info]: changing ownership for disk 0a.51 (S/N P8G9SMDF) from unowned (ID -1) to filer01 (ID 135027165)
    May 27 11:03:27 filer01 [raid.assim.rg.missingChild: error]: Aggregate foreign:aggr0, rgobj_verify: RAID object 0 has only 1 valid children, expected 14.
    May 27 11:03:27 filer01 [raid.assim.plex.missingChild: error]: Aggregate foreign:aggr0, plexobj_verify: Plex 0 only has 0 working RAID groups (2 total) and is being taken offline
    May 27 11:03:27 filer01 [raid.assim.mirror.noChild: ALERT]: Aggregate foreign:aggr0, mirrorobj_verify: No operable plexes found.
    May 27 11:03:27 filer01 [raid.assim.tree.foreign: error]: raidtree_verify: Aggregate aggr0 is a foreign aggregate and is being taken offline. Use the 'aggr online' command to bring it online.
    May 27 11:03:27 filer01 [raid.assim.tree.dupName: error]: Duplicate aggregate names found, an instance of foreign:aggr0 is being renamed to foreign:aggr0(1).
    May 27 11:03:28 filer01 [sfu.firmwareUpToDate: info]: Firmware is up-to-date on all disk shelves.
    May 27 11:04:40 filer01 [asup.smtp.sent: notice]: System Notification mail sent: System Notification from filer01 (RAID VOLUME FAILED) ERROR
    May 27 11:04:42 filer01 [asup.post.sent: notice]: System Notification message posted to NetApp: System Notification from filer01 (RAID VOLUME FAILED) ERROR
    

    Note line 6, where it identifies the newly-added disk as part of “foreign:aggr0” and missing the rest of its RAID group; “foreign:aggr0” is taken offline in line 9. In line 10, “foreign:aggr0” is renamed to “foreign:aggr0(1)” because the filer already has an aggr0, as you might expect. Be sure to note the new aggregate name, as you will need it for later steps.

  • Verify aggregate status and names:
    filer01> aggr status
               Aggr State           Status            Options
              aggr0 online          raid_dp, aggr     root
              aggr1 online          raid_dp, aggr     
           aggr0(1) failed          raid_dp, aggr     diskroot, lost_write_protect=off,
                                    foreign           
                                    partial           
              aggr2 online          raid_dp, aggr     nosnap=on
    
  • Double-check the name of the foreign, offline aggregate that was brought in with the replacement drive, and destroy it:
    filer01> aggr destroy aggr0(1)
    Are you sure you want to destroy this aggregate? yes
    Aggregate 'aggr0(1)' destroyed.
    
  • Verify that the aggregate has been removed:
    filer01> aggr status          
               Aggr State           Status            Options
              aggr0 online          raid_dp, aggr     root
              aggr1 online          raid_dp, aggr     
              aggr2 online          raid_dp, aggr     nosnap=on
    
  • Zero the new spare. First, confirm it is un-zeroed:
    filer01> vol status -s
    
    Spare disks
    
    RAID Disk	Device	HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
    ---------	------	------------- ---- ---- ---- ----- --------------    --------------
    Spare disks for block or zoned checksum traditional volumes or aggregates
    spare   	0a.53	0a    3   5   FC:B   -  ATA   7200 635555/1301618176 635858/1302238304 (not zeroed)
    spare   	0a.69	0a    4   5   FC:B   -  ATA   7200 635555/1301618176 635858/1302238304 
    spare   	1b.51	1b    3   3   FC:A   -  ATA   7200 635555/1301618176 635858/1302238304 (not zeroed)
    spare   	1b.61	1b    3   13  FC:A   -  ATA   7200 635555/1301618176 635858/1302238304 
    spare   	1b.87	1b    5   7   FC:A   -  ATA   7200 847555/1735794176 847827/1736350304 
    spare   	1b.89	1b    5   9   FC:A   -  ATA   7200 847555/1735794176 847827/1736350304 
    

    In this example, we actually have two un-zeroed spares – the newly replaced drive (1b.51) and another drive (0a.53). Zero them both:

    filer01> disk zero spares
    

    And verify that they have been zeroed:

    filer01> vol status -s
    
    Spare disks
    
    RAID Disk	Device	HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
    ---------	------	------------- ---- ---- ---- ----- --------------    --------------
    Spare disks for block or zoned checksum traditional volumes or aggregates
    spare   	0a.53	0a    3   5   FC:B   -  ATA   7200 635555/1301618176 635858/1302238304 
    spare   	0a.69	0a    4   5   FC:B   -  ATA   7200 635555/1301618176 635858/1302238304 
    spare   	1b.51	1b    3   3   FC:A   -  ATA   7200 635555/1301618176 635858/1302238304 
    spare   	1b.61	1b    3   13  FC:A   -  ATA   7200 635555/1301618176 635858/1302238304 
    spare   	1b.87	1b    5   7   FC:A   -  ATA   7200 847555/1735794176 847827/1736350304 
    spare   	1b.89	1b    5   9   FC:A   -  ATA   7200 847555/1735794176 847827/1736350304 
    
  • Done. You have replaced a failed drive with a zeroed spare.
Advertisements

HAProxy and Keepalived: Example Configuration

HAProxy is load balancer software that allows you to proxy HTTP and TCP connections to a pool of back-end servers; Keepalived – among other uses – allows you to create a redundant pair of HAProxy servers by moving an IP address between HAProxy hosts in an active-passive configuration.
Continue reading

Dropping Tarballs with Puppet

I frequently find myself using Puppet to expand tarballs in various locations, sometimes fiddling with a directory name here or there. In fact, I do it so often, that I created a “define” for it earlier this week. This could be a little more polished, but in the spirit of sharing first drafts, here goes:

# Small define to expand a tarball at a location; assumes File[$title]
# definition of tarball and installation of pax:

define baselayout::drop_tarball($dest, $dir_name, $dir_sub='') {

  # $dest: cwd in which expansion is done
  # $dir_name: name of top level directory created in $dest
  # $dir_sub: regexp to -s for pax - not supported for .zip archives

  if ($dir_sub) {
    $regexp = "-s $dir_sub"
  } else {
    $regexp = ''
  }

  # CentOS' pax doesn't support "-j" flag; therefore, run pax after
  # bzcat in a pipeline. Twiddle path to bzcat as distro-appropriate:
  case $operatingsystem {
    CentOS: {
      $bzcat = "/usr/bin/bzcat"
    }
    Ubuntu: {
      $bzcat = "/bin/bzcat"
    }
  }
  
  # Choose expansion method based on file suffix:
  if (($title =~ /.tar.gz$/) or ($title =~ /.tgz$/)) {
    $expand = "/usr/bin/pax -rz $regexp < $title"
  } elsif (($title =~ /.tar.bz2$/) or ($title =~ /.tbz$/)) {
    $expand = "$bzcat $title | /usr/bin/pax -r $regexp"
  } elsif ( $title =~ /.zip$/ ) {
    $expand = "/usr/bin/unzip $title"
  }
  
  exec { "drop_tarball $title":
    command => $expand,
    cwd => $dest,
    creates => "${dest}/${dir_name}",
    require => File[$title],
  }
  
}

The definition is written for Ubuntu and CentOS, assumes pax is installed on the system, and that a file resource for the tarball is defined before the definition is called. Pax is used instead of tar to facilitate renaming the top-level directory of the tarball. Zipped directories are also support, but without rename functionality.

I’ll update a gist as I develop the definition.

Comments welcome.

Sunday Project: Installing CyanogenMod on an HTC Hero

This weekend, I finally rooted my old but functional Sprint HTC Hero, and installed CyanogenMod 6.1.0 on it. Below are my notes on the process.

(Note: This post is more of a compilation than any original work of my own; I’ve tried to reference sources for the information that I present here as best as I recall; please post any links I should have included in the comments.)
Continue reading

Using an OpenLDAP Proxy to Work Around Solaris/Active Directory Issues

There is a long-standing bug in (Open)Solaris and derivatives (including NexentaStor) that breaks Active Directory interoperability:

Beginning with Windows Server 2003, Active Directory supports VLV searches. Every VLV search request must be accompanied by 2 request controls: the SSS control and the VLV control. However, Active Directory imposes some general criteria on the SSS control:

1. Cannot sort based on more than one sort keys/attributes.
2. Cannot sort based on the “distinguishedName” attribute (presumably Microsoft does not use the “DN” attribute).
3. Cannot sort based on a constructed attribute (presumably an attribute not stored on Active Directory).

Unfortunately, Solaris LDAP clients use 2 sort keys/attributes: “cn” and “uid” in the SSS control. Subsequently, when dumping a container or a naming database, Solaris LDAP clients would receive LDAP_UNAVAILABLE_CRITICAL_EXTENSION.

$ ldaplist passwd
ldaplist: Object not found (LDAP ERROR (12): Unavailable critical extension.)

This issue has been detailed elsewhere, including at utexas.edu. There appear to be at least four solutions:

  1. Wait for the fix from Sun Oracle to reach the light of day: this bug was apparently fixed in SNV 144. (I expect the fix is out in Solaris 11 Express now, but have not tested this myself.)
  2. Apply the hotfix in Microsoft’s KB886683 to your domain controllers, which will disable VLV.
  3. Run separate ADAM instances with VLV disabled, and point your Solaris machines at them instead of directly at your domain controllers. From the blog post linked above, it sounds like the University of Texas chose this route.
  4. Use OpenLDAP as a proxy in front of Active Directory; configure your Solaris machines to use the proxies instead of Active Directory servers. This is the solution detailed in this blog post.

Continue reading