thinking sysadmin

qstat -u aleonard -s z

Archive for the ‘storage’ Category

Replacing a Failed NetApp Drive with an Un-zeroed Spare

leave a comment

Jason Boche has a post on the method he used to replace a failed drive on a filer with an un-zeroed spare (transferred from a lab machine); my procedure was a little different.

In this example, I’ll be installing a replacement drive pulled from aggr0 on another filer. Note that this procedure is not relevant for drive failures covered by a support contract, where you will receive a zeroed replacement drive directly from NetApp.

  • Physically remove failed drive and replace with working drive. This will generate log messages similar to the following:
    May 27 11:02:36 filer01 [raid.disk.missing: info]: Disk 1b.51 Shelf 3 Bay 3 [NETAPP   X268_SGLXY750SSX AQNZ] S/N [5QD599LZ] is missing from the system
    May 27 11:03:00 filer01 [monitor.globalStatus.ok: info]: The system's global status is normal.
    May 27 11:03:16 filer01 [scsi.cmd.notReadyCondition: notice]: Disk device 0a.51: Device returns not yet ready: CDB 0x12: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x0)(7715).
    May 27 11:03:25 filer01 [sfu.firmwareUpToDate: info]: Firmware is up-to-date on all disk shelves.
    May 27 11:03:27 filer01 [diskown.changingOwner: info]: changing ownership for disk 0a.51 (S/N P8G9SMDF) from unowned (ID -1) to filer01 (ID 135027165)
    May 27 11:03:27 filer01 [raid.assim.rg.missingChild: error]: Aggregate foreign:aggr0, rgobj_verify: RAID object 0 has only 1 valid children, expected 14.
    May 27 11:03:27 filer01 [raid.assim.plex.missingChild: error]: Aggregate foreign:aggr0, plexobj_verify: Plex 0 only has 0 working RAID groups (2 total) and is being taken offline
    May 27 11:03:27 filer01 [raid.assim.mirror.noChild: ALERT]: Aggregate foreign:aggr0, mirrorobj_verify: No operable plexes found.
    May 27 11:03:27 filer01 [raid.assim.tree.foreign: error]: raidtree_verify: Aggregate aggr0 is a foreign aggregate and is being taken offline. Use the 'aggr online' command to bring it online.
    May 27 11:03:27 filer01 [raid.assim.tree.dupName: error]: Duplicate aggregate names found, an instance of foreign:aggr0 is being renamed to foreign:aggr0(1).
    May 27 11:03:28 filer01 [sfu.firmwareUpToDate: info]: Firmware is up-to-date on all disk shelves.
    May 27 11:04:40 filer01 [asup.smtp.sent: notice]: System Notification mail sent: System Notification from filer01 (RAID VOLUME FAILED) ERROR
    May 27 11:04:42 filer01 [asup.post.sent: notice]: System Notification message posted to NetApp: System Notification from filer01 (RAID VOLUME FAILED) ERROR
    

    Note line 6, where it identifies the newly-added disk as part of “foreign:aggr0″ and missing the rest of its RAID group; “foreign:aggr0″ is taken offline in line 9. In line 10, “foreign:aggr0″ is renamed to “foreign:aggr0(1)” because the filer already has an aggr0, as you might expect. Be sure to note the new aggregate name, as you will need it for later steps.

  • Verify aggregate status and names:
    filer01> aggr status
               Aggr State           Status            Options
              aggr0 online          raid_dp, aggr     root
              aggr1 online          raid_dp, aggr
           aggr0(1) failed          raid_dp, aggr     diskroot, lost_write_protect=off,
                                    foreign
                                    partial
              aggr2 online          raid_dp, aggr     nosnap=on
    
  • Double-check the name of the foreign, offline aggregate that was brought in with the replacement drive, and destroy it:
    filer01> aggr destroy aggr0(1)
    Are you sure you want to destroy this aggregate? yes
    Aggregate 'aggr0(1)' destroyed.
    
  • Verify that the aggregate has been removed:
    netapp03> aggr status
               Aggr State           Status            Options
              aggr0 online          raid_dp, aggr     root
              aggr1 online          raid_dp, aggr
              aggr2 online          raid_dp, aggr     nosnap=on
    
  • Zero the new spare. First, confirm it is un-zeroed:
    filer01> vol status -s
    
    Spare disks
    
    RAID Disk	Device	HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
    ---------	------	------------- ---- ---- ---- ----- --------------    --------------
    Spare disks for block or zoned checksum traditional volumes or aggregates
    spare   	0a.53	0a    3   5   FC:B   -  ATA   7200 635555/1301618176 635858/1302238304 (not zeroed)
    spare   	0a.69	0a    4   5   FC:B   -  ATA   7200 635555/1301618176 635858/1302238304
    spare   	1b.51	1b    3   3   FC:A   -  ATA   7200 635555/1301618176 635858/1302238304 (not zeroed)
    spare   	1b.61	1b    3   13  FC:A   -  ATA   7200 635555/1301618176 635858/1302238304
    spare   	1b.87	1b    5   7   FC:A   -  ATA   7200 847555/1735794176 847827/1736350304
    spare   	1b.89	1b    5   9   FC:A   -  ATA   7200 847555/1735794176 847827/1736350304
    

    In this example, we actually have two un-zeroed spares – the newly replaced drive (1b.51) and another drive (0a.53). Zero them both:

    filer01> disk zero spares
    

    And verify that they have been zeroed:

    filer01> vol status -s
    
    Spare disks
    
    RAID Disk	Device	HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
    ---------	------	------------- ---- ---- ---- ----- --------------    --------------
    Spare disks for block or zoned checksum traditional volumes or aggregates
    spare   	0a.53	0a    3   5   FC:B   -  ATA   7200 635555/1301618176 635858/1302238304
    spare   	0a.69	0a    4   5   FC:B   -  ATA   7200 635555/1301618176 635858/1302238304
    spare   	1b.51	1b    3   3   FC:A   -  ATA   7200 635555/1301618176 635858/1302238304
    spare   	1b.61	1b    3   13  FC:A   -  ATA   7200 635555/1301618176 635858/1302238304
    spare   	1b.87	1b    5   7   FC:A   -  ATA   7200 847555/1735794176 847827/1736350304
    spare   	1b.89	1b    5   9   FC:A   -  ATA   7200 847555/1735794176 847827/1736350304
    
  • Done. You have replaced a failed drive with a zeroed spare.

Written by Andy

May 28th, 2011 at 5:42 am

Posted in storage

Tagged with , , ,

NexentaStor in front of a NetApp FC LUN using MPxIO

leave a comment

  1. Create a Fibre Channel LUN on your NetApp and map it to your NexentaStor machine (I’m using version 3.0.2 in this example). For this example, I’ve created a 10GB LUN on a filer running ONTAP 7.2:
    netapp01> lun show /vol/nexenta01/lun01/lun
            /vol/nexenta01/lun01/lun      10g (10737418240)   (r/w, online, mapped)
    

    There are eight paths from our NetApp to our NexentaStor appliance, so the LUN appears eight times on the “qlc” adapter (lines 9-16 below):

    nmc@nexenta01:/$ lunsync
    Cleanup obsolete (dangling) device links?  Yes
    Re-enumerating LUNs... done.
    
    nmc@nexenta01:/$ show lun
    LUN ID      Device    Type         Size       Volume     Mounted Attach GUID
    c0t0d0      sd0       disk         272.3GB    syspool    no      mega_sas 60024e805102c100118a3fa70ae8937a
    c1t0d0      sd128     cdrom        No Media              no      ata    -
    c2t5*DDDd0  sd6       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
    c2t5*DDDd0  sd4       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
    c2t5*DDDd0  sd7       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
    c2t5*DDDd0  sd5       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
    c3t5*DDDd0  sd3       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
    c3t5*DDDd0  sd2       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
    c3t5*DDDd0  sd8       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
    c3t5*DDDd0  sd1       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
    syspo~/swap           zvol         1.0GB      syspool    no
    
  2. Read the rest of this entry »

Written by Andy

May 28th, 2010 at 9:35 am

Running NetApp’s aggrSpaceCheck without turning on RSH

one comment

When upgrading a NetApp filer from a pre-7.3 release to 7.3, metadata is apparently moved from within the FlexVol into the containing aggregate. If your aggregate is tight on space – more than 96% full – NetApp requires that you complete extra verification steps to ensure that you can complete the upgrade. From the Data ONTAP® 7.3.1.1 Release Notes (NOW login required):

If you suspect that your system has almost used all of its free space, or if you use thin provisioning, you should check the amount of space in use by each aggregate. If any aggregate is 97 percent full or more, do not proceed with the upgrade until you have used the Upgrade Advisor or aggrSpaceCheck tools to determine your system capacity and plan your upgrade.

Upgrade Advisor is a great tool, and I heartily recommend you use it for your upgrade. However, it doesn’t give you a lot of visibility into what’s being checked for here. Lucky for us, NetApp offers an alternative tool: aggrSpaceCheck (NOW login required).
Read the rest of this entry »

Written by Andy

June 24th, 2009 at 3:08 pm

Posted in storage

Tagged with ,

NetApp FAS2020 aggregate capacity on ONTAP 7.3.1 – now 16TB

2 comments

My NetApp FAS 2020 Sizing post remains popular nearly a year after I wrote it. However, with ONTAP 7.3.1 (and later releases) out, it’s also out of date. Here’s current information from p. 33 of the ONTAP 7.3.1.1 release notes (NOW login required):

Beginning with Data ONTAP 7.3.1, FAS2020 systems support aggregates up to 16 TB raw capacity,
provided that the root volume is hosted in a dedicated aggregate (that is, one that contains only the root
volume and no user data).

The release notes go on to point out an alternative to the dedicated root aggregate – having two spare disks per controller.

It’s nice to see the FAS2020 finally getting a maximum aggregate size on par with the rest of NetApp’s product line. However, in an era where 2TB drives are available from Western Digital – and presumably other manufacturers before too long – ONTAP’s 16TB aggregate limit grows increasingly anachronistic.

Written by Andy

June 23rd, 2009 at 10:41 am

Posted in storage

Tagged with ,

SnapManager for Exchange/SnapVault Integration Requirements

leave a comment

Update: NetApp has a KB article in NOW addressing this: Using SnapVault to Archive SnapManager for Exchange Backups Sets. Bottom line: You do not necessarily need ONTAP 7.3, Protection Manager and DataFabric Manager to send SnapManager for Exchange snapshots to a SnapVault secondary.

We recently acquired SnapManager for Exchange (SME) at my place of employment. We have an existing NetApp deployment consisting of two primary filers in a SnapVault arrangement with a third filer. The SME install is part of an upgrade from Exchange 2003 (on DAS) to 2007 (on Fibre Channel storage).

What we missed prior to purchasing SME: If you want to use SnapVault with SME, you need two additional pieces of software: Protection Manager and NetApp Management Console (part of DataFabric Manager, apparently). Here’s what p. 408 of the SnapManager® 5.0 for Microsoft® Exchange Installation and Administration Guide (NOW login required) says:

The following are the software dependencies for integrating SnapManager with
data set and SnapVault:

◆ Protection Manager 3.7 and later
◆ NetApp Management Console 3.7 and later
◆ SnapDrive for Windows 6.0 and later
◆ Data ONTAP 7.3 or later

Wish I’d known that sooner.

(This is the point where some random NetApp fanboy pops down to the comments and fires off something about how NetApp is the greatest storage company ever, and if I’d done appropriate due diligence, I wouldn’t have missed this requirement. My advice: Spare us, smart guy. I’m writing this post to make it easier for other NetApp customers to do their “due diligence”.)

Written by Andy

June 18th, 2009 at 11:00 am

Duplicity to Amazon S3 on FreeBSD: Building on the work of others

4 comments

(This post adds only a couple small details to work described at randys.org and cenolan.com – go there for background on this post and useful scripts for automated Duplicity backup to S3.)

First off, if you want to use Duplicity installed from FreeBSD Ports to backup to Amazon S3, be sure to also install the devel/py-boto and security/pinentry-curses ports.

If you attempt to run the backup script described at randys.org or cenolan.com from cron, you may run into an error similar to the following:
Read the rest of this entry »

Written by Andy

March 2nd, 2009 at 12:47 pm

Posted in freebsd,storage

Tagged with , , ,

New Years Resolution: Stop shouting at my disk arrays

2 comments

Apparently, disk arrays are sensitive sorts that respond poorly when yelled at:

Makes me wonder how much engineering that I never thought about goes into designing disk shelves to keep drives insulated from vibrations. The Fishworks analytics interface is dazzling – wish I had that yesterday when I was looking at a possible Exchange I/O performance issue with perfmon…

Written by Andy

January 1st, 2009 at 9:10 am

Posted in storage

Tagged with ,

Fishworks on the VMware HCL

leave a comment

I was checking out VMware’s new online search-able HCL and I noticed that the new Sun Unified Storage Systems were on the HCL. That was fast – and now I’m really curious as to how the systems with flash drives perform as storage for ESX.

Written by Andy

December 11th, 2008 at 12:29 pm

Posted in storage,virtualization

Tagged with , ,

Fishworks’ LDAP Schema Definition

one comment

Quick notes on configuring LDAP in Fishworks, gleaned from my experience working with the VMware simulator:

As I noted in my “quick walk” post‘s comments, I had difficulty getting LDAP working initially on my corporate Active Directory network. The crux for me turned out to be getting the LDAP Schema Definitions correct. Here are the settings that worked correctly for me, authenticating against an AD instance with the schema extended by Microsoft’s Services for Unix add-on (other LDAP schemata will, of course, need different mappings):

USERS
Search descriptor: Don’t leave this blank – according to the Fishworks documentation this “sets the LDAP search descriptor, attribute mappings and object class mappings for users and groups. By default, the search descriptor for users is ou=people,dc=example,dc=com, and for groups is ou=group,dc=example,dc=com” – so what you enter will be site-specific.

Attribute mappings:

  • uid=msSFU30Name
  • uidNumber=msSFU30UidNumber
  • gidNumber=msSFU30GidNumber

Object class mappings:

  • posixAccount=User

GROUPS
Search descriptor: Again, don’t leave this blank – enter the appropriate value for your site.

Attribute mappings:

  • gidNumber=msSFU30GidNumber
  • uniqueMember=msSFU30PosixMember

Object class mappings:

  • posixGroup=group

How did I know that the schema definition mappings were the problem? The logs gave it away: Maintenance -> Logs -> System, where I saw messages similar to the following: “libsldap: Status: 0 Mesg: Unable to set value: schema map already existed for ‘User’.”

How did I know that I had the schema definitions working? Share settings that I had created using numeric UIDs and GIDs automatically became mapped to the correct user and group names.

I’ll update this post if I find additional configuration that may be necessary.

Written by Andy

November 18th, 2008 at 5:02 pm

Posted in storage

Tagged with , , ,

ElasticFish?

leave a comment

(In the spirit of Joerg Moellenkamp‘s thought experiments:)

That virtualized Fishworks appliance got me thinking: What if you combined this with this? Yeah, managing Elastic Block Store devices would require some changes, but, if you needed a NAS for your EC2 instances…

Written by Andy

November 12th, 2008 at 3:21 pm

Posted in storage,virtualization

Tagged with , , ,