<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>thinking sysadmin</title>
	<atom:link href="http://andyleonard.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://andyleonard.com</link>
	<description>qstat -u aleonard -s z</description>
	<lastBuildDate>Fri, 30 Jul 2010 17:47:40 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>NexentaStor in front of a NetApp FC LUN using MPxIO</title>
		<link>http://andyleonard.com/2010/05/28/nexentastor-in-front-of-a-netapp-fc-lun-using-mpxio/</link>
		<comments>http://andyleonard.com/2010/05/28/nexentastor-in-front-of-a-netapp-fc-lun-using-mpxio/#comments</comments>
		<pubDate>Fri, 28 May 2010 17:35:13 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[storage]]></category>
		<category><![CDATA[ALUA]]></category>
		<category><![CDATA[fc]]></category>
		<category><![CDATA[fcp]]></category>
		<category><![CDATA[fibre channel]]></category>
		<category><![CDATA[lun]]></category>
		<category><![CDATA[mpio]]></category>
		<category><![CDATA[mpxio]]></category>
		<category><![CDATA[netapp]]></category>
		<category><![CDATA[nexenta]]></category>
		<category><![CDATA[nexentastor]]></category>
		<category><![CDATA[opensolaris]]></category>
		<category><![CDATA[solaris]]></category>

		<guid isPermaLink="false">http://andyleonard.com/?p=489</guid>
		<description><![CDATA[
Create a Fibre Channel LUN on your NetApp and map it to your NexentaStor machine (I&#8217;m using version 3.0.2 in this example).  For this example, I&#8217;ve created a 10GB LUN on a filer running ONTAP 7.2:

netapp01&#62; lun show /vol/nexenta01/lun01/lun
        /vol/nexenta01/lun01/lun      10g (10737418240) [...]]]></description>
			<content:encoded><![CDATA[<ol>
<li>Create a Fibre Channel LUN on your NetApp and map it to your NexentaStor machine (I&#8217;m using version 3.0.2 in this example).  For this example, I&#8217;ve created a 10GB LUN on a filer running ONTAP 7.2:
<pre class="brush: bash; light: true;">
netapp01&gt; lun show /vol/nexenta01/lun01/lun
        /vol/nexenta01/lun01/lun      10g (10737418240)   (r/w, online, mapped)
</pre>
<p>There are eight paths from our NetApp to our NexentaStor appliance, so the LUN appears eight times on the &#8220;qlc&#8221; adapter (lines 9-16 below):</p>
<pre class="brush: bash; highlight: [9,10,11,12,13,14,15,16];">
nmc@nexenta01:/$ lunsync
Cleanup obsolete (dangling) device links?  Yes
Re-enumerating LUNs... done.

nmc@nexenta01:/$ show lun
LUN ID      Device    Type         Size       Volume     Mounted Attach GUID
c0t0d0      sd0       disk         272.3GB    syspool    no      mega_sas 60024e805102c100118a3fa70ae8937a
c1t0d0      sd128     cdrom        No Media              no      ata    -
c2t5*DDDd0  sd6       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
c2t5*DDDd0  sd4       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
c2t5*DDDd0  sd7       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
c2t5*DDDd0  sd5       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
c3t5*DDDd0  sd3       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
c3t5*DDDd0  sd2       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
c3t5*DDDd0  sd8       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
c3t5*DDDd0  sd1       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
syspo~/swap           zvol         1.0GB      syspool    no
</pre>
</li>
<p><span id="more-489"></span></p>
<li>In <a href="http://kb.hurricane-ridge.com/storage/nexenta/getting-acces-to-a-shell-in-nexentastor">NexentaStor &#8220;expert&#8221; mode</a>, enable MPxIO for your Fibre Channel HBA (schedule this for a maintenance window, as it requires a reboot):
<pre class="brush: bash; light: true;">
root@nexenta01:/volumes# stmsboot -L
stmsboot: MPXIO disabled
root@nexenta01:/volumes# stmsboot -e -D fp
WARNING: This operation will require a reboot.
Do you want to continue ? [y/n] (default: y)
updating //platform/i86pc/boot_archive
updating //platform/i86pc/amd64/boot_archive
The changes will come into effect after rebooting the system.
Reboot the system now ? [y/n] (default: y)
</pre>
<p>Note that this will not have any immediately noticable effect after rebooting:</p>
<pre class="brush: bash; light: true;">
nmc@nexenta01:/$ lunsync
Cleanup obsolete (dangling) device links?  Yes

Re-enumerating LUNs... done.

nmc@nexenta01:/$ show lun
LUN ID      Device    Type         Size       Volume     Mounted Attach GUID
c0t0d0      sd0       disk         272.3GB    syspool    no      mega_sas 60024e805102c100118a3fa70ae8937a
c1t0d0      sd128     cdrom        No Media              no      ata    -
c2t5*DDDd0  sd6       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
c2t5*DDDd0  sd4       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
c2t5*DDDd0  sd7       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
c2t5*DDDd0  sd5       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
c3t5*DDDd0  sd3       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
c3t5*DDDd0  sd2       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
c3t5*DDDd0  sd8       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
c3t5*DDDd0  sd1       disk         10GB                  no      qlc    60a98000486e542f5034577076716469
syspo~/swap           zvol         1.0GB      syspool    no             -
</pre>
<p>However, in expert mode, you will now see the following:</p>
<pre class="brush: bash; light: true;">
root@nexenta01:/volumes# stmsboot -L
stmsboot: No STMS devices have been found
</pre>
</li>
<li>Enable ALUA (Asymmetric Logical Unit Access) on the initiator group on the NetApp:
<pre class="brush: bash; light: true;">
netapp01&gt; igroup show -v nexenta01
    nexenta01 (FCP):
        OS Type: solaris
        Member: 21:00:00:aa:bb:cc:dd:ee (logged in on: 0b, 0d, vtic)
        Member: 21:01:00:aa:bb:cc:dd:ee (logged in on: 0b, 0d, vtic)
netapp01&gt; igroup set nexenta01 alua yes
netapp01&gt; igroup show -v nexenta01
    nexenta01 (FCP):
        OS Type: solaris
        Member: 21:00:00:aa:bb:cc:dd:ee (logged in on: 0b, 0d, vtic)
        Member: 21:01:00:aa:bb:cc:dd:ee(logged in on: 0b, 0d, vtic)
        ALUA: Yes
</pre>
</li>
<li>Reconfigure and re-scan your NexentaStor HBA; note that the LUN is now attached to &#8220;mpxio&#8221; where it was previously attached to &#8220;qlc&#8221;:
<pre class="brush: bash; highlight: [10];">
nmc@nexenta01:/$ lunsync -r
Cleanup obsolete (dangling) device links?  Yes
Re-scanning HBAs... done.
Re-enumerating LUNs... done.

nmc@nexenta01:/$ show lun
LUN ID      Device    Type         Size       Volume     Mounted Attach GUID
c0t0d0      sd0       disk         272.3GB    syspool    no      mega_sas 60024e805102c100118a3fa70ae8937a
c1t0d0      sd128     cdrom        No Media              no      ata    -
c4t6*469d0  sd9       disk         10GB                  no      mpxio  60a98000486e542f5034577076716469
syspo~/swap           zvol         1.0GB      syspool    no             -
</pre>
<p>In NexentaStor expert mode, note that <code>stmsboot</code> now shows devices:</p>
<pre class="brush: bash; light: true;">
root@nexenta01:/volumes# stmsboot -L
non-STMS device name                    STMS device name
------------------------------------------------------------------
/dev/rdsk/c3t500A09869657ADDDd0 /dev/rdsk/c4t60A98000486E542F5034577076716469d0
/dev/rdsk/c3t500A09889657ADDDd0 /dev/rdsk/c4t60A98000486E542F5034577076716469d0
/dev/rdsk/c3t500A09888657ADDDd0 /dev/rdsk/c4t60A98000486E542F5034577076716469d0
/dev/rdsk/c3t500A09868657ADDDd0 /dev/rdsk/c4t60A98000486E542F5034577076716469d0
/dev/rdsk/c2t500A09869657ADDDd0 /dev/rdsk/c4t60A98000486E542F5034577076716469d0
/dev/rdsk/c2t500A09889657ADDDd0 /dev/rdsk/c4t60A98000486E542F5034577076716469d0
/dev/rdsk/c2t500A09888657ADDDd0 /dev/rdsk/c4t60A98000486E542F5034577076716469d0
/dev/rdsk/c2t500A09868657ADDDd0 /dev/rdsk/c4t60A98000486E542F5034577076716469d0
</pre>
<p>You can now create a NexentaStor volume on your LUN.</li>
</ol>
<p><a href="http://twitter.com/complex/status/14855930808">Hat Tip</a> to @complex on Twitter.</p>
<p>Reference: <a href="http://www.nexenta.com/corp/index.php?option=com_content&#038;task=view&#038;id=245&#038;Itemid=119">Is it possible to use I/O multipathing? How?</a></p>
]]></content:encoded>
			<wfw:commentRss>http://andyleonard.com/2010/05/28/nexentastor-in-front-of-a-netapp-fc-lun-using-mpxio/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Installing the F5 FirePass VPN Client on Ubuntu 10.04 AMD64</title>
		<link>http://andyleonard.com/2010/05/20/installing-the-f5-firepass-vpn-client-on-ubuntu-10-04-amd64/</link>
		<comments>http://andyleonard.com/2010/05/20/installing-the-f5-firepass-vpn-client-on-ubuntu-10-04-amd64/#comments</comments>
		<pubDate>Thu, 20 May 2010 19:12:21 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[operating systems]]></category>
		<category><![CDATA[10.04]]></category>
		<category><![CDATA[f5]]></category>
		<category><![CDATA[firefox]]></category>
		<category><![CDATA[firepass]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[lucid lynx]]></category>
		<category><![CDATA[lynx]]></category>
		<category><![CDATA[mozilla]]></category>
		<category><![CDATA[ssl vpn]]></category>
		<category><![CDATA[ubuntu]]></category>

		<guid isPermaLink="false">http://andyleonard.com/?p=474</guid>
		<description><![CDATA[Disclaimer: I am not a FirePass administrator; only an end-user and have no other relationship with F5.  There may be better methods to address this issue; please comment if you know of one.
See also: f5vpn-login.py, described here, and brought to my attention by sh4k3sph3r3.  A CLI FirePass client is quite likely a better [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Disclaimer:</strong> I am not a FirePass administrator; only an end-user and have no other relationship with F5.  There may be better methods to address this issue; please comment if you know of one.</p>
<p><strong>See also:</strong> <a href="http://fuhm.net/software/f5vpn-login/">f5vpn-login.py</a>, described <a href="http://fuhm.net/software/f5vpn-login/README">here</a>, and brought to my attention by <a href="http://andyleonard.com/2010/05/20/installing-the-f5-firepass-vpn-client-on-ubuntu-10-04-amd64/#comment-439">sh4k3sph3r3</a>.  A CLI FirePass client is quite likely a better solution than separate browser instances, etc.</p>
<p><strong>Preliminaries:</strong> Although the F5 FirePass SSL VPN product supports Linux, as best as I can tell, that support is somewhat limited: My understanding is that they officially claim support for 32-bit installs only, and they do not appear to track new distribution releases particularly aggressively.  F5 has also been somewhat slow in supporting new browser versions: They <a href="http://devcentral.f5.com/weblogs/f5news/archive/2008/10/06/firepass-v6.0.3-released.aspx">announced support for Firefox 3</a> on October 6, 2008, nearly four months after its release and with only two months to go before Firefox 2 was end-of-lifed.  For Firefox 3.6 support, a comment on the post linked above states that you need to request a special hot fix from F5 (which my site has not applied).  There is no Google Chrome support that I am aware of.</p>
<p>Further, F5&#8217;s automated client installation tools have unfortunately never worked for me on Linux, even when the architecture and browser are in their support matrix.  The manual download instruction links are also broken on the FirePass install I connect to.</p>
<p><strong>Solution:</strong> Install a dedicated, 32-bit version of Firefox in a supported version; create a single-purpose Firefox profile for VPN use.  Add the FirePass client to that browser and the operating system.<br />
<span id="more-474"></span><br />
For the Firefox install, follow the &#8220;Manual Installation&#8221; instructions from the <a href="https://help.ubuntu.com/community/FirefoxNewVersion/MozillaBuilds">Ubuntu Community Documentation</a> site.  Install version 3.5 if your site does not have the hotfix mentioned above.</p>
<p>Be sure to create a new Firefox profile in your account for use with the FirePass; however, I recommend modifying the script in the Ubuntu documentation to automatically take you to your FirePass site (https://firepass.example.com/ for the purposes of this post):</p>
<pre class="brush: bash;">
#!/bin/bash
exec &quot;\$HOME/firefox/firefox&quot; -P mozilla-build https://firepass.example.com/
</pre>
<p>Next, download the client components from your F5 site; again, assuming firepass.example.com, retrieve and save:</p>
<p>https://firepass.example.com/vdesk/vpn/nogzip/downloads.php/linux/np_F5_SSL_VPN.so</p>
<p>and</p>
<p>https://firepass.example.com/vdesk/vpn/nogzip/downloads.php/linux/SSLVpn.tgz</p>
<p>Move np_F5_SSL_VPN.so to the plugins directory of the new Firefox installation &#8211; ~/firefox/plugins if following the Ubuntu documentation.  Based on file layout, it appears that F5 intended for you to extract SSLVpn.tgz at the root of your file system.  Instead of following this bad practice, in scratch space and as root, extract the SSLVpn.tgz tarball and manually move the files into place:</p>
<pre class="brush: bash; light: true;">
cp SSLVpn.tgz /tmp
cd /tmp
sudo tar -xvpzf SSLVpn.tgz
# inspect extracted files here...
cd /usr/local/lib
mkdir -p F5Networks/SSLVPN
cd /tmp/usr/local/lib/F5Networks/SSLVPN
cp -Rp etc svpn var /
</pre>
<p>Using the bash script above, you should now be able to launch your purpose-built FirePass browser installation and have it &#8220;just work&#8221; for Network Access.  Good luck!</p>
]]></content:encoded>
			<wfw:commentRss>http://andyleonard.com/2010/05/20/installing-the-f5-firepass-vpn-client-on-ubuntu-10-04-amd64/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Automatic ZFS Snapshot Rotation on FreeBSD</title>
		<link>http://andyleonard.com/2010/04/07/automatic-zfs-snapshot-rotation-on-freebsd/</link>
		<comments>http://andyleonard.com/2010/04/07/automatic-zfs-snapshot-rotation-on-freebsd/#comments</comments>
		<pubDate>Thu, 08 Apr 2010 03:59:02 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[freebsd]]></category>
		<category><![CDATA[anacron]]></category>
		<category><![CDATA[auto-snapshot]]></category>
		<category><![CDATA[bash]]></category>
		<category><![CDATA[snapshot]]></category>
		<category><![CDATA[zfs]]></category>

		<guid isPermaLink="false">http://andyleonard.com/?p=460</guid>
		<description><![CDATA[OpenSolaris has ZFS Automatic Snapshots; FreeBSD, while it has ZFS, doesn&#8217;t have a comparable feature that I&#8217;m aware of.  So I wrote my own, zfs-snapshot.sh:


#!/usr/local/bin/bash

# Path to ZFS executable:
ZFS=/sbin/zfs

# Parse arguments:
TARGET=$1
SNAP=$2
COUNT=$3

# Function to display usage:
usage() {
    scriptname=`/usr/bin/basename $0`
    echo &#34;$scriptname: Take and rotate snapshots on a ZFS file [...]]]></description>
			<content:encoded><![CDATA[<p>OpenSolaris has <a href="http://blogs.sun.com/timf/en_IE/entry/zfs_automatic_snapshots_in_nv">ZFS Automatic Snapshots</a>; FreeBSD, while it has ZFS, doesn&#8217;t have a comparable feature that I&#8217;m aware of.  So I wrote my own, <code>zfs-snapshot.sh</code>:<br />
<span id="more-460"></span></p>
<pre class="brush: bash;">
#!/usr/local/bin/bash

# Path to ZFS executable:
ZFS=/sbin/zfs

# Parse arguments:
TARGET=$1
SNAP=$2
COUNT=$3

# Function to display usage:
usage() {
    scriptname=`/usr/bin/basename $0`
    echo &quot;$scriptname: Take and rotate snapshots on a ZFS file system&quot;
    echo
    echo &quot;  Usage:&quot;
    echo &quot;  $scriptname target snap_name count&quot;
    echo
    echo &quot;  target:    ZFS file system to act on&quot;
    echo &quot;  snap_name: Base name for snapshots, to be followed by a '.' and&quot;
    echo &quot;             an integer indicating relative age of the snapshot&quot;
    echo &quot;  count:     Number of snapshots in the snap_name.number format to&quot;
    echo &quot;             keep at one time.  Newest snapshot ends in '.0'.&quot;
    echo
    exit
}

# Basic argument checks:
if [ -z $COUNT ] ; then
    usage
fi

if [ ! -z $4 ] ; then
    usage
fi

# Snapshots are number starting at 0; $max_snap is the highest numbered
# snapshot that will be kept.
max_snap=$(($COUNT -1))

# Clean up oldest snapshot:
if [ -d /${TARGET}/.zfs/snapshot/${SNAP}.${max_snap} ] ; then
    $ZFS destroy -r ${TARGET}@${SNAP}.${max_snap}
fi

# Rename existing snapshots:
dest=$max_snap
while [ $dest -gt 0 ] ; do
    src=$(($dest - 1))
    if [ -d /${TARGET}/.zfs/snapshot/${SNAP}.${src} ] ; then
	$ZFS rename -r ${TARGET}@${SNAP}.${src} ${TARGET}@${SNAP}.${dest}
    fi
    dest=$(($dest - 1))
done

# Create new snapshot:
$ZFS snapshot -r ${TARGET}@${SNAP}.0
</pre>
<p>From the command line, call the script something like the following:</p>
<pre class="brush: bash; light: true;">
./zfs-snapshot.sh tank weekly 5
</pre>
<p>This would take a recursive snapshot of the &#8220;tank&#8221; zpool with the basename weekly, rotating through five snapshots with names &#8220;weekly.0&#8243; through &#8220;weekly.4&#8243;.  This allows you to implement a snapshot scheme approximately similar to NetApp&#8217;s hourly-daily-weekly scheme, if you like.  Because my FreeBSD workstation isn&#8217;t on 24&#215;7, I run hourly snapshots out of <code>/etc/crontab</code>:</p>
<pre class="brush: bash; gutter: false;">
# Automated ZFS backups (hourly):
0 * * * * root /root/bin/zfs-snapshot.sh tank hourly 25
</pre>
<p>And daily/weekly/monthly snapshots out of <code>/usr/local/etc/anacrontab</code> (from the sysutils/anacron port):</p>
<pre class="brush: bash; highlight: [9,11,13];">
PATH=/bin:/sbin:/usr/bin:/usr/sbin

# days		make sure the command is executed at least every 'days' days
# delay		delay in minutes, before a command starts
# id		unique id of a command

# days	delay	id		command
1	5	daily		periodic daily
1	10	daily_snap	/root/bin/zfs-snapshot.sh tank daily 8
7	15	weekly		periodic weekly
7	30	weekly_snap	/root/bin/zfs-snapshot.sh tank weekly 5
30	60	monthly		periodic monthly
30	90	monthly_snap	/root/bin/zfs-snapshot.sh tank monthly 13
</pre>
<p>(Of course, this isn&#8217;t as cool as the Gnome-integrated <a href="http://blogs.sun.com/erwann/entry/zfs_on_the_desktop_zfs">Time</a> <a href="http://blogs.sun.com/erwann/entry/new_time_slider_features_in">Slider</a> in OpenSolaris, but it scratches my itch sufficiently.)</p>
]]></content:encoded>
			<wfw:commentRss>http://andyleonard.com/2010/04/07/automatic-zfs-snapshot-rotation-on-freebsd/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Drupal Deployment Sysadmin Best Practices</title>
		<link>http://andyleonard.com/2009/12/09/drupal-deployment-sysadmin-best-practices/</link>
		<comments>http://andyleonard.com/2009/12/09/drupal-deployment-sysadmin-best-practices/#comments</comments>
		<pubDate>Thu, 10 Dec 2009 04:14:55 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[drupal]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[security]]></category>

		<guid isPermaLink="false">http://andyleonard.com/?p=401</guid>
		<description><![CDATA[Drupal is a popular open source CMS reportedly used on tens of thousands of sites ranging from personal blogs to whitehouse.gov; for readers of this blog, it probably requires no further introduction.
Despite its many desirable features and continuing popularity, Drupal is not without its shortcomings, as many readers are also likely aware.  Although Drupal [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://drupal.org/">Drupal</a> is a popular open source CMS reportedly used on tens of thousands of sites ranging from personal blogs to <a href="http://www.whitehouse.gov/">whitehouse.gov</a>; for readers of this blog, it probably requires no further introduction.</p>
<p>Despite its many desirable features and continuing popularity, Drupal is not without its shortcomings, as many readers are also likely aware.  Although Drupal has an active and responsive <a href="http://drupal.org/security-team">security team</a>, the software has a long track record of requiring frequent security patches &#8211; Secunia has seven 2009 advisories for <a href="http://secunia.com/advisories/product/17839/?task=advisories_2009">Drupal 6.x</a> listed as of this writing.  Although by its nature an apples-to-oranges comparison, this ranks Drupal behind similarly large and complex PHP projects such as <a href="http://secunia.com/advisories/product/6745/?task=advisories_2009">Wordpress 2.x</a> (5) and <a href="http://secunia.com/advisories/product/5879/?task=advisories_2009">Gallery 2.x</a> (0) &#8211; and the number for Drupal does not include dozens of additional advisories for Drupal modules.  Further, Drupal has <a href="http://drupal.org/node/360605">struggled and lagged</a> with support for PHP 5.3.x, suggesting to this outside observer that the project is having difficulties maintaining its codebase.</p>
<p>All that being said, I do not personally believe that the above issues rule out using Drupal; the benefits outweigh the shortcomings.  So, assuming the question is not whether to deploy Drupal, but how to do so most securely and efficiently, my recommendations from a systems administration perspective are below.<br />
<span id="more-401"></span><br />
<strong>Goals</strong></p>
<p>The recommendations described stem from three goals for a Drupal installation:</p>
<ol>
<li><strong>Security</strong> &#8211; Avoid compromise of your site and system.</li>
<li><strong>Flexibility</strong> &#8211; Create an environment that allows for easy modification of the OS, server and application, and provides for straightforward roll-back of those changes should those need arise.</li>
<li><strong>Stability and Security over Performance</strong> &#8211; Although not mutually exclusive, designing for stability and security can entail trade-offs that may impact performance.  I don&#8217;t directly address performance in these recommendations.</li>
</ol>
<p>The below recommendations are written assuming that the sysadmin (you) responsible for the Drupal server and the developer are two different people.  If you wear both hats, the recommendations can obviously adapted to fit your dual role.</p>
<p><strong>Recommendations</strong></p>
<ol>
<li><strong>Insist on the latest version of Drupal, and plan to upgrade as Drupal releases come out.</strong>  Some developers may be hesitant to use the latest (stable) version of Drupal, but Drupal&#8217;s security track record dictates this; anything else is not serving the site owner&#8217;s interests.</li>
<li><strong>Use the latest supported version of PHP.</strong>  Drupal&#8217;s PHP version requirements are more brittle than most applications; the latest version of PHP is almost always the most stable and secure version.  Combining the two is obvious: Aggressively track latest version of PHP that Drupal supports.  Additionally, use the latest version of your preferred web server and patch as it is updated.  Note that this implies a source-based installation of PHP and your web server instead of a package-based installation from your distribution.  Carefully consider your build and deployment strategies with an eye towards reproducibility, documentation and ease of roll-back should a problem arise.</li>
<li><strong>Isolate Drupal from other applications.</strong>  Run Drupal on its own system &#8211; physical or virtual &#8211; and in doing so, reduce operating complexity while limiting collateral damage if your Drupal instance should be compromised.  Depending on your backup strategy, don&#8217;t overlook the possibility that running a VM may have obvious recovery and forensic advantages over a physical install should your Drupal site get broken into.</li>
<li><strong>mod_security</strong> &#8211; Consider integrating a &#8220;web application firewall&#8221; such as <a href="http://www.modsecurity.org/">mod_security</a> directly into your web server.  Carefully evaluate the trade-off of additional complexity for enhanced security in your environment.</li>
<li><strong>Deploy a host-based firewall for inbound and outbound traffic</strong> &#8211; This has two functions &#8211; prevent your server from being compromised through a hole in a service other than your web server, and limit damage to other systems if/when you are compromised.  There&#8217;s a decent chance that the only external service you need exposed other than HTTP for your Drupal site is SSH for remote management &#8211; and you can probably lock that down to a very limited set of IP addresses.  And, in the situation where your server is compromised, outbound rules can reduce the ability of your server to attack other machines: Consider, for example, the effect of an outbound rule on port 25 if an attacker attempts to use your compromised server as a spam bot.</li>
<li><strong>Use SELinux</strong> &#8211; Security-Enhanced Linux provides a the ability to audit and possibly deny actions on your system through <a href="http://en.wikipedia.org/wiki/Mandatory_access_control">mandatory access control policies</a>.  You will likely want to run SELinux in &#8220;permissive&#8221; mode before your site is publicly available, at which point you would switch to &#8220;enforcing&#8221; mode; &#8220;audit2allow&#8221; and &#8220;audit2why&#8221; can be very helpful tools when developing policies.  See &#8220;man selinux&#8221; for more information.</li>
<li><strong>Use a Sysadmin Staging Server</strong> &#8211; The site developer likely has a staging instance of Drupal for testing out their code changes; deploy a similar environment for testing Sysadmin changes, such as PHP updates or the latest version of Drupal with your code.  Consider using a software testing framework such as <a href="http://seleniumhq.org/">Selenium</a> to automate the tests you run on the sysadmin staging site.</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://andyleonard.com/2009/12/09/drupal-deployment-sysadmin-best-practices/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Test Driving Google Public DNS (Updated with OpenDNS comparison)</title>
		<link>http://andyleonard.com/2009/12/03/test-driving-google-publi-dns/</link>
		<comments>http://andyleonard.com/2009/12/03/test-driving-google-publi-dns/#comments</comments>
		<pubDate>Thu, 03 Dec 2009 19:31:16 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[Internet]]></category>
		<category><![CDATA[comcast]]></category>
		<category><![CDATA[dns]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[opendns]]></category>

		<guid isPermaLink="false">http://andyleonard.com/?p=414</guid>
		<description><![CDATA[Google announced its Public DNS service this morning, claiming enhanced performance and security; I took it for a brief test drive with the following results.
(See bottom of post for an update running similar tests on OpenDNS.)
Methods: I searched Google for keywords that I believed fell somewhere between obscure and common and collected the first ten [...]]]></description>
			<content:encoded><![CDATA[<p>Google announced its <a href="http://code.google.com/speed/public-dns/">Public DNS</a> service this morning, claiming enhanced <a href="http://code.google.com/speed/public-dns/docs/performance.html">performance</a> and <a href="http://code.google.com/speed/public-dns/docs/security.html">security</a>; I took it for a brief test drive with the following results.</p>
<p>(See bottom of post for an update running similar tests on OpenDNS.)</p>
<p><strong>Methods:</strong> I searched Google for <a href="http://www.google.com/search?q=lightweight+backpacking">keywords</a> that I believed fell somewhere between obscure and common and collected the first ten hostnames printed on the screen.  I then used local installations of dig to query a collection of DNS servers for the hostnames&#8217; A records and collected the response times.  The different resolvers used were:</p>
<ul>
<li>A local BIND installation (127.0.0.1, cache empty) with Comcast Internet connectivity;</li>
<li>A Comcast DNS server (68.87.69.150) via Comcast Internet connectivity;</li>
<li>My employer&#8217;s internal caching DNS;</li>
<li>Google (8.8.8.8) via my employer&#8217;s Internet connectivity (mostly Level 3);</li>
<li>Google (8.8.8.8) via Comcast; and</li>
<li>Google (8.8.8.8) via an Amazon EC2 instance in us-east-1a.</li>
</ul>
<p>Anticipating a bimodal distribution of results, I assumed high latency responses were cache misses, while low latency responses were cache hits, and categorized results correspondingly.<br />
<span id="more-414"></span><br />
<strong>Limitations:</strong> Chiefly, the small number of hostnames queried.  Results from a larger group of domains would be more conclusive.</p>
<p><strong>Results:</strong> Given in the format of Server/Connectivity: Cache Miss/Cache Hit</p>
<ul>
<li>Local BIND server/Comcast: 319ms/0ms</li>
<li>Comcast/Comcast: 166ms/14ms</li>
<li>Google/Comcast: no misses/73ms</li>
<li>Employer/Level 3: 235ms/30ms</li>
<li>Google/Level 3: 204ms/44ms</li>
<li>Google/EC2: 190ms/4ms</li>
</ul>
<p>I concluded that Google/Comcast had no misses by testing another set of obscure hostnames twice each, noting that the first query was slower (~120ms) and the second query was similar in latency to the results above (70ms).  (My belief is that I inadvertently pre-populated the cache for Google/Comcast by my tests elsewhere.)</p>
<p><strong>Discussion:</strong></p>
<ul>
<li><strong>It&#8217;s all about cache hits.</strong>  Whichever resolver gives you the most cache hits will give you the best performance; cache misses are at least an order of magnitude slower than cache hits.  In this extremely limited test, the cache-hits-champion appears to be Google.  Excluding Google/Comcast, where I believe I pre-populated Google&#8217;s cache, Google had a 50% cache hit rate, while Comcast and my Employer only hit 20%.
<li><strong>Location, location, location.</strong>  Secondary to cache hits, the closer the resolver is to you, the better.  Looking at the Comcast results, it&#8217;s hard to get closer than localhost, and, as seems logical, Comcast&#8217;s resolvers have lower cached latency than Google&#8217;s.  <strong>Running a local caching resolver forwarding to Google may be a desirable configuration.</strong></li>
<li><strong>Resolver behavior matters.</strong>  Comcast is notorious for <a href="http://www.dslreports.com/shownews/Comcast-DNS-Redirection-Goes-Nationwide-103762">poor behavior</a>.  It&#8217;s reasonable to expect that Google will be mining your DNS query data.  <strong>Running a slower but directly-controlled local, non-forwarding server may be preferable for privacy and security reasons.</strong></li>
</ul>
<p><strong>Update:</strong>  At the <a href="http://twitter.com/tscalzott/status/6313122948">suggestion</a> of <a href="http://twitter.com/tscalzott">@tscalzott</a>, I researched OpenDNS&#8217;s performance with the same set of hostnames via the same connectivity on their DNS resolvers at 208.67.222.222.  This time, however, I queried the A record for each hostname twice in rapid succession to ascertain how many of my queries were served from OpenDNS&#8217;s cache.  Results are in the format:</p>
<p>DNS Server/Connectivity: Cache miss/Cache hit &#8211; cache hit rate</p>
<ul>
<li>OpenDNS/Comcast: 218ms/30ms &#8211; 20%</li>
<li>OpenDNS/EC2: 144ms/2ms &#8211; 10%</li>
<li>OpenDNS/Level 3: 230ms/4ms &#8211; 40%</li>
</ul>
<p>Compared to Google, OpenDNS had similar latency for cache misses and lower latency for cache hits, but appears to possibly have a lower cache hit rate.  It seems likely that the latency &#8220;winner&#8221; for each user&#8217;s individual situation will depend on where they are on the Internet relative to the nearest Google and OpenDNS installations.  Google&#8217;s greater cache hit rate suggests it may offer better service, but testing a larger number of hostnames would be necessary before being able to state that with any certainty.</p>
<p><strong>Disclosure:</strong> I use Google&#8217;s free Apps services to host personal email, and I use their public sites (Search, Reader, News, Analytics, etc.) extensively.  I recently attended a Google Apps for the Enterprise dog-and-pony show where I received a number of small tchotchkes; my wife took the notebook, I kept the pen and binder and threw the rest away.  My employer uses Postini.  I tried OpenDNS briefly several months back, but did not use them long-term because of limitations in my own configuration.</p>
]]></content:encoded>
			<wfw:commentRss>http://andyleonard.com/2009/12/03/test-driving-google-publi-dns/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Migrating from self-hosted email to Google Apps for Domains</title>
		<link>http://andyleonard.com/2009/11/24/migrating-from-self-hosted-email-to-google-apps-for-domains/</link>
		<comments>http://andyleonard.com/2009/11/24/migrating-from-self-hosted-email-to-google-apps-for-domains/#comments</comments>
		<pubDate>Wed, 25 Nov 2009 03:01:40 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[personal tech]]></category>
		<category><![CDATA[utility computing]]></category>
		<category><![CDATA[cyrus]]></category>
		<category><![CDATA[dns]]></category>
		<category><![CDATA[gmail]]></category>
		<category><![CDATA[imap]]></category>
		<category><![CDATA[imapsync]]></category>

		<guid isPermaLink="false">http://andyleonard.com/?p=366</guid>
		<description><![CDATA[I recently moved my personal email from a self-managed Exim/Cyrus setup on a dedicated FreeBSD server to Gmail (Google Apps for Domains).  This migration was motivated by a desire to reduce expenses, reduce time spent managing mail software and the importance of email (for me, personally) dropping to a level where I was willing [...]]]></description>
			<content:encoded><![CDATA[<p>I recently moved my personal email from a self-managed Exim/Cyrus setup on a dedicated FreeBSD server to Gmail (<a href="http://www.google.com/apps/intl/en/group/index.html">Google Apps for Domains</a>).  This migration was motivated by a desire to reduce expenses, reduce time spent managing mail software and the importance of email (for me, personally) dropping to a level where I was willing to accept the risks inherent in outsourcing it.  Details of the exact process I used to migrate mail are below.</p>
<p><strong>Assumptions:</strong> An IMAP interface to your current email, basic comptency at managing DNS, and the ability to run the <a href="http://www.linux-france.org/prj/imapsync/README">imapsync</a> Perl script (built via FreeBSD ports in my case, but installation should be straightforward under most UNIX or Linux systems).<br />
<span id="more-366"></span><br />
<strong>0.</strong> Ensure your domain registration is up to date and with a reputable registrar (Pro tip: if you have to wade through pages of deceptive and needless up-sells with your registrar, they&#8217;re not reputable), and fully independent from anything you have set up with Google.  Ensure your DNS configuration is completely correct; if you&#8217;re unsure whether or not it is, buy a membership at <a href="http://www.dnsstuff.com/">DNSstuff</a>, run its tests on your domains, and fix the relevant problems.  If you don&#8217;t fully host your own DNS &#8211; and chances are, if you&#8217;re outsourcing your email, you probably don&#8217;t host your own DNS &#8211; I recommend using two independent DNS providers.  Hosted DNS service that&#8217;s worth spending money with will be able to function as a secondary and allow zone transfers &#8211; and, while you&#8217;re at it, make sure your chosen provider supports <a href="http://www.rfc-archive.org/getrfc.php?rfc=1996">DNS NOTIFY</a> &#8211; it&#8217;ll make changing your MX records in a timely fashion that much easier.  Whatever you do, don&#8217;t use your registrar&#8217;s DNS servers.</p>
<p>You occasionally see <a href="http://discuss.joelonsoftware.com/default.asp?biz.5.730915.0">sad tales of Google Apps woe</a> which could have been mitigated by independent domain registration and a moderate knowledge of DNS.  Don&#8217;t risk becoming another bitter messageboard-posting statistic.</p>
<p>In summary: Don&#8217;t understand DNS or don&#8217;t have time for it?  Stop reading now; either hire a consultant to do the work, or accept that outsourced email probably isn&#8217;t right for you.</p>
<p><strong>0.1.</strong> This goes hand-in-hand with step 0 above: Have an off-Google backup system for your email &#8211; one that runs unattended, automatically, just like a &#8220;real&#8221; backup system would be.  Google has its strong points, but customer service for their free products isn&#8217;t one of them.  If Google should update their <a href="http://en.wikipedia.org/wiki/Don%27t_be_evil">unofficial motto</a> to a more succinct &#8220;be evil&#8221; you want to have an &#8220;out&#8221; and the ability to migrate your mail elsewhere.  Imapsync, linked to above and described below, can be adapted for this purpose.</p>
<p><strong>0.2</strong> Install and test imapsync before beginning this process.  This is the best tool that I&#8217;ve found for syncing mailboxes between servers; all others I tried were unreliable, and documentation below will be specific to it.  In theory, another tool could work &#8211; IMAP is IMAP and a standard &#8211; but I leave using other migration tools as an exercise for the reader.</p>
<p><strong>1.</strong> Drop your MX record TTL to a suitably low value; I used 60 seconds during my migration.  Let the older, presumably greater TTL age off prior to making any further changes to your MX records (as in step 7 below).</p>
<p><strong>2.</strong> <a href="http://www.google.com/a/cpanel/domain/new">Sign up</a> your domain for Google Apps, and use DNS to verify control.</p>
<p><strong>3.</strong> Create users for the domain, and accept terms.  A best practice would be to use different passwords on Gmail from what users currently have, and do note that you will need both old and new passwords for the synchronization step.  (If you are currently running a Cyrus IMAP server with sasldb, passwords can be extracted by running &#8220;db3_dump185 -p sasldb2.db&#8221; as a user that can read sasldb.)</p>
<p><strong>4.</strong> In the Google Apps control panel, create distribution lists and aliases (&#8220;nicknames&#8221;) for your users as appropriate.</p>
<p><strong>5.</strong> For each Gmail mailbox, enable IMAP: Settings > Forwarding and POP/IMAP: Enable IMAP.</p>
<p><strong>6.</strong> If appropriate (and assuming you have other users in your domain), have users verify their logins and familiarize themselves with Gmail&#8217;s options.</p>
<p><strong>7.</strong> Change your <a href="http://www.google.com/support/a/bin/answer.py?answer=33352">MX records</a> to point to Google; keep your TTL low in case you need to change back to your old servers.</p>
<p><strong>8.</strong> Verify delivery to your new Gmail mailboxes by sending test messages from a third-party address.</p>
<p><strong>9.</strong> Set an appropriate <a href="http://www.google.com/support/a/bin/answer.py?hl=en&#038;answer=33786">SPF record</a> for your domain in DNS.</p>
<p><strong>10.</strong> Sync your old mailboxes with your new; I recommend running imapsync directly on your mail server, if possible, for best performance.  Example imapsync session for a Cyrus inbox:</p>
<p><code>imapsync --host1 mail.example.com --port1 143 --user1 user@example.com --password1 [...] --prefix1 INBOX. --host2 imap.gmail.com --port2 993 --user2 user@example.com --password2 [...] --ssl2 --folder INBOX</code></p>
<p>(Insert passwords as appropriate.)</p>
<p>Example imapsync session for a Cyrus folder other than the inbox:</p>
<p><code>imapsync --host1 mail.example.com --port1 143 --user1 user@example.com --password1 [...] --prefix1 INBOX. --host2 imap.gmail.com --port2 993 --user2 user@example.com --password2 [...] --ssl2 --folder INBOX.saved-messages</code></p>
<p>Note that Gmail has certain reserved labels that you cannot sync directly to, such as &#8220;Sent&#8221;.  In this case, you&#8217;ll need to use the &#8220;regextrans2&#8243; flag to sync to a different folder:</p>
<p><code>imapsync --host1 mail.example.com --port1 143 --user1 user@example.com --password1 [...] --prefix1 INBOX. --host2 imap.gmail.com --port2 993 --user2 use@example.com --password2 [...] --ssl2 --folder INBOX.Sent --regextrans2 's/Sent/Old-Sent/'</code></p>
<p><strong>11.</strong> After double checking that your old MX record&#8217;s TTL has passed, and no new mail is being delivered to your old mail server, decommission it as necessary.</p>
]]></content:encoded>
			<wfw:commentRss>http://andyleonard.com/2009/11/24/migrating-from-self-hosted-email-to-google-apps-for-domains/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Interesting Linux VM Crash Pattern</title>
		<link>http://andyleonard.com/2009/11/20/interesting-linux-vm-crash-pattern/</link>
		<comments>http://andyleonard.com/2009/11/20/interesting-linux-vm-crash-pattern/#comments</comments>
		<pubDate>Fri, 20 Nov 2009 19:09:16 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[virtualization]]></category>
		<category><![CDATA[centos]]></category>
		<category><![CDATA[crash]]></category>
		<category><![CDATA[dell]]></category>
		<category><![CDATA[iscsi]]></category>
		<category><![CDATA[kernel panic]]></category>
		<category><![CDATA[mptscsi]]></category>
		<category><![CDATA[netapp]]></category>
		<category><![CDATA[nfs]]></category>
		<category><![CDATA[rhel]]></category>
		<category><![CDATA[vmware]]></category>
		<category><![CDATA[vmware esx]]></category>
		<category><![CDATA[vsmp]]></category>

		<guid isPermaLink="false">http://andyleonard.com/?p=343</guid>
		<description><![CDATA[I&#8217;ve just begun to pull together some interesting data on a series of Linux VM crashes I&#8217;ve seen.  I don&#8217;t have a resolution yet, but some interesting patterns have emerged.
Crash Symptoms
A CentOS 4.x or 5.x guest will crash with a message similar to the following on its console:
CentOS 4.x:
[&#60;f883b299&#62;] .text.lock.scsi_error+0x19/0x34 [scsi_mod]
[&#60;f88c19ce&#62;] mptscsih_io_done+0x5ee/0x608 [mptscsi] (…)
[&#60;c02de564&#62;] [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve just begun to pull together some interesting data on a series of Linux VM crashes I&#8217;ve seen.  I don&#8217;t have a resolution yet, but some interesting patterns have emerged.</p>
<p><strong>Crash Symptoms</strong></p>
<p>A CentOS 4.x or 5.x guest will crash with a message similar to the following on its console:</p>
<p>CentOS 4.x:</p>
<p><code>[&lt;f883b299&gt;] .text.lock.scsi_error+0x19/0x34 [scsi_mod]<br />
[&lt;f88c19ce&gt;] mptscsih_io_done+0x5ee/0x608 [mptscsi] (…)<br />
[&lt;c02de564&gt;] common_interrupt+0x18/0x20<br />
[&lt;c02ddb54&gt;] system_call+0x0/0x30</code></p>
<p>CentOS 5.x:</p>
<p><code>RIP  [&lt;ffffffff8014c562&gt;] list_del+0x48/0x71 RSP &lt;ffffffff80425d00&gt; &lt;0&gt;Kernel Panic - not syncing: Fatal exception</code></p>
<p>A hard reset (i.e. pressing the reset button on the VM&#8217;s console) is required to reboot the guest.<br />
<span id="more-343"></span><br />
<strong>Further Details</strong></p>
<p>Five different VMs have encountered this issue, running at a mix of close-to-current CentOS 4.x and 5.x patch levels.  Guest kernel versions when the crash occurred were 2.6.18-128.7.1.el5 and 2.6.18-128.1.10.el5 (5.x) and 2.6.9-89.0.9.ELsmp (4.x).  Memory allocations on affected guests range from 512MB to 3072MB.  Notably, all affected VMs are using SMP &#8211; each has 2 vCPUs &#8211; having been created before our in-house practices followed <a href="http://blogs.vmware.com/performance/2008/06/esx-scheduler-s.html">VMware guidelines</a> and discouraged use of SMP on ESX guests when unnecessary.  One VM was created via P2V; the rest were created <em>de novo</em> on virtual hardware.</p>
<p>All crashes have happened on a single node in an ESX 3.5 HA cluster composed of four Dell PowerEdge 1950s.  ESX hosts have tracked the latest VMware patches closely.  COS memory on the ESX host in question was increased from the default to 800MB prior to the three most recent crashes; in other words, the COS memory increase appears to have had no effect on the crashes.  DRS is in use, set to &#8220;fully automated&#8221; and &#8220;apply recommendations with three or more stars&#8221; and no virtual machine rules have been created to control DRS host placement.</p>
<p>All guests are on the same NFS data store, served from a NetApp filer running ONTAP 7.2.x.  One guest had its vmswap placed separately on an iSCSI data store; the rest have their swap stored on NFS with the VM.  No log messages were seen on the filer during the event, although the a log message similar to the following has been seen several times on the ESX host:</p>
<p><code>vmkernel: 43:07:27:51.725 cpu2:2185)WARNING: NFS: 4590: Can't find call with serial number -2146566055</code></p>
<p>Curiously, all crashes have happened in the evening, in the 10 o&#8217;clock hour, after nightly backups have been completed.  Backups are created using a combination of VMware and NetApp snapshots via a script similar to one detailed on <a href="http://vmwaretips.com/wp/2008/12/05/netapp-snapshots-in-esx-take-2/">vmwaretips.com</a>.  No substantial load or latency has been recorded on the NetApp during the crashes, and weeks have passed between events.</p>
<p><strong>Speculation</strong></p>
<p>Explanations I&#8217;m leaning towards, ranked by my judgment of their likelihood:</p>
<p>1) <strong>Hardware issue.</strong>  Assuming a random distribution of VMs &#8211; recall that DRS is in use and no virtual machine rules are in place &#8211; the odds of all five crashes happening on one host out of four are slim: 1 in 1024.  Unfortunately, by all measures we&#8217;ve used, including the VI Client&#8217;s &#8220;Health Status&#8221; and Dell OMSA, there are no hardware issues with the host.</p>
<p>Further, the distribution of VMs is not truly random.  DRS migrations are infrequent in this cluster, and the largest determinant of guest location is migration following hosts being placed into maintenance mode for patching.</p>
<p>If it is a hardware issue, it&#8217;s subtle, and possibly only brought to the fore by the following issues.</p>
<p>2) <strong>Red Hat Enterprise Linux bug</strong> &#8211; which, by extension, is typically equivalent to a CentOS bug.  In fact, this issue appears to have been raised with Red Hat already in bugs <a href="https://bugzilla.redhat.com/show_bug.cgi?id=197158">197158</a> and <a href="https://bugzilla.redhat.com/show_bug.cgi?id=228108">228108</a> &#8211; but, according the bug reports, the issue is resolved, and the patches have since been ported downstream to CentOS.  However, perhaps the issue is not truly resolved &#8211; see <a href="https://bugzilla.redhat.com/show_bug.cgi?id=228108#c35">comment 35</a> in 228108.</p>
<p>3) <strong>vSMP Bug.</strong>  The majority of our Linux VMs are uniprocessor and appear so far to be immune to this issue; it is striking that the crash has only occurred on dual processor guests.  I cannot articulate a mechanism for multiple vCPUs causing this crash, however.</p>
<p>4) <strong>NetApp issue.</strong>  This appears to be a storage issue at some level, considering the mptscsi and NFS messages noted above, so performance of the NetApp filer would be a natural place for further investigation.  However, we monitor the performance of our filer relatively closely, using the ONTAP SDK and Cacti, and nothing unusual was recorded during any crash.  It seems unusual that all VMs reside on the same data store, but that data store shares an aggregate with multiple other unaffected data stores, and several LUNs are served from the same aggregate to non-ESX machines without complaint.</p>
<p>I have not yet opened a case with VMware on this issue &#8211; or Dell, or NetApp, for that matter &#8211; but if and when I do, I&#8217;ll update here to the extent possible.</p>
<p><strong>Update 11/20/2009:</strong> Prompted by a helpful comment from nate below, I looked up and verified the NFS settings across the cluster.  They are the same across all hosts, and are as follows:</p>
<p><code>NFS.IndirectSend 0<br />
NFS.DiskFileLockUpdate 10<br />
NFS.LockUpdateTimeout 5<br />
NFS.LockRenewMaxFailureNumber 3<br />
NFS.LockDisable 0<br />
NFS.HeartbeatFrequency 12<br />
NFS.HeartbeatTimeout 5<br />
NFS.HeartbeatDelta 5<br />
NFS.HeartbeatMaxFailures 10<br />
NFS.MaxVolumes 8<br />
NFS.SendBufferSize 264<br />
NFS.ReceiveBufferSize 128<br />
NFS.VolumeRemountFrequency 30<br />
NFS.UDPRetransmitDelay 700</code></p>
<p>The only values that are changed from default are HeartbeatFrequency and HeartbeatMaxFailures, to match NetApp&#8217;s recommendations in <a href="http://media.netapp.com/documents/tr-3428.pdf">TR-3428</a>. </p>
]]></content:encoded>
			<wfw:commentRss>http://andyleonard.com/2009/11/20/interesting-linux-vm-crash-pattern/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Keeping your RHEL VMs from crushing your storage at 4:02am</title>
		<link>http://andyleonard.com/2009/11/19/keeping-your-rhel-vms-from-crushing-your-storage-at-402am/</link>
		<comments>http://andyleonard.com/2009/11/19/keeping-your-rhel-vms-from-crushing-your-storage-at-402am/#comments</comments>
		<pubDate>Thu, 19 Nov 2009 19:39:30 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[operating systems]]></category>
		<category><![CDATA[centos]]></category>
		<category><![CDATA[locate]]></category>
		<category><![CDATA[mlocate]]></category>
		<category><![CDATA[rhel]]></category>
		<category><![CDATA[scientific linux]]></category>
		<category><![CDATA[slocate]]></category>
		<category><![CDATA[updatedb]]></category>
		<category><![CDATA[virtualization]]></category>
		<category><![CDATA[vmware]]></category>

		<guid isPermaLink="false">http://andyleonard.com/?p=315</guid>
		<description><![CDATA[Running a lot of Red Hat VMs in your virtual infrastructure, on shared storage?  CentOS, Scientific Linux, both versions 4 and 5, they count for these purposes; Fedora should likely be included too.  Do you have the slocate (version 4.x and earlier) or mlocate (version 5.x) RPMs installed?  If you&#8217;re uncertain, check [...]]]></description>
			<content:encoded><![CDATA[<p>Running a lot of Red Hat VMs in your virtual infrastructure, on shared storage?  CentOS, Scientific Linux, both versions 4 and 5, they count for these purposes; Fedora should likely be included too.  Do you have the slocate (version 4.x and earlier) or mlocate (version 5.x) RPMs installed?  If you&#8217;re uncertain, check using the following:</p>
<p><code>> rpm -q slocate<br />
slocate-2.7-13.el4.8.i386</code></p>
<p>or</p>
<p><code>> rpm -q mlocate<br />
mlocate-0.15-1.el5.2.x86_64</code></p>
<p>If so, multiple RHEL VMs plus mlocate or slocate may be adding up to an array-crushing 4:02am shared storage load and latency spike for you.  Before being addressed, this spike was bad enough at my place of employment (when combined with a NetApp Sunday-morning disk scrub) to cause a Windows VM to crash with I/O errors.  Ouch.<br />
<span id="more-315"></span><br />
<strong>Details and ideas for resolution:</strong></p>
<p>By default, a line in /etc/crontab runs the scripts within /etc/cron.daily at 4:02am each morning:</p>
<p><code>02 4 * * * root run-parts /etc/cron.daily</code></p>
<p>One of those scripts &#8211; mlocate.cron or slocate.cron, depending on your OS version &#8211; launches updatedb; as the man page says, &#8220;updatedb  creates  or  updates  a  database  used by locate(1).&#8221;  (The &#8220;locate&#8221; binary is a filesystem search tool, see &#8220;man locate&#8221; for more information.)  Updatedb refreshes its database by walking the filesystem, generating a fair amount of I/O on a single system.  Imagine upwards of thirty of these running in parallel through VMDKs on one shared storage system carrying out internal maintenance at the same time, and you&#8217;re pretty much picturing the problem my employer had.</p>
<p>I see <strong>three options</strong> for addressing this issue:</p>
<p><strong>1) Uninstall mlocate or slocate.</strong>  If you don&#8217;t currently use &#8220;locate&#8221; and you&#8217;re not interested in learning to use a tool that will likely make you more effective at your job (again, see &#8220;man locate&#8221;), this is probably the best option.  (Yeah, I know, people that fit this bill generally don&#8217;t read blogs more technical than <a href="http://perezhilton.com/">this one</a>, so I could probably have skipped it here.  Consider it an option for completeness, or if you really need to strip down an install.)</p>
<p><strong>2) Disable the scheduled job by removing mlocate.cron or slocate.cron from /etc/cron.daily.</strong>  This keeps locate available for your use, but requires that you update locate&#8217;s database ad-hoc and interactively by running the following as root:</p>
<p><code># updatedb</code></p>
<p>This will take a few minutes to return, depending on the size of your file systems.</p>
<p>I don&#8217;t recommend this option either; at least it doesn&#8217;t fit the way I work.  I often find myself using locate in high-pressure situations in which I need to quickly get a file location on a system.  Waiting minutes for updatedb to return is extra painful when every second counts.</p>
<p><strong>3) Stagger when updatedb runs by inserting a random delay into the script.</strong>.  This is my preferred alternative; locate&#8217;s database is kept current automatically, and your storage doesn&#8217;t have to bear a sudden spike in load.  I implemented this by adding the lines in <strong>bold</strong> (lines 2-7 if your browser doesn&#8217;t display the bold text clearly): </p>
<p><code>#!/bin/sh<br />
<strong># sleep up to two hours before launching job:<br />
value=$RANDOM<br />
while [ $value -gt 7200 ] ; do<br />
  value=$RANDOM<br />
done<br />
sleep $value</strong><br />
nodevs=$(< /proc/filesystems awk '$1 == "nodev" { print $2 }')<br />
renice +19 -p $$ >/dev/null 2>&#038;1<br />
/usr/bin/updatedb -f "$nodevs"<br />
</code></p>
<p>The added code inserts a pseudo-random sleep delay of up to two hours before updatedb runs, with the key being the built-in Bash function <a href="http://tldp.org/LDP/abs/html/randomvar.html">$RANDOM</a>.  In our environment, this removed a 2000 IOPS spike at 4:02am, and eliminated a corresponding jump in filer latency.  Obviously, adjust the delay period as appropriate for your environment.  Additionally, be sure to add this change to your configuration management or installation management tools so that all of your RHEL and RHEL-derived VMs get the updated script.</p>
<p>Using $RANDOM to avoid this variant of the <a href="http://en.wikipedia.org/wiki/Thundering_herd_problem">thundering herd problem</a> also works nicely for a range of similar problems; I believe I first saw it at <a href="http://www.moundalexis.com/archives/000076.php">Moundalexis.com</a>.</p>
<p>(This problem may apply to other Linux distributions being run as VMs, and FreeBSD does something equivalent &#8211; weekly &#8211; with /etc/periodic/weekly/310.locate.  A similar solution can be applied to these environments, if necessary.)</p>
]]></content:encoded>
			<wfw:commentRss>http://andyleonard.com/2009/11/19/keeping-your-rhel-vms-from-crushing-your-storage-at-402am/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Running NetApp&#8217;s aggrSpaceCheck without turning on RSH</title>
		<link>http://andyleonard.com/2009/06/24/running-netapps-aggrspacecheck-without-turning-on-rsh/</link>
		<comments>http://andyleonard.com/2009/06/24/running-netapps-aggrspacecheck-without-turning-on-rsh/#comments</comments>
		<pubDate>Wed, 24 Jun 2009 22:08:43 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[storage]]></category>
		<category><![CDATA[aggrSpaceCheck]]></category>
		<category><![CDATA[netapp]]></category>

		<guid isPermaLink="false">http://andyleonard.com/?p=299</guid>
		<description><![CDATA[When upgrading a NetApp filer from a pre-7.3 release to 7.3, metadata is apparently moved from within the FlexVol into the containing aggregate.  If your aggregate is tight on space &#8211; more than 96% full &#8211; NetApp requires that you complete extra verification steps to ensure that you can complete the upgrade.  From [...]]]></description>
			<content:encoded><![CDATA[<p>When upgrading a NetApp filer from a pre-7.3 release to 7.3, metadata is apparently moved from within the FlexVol into the containing aggregate.  If your aggregate is tight on space &#8211; more than 96% full &#8211; NetApp requires that you complete extra verification steps to ensure that you can complete the upgrade.  From the <a href="http://now.netapp.com/NOW/knowledge/docs/ontap/rel7311/pdfs/ontap/rnote.pdf">Data ONTAP® 7.3.1.1 Release Notes</a> (NOW login required):</p>
<blockquote><p>If you suspect that your system has almost used all of its free space, or if you use thin provisioning, you should check the amount of space in use by each aggregate. If any aggregate is 97 percent full or more, do not proceed with the upgrade until you have used the Upgrade Advisor or aggrSpaceCheck tools to determine your system capacity and plan your upgrade.</p></blockquote>
<p>Upgrade Advisor is a great tool, and I heartily recommend you use it for your upgrade.  However, it doesn&#8217;t give you a lot of visibility into what&#8217;s being checked for here.  Lucky for us, NetApp offers an alternative tool: <a href="http://now.netapp.com/NOW/download/tools/aggrSpaceCheck/download.shtml">aggrSpaceCheck</a> (NOW login required).<br />
<span id="more-299"></span><br />
AggrSpaceCheck, as written, relies on you having rsh turned on for access to the filer &#8211; something that you probably locked down when you were still wearing acid-washed jeans.  Assuming you don&#8217;t have rsh access, you&#8217;ll see an error like this when you attempt to run aggrSpaceCheck:</p>
<pre class="brush: bash; light: true;">&gt; perl aggrSpaceCheck.pl -filer toaster01

aggrSpaceCheck V1.0.0 Copyright (c) 2008 NetApp 

Could not retrieve Aggregate free space. Could not get any aggregates in this filer
</pre>
<p>The fix is easy, however, if you&#8217;re using SSH: Edit the aggrSpaceCheck.pl file, replacing &#8220;rsh&#8221; with &#8220;ssh&#8221; (you only need to actually edit it in the line where &#8220;$remCmd&#8221; is defined, but changing rsh to ssh elsewhere won&#8217;t hurt).  You will be prompted for root&#8217;s SSH password repeatedly &#8211; once for each command run remotely on the filer:</p>
<pre class="brush: bash; light: true;">&gt; perl aggrSpaceCheck.pl -filer toaster01

aggrSpaceCheck V1.0.0 Copyright (c) 2008 NetApp
root@toaster01's password:
root@toaster01's password:
root@toaster01's password:
root@toaster01's password:
root@toaster01's password:
root@toaster01's password:
root@toaster01's password:
root@toaster01's password:
root@toaster01's password:
root@toaster01's password: 

Aggregate aggr0 requires 54.18GB for volume metadata; 137.76GB is available.
Aggregate aggr0 has enough free space for you to upgrade to
Data ONTAP 7.3 or later.
</pre>
<p>There&#8217;s a solution for this too, of course: Enable SSH key pair authentication from your management host to your filer &#8211; no more password prompts:</p>
<pre class="brush: bash; light: true;">&gt; perl aggrSpaceCheck.pl -filer toaster01

aggrSpaceCheck V1.0.0 Copyright (c) 2008 NetApp 

Aggregate aggr0 requires 54.18GB for volume metadata; 137.76GB is available.
Aggregate aggr0 has enough free space for you to upgrade to
Data ONTAP 7.3 or later.
</pre>
]]></content:encoded>
			<wfw:commentRss>http://andyleonard.com/2009/06/24/running-netapps-aggrspacecheck-without-turning-on-rsh/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>NetApp FAS2020 aggregate capacity on ONTAP 7.3.1 &#8211; now 16TB</title>
		<link>http://andyleonard.com/2009/06/23/netapp-fas2020-aggregate-capacity-on-ontap-7-3-1/</link>
		<comments>http://andyleonard.com/2009/06/23/netapp-fas2020-aggregate-capacity-on-ontap-7-3-1/#comments</comments>
		<pubDate>Tue, 23 Jun 2009 17:41:45 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[storage]]></category>
		<category><![CDATA[fas2020]]></category>
		<category><![CDATA[netapp]]></category>

		<guid isPermaLink="false">http://andyleonard.com/?p=289</guid>
		<description><![CDATA[My NetApp FAS 2020 Sizing post remains popular nearly a year after I wrote it.  However, with ONTAP 7.3.1 (and later releases) out, it&#8217;s also out of date.  Here&#8217;s current information from p. 33 of the ONTAP 7.3.1.1 release notes (NOW login required):
Beginning with Data ONTAP 7.3.1, FAS2020 systems support aggregates up to [...]]]></description>
			<content:encoded><![CDATA[<p>My <a href="http://andyleonard.com/2008/08/04/netapp-fas-2020-sizing/">NetApp FAS 2020 Sizing</a> post remains popular nearly a year after I wrote it.  However, with ONTAP 7.3.1 (and later releases) out, it&#8217;s also out of date.  Here&#8217;s current information from p. 33 of the <a href="http://now.netapp.com/NOW/knowledge/docs/ontap/rel7311/pdfs/ontap/rnote.pdf">ONTAP 7.3.1.1 release notes</a> (NOW login required):</p>
<blockquote><p>Beginning with Data ONTAP 7.3.1, FAS2020 systems support aggregates up to 16 TB raw capacity,<br />
provided that the root volume is hosted in a dedicated aggregate (that is, one that contains only the root<br />
volume and no user data).</p></blockquote>
<p>The release notes go on to point out an alternative to the dedicated root aggregate &#8211; having two spare disks per controller.</p>
<p>It&#8217;s nice to see the FAS2020 finally getting a maximum aggregate size on par with the rest of NetApp&#8217;s product line.  However, in an era where 2TB drives are available from Western Digital &#8211; and presumably other manufacturers before too long &#8211; ONTAP&#8217;s 16TB aggregate limit grows increasingly anachronistic.</p>
]]></content:encoded>
			<wfw:commentRss>http://andyleonard.com/2009/06/23/netapp-fas2020-aggregate-capacity-on-ontap-7-3-1/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
