<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>thinking sysadmin &#187; vmware</title>
	<atom:link href="http://andyleonard.com/tag/vmware/feed/" rel="self" type="application/rss+xml" />
	<link>http://andyleonard.com</link>
	<description>qstat -u aleonard -s z</description>
	<lastBuildDate>Tue, 28 Feb 2012 04:47:09 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>VMware Tools Upgrade on CentOS Enables Host Time Sync (plus fix)</title>
		<link>http://andyleonard.com/2010/11/12/vmware-tools-upgrade-on-centos-enables-host-time-sync-plus-fix/</link>
		<comments>http://andyleonard.com/2010/11/12/vmware-tools-upgrade-on-centos-enables-host-time-sync-plus-fix/#comments</comments>
		<pubDate>Fri, 12 Nov 2010 18:53:22 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[virtualization]]></category>
		<category><![CDATA[clock]]></category>
		<category><![CDATA[esxi]]></category>
		<category><![CDATA[ntp]]></category>
		<category><![CDATA[time]]></category>
		<category><![CDATA[time sync]]></category>
		<category><![CDATA[timekeeping]]></category>
		<category><![CDATA[vmware]]></category>
		<category><![CDATA[vmware tools]]></category>
		<category><![CDATA[vsphere]]></category>

		<guid isPermaLink="false">http://andyleonard.com/?p=540</guid>
		<description><![CDATA[After bringing some CentOS guests from an ESX 3.5 environment to an ESXi 4.1 environment and performing a VMware Tools upgrade, I noticed log messages on the VMs similar to the following: Along with console messages about the cmos clock such as: Inspecting the affected VMs, the clock appeared to be losing almost a second [...]]]></description>
			<content:encoded><![CDATA[<p>After bringing some CentOS guests from an ESX 3.5 environment to an ESXi 4.1 environment and performing a VMware Tools upgrade, I noticed log messages on the VMs similar to the following:</p>
<pre class="brush: plain; light: true; title: ; notranslate">
Nov 12 09:07:18 node01 ntpd[2574]: time reset +175.995101 s
</pre>
<p>Along with console messages about the cmos clock such as:</p>
<pre class="brush: plain; light: true; title: ; notranslate">
time.c can't update cmos clock from 0 to 59
</pre>
<p>Inspecting the affected VMs, the clock appeared to be losing almost a second each second, despite ntpd being up and running and kernel options set appropriately.  Further investigation revealed that &#8220;Synchronize guest time with host&#8221; had been silently enabled for the guest during the Tools upgrade, contrary to VMware&#8217;s <a href="http://kb.vmware.com/kb/1006427">Timekeeping best practices</a>.</p>
<p>To be fair, I don&#8217;t know how widespread this problem is &#8211; it could be particular to CentOS, ESX 3.5 to 4.1 migrations, the fact that the virtual hardware hasn&#8217;t yet been upgraded from version 4 to version 7, or even my method of upgrading the tools.  However, once you know to look for this issue, the resolution is simple: Disable host time sync.  You can do this manually, or, if you use Puppet to manage your Linux VMs, the following manifest snippet will automate this for you (assuming you have a &#8220;vmware-tools&#8221; Service):</p>
<pre class="brush: plain; title: ; notranslate">
exec { &quot;Disable host time sync&quot;:
  onlyif =&gt; &quot;/usr/bin/test `/usr/bin/vmware-toolbox-cmd timesync status` = 'Enabled'&quot;,
  command =&gt; &quot;/usr/bin/vmware-toolbox-cmd timesync disable&quot;,
  require =&gt; Service[&quot;vmware-tools&quot;],
}
</pre>
]]></content:encoded>
			<wfw:commentRss>http://andyleonard.com/2010/11/12/vmware-tools-upgrade-on-centos-enables-host-time-sync-plus-fix/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Put Down the Saw and Get the Glue: Working Around VMware KB1022751</title>
		<link>http://andyleonard.com/2010/09/23/put-down-the-saw-and-get-the-glue-working-around-vmware-kb1022751/</link>
		<comments>http://andyleonard.com/2010/09/23/put-down-the-saw-and-get-the-glue-working-around-vmware-kb1022751/#comments</comments>
		<pubDate>Thu, 23 Sep 2010 22:18:53 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[virtualization]]></category>
		<category><![CDATA[cisco]]></category>
		<category><![CDATA[esxi]]></category>
		<category><![CDATA[etherchannel]]></category>
		<category><![CDATA[link aggregation]]></category>
		<category><![CDATA[vmware]]></category>
		<category><![CDATA[vsphere]]></category>

		<guid isPermaLink="false">http://andyleonard.com/?p=519</guid>
		<description><![CDATA[VMware KB article 1022751 lays out the details of an interesting bug in ESXi 4.0 and 4.1 pretty plainly: When trying to team NICs using EtherChannel, the network connectivity is disrupted on an ESXi host. This issue occurs because NIC teaming properties do not propagate to the Management Network portgroup in ESXi. When you configure [...]]]></description>
			<content:encoded><![CDATA[<p>VMware KB article <a href="http://kb.vmware.com/kb/1022751">1022751</a> lays out the details of an interesting bug in ESXi 4.0 and 4.1 pretty plainly:</p>
<blockquote><p>When trying to team NICs using EtherChannel, the network connectivity is disrupted on an ESXi host. This issue occurs because NIC teaming properties do not propagate to the Management Network portgroup in ESXi.  When you configure the ESXi host for NIC teaming by setting the Load Balancing to Route based on ip hash, this configuration is not propagated to Management Network portgroup.</p></blockquote>
<p>(Note that load balancing by IP hash is the <a href="http://kb.vmware.com/kb/1004048">only supported option</a> for EtherChannel link aggregation.)</p>
<p>Unfortunately, the KB article&#8217;s workaround &#8211; there is no patch that I&#8217;m aware of &#8211; requires network connectivity to the host via the vSphere Client.  But what do you do if you&#8217;ve just sawed off the branch you&#8217;re sitting on network-wise, and can no longer connect with the vSphere client?<br />
<span id="more-519"></span><br />
Enable Local Tech Support Mode on the ESXi host, log in as root and run the following command:</p>
<pre class="brush: plain; title: ; notranslate">
vim-cmd hostsvc/net/portgroup_set --nicteaming-policy=loadbalance_ip vSwitch0 &quot;Management Network&quot;
</pre>
<p>(Replace &#8220;vSwitch0&#8243; and &#8220;Management Network&#8221; with the appropriate vSwitch and portgroup as necessary.)</p>
<p>You may also find that while both NICs are active for the vSwitch, one will be in &#8220;standby&#8221; for the portgroup &#8211; a configuration not supported for IP hash load balancing.  It would be reasonable to think that you could fix this with the following, but you can&#8217;t (see error on lines 2-8):</p>
<pre class="brush: plain; title: ; notranslate">
~ # vim-cmd hostsvc/net/portgroup_set --nicorderpolicy-active=vmnic0,vmnic1 vSwitch0 &quot;Management Network&quot;
(vmodl.fault.InvalidArgument) {
   dynamicType = &lt;unset&gt;,
   faultCause = (vmodl.MethodFault) null,
   invalidProperty = &lt;unset&gt;,
   msg = &quot;A specified parameter was not correct.
&quot;,
}
</pre>
<p>VMware appears to have known about this bug for a while now &#8211; try searching the VMware Communities for some workarounds dating back to the 3.x days, including some from VMware employees &#8211; so resolving it is presumably either extremely difficult or not currently a high priority.  However, you will likely be able reach the ESXi host using the vSphere Client after fixing the portgroup NIC teaming policy, so you can fix this issue in the GUI.</p>
<p>If you find yourself attempting to automate an ESXi install with Kickstart and don&#8217;t want to make fixing the portgroup through the vSphere Client part of your install process, consider not using EtherChannel at all for the Management Network &#8211; just use active and standby NICs, perhaps in a configuration similar to Kendrick Coleman&#8217;s <a href="http://kendrickcoleman.com/index.php?/Tech-Blog/esxi-41-kickstart-install-wip.html">ESXi 4.1 Kickstart Install</a> blog post.</p>
]]></content:encoded>
			<wfw:commentRss>http://andyleonard.com/2010/09/23/put-down-the-saw-and-get-the-glue-working-around-vmware-kb1022751/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Interesting Linux VM Crash Pattern</title>
		<link>http://andyleonard.com/2009/11/20/interesting-linux-vm-crash-pattern/</link>
		<comments>http://andyleonard.com/2009/11/20/interesting-linux-vm-crash-pattern/#comments</comments>
		<pubDate>Fri, 20 Nov 2009 19:09:16 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[virtualization]]></category>
		<category><![CDATA[centos]]></category>
		<category><![CDATA[crash]]></category>
		<category><![CDATA[dell]]></category>
		<category><![CDATA[iscsi]]></category>
		<category><![CDATA[kernel panic]]></category>
		<category><![CDATA[mptscsi]]></category>
		<category><![CDATA[netapp]]></category>
		<category><![CDATA[nfs]]></category>
		<category><![CDATA[rhel]]></category>
		<category><![CDATA[vmware]]></category>
		<category><![CDATA[vmware esx]]></category>
		<category><![CDATA[vsmp]]></category>

		<guid isPermaLink="false">http://andyleonard.com/?p=343</guid>
		<description><![CDATA[I&#8217;ve just begun to pull together some interesting data on a series of Linux VM crashes I&#8217;ve seen. I don&#8217;t have a resolution yet, but some interesting patterns have emerged. Crash Symptoms A CentOS 4.x or 5.x guest will crash with a message similar to the following on its console: CentOS 4.x: [&#60;f883b299&#62;] .text.lock.scsi_error+0x19/0x34 [scsi_mod] [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve just begun to pull together some interesting data on a series of Linux VM crashes I&#8217;ve seen.  I don&#8217;t have a resolution yet, but some interesting patterns have emerged.</p>
<p><strong>Crash Symptoms</strong></p>
<p>A CentOS 4.x or 5.x guest will crash with a message similar to the following on its console:</p>
<p>CentOS 4.x:</p>
<p><code>[&lt;f883b299&gt;] .text.lock.scsi_error+0x19/0x34 [scsi_mod]<br />
[&lt;f88c19ce&gt;] mptscsih_io_done+0x5ee/0x608 [mptscsi] (…)<br />
[&lt;c02de564&gt;] common_interrupt+0x18/0x20<br />
[&lt;c02ddb54&gt;] system_call+0x0/0x30</code></p>
<p>CentOS 5.x:</p>
<p><code>RIP  [&lt;ffffffff8014c562&gt;] list_del+0x48/0x71 RSP &lt;ffffffff80425d00&gt; &lt;0&gt;Kernel Panic - not syncing: Fatal exception</code></p>
<p>A hard reset (i.e. pressing the reset button on the VM&#8217;s console) is required to reboot the guest.<br />
<span id="more-343"></span><br />
<strong>Further Details</strong></p>
<p>Five different VMs have encountered this issue, running at a mix of close-to-current CentOS 4.x and 5.x patch levels.  Guest kernel versions when the crash occurred were 2.6.18-128.7.1.el5 and 2.6.18-128.1.10.el5 (5.x) and 2.6.9-89.0.9.ELsmp (4.x).  Memory allocations on affected guests range from 512MB to 3072MB.  Notably, all affected VMs are using SMP &#8211; each has 2 vCPUs &#8211; having been created before our in-house practices followed <a href="http://blogs.vmware.com/performance/2008/06/esx-scheduler-s.html">VMware guidelines</a> and discouraged use of SMP on ESX guests when unnecessary.  One VM was created via P2V; the rest were created <em>de novo</em> on virtual hardware.</p>
<p>All crashes have happened on a single node in an ESX 3.5 HA cluster composed of four Dell PowerEdge 1950s.  ESX hosts have tracked the latest VMware patches closely.  COS memory on the ESX host in question was increased from the default to 800MB prior to the three most recent crashes; in other words, the COS memory increase appears to have had no effect on the crashes.  DRS is in use, set to &#8220;fully automated&#8221; and &#8220;apply recommendations with three or more stars&#8221; and no virtual machine rules have been created to control DRS host placement.</p>
<p>All guests are on the same NFS data store, served from a NetApp filer running ONTAP 7.2.x.  One guest had its vmswap placed separately on an iSCSI data store; the rest have their swap stored on NFS with the VM.  No log messages were seen on the filer during the event, although the a log message similar to the following has been seen several times on the ESX host:</p>
<p><code>vmkernel: 43:07:27:51.725 cpu2:2185)WARNING: NFS: 4590: Can't find call with serial number -2146566055</code></p>
<p>Curiously, all crashes have happened in the evening, in the 10 o&#8217;clock hour, after nightly backups have been completed.  Backups are created using a combination of VMware and NetApp snapshots via a script similar to one detailed on <a href="http://vmwaretips.com/wp/2008/12/05/netapp-snapshots-in-esx-take-2/">vmwaretips.com</a>.  No substantial load or latency has been recorded on the NetApp during the crashes, and weeks have passed between events.</p>
<p><strong>Speculation</strong></p>
<p>Explanations I&#8217;m leaning towards, ranked by my judgment of their likelihood:</p>
<p>1) <strong>Hardware issue.</strong>  Assuming a random distribution of VMs &#8211; recall that DRS is in use and no virtual machine rules are in place &#8211; the odds of all five crashes happening on one host out of four are slim: 1 in 1024.  Unfortunately, by all measures we&#8217;ve used, including the VI Client&#8217;s &#8220;Health Status&#8221; and Dell OMSA, there are no hardware issues with the host.</p>
<p>Further, the distribution of VMs is not truly random.  DRS migrations are infrequent in this cluster, and the largest determinant of guest location is migration following hosts being placed into maintenance mode for patching.</p>
<p>If it is a hardware issue, it&#8217;s subtle, and possibly only brought to the fore by the following issues.</p>
<p>2) <strong>Red Hat Enterprise Linux bug</strong> &#8211; which, by extension, is typically equivalent to a CentOS bug.  In fact, this issue appears to have been raised with Red Hat already in bugs <a href="https://bugzilla.redhat.com/show_bug.cgi?id=197158">197158</a> and <a href="https://bugzilla.redhat.com/show_bug.cgi?id=228108">228108</a> &#8211; but, according the bug reports, the issue is resolved, and the patches have since been ported downstream to CentOS.  However, perhaps the issue is not truly resolved &#8211; see <a href="https://bugzilla.redhat.com/show_bug.cgi?id=228108#c35">comment 35</a> in 228108.</p>
<p>3) <strong>vSMP Bug.</strong>  The majority of our Linux VMs are uniprocessor and appear so far to be immune to this issue; it is striking that the crash has only occurred on dual processor guests.  I cannot articulate a mechanism for multiple vCPUs causing this crash, however.</p>
<p>4) <strong>NetApp issue.</strong>  This appears to be a storage issue at some level, considering the mptscsi and NFS messages noted above, so performance of the NetApp filer would be a natural place for further investigation.  However, we monitor the performance of our filer relatively closely, using the ONTAP SDK and Cacti, and nothing unusual was recorded during any crash.  It seems unusual that all VMs reside on the same data store, but that data store shares an aggregate with multiple other unaffected data stores, and several LUNs are served from the same aggregate to non-ESX machines without complaint.</p>
<p>I have not yet opened a case with VMware on this issue &#8211; or Dell, or NetApp, for that matter &#8211; but if and when I do, I&#8217;ll update here to the extent possible.</p>
<p><strong>Update 11/20/2009:</strong> Prompted by a helpful comment from nate below, I looked up and verified the NFS settings across the cluster.  They are the same across all hosts, and are as follows:</p>
<p><code>NFS.IndirectSend 0<br />
NFS.DiskFileLockUpdate 10<br />
NFS.LockUpdateTimeout 5<br />
NFS.LockRenewMaxFailureNumber 3<br />
NFS.LockDisable 0<br />
NFS.HeartbeatFrequency 12<br />
NFS.HeartbeatTimeout 5<br />
NFS.HeartbeatDelta 5<br />
NFS.HeartbeatMaxFailures 10<br />
NFS.MaxVolumes 8<br />
NFS.SendBufferSize 264<br />
NFS.ReceiveBufferSize 128<br />
NFS.VolumeRemountFrequency 30<br />
NFS.UDPRetransmitDelay 700</code></p>
<p>The only values that are changed from default are HeartbeatFrequency and HeartbeatMaxFailures, to match NetApp&#8217;s recommendations in <a href="http://media.netapp.com/documents/tr-3428.pdf">TR-3428</a>. </p>
]]></content:encoded>
			<wfw:commentRss>http://andyleonard.com/2009/11/20/interesting-linux-vm-crash-pattern/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Keeping your RHEL VMs from crushing your storage at 4:02am</title>
		<link>http://andyleonard.com/2009/11/19/keeping-your-rhel-vms-from-crushing-your-storage-at-402am/</link>
		<comments>http://andyleonard.com/2009/11/19/keeping-your-rhel-vms-from-crushing-your-storage-at-402am/#comments</comments>
		<pubDate>Thu, 19 Nov 2009 19:39:30 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[operating systems]]></category>
		<category><![CDATA[centos]]></category>
		<category><![CDATA[locate]]></category>
		<category><![CDATA[mlocate]]></category>
		<category><![CDATA[rhel]]></category>
		<category><![CDATA[scientific linux]]></category>
		<category><![CDATA[slocate]]></category>
		<category><![CDATA[updatedb]]></category>
		<category><![CDATA[virtualization]]></category>
		<category><![CDATA[vmware]]></category>

		<guid isPermaLink="false">http://andyleonard.com/?p=315</guid>
		<description><![CDATA[Running a lot of Red Hat VMs in your virtual infrastructure, on shared storage? CentOS, Scientific Linux, both versions 4 and 5, they count for these purposes; Fedora should likely be included too. Do you have the slocate (version 4.x and earlier) or mlocate (version 5.x) RPMs installed? If you&#8217;re uncertain, check using the following: [...]]]></description>
			<content:encoded><![CDATA[<p>Running a lot of Red Hat VMs in your virtual infrastructure, on shared storage?  CentOS, Scientific Linux, both versions 4 and 5, they count for these purposes; Fedora should likely be included too.  Do you have the slocate (version 4.x and earlier) or mlocate (version 5.x) RPMs installed?  If you&#8217;re uncertain, check using the following:</p>
<p><code>> rpm -q slocate<br />
slocate-2.7-13.el4.8.i386</code></p>
<p>or</p>
<p><code>> rpm -q mlocate<br />
mlocate-0.15-1.el5.2.x86_64</code></p>
<p>If so, multiple RHEL VMs plus mlocate or slocate may be adding up to an array-crushing 4:02am shared storage load and latency spike for you.  Before being addressed, this spike was bad enough at my place of employment (when combined with a NetApp Sunday-morning disk scrub) to cause a Windows VM to crash with I/O errors.  Ouch.<br />
<span id="more-315"></span><br />
<strong>Details and ideas for resolution:</strong></p>
<p>By default, a line in /etc/crontab runs the scripts within /etc/cron.daily at 4:02am each morning:</p>
<p><code>02 4 * * * root run-parts /etc/cron.daily</code></p>
<p>One of those scripts &#8211; mlocate.cron or slocate.cron, depending on your OS version &#8211; launches updatedb; as the man page says, &#8220;updatedb  creates  or  updates  a  database  used by locate(1).&#8221;  (The &#8220;locate&#8221; binary is a filesystem search tool, see &#8220;man locate&#8221; for more information.)  Updatedb refreshes its database by walking the filesystem, generating a fair amount of I/O on a single system.  Imagine upwards of thirty of these running in parallel through VMDKs on one shared storage system carrying out internal maintenance at the same time, and you&#8217;re pretty much picturing the problem my employer had.</p>
<p>I see <strong>three options</strong> for addressing this issue:</p>
<p><strong>1) Uninstall mlocate or slocate.</strong>  If you don&#8217;t currently use &#8220;locate&#8221; and you&#8217;re not interested in learning to use a tool that will likely make you more effective at your job (again, see &#8220;man locate&#8221;), this is probably the best option.  (Yeah, I know, people that fit this bill generally don&#8217;t read blogs more technical than <a href="http://perezhilton.com/">this one</a>, so I could probably have skipped it here.  Consider it an option for completeness, or if you really need to strip down an install.)</p>
<p><strong>2) Disable the scheduled job by removing mlocate.cron or slocate.cron from /etc/cron.daily.</strong>  This keeps locate available for your use, but requires that you update locate&#8217;s database ad-hoc and interactively by running the following as root:</p>
<p><code># updatedb</code></p>
<p>This will take a few minutes to return, depending on the size of your file systems.</p>
<p>I don&#8217;t recommend this option either; at least it doesn&#8217;t fit the way I work.  I often find myself using locate in high-pressure situations in which I need to quickly get a file location on a system.  Waiting minutes for updatedb to return is extra painful when every second counts.</p>
<p><strong>3) Stagger when updatedb runs by inserting a random delay into the script.</strong>.  This is my preferred alternative; locate&#8217;s database is kept current automatically, and your storage doesn&#8217;t have to bear a sudden spike in load.  I implemented this by adding the lines in <strong>bold</strong> (lines 2-7 if your browser doesn&#8217;t display the bold text clearly): </p>
<p><code>#!/bin/sh<br />
<strong># sleep up to two hours before launching job:<br />
value=$RANDOM<br />
while [ $value -gt 7200 ] ; do<br />
  value=$RANDOM<br />
done<br />
sleep $value</strong><br />
nodevs=$(< /proc/filesystems awk '$1 == "nodev" { print $2 }')<br />
renice +19 -p $$ >/dev/null 2>&#038;1<br />
/usr/bin/updatedb -f "$nodevs"<br />
</code></p>
<p>The added code inserts a pseudo-random sleep delay of up to two hours before updatedb runs, with the key being the built-in Bash function <a href="http://tldp.org/LDP/abs/html/randomvar.html">$RANDOM</a>.  In our environment, this removed a 2000 IOPS spike at 4:02am, and eliminated a corresponding jump in filer latency.  Obviously, adjust the delay period as appropriate for your environment.  Additionally, be sure to add this change to your configuration management or installation management tools so that all of your RHEL and RHEL-derived VMs get the updated script.</p>
<p>Using $RANDOM to avoid this variant of the <a href="http://en.wikipedia.org/wiki/Thundering_herd_problem">thundering herd problem</a> also works nicely for a range of similar problems; I believe I first saw it at <a href="http://www.moundalexis.com/archives/000076.php">Moundalexis.com</a>.</p>
<p>(This problem may apply to other Linux distributions being run as VMs, and FreeBSD does something equivalent &#8211; weekly &#8211; with /etc/periodic/weekly/310.locate.  A similar solution can be applied to these environments, if necessary.)</p>
]]></content:encoded>
			<wfw:commentRss>http://andyleonard.com/2009/11/19/keeping-your-rhel-vms-from-crushing-your-storage-at-402am/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>VMware/NFS/NetApp SnapRestore/Linux LVM Single File Recovery Notes</title>
		<link>http://andyleonard.com/2009/06/01/vmwarenfsnetapp-snaprestorelinux-lvm-single-file-recovery-notes/</link>
		<comments>http://andyleonard.com/2009/06/01/vmwarenfsnetapp-snaprestorelinux-lvm-single-file-recovery-notes/#comments</comments>
		<pubDate>Mon, 01 Jun 2009 21:55:54 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[virtualization]]></category>
		<category><![CDATA[backup]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[netapp]]></category>
		<category><![CDATA[nfs]]></category>
		<category><![CDATA[snaprestore]]></category>
		<category><![CDATA[vmware]]></category>
		<category><![CDATA[vmware esx]]></category>

		<guid isPermaLink="false">http://andyleonard.com/?p=241</guid>
		<description><![CDATA[There have been a few posts elsewhere discussing file-level recovery for Linux VMs on NetApp NFS datastores, but none that have dealt specifically with Linux LVM-encapsulated partitions. Here&#8217;s our in-house procedure for recovery; note that we do not have FlexClone licensed on our filers. Prerequisites An existing VMware ESX infrastructure, connected to a NetApp filer [...]]]></description>
			<content:encoded><![CDATA[<p>There have been a few posts <a href="http://storagefoo.blogspot.com/2007/10/vmware-over-nfs-backup-trickscontinued.html">elsewhere</a> discussing file-level recovery for Linux VMs on NetApp NFS datastores, but none that have dealt specifically with Linux LVM-encapsulated partitions.</p>
<p>Here&#8217;s our in-house procedure for recovery; note that we do not have FlexClone licensed on our filers.<br />
<span id="more-241"></span><br />
<strong>Prerequisites</strong></p>
<ul>
<li>An existing VMware ESX infrastructure, connected to a NetApp filer NFS datastore; SnapRestore speeds the recovery process but is not mandatory &#8211; see discussion below.</li>
<li>A backup script or system which coordinates VMware snapshots with NetApp snapshots &#8211; perhaps something along the lines of <a href="http://vmwaretips.com/wp/2008/12/05/netapp-snapshots-in-esx-take-2/">Rick Scherer&#8217;s script</a>.</li>
<li>A dedicated Linux restore VM, at a similar version level to the rest of your Linux VM infrastructure.  This VM should have LVM support, but <em>should not have any volume groups (VGs) or logical volumes (LVs) configured</em> &#8211; volume group and logical volume names on the VMDK you are restoring from must not conflict with VGs and LVs already in use on the restore system; the simplest way to guarantee this is to simply not have any VGs or LVs.</li>
</ul>
<p><strong>Restore Procedure</strong></p>
<ul>
<li>Restore the VMDK file from the appropriate snapshot to a <em>new location</em> in the datastore.  With SnapRestore, this can be done as follows (one line in the filer CLI, restoring from snapshot sv_daily.0 to a new file &#8211; again, <strong>be extremely careful not to overwrite the current version of the VMDK in your datastore</strong>, consider restoring to an entirely different directory in the FlexVol):
<p><code>snap restore -t file -s sv_daily.0<br />
-r /vol/vmware04_sis/system.example.com/system.example.com-restore.vmdk<br />
/vol/vmware04_sis/system.example.com/system.example.com.vmdk</code></p>
<p>Follow the prompts, verifying the restore path is correct and is not the path to your existing VMDK.  Do the same for the flat VMDK file (again, one line, and, as before, <strong>use caution to make sure you do not clobber an existing file</strong>):</p>
<p><code>snap restore -t file -s sv_daily.0<br />
-r /vol/vmware04_sis/system.example.com/system.example.com-restore-flat.vmdk<br />
/vol/vmware04_sis/system.example.com/system.example.com-flat.vmdk</code></p>
<p>Without SnapRestore, you can simply mount the NFS export of the datastore on a Linux machine and use &#8220;cp&#8221; to copy the files out of the snapshot.  For flat VMDK files, expect this copy to run for a substantial amount of time compared the nearly-instant recovery SnapRestore offers.</li>
<li>Manually edit the line below &#8220;# Extent description&#8221; in the recovered .vmdk file to match the path to the recovered flat VMDK.  In this case, it would look something like this:<br />
<code># Extent description<br />
RW 20971520 VMFS "system.example.com-restore-flat.vmdk"</code></li>
<li>Attach the recovered VMDK to your powered-off restore host.  Boot the restore host.</li>
<li>Once your restore host is up, use &#8220;pvscan&#8221;, &#8220;vgscan&#8221; and &#8220;lvscan&#8221; (each without arguments) as root to examine available LVM components.  Then, use the &#8220;lvchange&#8221; command to activate the necessary volume group (in this case, &#8220;VolGroup00&#8243;):<br />
<code># lvchange -ay VolGroup00</code></li>
<li>Mount the appropriate logical volume &#8211; for example, LogVol00 in VolGroup00:<br />
<code>mount -o ro /dev/VolGroup00/LogVol00 /mnt</code><br />
Restore files by copying them out of /mnt.</li>
</ul>
<p><strong>Cleanup</strong></p>
<ul>
<li>Shut down the Linux restore host.</li>
<li>Remove the recovery VMDK &#8211; the files restored with SnapRestore or by &#8220;cp&#8221; above &#8211; from the restore host in the VMware Infrastructure Client.</li>
<li>Delete the recovery .vmdk and -flat.vmdk files in the NFS datastore.  <strong>Don&#8217;t screw up here: Be sure to delete the recovery files only, not the working VMDK.</strong></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://andyleonard.com/2009/06/01/vmwarenfsnetapp-snaprestorelinux-lvm-single-file-recovery-notes/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ESX VM swap on NFS: If it crashes, try something else</title>
		<link>http://andyleonard.com/2009/02/04/esx-vm-swap-on-nfs-if-it-crashes-try-something-else/</link>
		<comments>http://andyleonard.com/2009/02/04/esx-vm-swap-on-nfs-if-it-crashes-try-something-else/#comments</comments>
		<pubDate>Thu, 05 Feb 2009 00:58:42 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[virtualization]]></category>
		<category><![CDATA[nfs]]></category>
		<category><![CDATA[vmware]]></category>
		<category><![CDATA[vmware esx]]></category>

		<guid isPermaLink="false">http://andyleonard.com/?p=218</guid>
		<description><![CDATA[I&#8217;ve written about running VMware ESX with VM swap on an NFS datastore previously &#8211; specifically whether or not it was supported/recommended: ESX Swap on NFS or Not? VMware about ESX swap on NFS: It’s okay After writing the second post, I thought the issue was pretty much resolved: From multiple sources, the consensus seemed [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve written about running VMware ESX with VM swap on an NFS datastore previously &#8211; specifically whether or not it was supported/recommended:</p>
<ul>
<li><a href="http://andyleonard.com/2008/10/17/esx-swap-on-nfs-or-not/">ESX Swap on NFS or Not?</a></li>
<li><a href="http://andyleonard.com/2008/11/24/vmware-about-esx-swap-on-nfs-its-okay/">VMware about ESX swap on NFS: It’s okay</a></li>
</ul>
<p>After writing the second post, I thought the issue was pretty much resolved: From multiple sources, the consensus seemed to be that running ESX with VM swap on NFS would be fine.  Imagine my surprise (and disappointment) at seeing the following VMware KB article 1008091, updated yesterday: <a href="http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&#038;cmd=displayKC&#038;externalId=1008091">An ESX virtual machine on NFS fails with swap errors</a>.  Further details are in the article itself, but VMware&#8217;s KB site is throwing intermittent errors for me at the moment, so I&#8217;ll provide the money quote:</p>
<blockquote><p>The reliability of the virtual machine can be improved by relocating the swap file location to a non-NFS datastore. Either SAN or local storage datastores improve virtual machine stability.</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://andyleonard.com/2009/02/04/esx-vm-swap-on-nfs-if-it-crashes-try-something-else/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>VMware: Not kidding about VMotion GigE Requirement</title>
		<link>http://andyleonard.com/2009/02/03/vmware-not-kidding-about-vmotion-gige-requirement/</link>
		<comments>http://andyleonard.com/2009/02/03/vmware-not-kidding-about-vmotion-gige-requirement/#comments</comments>
		<pubDate>Wed, 04 Feb 2009 00:42:06 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[virtualization]]></category>
		<category><![CDATA[ethernet]]></category>
		<category><![CDATA[vmotion]]></category>
		<category><![CDATA[vmware]]></category>

		<guid isPermaLink="false">http://andyleonard.com/?p=206</guid>
		<description><![CDATA[In case you&#8217;re curious/adventurous/broke enough to try configuring your VMotion network on Fast Ethernet instead of Gigabit Ethernet, here&#8217;s what you can expect. First, a warning from your VI client that you&#8217;re venturing into unsupported territory: And then, if you go ahead with the VMotion, a slight pause on the VM in question.  The following [...]]]></description>
			<content:encoded><![CDATA[<p>In case you&#8217;re curious/adventurous/broke enough to try configuring your VMotion network on Fast Ethernet instead of Gigabit Ethernet, here&#8217;s what you can expect.</p>
<p>First, a warning from your VI client that you&#8217;re venturing into unsupported territory:</p>
<div id="attachment_208" class="wp-caption aligncenter" style="width: 310px"><a href="http://andyleonard.com/wp-content/uploads/2009/02/vmotion.png"><img class="size-medium wp-image-208" title="vmotion" src="http://andyleonard.com/wp-content/uploads/2009/02/vmotion-300x182.png" alt="A friendly warning" width="300" height="182" /></a><p class="wp-caption-text">A friendly warning</p></div>
<p>And then, if you go ahead with the VMotion, a slight pause on the VM in question.  The following is output from running <code>while true; do date; sleep 1; done</code> on a Linux guest during the VMotion:</p>
<pre>Tue Feb  3 13:23:17 PST 2009
Tue Feb  3 13:23:18 PST 2009
Tue Feb  3 13:23:19 PST 2009
Tue Feb  3 13:23:20 PST 2009
Tue Feb  3 13:23:21 PST 2009
Tue Feb  3 13:23:22 PST 2009
Tue Feb  3 13:24:12 PST 2009
Tue Feb  3 13:24:13 PST 2009
Tue Feb  3 13:24:14 PST 2009
Tue Feb  3 13:24:15 PST 2009
Tue Feb  3 13:24:16 PST 2009</pre>
<p>Note the fifty second pause between 13:23:22 and 13:24:12?  Ouch&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://andyleonard.com/2009/02/03/vmware-not-kidding-about-vmotion-gige-requirement/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Fishworks on the VMware HCL</title>
		<link>http://andyleonard.com/2008/12/11/fishworks-on-the-vmware-hcl/</link>
		<comments>http://andyleonard.com/2008/12/11/fishworks-on-the-vmware-hcl/#comments</comments>
		<pubDate>Thu, 11 Dec 2008 19:29:44 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[storage]]></category>
		<category><![CDATA[virtualization]]></category>
		<category><![CDATA[fishworks]]></category>
		<category><![CDATA[sun]]></category>
		<category><![CDATA[vmware]]></category>

		<guid isPermaLink="false">http://andyleonard.com/?p=186</guid>
		<description><![CDATA[I was checking out VMware&#8217;s new online search-able HCL and I noticed that the new Sun Unified Storage Systems were on the HCL. That was fast &#8211; and now I&#8217;m really curious as to how the systems with flash drives perform as storage for ESX.]]></description>
			<content:encoded><![CDATA[<p>I was checking out VMware&#8217;s new <a href="http://www.vmware.com/resources/compatibility/search.php">online search-able HCL</a> and I noticed that the new <a href="http://www.sun.com/storage/disk_systems/unified_storage/">Sun Unified Storage Systems</a> were <a href="http://www.vmware.com/resources/compatibility/search.php?action=search&#038;deviceCategory=san&#038;productId=1&#038;keyBasic=Unified+Storage+System&#038;maxDisplayRows=50&#038;key=Sun&#038;release[]=-1&#038;datePosted=-1">on the HCL</a>.  That was fast &#8211; and now I&#8217;m really curious as to how the systems with flash drives perform as storage for ESX.</p>
]]></content:encoded>
			<wfw:commentRss>http://andyleonard.com/2008/12/11/fishworks-on-the-vmware-hcl/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>VMware about ESX swap on NFS: It&#8217;s okay</title>
		<link>http://andyleonard.com/2008/11/24/vmware-about-esx-swap-on-nfs-its-okay/</link>
		<comments>http://andyleonard.com/2008/11/24/vmware-about-esx-swap-on-nfs-its-okay/#comments</comments>
		<pubDate>Mon, 24 Nov 2008 18:32:42 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[virtualization]]></category>
		<category><![CDATA[esx]]></category>
		<category><![CDATA[nfs]]></category>
		<category><![CDATA[swap]]></category>
		<category><![CDATA[vmware]]></category>

		<guid isPermaLink="false">http://andyleonard.com/?p=178</guid>
		<description><![CDATA[Paul Manning, from VMware, in response to a question I asked in the VI:OPS forums: The current best practice for NFS is to not seperate the VM swap space from the VMhome directory on a NFS datastore. The reason for the originial recommendation was just good old fashioned conservitiveness. More at the forum post, including [...]]]></description>
			<content:encoded><![CDATA[<p>Paul Manning, from VMware, in <a href="http://viops.vmware.com/home/message/1672?tstart=0#1672">response</a> to a question I asked in the VI:OPS forums:</p>
<blockquote><p>The current best practice for NFS is to not seperate the VM swap space from the VMhome directory on a NFS datastore. The reason for the originial recommendation was just good old fashioned conservitiveness.</p></blockquote>
<p>More at the forum post, including more on the reasoning for the old recommendation of separating swap when using NFS &#8211; thanks, Paul, you made my day.</p>
]]></content:encoded>
			<wfw:commentRss>http://andyleonard.com/2008/11/24/vmware-about-esx-swap-on-nfs-its-okay/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Links, 9/10/2008</title>
		<link>http://andyleonard.com/2008/09/10/links-9102008/</link>
		<comments>http://andyleonard.com/2008/09/10/links-9102008/#comments</comments>
		<pubDate>Wed, 10 Sep 2008 19:57:08 +0000</pubDate>
		<dc:creator>Andy</dc:creator>
				<category><![CDATA[link dump]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[timekeeping]]></category>
		<category><![CDATA[vmware]]></category>

		<guid isPermaLink="false">http://andyleonard.com/?p=80</guid>
		<description><![CDATA[Timekeeping best practices for Linux &#8211; &#8220;This article presents best practices for Linux timekeeping. These recommendations include specifics on the particular kernel command line options to use for the Linux operating system of interest. There is also a description of the recommended settings and usage for NTP time sync, configuration of VMware Tools time synchronization, [...]]]></description>
			<content:encoded><![CDATA[<ul>
<li><a href="http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&#038;cmd=displayKC&#038;externalId=1006427">Timekeeping best practices for Linux</a> &#8211; &#8220;This article presents best practices for Linux timekeeping. These recommendations include specifics on the particular kernel command line options to use for the Linux operating system of interest. There is also a description of the recommended settings and usage for NTP time sync, configuration of VMware Tools time synchronization, and Virtual Hardware Clock configuration, to achieve best timekeeping results.&#8221;  Where has this document been since I started deploying VMware?  Oh, wait, looks like it may have been written on August 19th&#8230; Still, thanks, VMware &#8211; exactly what I wanted!</li>
<li><a href="http://viops.vmware.com/home/index.jspa">VI:OPS</a> &#8211; A new VMware site: &#8220;We created VI:OPS to widen the discussion beyond pure, deep technical by adding five topics that VMware staff, partners and customers talk about all the time but where there is no online collaboration facility for these topics.&#8221;  I found the above link through a post on this site.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://andyleonard.com/2008/09/10/links-9102008/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

