Tagged: vmware

November 12, 2010

VMware Tools Upgrade on CentOS Enables Host Time Sync (plus fix)

After bringing some CentOS guests from an ESX 3.5 environment to an ESXi 4.1 environment and performing a VMware Tools upgrade, I noticed log messages on the VMs similar to the following:

Nov 12 09:07:18 node01 ntpd[2574]: time reset +175.995101 s

Along with console messages about the cmos clock such as:

time.c can't update cmos clock from 0 to 59

Inspecting the affected VMs, the clock appeared to be losing almost a second each second, despite ntpd being up and running and kernel options set appropriately. Further investigation revealed that “Synchronize guest time with host” had been silently enabled for the guest during the Tools upgrade, contrary to VMware’s Timekeeping best practices.

To be fair, I don’t know how widespread this problem is – it could be particular to CentOS, ESX 3.5 to 4.1 migrations, the fact that the virtual hardware hasn’t yet been upgraded from version 4 to version 7, or even my method of upgrading the tools. However, once you know to look for this issue, the resolution is simple: Disable host time sync. You can do this manually, or, if you use Puppet to manage your Linux VMs, the following manifest snippet will automate this for you (assuming you have a “vmware-tools” Service):

exec { "Disable host time sync":
  onlyif => "/usr/bin/test `/usr/bin/vmware-toolbox-cmd timesync status` = 'Enabled'",
  command => "/usr/bin/vmware-toolbox-cmd timesync disable",
  require => Service["vmware-tools"],
}

September 23, 2010

Put Down the Saw and Get the Glue: Working Around VMware KB1022751

VMware KB article 1022751 lays out the details of an interesting bug in ESXi 4.0 and 4.1 pretty plainly:

When trying to team NICs using EtherChannel, the network connectivity is disrupted on an ESXi host. This issue occurs because NIC teaming properties do not propagate to the Management Network portgroup in ESXi. When you configure the ESXi host for NIC teaming by setting the Load Balancing to Route based on ip hash, this configuration is not propagated to Management Network portgroup.

(Note that load balancing by IP hash is the only supported option for EtherChannel link aggregation.)

Unfortunately, the KB article’s workaround – there is no patch that I’m aware of – requires network connectivity to the host via the vSphere Client. But what do you do if you’ve just sawed off the branch you’re sitting on network-wise, and can no longer connect with the vSphere client?
Continue reading →

November 20, 2009

Interesting Linux VM Crash Pattern

I’ve just begun to pull together some interesting data on a series of Linux VM crashes I’ve seen. I don’t have a resolution yet, but some interesting patterns have emerged.

Crash Symptoms

A CentOS 4.x or 5.x guest will crash with a message similar to the following on its console:

CentOS 4.x:

[<f883b299>] .text.lock.scsi_error+0x19/0x34 [scsi_mod] [<f88c19ce>] mptscsih_io_done+0x5ee/0x608 [mptscsi] (…) [<c02de564>] common_interrupt+0x18/0x20 [<c02ddb54>] system_call+0x0/0x30

CentOS 5.x:

RIP [<ffffffff8014c562>] list_del+0x48/0x71 RSP <ffffffff80425d00> <0>Kernel Panic - not syncing: Fatal exception

A hard reset (i.e. pressing the reset button on the VM’s console) is required to reboot the guest.
Continue reading →

November 19, 2009

Keeping your RHEL VMs from crushing your storage at 4:02am

Running a lot of Red Hat VMs in your virtual infrastructure, on shared storage? CentOS, Scientific Linux, both versions 4 and 5, they count for these purposes; Fedora should likely be included too. Do you have the slocate (version 4.x and earlier) or mlocate (version 5.x) RPMs installed? If you’re uncertain, check using the following:

> rpm -q slocate slocate-2.7-13.el4.8.i386

> rpm -q mlocate mlocate-0.15-1.el5.2.x86_64

If so, multiple RHEL VMs plus mlocate or slocate may be adding up to an array-crushing 4:02am shared storage load and latency spike for you. Before being addressed, this spike was bad enough at my place of employment (when combined with a NetApp Sunday-morning disk scrub) to cause a Windows VM to crash with I/O errors. Ouch.
Continue reading →

June 1, 2009

VMware/NFS/NetApp SnapRestore/Linux LVM Single File Recovery Notes

There have been a few posts elsewhere discussing file-level recovery for Linux VMs on NetApp NFS datastores, but none that have dealt specifically with Linux LVM-encapsulated partitions.

Here’s our in-house procedure for recovery; note that we do not have FlexClone licensed on our filers.
Continue reading →

February 4, 2009

ESX VM swap on NFS: If it crashes, try something else

I’ve written about running VMware ESX with VM swap on an NFS datastore previously – specifically whether or not it was supported/recommended:

After writing the second post, I thought the issue was pretty much resolved: From multiple sources, the consensus seemed to be that running ESX with VM swap on NFS would be fine. Imagine my surprise (and disappointment) at seeing the following VMware KB article 1008091, updated yesterday: An ESX virtual machine on NFS fails with swap errors. Further details are in the article itself, but VMware’s KB site is throwing intermittent errors for me at the moment, so I’ll provide the money quote:

The reliability of the virtual machine can be improved by relocating the swap file location to a non-NFS datastore. Either SAN or local storage datastores improve virtual machine stability.

February 3, 2009

VMware: Not kidding about VMotion GigE Requirement

In case you’re curious/adventurous/broke enough to try configuring your VMotion network on Fast Ethernet instead of Gigabit Ethernet, here’s what you can expect.

First, a warning from your VI client that you’re venturing into unsupported territory:

A friendly warning

And then, if you go ahead with the VMotion, a slight pause on the VM in question. The following is output from running while true; do date; sleep 1; done on a Linux guest during the VMotion:

Tue Feb  3 13:23:17 PST 2009
Tue Feb  3 13:23:18 PST 2009
Tue Feb  3 13:23:19 PST 2009
Tue Feb  3 13:23:20 PST 2009
Tue Feb  3 13:23:21 PST 2009
Tue Feb  3 13:23:22 PST 2009
Tue Feb  3 13:24:12 PST 2009
Tue Feb  3 13:24:13 PST 2009
Tue Feb  3 13:24:14 PST 2009
Tue Feb  3 13:24:15 PST 2009
Tue Feb  3 13:24:16 PST 2009

Note the fifty second pause between 13:23:22 and 13:24:12? Ouch…

November 24, 2008

VMware about ESX swap on NFS: It’s okay

Paul Manning, from VMware, in response to a question I asked in the VI:OPS forums:

The current best practice for NFS is to not seperate the VM swap space from the VMhome directory on a NFS datastore. The reason for the originial recommendation was just good old fashioned conservitiveness.

More at the forum post, including more on the reasoning for the old recommendation of separating swap when using NFS – thanks, Paul, you made my day.

July 31, 2008

Quick and Dirty VMware ESX Patching

On the ESX console, do the following:

Read the documentation for each patch.
Group patches that can be installed together into a directory, possibly an NFS mount available on all your ESX hosts.
Cd into the patch directory and untar the patches:
for i in `ls *.tgz`; do tar -xvzf $i done
Install the patches:
for i in `ls`; do
if [ -d $i ]; then
cd $i
esxupdate --noreboot update
cd ..
fi
done
Reboot.

February 8, 2008

VMware’s Comparison of Storage Protocol Performance

VMware has just released a paper entitled Comparison of Storage Protocol Performance (seen at Scale the Mind and blog.scottlowe.org); maybe this will help deflate some of the too-often repeated speculation that NFS is too slow for VMware ESX.
Continue reading →