Category: virtualization

VMware Tools Upgrade on CentOS Enables Host Time Sync (plus fix)

After bringing some CentOS guests from an ESX 3.5 environment to an ESXi 4.1 environment and performing a VMware Tools upgrade, I noticed log messages on the VMs similar to the following:

Nov 12 09:07:18 node01 ntpd[2574]: time reset +175.995101 s

Along with console messages about the cmos clock such as:

time.c can't update cmos clock from 0 to 59

Inspecting the affected VMs, the clock appeared to be losing almost a second each second, despite ntpd being up and running and kernel options set appropriately. Further investigation revealed that “Synchronize guest time with host” had been silently enabled for the guest during the Tools upgrade, contrary to VMware’s Timekeeping best practices.

To be fair, I don’t know how widespread this problem is – it could be particular to CentOS, ESX 3.5 to 4.1 migrations, the fact that the virtual hardware hasn’t yet been upgraded from version 4 to version 7, or even my method of upgrading the tools. However, once you know to look for this issue, the resolution is simple: Disable host time sync. You can do this manually, or, if you use Puppet to manage your Linux VMs, the following manifest snippet will automate this for you (assuming you have a “vmware-tools” Service):

exec { "Disable host time sync":
  onlyif => "/usr/bin/test `/usr/bin/vmware-toolbox-cmd timesync status` = 'Enabled'",
  command => "/usr/bin/vmware-toolbox-cmd timesync disable",
  require => Service["vmware-tools"],
}

Put Down the Saw and Get the Glue: Working Around VMware KB1022751

VMware KB article 1022751 lays out the details of an interesting bug in ESXi 4.0 and 4.1 pretty plainly:

When trying to team NICs using EtherChannel, the network connectivity is disrupted on an ESXi host. This issue occurs because NIC teaming properties do not propagate to the Management Network portgroup in ESXi. When you configure the ESXi host for NIC teaming by setting the Load Balancing to Route based on ip hash, this configuration is not propagated to Management Network portgroup.

(Note that load balancing by IP hash is the only supported option for EtherChannel link aggregation.)

Unfortunately, the KB article’s workaround – there is no patch that I’m aware of – requires network connectivity to the host via the vSphere Client. But what do you do if you’ve just sawed off the branch you’re sitting on network-wise, and can no longer connect with the vSphere client?
Continue reading

Interesting Linux VM Crash Pattern

I’ve just begun to pull together some interesting data on a series of Linux VM crashes I’ve seen. I don’t have a resolution yet, but some interesting patterns have emerged.

Crash Symptoms

A CentOS 4.x or 5.x guest will crash with a message similar to the following on its console:

CentOS 4.x:

[<f883b299>] .text.lock.scsi_error+0x19/0x34 [scsi_mod]
[<f88c19ce>] mptscsih_io_done+0x5ee/0x608 [mptscsi] (…)
[<c02de564>] common_interrupt+0x18/0x20
[<c02ddb54>] system_call+0x0/0x30

CentOS 5.x:

RIP  [<ffffffff8014c562>] list_del+0x48/0x71 RSP <ffffffff80425d00> <0>Kernel Panic - not syncing: Fatal exception

A hard reset (i.e. pressing the reset button on the VM’s console) is required to reboot the guest.
Continue reading

VMware/NFS/NetApp SnapRestore/Linux LVM Single File Recovery Notes

There have been a few posts elsewhere discussing file-level recovery for Linux VMs on NetApp NFS datastores, but none that have dealt specifically with Linux LVM-encapsulated partitions.

Here’s our in-house procedure for recovery; note that we do not have FlexClone licensed on our filers.
Continue reading

ESX VM swap on NFS: If it crashes, try something else

I’ve written about running VMware ESX with VM swap on an NFS datastore previously – specifically whether or not it was supported/recommended:

After writing the second post, I thought the issue was pretty much resolved: From multiple sources, the consensus seemed to be that running ESX with VM swap on NFS would be fine.  Imagine my surprise (and disappointment) at seeing the following VMware KB article 1008091, updated yesterday: An ESX virtual machine on NFS fails with swap errors. Further details are in the article itself, but VMware’s KB site is throwing intermittent errors for me at the moment, so I’ll provide the money quote:

The reliability of the virtual machine can be improved by relocating the swap file location to a non-NFS datastore. Either SAN or local storage datastores improve virtual machine stability.

VMware: Not kidding about VMotion GigE Requirement

In case you’re curious/adventurous/broke enough to try configuring your VMotion network on Fast Ethernet instead of Gigabit Ethernet, here’s what you can expect.

First, a warning from your VI client that you’re venturing into unsupported territory:

A friendly warning

A friendly warning

And then, if you go ahead with the VMotion, a slight pause on the VM in question.  The following is output from running while true; do date; sleep 1; done on a Linux guest during the VMotion:

Tue Feb  3 13:23:17 PST 2009
Tue Feb  3 13:23:18 PST 2009
Tue Feb  3 13:23:19 PST 2009
Tue Feb  3 13:23:20 PST 2009
Tue Feb  3 13:23:21 PST 2009
Tue Feb  3 13:23:22 PST 2009
Tue Feb  3 13:24:12 PST 2009
Tue Feb  3 13:24:13 PST 2009
Tue Feb  3 13:24:14 PST 2009
Tue Feb  3 13:24:15 PST 2009
Tue Feb  3 13:24:16 PST 2009

Note the fifty second pause between 13:23:22 and 13:24:12? Ouch…

VMware about ESX swap on NFS: It’s okay

Paul Manning, from VMware, in response to a question I asked in the VI:OPS forums:

The current best practice for NFS is to not seperate the VM swap space from the VMhome directory on a NFS datastore. The reason for the originial recommendation was just good old fashioned conservitiveness.

More at the forum post, including more on the reasoning for the old recommendation of separating swap when using NFS – thanks, Paul, you made my day.