centos – Thinking Sysadmin

Interesting Linux VM Crash Pattern

anleonard — Fri, 20 Nov 2009 19:09:16 +0000

I’ve just begun to pull together some interesting data on a series of Linux VM crashes I’ve seen. I don’t have a resolution yet, but some interesting patterns have emerged.

Crash Symptoms

A CentOS 4.x or 5.x guest will crash with a message similar to the following on its console:

CentOS 4.x:

[] .text.lock.scsi_error+0x19/0x34 [scsi_mod] [] mptscsih_io_done+0x5ee/0x608 [mptscsi] (…) [] common_interrupt+0x18/0x20 [] system_call+0x0/0x30

CentOS 5.x:

RIP [] list_del+0x48/0x71 RSP <0>Kernel Panic - not syncing: Fatal exception

A hard reset (i.e. pressing the reset button on the VM’s console) is required to reboot the guest.

Further Details

Five different VMs have encountered this issue, running at a mix of close-to-current CentOS 4.x and 5.x patch levels. Guest kernel versions when the crash occurred were 2.6.18-128.7.1.el5 and 2.6.18-128.1.10.el5 (5.x) and 2.6.9-89.0.9.ELsmp (4.x). Memory allocations on affected guests range from 512MB to 3072MB. Notably, all affected VMs are using SMP – each has 2 vCPUs – having been created before our in-house practices followed VMware guidelines and discouraged use of SMP on ESX guests when unnecessary. One VM was created via P2V; the rest were created de novo on virtual hardware.

All crashes have happened on a single node in an ESX 3.5 HA cluster composed of four Dell PowerEdge 1950s. ESX hosts have tracked the latest VMware patches closely. COS memory on the ESX host in question was increased from the default to 800MB prior to the three most recent crashes; in other words, the COS memory increase appears to have had no effect on the crashes. DRS is in use, set to “fully automated” and “apply recommendations with three or more stars” and no virtual machine rules have been created to control DRS host placement.

All guests are on the same NFS data store, served from a NetApp filer running ONTAP 7.2.x. One guest had its vmswap placed separately on an iSCSI data store; the rest have their swap stored on NFS with the VM. No log messages were seen on the filer during the event, although the a log message similar to the following has been seen several times on the ESX host:

vmkernel: 43:07:27:51.725 cpu2:2185)WARNING: NFS: 4590: Can't find call with serial number -2146566055

Curiously, all crashes have happened in the evening, in the 10 o’clock hour, after nightly backups have been completed. Backups are created using a combination of VMware and NetApp snapshots via a script similar to one detailed on vmwaretips.com. No substantial load or latency has been recorded on the NetApp during the crashes, and weeks have passed between events.

Speculation

Explanations I’m leaning towards, ranked by my judgment of their likelihood:

1) Hardware issue. Assuming a random distribution of VMs – recall that DRS is in use and no virtual machine rules are in place – the odds of all five crashes happening on one host out of four are slim: 1 in 1024. Unfortunately, by all measures we’ve used, including the VI Client’s “Health Status” and Dell OMSA, there are no hardware issues with the host.

Further, the distribution of VMs is not truly random. DRS migrations are infrequent in this cluster, and the largest determinant of guest location is migration following hosts being placed into maintenance mode for patching.

If it is a hardware issue, it’s subtle, and possibly only brought to the fore by the following issues.

2) Red Hat Enterprise Linux bug – which, by extension, is typically equivalent to a CentOS bug. In fact, this issue appears to have been raised with Red Hat already in bugs 197158 and 228108 – but, according the bug reports, the issue is resolved, and the patches have since been ported downstream to CentOS. However, perhaps the issue is not truly resolved – see comment 35 in 228108.

3) vSMP Bug. The majority of our Linux VMs are uniprocessor and appear so far to be immune to this issue; it is striking that the crash has only occurred on dual processor guests. I cannot articulate a mechanism for multiple vCPUs causing this crash, however.

4) NetApp issue. This appears to be a storage issue at some level, considering the mptscsi and NFS messages noted above, so performance of the NetApp filer would be a natural place for further investigation. However, we monitor the performance of our filer relatively closely, using the ONTAP SDK and Cacti, and nothing unusual was recorded during any crash. It seems unusual that all VMs reside on the same data store, but that data store shares an aggregate with multiple other unaffected data stores, and several LUNs are served from the same aggregate to non-ESX machines without complaint.

I have not yet opened a case with VMware on this issue – or Dell, or NetApp, for that matter – but if and when I do, I’ll update here to the extent possible.

Update 11/20/2009: Prompted by a helpful comment from nate below, I looked up and verified the NFS settings across the cluster. They are the same across all hosts, and are as follows:

NFS.IndirectSend 0 NFS.DiskFileLockUpdate 10 NFS.LockUpdateTimeout 5 NFS.LockRenewMaxFailureNumber 3 NFS.LockDisable 0 NFS.HeartbeatFrequency 12 NFS.HeartbeatTimeout 5 NFS.HeartbeatDelta 5 NFS.HeartbeatMaxFailures 10 NFS.MaxVolumes 8 NFS.SendBufferSize 264 NFS.ReceiveBufferSize 128 NFS.VolumeRemountFrequency 30 NFS.UDPRetransmitDelay 700

The only values that are changed from default are HeartbeatFrequency and HeartbeatMaxFailures, to match NetApp’s recommendations in TR-3428.

Keeping your RHEL VMs from crushing your storage at 4:02am

anleonard — Thu, 19 Nov 2009 19:39:30 +0000

Running a lot of Red Hat VMs in your virtual infrastructure, on shared storage? CentOS, Scientific Linux, both versions 4 and 5, they count for these purposes; Fedora should likely be included too. Do you have the slocate (version 4.x and earlier) or mlocate (version 5.x) RPMs installed? If you’re uncertain, check using the following:

> rpm -q slocate slocate-2.7-13.el4.8.i386

> rpm -q mlocate mlocate-0.15-1.el5.2.x86_64

If so, multiple RHEL VMs plus mlocate or slocate may be adding up to an array-crushing 4:02am shared storage load and latency spike for you. Before being addressed, this spike was bad enough at my place of employment (when combined with a NetApp Sunday-morning disk scrub) to cause a Windows VM to crash with I/O errors. Ouch.

Details and ideas for resolution:

By default, a line in /etc/crontab runs the scripts within /etc/cron.daily at 4:02am each morning:

02 4 * * * root run-parts /etc/cron.daily

One of those scripts – mlocate.cron or slocate.cron, depending on your OS version – launches updatedb; as the man page says, “updatedb creates or updates a database used by locate(1).” (The “locate” binary is a filesystem search tool, see “man locate” for more information.) Updatedb refreshes its database by walking the filesystem, generating a fair amount of I/O on a single system. Imagine upwards of thirty of these running in parallel through VMDKs on one shared storage system carrying out internal maintenance at the same time, and you’re pretty much picturing the problem my employer had.

I see three options for addressing this issue:

1) Uninstall mlocate or slocate. If you don’t currently use “locate” and you’re not interested in learning to use a tool that will likely make you more effective at your job (again, see “man locate”), this is probably the best option. (Yeah, I know, people that fit this bill generally don’t read blogs more technical than this one, so I could probably have skipped it here. Consider it an option for completeness, or if you really need to strip down an install.)

2) Disable the scheduled job by removing mlocate.cron or slocate.cron from /etc/cron.daily. This keeps locate available for your use, but requires that you update locate’s database ad-hoc and interactively by running the following as root:

# updatedb

This will take a few minutes to return, depending on the size of your file systems.

I don’t recommend this option either; at least it doesn’t fit the way I work. I often find myself using locate in high-pressure situations in which I need to quickly get a file location on a system. Waiting minutes for updatedb to return is extra painful when every second counts.

3) Stagger when updatedb runs by inserting a random delay into the script.. This is my preferred alternative; locate’s database is kept current automatically, and your storage doesn’t have to bear a sudden spike in load. I implemented this by adding the lines in bold (lines 2-7 if your browser doesn’t display the bold text clearly):

#!/bin/sh # sleep up to two hours before launching job: value=$RANDOM while [ $value -gt 7200 ] ; do value=$RANDOM done sleep $value nodevs=$(/dev/null 2>&1 /usr/bin/updatedb -f "$nodevs"

The added code inserts a pseudo-random sleep delay of up to two hours before updatedb runs, with the key being the built-in Bash function $RANDOM. In our environment, this removed a 2000 IOPS spike at 4:02am, and eliminated a corresponding jump in filer latency. Obviously, adjust the delay period as appropriate for your environment. Additionally, be sure to add this change to your configuration management or installation management tools so that all of your RHEL and RHEL-derived VMs get the updated script.

Using $RANDOM to avoid this variant of the thundering herd problem also works nicely for a range of similar problems; I believe I first saw it at Moundalexis.com.

(This problem may apply to other Linux distributions being run as VMs, and FreeBSD does something equivalent – weekly – with /etc/periodic/weekly/310.locate. A similar solution can be applied to these environments, if necessary.)

Kickstarting CentOS 5.1 – Not from a yum repository any more

anleonard — Fri, 15 Feb 2008 14:47:27 +0000

In the past, I’ve used our local mirror of the CentOS yum repository to kickstart machines booted using PXE; apparently, this no longer works with CentOS 5.1, although it did with 5.0. If you attempt to do so, after the initial PXE boot, you get the following message:

The CentOS installation tree in that directory does not seem to match your boot media.

The solution? Download the 5.1 DVD iso image, copy its contents to your disk and re-run the “pxeos” command to use that as your installation tree. I’m not sure if this was an oversight or a conscious change on the part of the CentOS project (or if the yum repository method was never supported in the first place), but it stumped me for a little while, so I thought I’d post it here.