Interesting Linux VM Crash Pattern
I’ve just begun to pull together some interesting data on a series of Linux VM crashes I’ve seen. I don’t have a resolution yet, but some interesting patterns have emerged.
Crash Symptoms
A CentOS 4.x or 5.x guest will crash with a message similar to the following on its console:
CentOS 4.x:
[<f883b299>] .text.lock.scsi_error+0x19/0x34 [scsi_mod]
[<f88c19ce>] mptscsih_io_done+0x5ee/0x608 [mptscsi] (…)
[<c02de564>] common_interrupt+0x18/0x20
[<c02ddb54>] system_call+0x0/0x30
CentOS 5.x:
RIP [<ffffffff8014c562>] list_del+0x48/0x71 RSP <ffffffff80425d00> <0>Kernel Panic - not syncing: Fatal exception
A hard reset (i.e. pressing the reset button on the VM’s console) is required to reboot the guest.
Further Details
Five different VMs have encountered this issue, running at a mix of close-to-current CentOS 4.x and 5.x patch levels. Guest kernel versions when the crash occurred were 2.6.18-128.7.1.el5 and 2.6.18-128.1.10.el5 (5.x) and 2.6.9-89.0.9.ELsmp (4.x). Memory allocations on affected guests range from 512MB to 3072MB. Notably, all affected VMs are using SMP – each has 2 vCPUs – having been created before our in-house practices followed VMware guidelines and discouraged use of SMP on ESX guests when unnecessary. One VM was created via P2V; the rest were created de novo on virtual hardware.
All crashes have happened on a single node in an ESX 3.5 HA cluster composed of four Dell PowerEdge 1950s. ESX hosts have tracked the latest VMware patches closely. COS memory on the ESX host in question was increased from the default to 800MB prior to the three most recent crashes; in other words, the COS memory increase appears to have had no effect on the crashes. DRS is in use, set to “fully automated” and “apply recommendations with three or more stars” and no virtual machine rules have been created to control DRS host placement.
All guests are on the same NFS data store, served from a NetApp filer running ONTAP 7.2.x. One guest had its vmswap placed separately on an iSCSI data store; the rest have their swap stored on NFS with the VM. No log messages were seen on the filer during the event, although the a log message similar to the following has been seen several times on the ESX host:
vmkernel: 43:07:27:51.725 cpu2:2185)WARNING: NFS: 4590: Can't find call with serial number -2146566055
Curiously, all crashes have happened in the evening, in the 10 o’clock hour, after nightly backups have been completed. Backups are created using a combination of VMware and NetApp snapshots via a script similar to one detailed on vmwaretips.com. No substantial load or latency has been recorded on the NetApp during the crashes, and weeks have passed between events.
Speculation
Explanations I’m leaning towards, ranked by my judgment of their likelihood:
1) Hardware issue. Assuming a random distribution of VMs – recall that DRS is in use and no virtual machine rules are in place – the odds of all five crashes happening on one host out of four are slim: 1 in 1024. Unfortunately, by all measures we’ve used, including the VI Client’s “Health Status” and Dell OMSA, there are no hardware issues with the host.
Further, the distribution of VMs is not truly random. DRS migrations are infrequent in this cluster, and the largest determinant of guest location is migration following hosts being placed into maintenance mode for patching.
If it is a hardware issue, it’s subtle, and possibly only brought to the fore by the following issues.
2) Red Hat Enterprise Linux bug – which, by extension, is typically equivalent to a CentOS bug. In fact, this issue appears to have been raised with Red Hat already in bugs 197158 and 228108 – but, according the bug reports, the issue is resolved, and the patches have since been ported downstream to CentOS. However, perhaps the issue is not truly resolved – see comment 35 in 228108.
3) vSMP Bug. The majority of our Linux VMs are uniprocessor and appear so far to be immune to this issue; it is striking that the crash has only occurred on dual processor guests. I cannot articulate a mechanism for multiple vCPUs causing this crash, however.
4) NetApp issue. This appears to be a storage issue at some level, considering the mptscsi and NFS messages noted above, so performance of the NetApp filer would be a natural place for further investigation. However, we monitor the performance of our filer relatively closely, using the ONTAP SDK and Cacti, and nothing unusual was recorded during any crash. It seems unusual that all VMs reside on the same data store, but that data store shares an aggregate with multiple other unaffected data stores, and several LUNs are served from the same aggregate to non-ESX machines without complaint.
I have not yet opened a case with VMware on this issue – or Dell, or NetApp, for that matter – but if and when I do, I’ll update here to the extent possible.
Update 11/20/2009: Prompted by a helpful comment from nate below, I looked up and verified the NFS settings across the cluster. They are the same across all hosts, and are as follows:
NFS.IndirectSend 0
NFS.DiskFileLockUpdate 10
NFS.LockUpdateTimeout 5
NFS.LockRenewMaxFailureNumber 3
NFS.LockDisable 0
NFS.HeartbeatFrequency 12
NFS.HeartbeatTimeout 5
NFS.HeartbeatDelta 5
NFS.HeartbeatMaxFailures 10
NFS.MaxVolumes 8
NFS.SendBufferSize 264
NFS.ReceiveBufferSize 128
NFS.VolumeRemountFrequency 30
NFS.UDPRetransmitDelay 700
The only values that are changed from default are HeartbeatFrequency and HeartbeatMaxFailures, to match NetApp’s recommendations in TR-3428.
just for reference, I run about 100 VMs with 2.6.18-128.1.10.el5 on CentOS 5.2 (5.3 kernel on 5.2 distro) with the SMP kernel though there is only 1 vCPU per VM. All of them are local storage on Dell R610. Also have about a dozen VMs on the same kernel running with 2 vCPUs.
I run about 40 physical servers on the same kernel on native R610 with 8 cores no panics there either.
I run the SMP kernel on everything even if it’s only 1 vCPU, because you never know if you may need to upgrade to 2 or more CPUs and if you do I don’t want to have to change kernels. Though now that I think about it I think RHEL/CentOS 5.x kernels are all SMP now vs on 4.x where they had SMP and UP kernels, but not 100% sure.
Never had a panic of any sort on any of them.
To me your problem looks related to storage, so I would look to NetApp or VMware’s NFS stuff. I would also consider hooking at least one host via Fiber or iSCSI and run some of the VMs off of that and see what you get.
My own infrastructure is split into two classes, we have our edge web servers which are at several physical locations, all of them run off of local storage. The other class is most of our back end stuff or QA or internal IT, of which it’s all fiber channel connected(most of it is boot from SAN, with the exception of some older ESXi 3.5 systems that don’t support that), connected to our 3PAR T400 storage system.
in case it helps you in your tracing of the issue..
Thanks for the input and suggestions, Nate – much appreciated.
I failed to mention in the post that we are running another cluster of ESX 3.5 hosts off the same filer (and same aggregate) using a Fibre Channel LUN without issue, FWIW. Guests on that cluster include Linux VMs, but all without vSMP. That suggests it’s not the filer itself, but it could still be the storage protocol, or the host hardware, or vSMP.
(And now that I’ve written that, I’m curious about verifying the VMware NFS settings on the problem host. I’ll post an update on what I find there.)
I’ve posted the NFS settings above; unfortunately, they don’t appear to deviate from default/recommended settings.