Tuesday, July 20, 2010

What happens when a VMware ESX host loses redundant fibre channel (FC) links to a datastore

I’ve been fortunate enough to be involved with a project for a law firm to deploy 10 ESX hosts at 2 co-located / geographically dispersed site for DR (disaster recovery). Other than all the other technologies I got to worth with: EMC, VMware SRM, vSphere, the list goes on (I love this datacenter virtualization stuff), I came across an interesting discovery during the testing phase. I was responsible for testing all of the ESX clusters and its redundancy whether it was network or storage so I had to create a test plan. Other than the test cases for HA, DRS, Nexus 1000v (what a disaster) and all the bells and whistles of the technologies involved, I had one test case that revealed something completely new to me and this test was the FC paths to storage. We used EMC PowerPath to provided the redundant FC links to the fabric switches and I included the following test case:

CategoryTestCommand / ProcedureExpected BehaviorResultNotes
VMware - HA / EMC StorageHA Virtual machine restartChoose a host with a test virtual machine, disconnect both FC cable, ensure virtual machine restarts on another host.

As shown in the above table, I anticipated that once the ESX host loses both its paths to the fabric and thus losing connectivity to the datastore, VMware HA would restart the virtual machine on the other host. This did not happen and here’s how it would look if you did the test I have above:

1. Once you disconnect the 2 fc cables, navigate to the host’s Configuration tab –> Storage Adapters and click on the vmhba, you will see that all the other datastores are gone aside from which ever store a powered up virtual machine resides on.


2. Clicking on the Paths tab will show the following:


3. Great, so the path is indicated as dead. So what does the information for the virtual machine show?


4. Interestingly enough, the testvm still shows that it’s powered on at the host that lost all of its FC paths to the EMC SAN. So what happens if I try to open the console window?


5. Here we see a “Unable to connect to the MKS: Virtual machine config file does not exist..” message. No surprise here, the host did lose access to the virtual machine files.

So long story short, I went ahead and posted a question on the VMware community forums and someone responded telling me HA doesn’t restart virtual machines for fc connections. This didn’t surprise me as the training course I’ve been in for 3.5 always talked about “host isolation” or host actually down. Then I went ahead and did some tests with the VM Monitoring feature that monitors the heartbeat via VMware tools to see if that would restart it and found out that it does indeed restart the virtual machine only:

1. When you reconnect the FC connection.

2. It will restart it on the same host.

The post I wrote on the forum haven’t gotten a lot of responses from other users in the community and google searches don’t appear to yield much results either (or maybe I’m not typing in the right words to search) but I think I’ve come to believe that there is no solution for this unless it’s some form of manual scripting (possibly a forced reboot of the host) with monitoring (the fc links).

My colleague that was with me went ahead and reached out to his ex-coworker in the datacenter field and surprisingly he thought that VM Monitoring would restart it. I’m sure it wouldn’t because I left the links disconnected for 30 minutes and confirmed that the virtual machine was indeed off by reviewing the Windows event logs and seeing that the logs had a 30 minute gap between events.


Anonymous said...

We've encountered the same problem with ESXi 4.1 (build 260247) cluster and EMC AX4 storage array.

Virtualization said...

When you get the error message:

Unable to connect to MKS: Virtual machine config file does not exist

It is likely that IP address of that VM is still pingable and vCenter does not see anything wrong with it.

Here is potential solution to the problem:

I've seen this issue come up on vmware forum a few times but no solution. I can confirm that above method fixes the issue, datastore becomes browseable again and VM functions as it was before.