Monday, July 12, 2010

Considerations for VMware HA on UCS when performing northbound Cisco 3750 switch maintenance

A client gave me a call one evening asking me why he wasn't able to access vCenter. The vCenter server was virtualized and run in a 7 ESX 4.0 host cluster on a redundant UCS setup (2 fabric interconnects, 2 chassis, 2 3750 switches). Seeing how he wasn't able to even ping the vCenter server but no errors were logged in the UCSM, I told him to try using VI Client to connect to each blade's ESX. As he connected to each ESX host independently, he finally found the ESX host with vCenter and powered it up. Once managed to get vCenter up and VI Client to the host, he noticed that all his VMs were powered off.

The client was obviously worried and confused then proceeded to ask me for an explanation. I was on the road and wasn't able to log in to check the logs so I asked him what was changed. He told me nothing was changed within ESX and the UCS other than the 2 Cisco 3750 switches were swapped out. Seeing how losing the northbound connection from the fabric interconnect to the 3750s would cause ESX to assume that the vmnics were disconnected, I knew immediately that all the hosts identified themselves that they were in isolation mode and by default, HA would power down the virtual machines. Since all of the UCS blades dependent on these connections, none of the ESX hosts would be online and none would power on the virtual machines.

Basically what I told the client was that if he were to ever perform maintenance on the switches in the future, do one of the following:

1. Swap them out 1 at a time ensuring that 1 of them is online with the 10gb link to the UCS up.
2. Disable HA before performing maintenance.

1 comment:

mj1pate said...

No Comments on this one yet? Hard to believe. This bit us as well when we were in early deployment stages. Very good advise and more on the topic of coordinating UCS/Networking HA with VMWARE HA deserves treatment.