Monday, October 4, 2010

Experiencing slow boot up times during an update from ESX 3.5 to ESX/ESXi 4.0 or 4.1? Check your RDMs.

I ran into an interesting problem the other day while refreshing one of our client’s Virtual Infrastructure 3 (VI3) hosts to vSphere 4.1. The ESX cluster had clustered SQL servers configured with MSCS and raw device mappings from a NetApp SAN. Since we’re adding additional hosts to the cluster, the plan I had in place was to start off with installing ESXi 4.1.0 on the new hosts, test to ensure the hosts are configured properly (i.e. network, storage, etc) and stable, then schedule a window to migrate existing virtual machines over. After getting a few of the new hosts installed and included in the initiator groups for the existing LUNs, I noticed that the servers would take upwards of 15 to 20 minutes to boot. Through looking at the console during the boot up, I can see that it gets stuck at the LUN detection line.

After going through additional troubleshooting steps to isolate the problem (I didn’t know it was the RDMs), I went ahead and called VMware. What I was told was that the timeout for synchronous commands during boot up was changed in ESX and ESXi 4.x and therefore causes the host to take a very long time to boot when there are LUNs presented to the host but these LUNs had persistent reservations. In our case, it was the LUNs presented to the SQL cluster built on top of Microsoft Clustering Services (MSCS). The support engineer then proceeded to give me instructions documented in an internal KB that hasn’t been published yet to the public to change the timeout parameter to a lower value:

For vSphere 4.1 you need to modify the Scsi.CRTimeoutDuringBoot parameter form the GUI

  1. Go to Host > Configuration > Advanced Settings.
  2. Select SCSI.
  3. Change the Scsi.CRTimeoutDuringBoot value to 10000.

Once I changed the value to 10000 which equals to 10 seconds), the host booted up properly.

The support engineer also had me verify the following which I wasn’t able to find ESXi 4.1.0:

To modify the Scsi.ConflictRetries parameter from the GUI:

  1. Go to Host > Configuration > Advanced settings.
  2. In the Advanced settings window, select SCSI.
  3. Change the Scsi.UWConflictRetries value from the default 1000 to 80.


I was told that the retry value in ESX 3.5 is apparently set to a lower value which was why we never experienced this issue until we started upgrading the hosts.

No comments: