Tuesday, January 25, 2011

VMware ESXi 4.1 slow boot up seemingly stuck at: vmw_vaaip_netapp, vmkibft, vmfs3 modules / processes

For those who have come across one of my earlier posts a few months back:

Experiencing slow boot up times during an update from ESX 3.5 to ESX/ESXi 4.0 or 4.1? Check your RDMs

http://terenceluk.blogspot.com/2010/10/slow-boot-up-times-on-esxi-41-do-you.html

… this week was finally booked for planning and upgrading the rest of the hosts in the VI3 cluster for one of our clients we providing hosting service for.  Through reading my earlier post mentioned above, I realized that I did not provide enough details for the environment so let me do so now before I proceed with what was discovered through troubleshooting with VMware:

NetApp Model: FAS2020

Data ONTAP: Release 7.2.6.1

ESX 3.5 Servers (old hosts): HP ProLiant DL380 G5 w/LPe11000 FC HBA

ESX 3.5 Version: Build 3.5.0, 153875

New ESXi 4.1 Server (new hosts) as mentioned in previous post: HP ProLiant DL360 G6 w/Brocade-425/825 FC HBA

ESXi 4.1 Version: Build 4.1.0, 260247

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Since we’ve designed the cluster as N+1, I was able to  VMotion all of the VMs off of one of the hosts so that I was able to take it down for the upgrade but I noticed that after upgrading the host to ESXi 4.0 Build: 260247, it would seemingly get stuck at this screen for upwards to 5 minutes:

vmw_vaaip_netapp loaded successfully.

clip_image001

Since I’ve encountered a similar problem a few months back when we deployed the new server mentioned in my earlier post on the new HP server, where there was a known issue with MSCS RDMs that causes the boot process to take an extremely long, I went ahead and changed the CRTimeoutDuringBoot to 10,000 (10 seconds) as I did with the newer server but it still took a long time to boot.  I then proceed to try changing it to 5000 then to 5 yet it still took a very long time to boot which lead me to believe that the change for that setting did speed up the process a bit (not sure because I didn’t time it) but it may be stuck at some other process that we’re not aware about because when I stood around to monitor the screens, I noticed that it would stay at the following one for awhile as well:

vmkibft loaded successfully.

clip_image001[4]

… then here for awhile as well:

vmfs3 loaded successfully.

clip_image001[6]

After thinking about this for a bit, I suspected that the screens where the boot up process seemed to be stuck on may not be accurate because ESXi may have completed the process shown on the screen but stuck on the following one.  What I’ve also noticed was that if I unplugged the FC cables from the server, the boot up process would be extremely fast (what we’re normally used to).  Since I felt that I wasn’t going to get anywhere with this, I went ahead and called VMware.

So what did we find?  Based on the logs during the boot up (ESXi only stores 1 and overwrites it after every reboot) the server is taking approximately 12 to 14 minutes to boot.  Further reviewing the entries in the logs lead us to 2 issues we thought were causing the slow boot up times:

  1. Timeouts during reads on the RDM LUNs.
  2. “Read capacity” errors for the RDM LUNs.

Let me separate the findings for these 2 issues:

Timeouts during reads on the RDM LUNs

As per the following article: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1016106, we tried different values for the Scsi.CRTimeoutDuringBoot value and while we suspect it did help the boot times, the best we were able to do was still 9 minutes with the value set to “1”.  Note that there was no Scsi.UWConflictRetries to configure because this only existed in ESXi 4.0 and that it was supposed to be fixed in 4.1.  Seeing how this wasn’t the cause, we looked at the logs line by line and it we could see that the processes we saw in the screenshots took 1 or 2 minutes to complete and there were no significant gaps between the line entries that suggest there was a specific process in which the boot up process was stuck on. 

Jan 25 20:18:57 vmkernel: 0:00:00:34.957 cpu7:4682)VMKAPICore: 3786: Loading Module vmw_vaaip_netapp

Jan 25 20:18:57 vmkernel: 0:00:00:34.972 cpu7:4684)vmw_vaaip_netapp loaded successfully.

Notice how there was no gap in time lapse at all for this module?

The VMware engineer essentially said: “It just looks like there’s a lot happening during the boot up but the server isn’t stuck on anything.”  Errors that we saw such as these:

Jan 25 20:18:57 vmkernel: 0:00:00:35.112 cpu6:4682)WARNING: ScsiDeviceIO: 5172: READ CAPACITY on device "naa.60a98000503354706b4a5a446b657147" from Plugin "NMP" failed. I/O error

… were expected because they were RDMs and ESXi was skipping them after a few retires.  As the VMware storage engineer plowed through numerous engineering notes, he believes that it's just taking a long time with the retries and because we have multiple RDMs, the boot process will take long.  All the information he could find points to the how this may be fixed in a future release even though there is no set date.

From here on, I felt that we were on a dead end so I asked the engineer about another problem that we had before which I blogged about a few months back and that was the read capacity errors on RDMs: http://terenceluk.blogspot.com/2010/10/problem-when-upgrading-from-esx-35-to.html.  What I basically wanted to ask was whether he could confirm that information I had received from another engineer a few months back that when I proceed with the upgrade to 4.1 and release the lock on the LUNs, all the host would see it.  With this question, we ended up with the next issue.

“Read capacity” errors for the RDM LUNs

The engineer had a look at the problem and then said he’s not entirely convinced that this won’t be a problem if we follow the procedures in 4.1.  He says that we can’t just assume that all the 4.1 hosts would read the LUNs properly if another 4.1 host ends up locking it.  By this he means this:

We shutdown the MSCS VMs

Shutdown the ESX 3.5 servers

Rescan the RDMs on all 4.1 hosts

We see the capacity for each LUN

Boot up the active MSCS node

Boot up the passive MSCS node

… but what happens if we ever suffer a host failure or have to reboot a host?  Will rebooted host be able to read the LUN capacity?  Nobody really knows and the only way to know for sure is to test it (I will update this post when I get the chance to).

Further digging on his internal database shows that there was a workaround for cases similar and that was to set the path to “fixed” which it already was and then execute the following command:

esxcli storage core device setconfig --perennially-reserved yes

Unfortunately, this command did not appear to exist in 4.1 because the closest command that existed was esxcli corestorage but executing that with the parameters did not appear to be a valid command.  I also went ahead and also tried logging onto an ESXi 4.0 Update 2 server to test this out and the result ended up being the same as 4.1.

Another note mentioned in his internal knowledge base showed that a fix may have been put into the next version but again, there is no official release date or even a name for it at this point.  More research eventually lead us to the NetApp FAS and whether 4K block size would cause this.  I’m not much of a NetApp expert but I do recall 4K is the default block size.  Yet another dead end.

One of the other attempts the engineer thought of was VAAI.  Since this was a new feature in 4.1, perhaps we could try disabling it with the following commands:

~ # vsish -e set /config/DataMover/intOpts/HardwareAcceleratedInit 0

~ # vsish -e set /config/DataMover/intOpts/HardwareAcceleratedMove 0

Unfortunately, executing these commands and rebooting the host didn’t speed up the process or allow us to read the capacity of the RDM LUNs.  This then lead the engineer to review the compatibility guide:

http://www.vmware.com/resources/compatibility/search.php

image

We searched for NetApp:

image

… then reviewed the notes for the FAS2000 series:

image

Hovering over the notes show the following:

See KB 101073 for information about VMW_PSP_RR(RoundRobin)

image

Data ONTAP versions 7.3, 7.3.1, 7.3.1.1L1, & 7.3.2 are all compatible, and therefore supported by this ESX release.

image

Data ONTAP 8.0.1 supports only the FAS3040 & FAS3070 models within the FAS3000 Series.

image

What’s obvious is that the NetApp Data ONTAP for our filer needs to be upgraded but what’s strange is that 7.2 was not listed as supported in ESX 3.5 Update 2 yet our MSCS is working fine for 2 years:

image

At this point, we felt that we’ve exhausted all of our options and it was best to try contacting NetApp so they can verify the notes they have internally as the information published by VMware is what was given to them since validation is done by the storage vendor.

So what are my plans?  We’re lucky to have more than N+1 capacity with the new server so I’m going to separate the nodes into 2 clusters preserving the legacy 3.5 servers for the time being as I do some more tests prior to moving the SQL clusters onto new ESXi 4.1 servers.

Update

Since the server I was working with had no VMs on it, I took the opportunity to reinstall the server with ESXi 4.0 Update 2 to test out whether:

  1. It booted up faster
  2. I was able to read the LUNs

The result was pretty much the same with one exception and that was I could drastically improve the boot up speed to less than 5 minutes by changing the advanced settings variable Scsi.UWConflictRetries from 1000 (default) to 80 as per the KB: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1016106 I mentioned above which was extremely tolerable:

image

One of the items I should have included in this post was the BIOS version so I took a screenshot of it today:

image

I also took a photo of the boot up screen to show a bit more of the details for the BIOS:

image

As shown in the screenshot above, the BIOS is extremely old and since this server has been happily chugging along for 2 years, I’m not surprised with what I saw.  I did talk to the manager about whether we should try updating the firmware but after our discussion, we believe that the issue is most likely the NetApp so we’ll defer the update to a later time to get this host back up. 

As for whether there was any difference for the inability to read the LUN capacity, the results were exactly the same where the logs would throw the I/O errors for Read Capacity and VI Client would show the LUNs as 0.00 B:

image

I will write a separate post about what I saw during the slow boot up with ESXi 4.0 Update 2 in another post so stay tuned.

2 comments:

MattM said...

I am curious Terence, have you experienced any issues with transmission pauses on your Nexus switches when connected to a C250? We have been getting in the millions of pauses and Cisco TAC is jerking us around on the phone. Any suggestions on what to look at? We have updated the QLogic drivers, dropped que depth to 128 from the default 65000, updated firmware on the Nexus 5010s, nothing...

sakebombers said...

Terence,

I also experience similar problems when I tried to upgrade my ESXi host from 4.0 to 4.1. Very slow boot up, to a point that I felt there was something wrong with the upgrade. Also a LUN that was previously attached would not be recognized. The paths to the LUN via the storage adapter was visible, but the ability read/write to the storage was not there. First thought it was my SANs that was having the issue, but with the similarity you had compared to mine it sounds like a VMware issue. Any new update since you last went back to 4.0?

Chris