Thursday, October 14, 2010

Problem when upgrading from ESX 3.5 to ESXi 4.0 or 4.1 – Why are my RDM LUNs showing up as “0.00 B” for Capacity?

Update – February 1st, 2011:  I finally managed to get a small window to do a few tests with both ESXi 4.0 Update 2 and ESXi 4.1 to complete some tests and the following Part #2 post shows the results: http://terenceluk.blogspot.com/2011/02/part-2-problem-when-upgrading-from-esx.html

--------------------------------------------------------------------------------------------------------------------------------------------------------------------

I ran into an interesting problem while upgrading a VMware Virtual Infrastructure 3 (ESX 3.5) environment to vSphere ESXi 4.1 last week when I configured the new ESXi 4.1 to be included in the initiator group for the existing RDM LUNs that the other ESX 3.5 hosts had access to. Other than the slow boot up issue I encountered as noted in my previous post (http://terenceluk.blogspot.com/2010/10/slow-boot-up-times-on-esxi-41-do-you.html), I also noticed that all the RDM LUNs that the ESX 3.5 hosts were using had the Capacity show up as “0.00 B”.

image

As shown in the highlighted items in the screenshot above, the LUNs do not have their Capacity values showing properly and what’s common between all of these LUNs is that they are all Raw Device Mappings.

So after spending a few hours troubleshooting and being pressed for time to get a resolution, I went ahead and initiated a call to VMware support. The VMware support engineer began by reviewing the logs during ESXi’s bootup as well as a rescan operation (command: less /var/log/messages) and found that there were entries listed as:

Oct 1 19:02:52 0 minute(s) and 0 second(s)
Oct 1 19:02:52 Vpxa:
Oct 1 19:02:52 Vpxa: [2010-10-01 19:02:52.044 FFE77B10 verbose 'App'] [VpxaVMAP::Invoke] Command returned successfully
Oct 1 19:03:00 vmkernel: 0:01:06:16.227 cpu6:4102)ScsiDeviceIO: 1672: Command 0x1a to device "naa.60a98000503354706b4a50617a34354f" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
Oct 1 19:03:00 vmkernel: 0:01:06:16.227 cpu1:5415)WARNING: ScsiDeviceIO: 5172: READ CAPACITY on device "naa.60a98000503354706b4a50617a34354f" from Plugin "NMP" failed. I/O error
Oct 1 19:03:01 Vpxa: [2010-10-01 19:03:01.236 1B3D5B90 warning 'App'] [VpxaHalStats] Unexpected return result. Expect 1 sample, receive 2
Oct 1 19:03:01 Vpxa: [2010-10-01 19:03:01.252 FFE77B10 verbose 'App'] [VpxaInvtHost] Increment master gen. no to (128): Event:VpxaHalEvent::CheckQueuedEvents
Oct 1 19:03:10 vmkernel: 0:01:06:26.254 cpu6:4102)ScsiDeviceIO: 1672: Command 0x25 to device "naa.60a98000503354706b4a5a446b657147" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
Oct 1 19:03:10 vmkernel: 0:01:06:26.254 cpu1:5415)WARNING: ScsiCore: 1399: Invalid sense buffer: error=0x0, valid=0x0, segment=0x0, key=0x0
Oct 1 19:03:20 vmkernel: 0:01:06:36.269 cpu6:4102)ScsiDeviceIO: 1672: Command 0x1a to device "naa.60a98000503354706b4a5a446b657147" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
Oct 1 19:03:20 vmkernel: 0:01:06:36.269 cpu1:5415)WARNING: ScsiDeviceIO: 5172: READ CAPACITY on device "naa.60a98000503354706b4a5a446b657147" from Plugin "NMP" failed. I/O error
Oct 1 19:03:20 vmkernel: 0:01:06:36.311 cpu1:5415)Vol3: 1604: Could not open device 'naa.600508b1001030384241333630300300:5' for probing: Permission denied
Oct 1 19:03:20 vmkernel: 0:01:06:36.312 cpu1:5415)Vol3: 644: Could not open device 'naa.600508b1001030384241333630300300:5' for volume open: Permission denied
Oct 1 19:03:20 vmkernel: 0:01:06:36.313 cpu1:5415)Vol3: 1604: Could not open device 'naa.600508b1001030384241333630300300:6' for probing: Permission denied
Oct 1 19:03:20 vmkernel: 0:01:06:36.314 cpu1:5415)Vol3: 644: Could not open device 'naa.600508b1001030384241333630300300:6' for volume open: Permission denied
Oct 1 19:03:20 vmkernel: 0:01:06:36.315 cpu1:5415)Vol3: 1604: Could not open device 'naa.600508b1001030384241333630300300:8' for probing: Permission denied
Oct 1 19:03:20 vmkernel: 0:01:06:36.316 cpu1:5415)Vol3: 644: Could not open device 'naa.600508b1001030384241333630300300:8' for volume open: Permission denied
Oct 1 19:03:20 Hostd: [2010-10-01 19:03:20.360 500CBB90 verbose 'Hostsvc::DatastoreSystem'] VmfsUpdate: got VMFS message [N11HostdCommon18VmkernelUpdateVmfsE:0x4fde19d0]

Through breaking down the log entry, we can see the following:

1. The log entry below shows that the vmkernel issued a 0x1a command to the device which essentially senses the mode of the LUN (http://en.wikipedia.org/wiki/SCSI_command) but it fails with the code 0x5.

Oct 1 19:03:00 vmkernel: 0:01:06:16.227 cpu6:4102)ScsiDeviceIO: 1672: Command 0x1a to device "naa.60a98000503354706b4a50617a34354f" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2. The same goes for the line:

Oct 1 19:03:00 vmkernel: 0:01:06:16.227 cpu1:5415)WARNING: ScsiDeviceIO: 5172: READ CAPACITY on device "naa.60a98000503354706b4a50617a34354f" from Plugin "NMP" failed. I/O error
Oct 1 19:03:01 Vpxa: [2010-10-01 19:03:01.236 1B3D5B90 warning 'App'] [VpxaHalStats] Unexpected return result. Expect 1 sample, receive 2
Oct 1 19:03:01 Vpxa: [2010-10-01 19:03:01.252 FFE77B10 verbose 'App'] [VpxaInvtHost] Increment master gen. no to (128): Event:VpxaHalEvent::CheckQueuedEvents
Oct 1 19:03:10 vmkernel: 0:01:06:26.254 cpu6:4102)ScsiDeviceIO: 1672: Command 0x25 to device "naa.60a98000503354706b4a5a446b657147" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

Read Capacity (which is our problem) correlates to 0x25 SCSI code listed in the error log.

We continued to create test LUNs on the NetApp and soon believed that the following reasons were why the new host to be unable to read the LUNs:

  1. The LUNs were all RDMs.
  2. The RDMs were all being used by a Microsoft Clustering Services (MSCS) with SQL Server 2008 clustering configured.

What the support engineer said was that she believes the “failed to read LUNs' capacity” behavior is the expected since the Microsoft cluster maintains a persistent SCSI reservation on the RDMs and the reason why the old ESX 3.5 hosts that hosts the passive VM node was able to see the capacity is because the LUNs were presented the hosts prior to setting up the MSCS/SQL clustered VMs. What she basically wanted me to test to confirm this was to schedule some time to shutdown the cluster, perform a rescan and see if the capacity gets displayed properly.

Since I didn’t want to wait till the night of the cutover when we refresh all the hosts to test this, I went ahead and requested a short 10 minute outage for a cluster that wasn’t as critical (we had 2 SQL clusters on the ESX host) to do the test and the following are the results:

Once I was able to shutdown the 2 SQL clustered nodes, I logged directly onto the ESXi 4.1 host and did a rescan on the storage adapters:

image

The rescan completed but the capacity for the LUNs still weren’t showing up:

image

Seeing how I still had 5 minutes left for the 10 minute outage window, I went ahead and restarted the ESXi 4.1 host to see if it’ll pick it up and voila! The LUNs’ capacity showed up properly after a server reboot:

image

Summary

Problem: When you perform a rescan operation on a new ESX host’s storage adapter that has been given access to RDMs that are being used on other ESX hosts for MSCS/SQL Clusters, the capacity column displays zero bytes.

Reason: The reason is because MSCS/SQL clusters persistently has SCSI reservations on the LUN and therefore causes the newly added host to get a permission denied.

Solution: Simply shutdown the MSCS/SQL cluster virtual machines that access these LUNs and reboot the new host. Once the host is rebooted, the LUNs’ capacity will show up correctly.

No comments: