Pages

Saturday, November 6, 2010

VMware ESXi 4.1.0 stuck at “Initializing scheduler …” screen on boot up with Cisco UCS C210 M2 servers

Update March 9, 2011:

It looks like we finally have a firmware fix from Cisco: http://terenceluk.blogspot.com/2011/03/firmware-update-fix-for-vmware-esxi-41.html

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

I have been working in 2 separate virtualization environments with new Cisco UCS C210 M2 servers and noticed that in both environments, I would intermittently see the VMware ESXi 4.1.0 Build 260247 boot up process get stuck at the “Initializing scheduler …” screen as shown in the following photo I took with my phone:

image

What was difficult was that this happens intermittently and it’s not easy to replicate simply by rebooting the server. Some time was spent searching on the internet to see if anyone else has had this problem and a post about a user with an HP server experiencing a similar problem was found but nothing on UCS. After knowing that I wasn’t going to get too far with this, I decided to post a question on the Cisco Support Community forums and almost immediately received a reply from another user who said they had the same problem. Shortly there after, another user posted a reply that I’ll copy and paste here:

Hi Terence / Clint,

Cisco is aware of this problem and has been working with VMware to address this issue - as this issue has also been seen on other vendor's servers. Additionally, it only seems to be affecting platforms running ESXi.

On the Cisco side, this issue is being tracked via CSCtj19224 - ESXi stuck at Initializing Scheduler - (CCO Account Required to view details)

The workaround is to disable legacy USB within the BIOS.

Hope that helps.

Thanks,
Michael

Link to the post: https://supportforums.cisco.com/thread/2050592?tstart=0

So as it turns out, this isn’t specific to Cisco UCS which was no surprise since a similar post was found but with an HP server.

In case anyone out there is wondering what the process of disabling this for a Cisco C210 M2 server looks like, the following are screenshots I took for one of the servers:

image

image

image

clip_image002

clip_image002[4]

Select Legacy USB Support.

clip_image002[6]

Disable and press F10 to save:

clip_image002[8]

Update November 7, 2010:

I got another response today with the following when I followed up with asking whether this affects 4.0 as well:

In response to Terence, from what I can tell, it seems to mainly affect ESXi 4.1 - based on the number of cases I could find, however I have seen a couple of cases on ESXi 4.0 as well.

Update November 16, 2010:

I got another response on the forums and it looks like this affects ESXi 4.01 as well:

AI ran into this tonight while doing a 1000v upgrade. UCS B250 M1 blades with ESXi 4.0 (Build 236512). Disabling the Legacy USB resolved the issue.

14 comments:

Unknown said...

You can include also Cisco UCS C250 M2 in the list of affected servers.
Thanks for this post I was resetting CMOS everytime I rebooted the servers.
Regards

Anonymous said...

Add the IBM x3650 M3 to the list as well. Unplugging the USB->PS2 adapter solved the issue.

Peter Cronwright said...

Looks like this is fixed now in 1.4(1m)

"ESXi boot up no longer intermittently hangs at the initializing scheduler. (CSCtj19224"

Terence Luk said...

Awesome! Thanks for the heads up Peter.

Paul B said...

Just upgraded our Test standalone UCS to 1.4m this afternoon and still hung on Initialising Scheduler after installing ESXi4.1 - disabling in the BIOS seems to have resolved the issue but again difficult to be definite at this point due to the intermittent issue.

Anonymous said...

I am also in the same situation:
VMware ESXi 4.1.0 on HP Proliant DL385 G7 Hangs at loading VMKernel
Is say "VMKernel Loaded successfully". But hangs after five bars.

Anonymous said...

Saw this on an IBM bladeCenter model 7870 (HS22) where USB and media trays are shared with the bladecenter chassis. Was gonna try disabling legacy USB, but a coworker suggested moving the "M/T" assignment to another blade (effectively moving USB and media to another blade). ESXi immediately continued it's bootup once we did that.

Rob Rech said...

Apparently this also affects vSphere 5.1 with the R210 too. This setting resolved the issue for me.

chrisgriner said...

This issue is also seen with 5.1.0 on IBM System x3650 M3 (7945AC1).

I tried disabling legacy USB, but the only thing that worked was unplugging the ps2-2-usb adapter for my KVM.

Unknown said...

We have this same issue on our brand new x3650 M4 7915 server and cannot seem to resolve it with IBM. I tried legacy boot-mode and UEFI modes with no success. Can you help me resolve it?

Unknown said...

See latest update on my issue: https://communities.vmware.com/message/2461204#2461204

VM-Ware said...

I've installed the licensed VM-Ware ESXi 4.1 and, most of the time, it's working perfectly. Randomly, however, I lose connectivity to the virtual machine having SAP Application installed on it. During this timeout period, the application struck
at client end.

General Server Details:

HP DL380-G5 Proliant
RAID level: 0 + 5

Separate VLAN for management

This, to me, indicates that the issue isn't with networking outside of the ESX host, but rather within the virtual machine or the virtual switch. I've moved the VM to
another ESXi host but the problem persists.

Another curious sign is the ping latency from the Local Traffic Manager out to a VM node (same ESXi host):

PING 172.16.xxx.xxx (172.16.xxx.xxx) 56(84) bytes of data.
64 bytes from 172.16.xxx.xxx: icmp_seq=1 ttl=128 time=7.25 ms
64 bytes from 172.16.xxx.xxx: icmp_seq=2 ttl=128 time=9.26 ms
64 bytes from 172.16.xxx.xxx: icmp_seq=3 ttl=128 time=10.2 ms
64 bytes from 172.16.xxx.xxx: icmp_seq=4 ttl=128 time=10.2 ms
64 bytes from 172.16.xxx.xxx: icmp_seq=5 ttl=128 time=9.12 ms
64 bytes from 172.16.xxx.xxx: icmp_seq=6 ttl=128 time=10.3 ms

--- 172.16.xxx.xxx ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 5035ms

rtt min/avg/max/mdev = 7.252/9.421/10.319/1.091 ms

@AndrewPWR:

1. Nothing logged to any of the /var/log files that would be of any help.

2. Performance graphs don't indicate that I'm hitting any sort of ceiling.

3. Outages last for 1 - 2 minutes, then traffic resumes on its own.



After trying different methodologies, configuration, using different network latency test tool. In Last with the help of Mr. Marc (Sr. Infrastructure Specialist) @ SDN Singapore we have found that the bug is in VMXNET 3 driver, all the reports and statics has been forwarded to VM support center and after 1 week they have resolved this bug via releasing a driver patch, details are mentioned below.
Name: ESXi410-201404001
Ver: 4.1.0 Patch 12
Release 2015-04-20
Build: 1682698
I will try my level best in future to identify these types of bugs, which will help us and other to run there all live applications flawless.
Trying to Upgrade and Migrate on Latest Versions as well.

VM-Ware said...

I've installed the licensed VM-Ware ESXi 4.1 and, most of the time, it's working perfectly. Randomly, however, I lose connectivity to the virtual machine having SAP Application installed on it. During this timeout period, the application struck
at client end.

General Server Details:

HP DL380-G5 Proliant
RAID level: 0 + 5

Separate VLAN for management

This, to me, indicates that the issue isn't with networking outside of the ESX host, but rather within the virtual machine or the virtual switch. I've moved the VM to
another ESXi host but the problem persists.

Another curious sign is the ping latency from the Local Traffic Manager out to a VM node (same ESXi host):

PING 172.16.xxx.xxx (172.16.xxx.xxx) 56(84) bytes of data.
64 bytes from 172.16.xxx.xxx: icmp_seq=1 ttl=128 time=7.25 ms
64 bytes from 172.16.xxx.xxx: icmp_seq=2 ttl=128 time=9.26 ms
64 bytes from 172.16.xxx.xxx: icmp_seq=3 ttl=128 time=10.2 ms
64 bytes from 172.16.xxx.xxx: icmp_seq=4 ttl=128 time=10.2 ms
64 bytes from 172.16.xxx.xxx: icmp_seq=5 ttl=128 time=9.12 ms
64 bytes from 172.16.xxx.xxx: icmp_seq=6 ttl=128 time=10.3 ms

--- 172.16.xxx.xxx ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 5035ms

rtt min/avg/max/mdev = 7.252/9.421/10.319/1.091 ms

@AndrewPWR:

1. Nothing logged to any of the /var/log files that would be of any help.

2. Performance graphs don't indicate that I'm hitting any sort of ceiling.

3. Outages last for 1 - 2 minutes, then traffic resumes on its own.



After trying different methodologies, configuration, using different network latency test tool. In Last with the help of Mr. Marc (Sr. Infrastructure Specialist) @ SDN Singapore we have found that the bug is in VMXNET 3 driver, all the reports and statics has been forwarded to VM support center and after 1 week they have resolved this bug via releasing a driver patch, details are mentioned below.
Name: ESXi410-201404001
Ver: 4.1.0 Patch 12
Release 2015-04-20
Build: 1682698
I will try my level best in future to identify these types of bugs, which will help us and other to run there all live applications flawless.
Trying to Upgrade and Migrate on Latest Versions as well.

VM-Ware said...

I've installed the licensed VM-Ware ESXi 4.1 and, most of the time, it's working perfectly. Randomly, however, I lose connectivity to the virtual machine having SAP Application installed on it. During this timeout period, the application struck
at client end.

General Server Details:

HP DL380-G5 Proliant
RAID level: 0 + 5

Separate VLAN for management

This, to me, indicates that the issue isn't with networking outside of the ESX host, but rather within the virtual machine or the virtual switch. I've moved the VM to
another ESXi host but the problem persists.

Another curious sign is the ping latency from the Local Traffic Manager out to a VM node (same ESXi host):

PING 172.16.xxx.xxx (172.16.xxx.xxx) 56(84) bytes of data.
64 bytes from 172.16.xxx.xxx: icmp_seq=1 ttl=128 time=7.25 ms
64 bytes from 172.16.xxx.xxx: icmp_seq=2 ttl=128 time=9.26 ms
64 bytes from 172.16.xxx.xxx: icmp_seq=3 ttl=128 time=10.2 ms
64 bytes from 172.16.xxx.xxx: icmp_seq=4 ttl=128 time=10.2 ms
64 bytes from 172.16.xxx.xxx: icmp_seq=5 ttl=128 time=9.12 ms
64 bytes from 172.16.xxx.xxx: icmp_seq=6 ttl=128 time=10.3 ms

--- 172.16.xxx.xxx ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 5035ms

rtt min/avg/max/mdev = 7.252/9.421/10.319/1.091 ms

@AndrewPWR:

1. Nothing logged to any of the /var/log files that would be of any help.

2. Performance graphs don't indicate that I'm hitting any sort of ceiling.

3. Outages last for 1 - 2 minutes, then traffic resumes on its own.



After trying different methodologies, configuration, using different network latency test tool. In Last with the help of Mr. Marc (Sr. Infrastructure Specialist) @ SDN Singapore we have found that the bug is in VMXNET 3 driver, all the reports and statics has been forwarded to VM support center and after 1 week they have resolved this bug via releasing a driver patch, details are mentioned below.
Name: ESXi410-201404001
Ver: 4.1.0 Patch 12
Release 2015-04-20
Build: 1682698
I will try my level best in future to identify these types of bugs, which will help us and other to run there all live applications flawless.
Trying to Upgrade and Migrate on Latest Versions as well.