Monday, July 19, 2010

Problems when updating Client's UCS Firmware

One of the emails I sent out after completing my first firmware update:

Some other notes worth mentioning during the firmware update:

Updating Passive Fabric

As per the document with instructions on how to update the firmware step-by-step: http://www.cisco.com/en/US/docs/unified_computing/ucs/sw/upgrading/from1.1.1/to1.2.1/UpgradingCiscoUCSFrom1.1.1To1.2.1_chapter4.html

While performing the following step:

Activating the Fabric Interconnect Firmware for a Cluster Configuration

Activating the Firmware on a Subordinate Fabric Interconnect to Release 1.2(1)

Once I brought the firmware version from 1.1 to 1.2, Fabric B (passive) threw an IOM 1 error on Chassis 2. When navigating to the “High availability” status of Fabric B, the Ready value was No but the State was Up. The description of the problem was: chassis configuration incomplete. When I view the properties of IOM 1 on Chassis 2, the Faults tab indicates that the module was removed. I checked the status of the failed IOM and noticed all the servers were in the failed state. I confirmed that all the 4 servers in Chassis 2 were offline as I was not able to KVM or ping the service console IP of the 4 ESX servers.

The document basically states the following:

Step 9

Verify the high availability status of the subordinate fabric interconnect.

If the High Availability Status area for the fabric interconnect does not show the following values, contact Cisco Technical Support immediately. Do not continue to update the primary fabric interconnect.

Field Name

Required Value

Ready field

Yes

State field

Up

I was a bit worried that I’ll have to call Cisco tonight but as it turns out, after 5 to 10 minutes or so, the missing IOM came back and the Ready field is now Yes on Fabric B.

Another note I’d like to make is that updating the fabric takes a lot of time. Don’t sit around in the Firmware Activation page watching the status as Activating because you can view a progress status with a % in the Fabric’s properties page.

Updating Active Fabric

5 minutes into activating the active fabric, I got kicked out of UCSM. I was able to reconnect to UCSM via the passive Fabric but upon connecting, Chassis 1 and 2 and Fabric A and B were all highlighted in red meaning there are faults. After waiting around for 5 minutes, they started turning orange and yellow indicating that they’re slowly getting back to better health. While the status of Chassis 1 and FI A was still yellow/orange, I tried to ping the service console of one of the ESX blades on Fabric A and was not able to get a reply. I got a reply when I pinged the blades on Fabric B though.

I guess it’s safe to say that as long as the active fabric is getting updated, the servers will be disrupted:

clip_image002

Once the activation was complete with the activating status as Ready:

clip_image004

… I experienced the same situation as I did with updating fabric B where UCSM would display an error indicating that IOM 1 on Chassis 1 was missing:

image

clip_image008

clip_image010

What was interesting this time even though it does make sense since Fabric A is the primary is that an additional IOM, IOM 2 on Chassis 2, is also indicated as failed/missing:

clip_image012

The status gradually switches from red to orange then to yellow and finally green. When update for fabric A was finally completed, I noticed that it was now the subordinate:

clip_image014

The whole exercise of updating the firmware took more than an hour and a half to complete so remember accommodate enough time to complete these updates in the future.

No comments: