Sunday, December 12, 2010

Fixing a Cisco UCS B-Series RAID-1 mirror disk when configuration status is stated as “Unknown” in UCS Manager

A few months ago, I was asked by a friend about a strange problem he encountered when working with Cisco’s UCS B-Series servers where navigating to a server’s Inventory –> Storage tab, he’ll see one of the blade server’s local disk’s status stated as unknown:

Disk 2

ID: 2

Product Name:

Vendor:

Revision: 0

PID:

Serial Number:

Block Size (Bytes): unknown

Number of Blocks: unknown

Size (MB): unknown

States

Operability: N/A

Presence: equipped

image

The server is otherwise running fine and reviewing the event logs within UCSM does not show any errors.

Unfortunately, I wasn’t able to assist in troubleshooting since it wouldn’t be ethical for him to give me access so I told him I’ll try to mimic it in our test environment when I get the chance. Since I was spending some time on setting up a NetApp earlier in our datacenter and there were 2 test blades that I was using with no critical applications on it, I thought I try to mimic this.

Testing Procedure

First off, our UCS chassis, FI, blades, etc were on firmware version 1.3(1n):

clip_image002

I had been thinking about how I could recreate this error for a few weeks and what I came up was to try and destroy the RAID 1 by pulling out the 2nd drive to see if I’ll end up seeing this error. After proceeding to pull the drive and then inserting it back in, I noticed that the 2nd drive had the LED light turn to yellow:

IMG01642-20101209-1450

Seeing how there is now a visual warning, I proceeded to log into UCSM to check the disk 2 in the inventory –> storage tab of the blade server and it indeed showed the status as “unknown”:

image

The first thought I had in mind was that the RAID is now probably in a degraded state and since this information wasn’t provided in UCSM, I decided to reboot the server and go into the LSI SAS controller to see what the status was.

Note: Unfortunately, I got tied up with a few things so I didn’t end up getting back to this till 5 hours later.

Restart the server:

image

If you have quiet boot turned on, you won’t see the option to get into LSI SAS controller utility so make sure you turn it off by going into the BIOS:

image

image

image

When you see the LSI SAS controller being displayed, hit Ctrl-C to get into the utility:

image

image

image

image

What I noticed was that going into the RAID Properties and reviewing the RAID status, the controller actually does not list it in a degraded state:

image

***This could possibly be because the 5 hour delay of continuing this test meant the controller automatically rebuilt the RAID which means I will have to try doing this test again to determine if this is true.

Seeing how the RAID was in good health, I logged back into UCSM to check the status of the drive to see if it has changed but it looked like nothing has changed:

image

Logically thinking about service profiles and how they’re applied to servers, I suspect that the hardware information is pulled from the B-series blade when the service profile is originally applied and based on the behavior I see here, it doesn’t look like it refreshes periodically.

So what now? The first thing I could think of was try to re-acknowledge the server because by doing so, the chassis will perform a rediscovery and possibly pull the hardware information again.

image

image

As indicated with the following warning message, make sure you don’t have anything critical running on this blade as the server will get rebooted:

image

image

image

image

I’ve always found it fascinating to look at the FSM tab for some reason:

image

image

image

image

image

image

image

image

Once the re-acknowledge task completed, I navigated back to the inventory –> storage tab and now I can see the information for Disk 2.

image

Once I get more time again, I’ll try to update this post with the results of not having a 5 hour delay between breaking the RAID and going into the SAS controller to check the RAID status.

2 comments:

Anonymous said...

Terence thanks for the Awesome Posting! I just ran into the same issue and followed your steps and got it working!

Thanks a bunch!

Grzeg said...

Thanks buddy for the great and very well documented postage! It helped me to fix the unknown disk status.