I took a bit of time to think about what title I should give this post and while I’m sure no one is going to Google/Bing “Ops”, I figured someone might type in “deleted service profiles” and may end up finding this post. So without further delay, let’s dive into the problem and resolution.
Problem
In case you’re wondering, no, I didn’t delete any active / running Cisco UCS B Series servers’ Service Profiles. I was, however, asked to look into a problem where I eventually found that someone else did. What was interesting about the troubleshooting process was that I was told the following:
“There’s a problem with all of the ESX and ESXi hosts on the blades.”
“It looks like the network is down.”
“The only server that didn’t have a problem was the Windows 2008 server.”
Reviewing the warnings and faults I managed to find the logs didn’t really tell me that the service profile associated to the blade was no longer there and the following was essentially what I saw when my colleague asked me to look at it:
Ok, so there are errors and faults with the components and mismatching firmware versions.
Let’s try reverting / updating the firmware to match the other components. Result: problem still exists.
Let’s try resetting the connectivity of the NICs:
Result: problem still exists.
Ok, let’s try shutting down the server and re-acknowledging it:
Result: problem still exists.
What sort of caught my eye during the Re-acknowledge process was this:
Notice how the Service Profile field is blank?
Interesting to see that if I left the blade shut down, I see the following:
The Service Profile field’s blank! What’s going on?
This was when I realized someone had deleted 3 service profiles that belonged to blades hosting ESX and ESXi. The service profile for the Windows Server was still there and hence was humming along.
This was when I recreated a new service profile that matched the old deleted one and tried to assign it.
Not good because as you can see, it failed with:
Description: Service profile <profilename> configuration failed due to compute-unavailable,insufficient-resources
ID: 6241031
Cause: configuration-failure
Code: F0327
Original severity: major
Definitely not good. Can I still boot up this server with an empty service profile? Hitting the Reset option I get the following:
Error performing reset
Error performing reset
Parsing Exception at line : 1 invalid XML character: 23
Further deeper digging into the minor faults show:
Description: Assignment of service profile <profilename> to server sys/chassis-1/blade-4 failed
ID: 6241036
Cause: assignment-failed
Code: F0689
Original severity: minor
Ok, so now I’m stuck with a server that I shutdown and can longer boot up. I’d have to say I started feeling a bit nervous even though I did ask which server I could test on that’s not critical and this was what I got. Note that I was able to boot it up but because I’m writing this post 3 weeks later, I can’t remember how I did it. What I can remember is that it wasn’t too difficult and it was just a small workaround with the controls available in UCS Manager.
Solution
Now that I finally understood what the problem was and knowing there probably wasn’t a Cisco document that tells you what to do, I sat down and thought about this logically. The bright idea (or at least I thought so) I had was that since:
- The server fails to associate with a new profile.
- The server can still be booted up with the old profile that’s no longer there.
- UCS Manager maintains information about blades that are added to the chassis.
Perhaps the only way to remove this ghost profile is to decommission the server? I remember reading in the Cisco documentation that the decommission option allows you to remove a blade from the configuration. Before I went ahead to ask for permission to do this, I went ahead and did a quick search to confirm what I remember:
Decommissioning a Server
This procedure removes the server from the configuration. As long as the server physically remains in the Cisco UCS instance, Cisco UCS Manager considers the server to be decommissioned and ignores it.
Ok, now that I’ve confirmed it, I went ahead and asked the manager who owns these blades and have been getting a lot of heat to get it up and going. The response was: “Just make it happen and get it back up!”
So I proceeded with this theory and went ahead to do the following:
Server Maintenance –> Decommission
I went ahead to make sure there were no errors during the decommissioning process then proceeded with choosing the Resolve Slot Issue to re-acknowledge the server thus putting it back into the database:
No errors so far:
Discovery has started:
Overall Status is now unassociated:
Let’s now try to associate this blade to a service profile:
Looking good:
Success!
From there on, I repeated the same procedure for the other 2 servers and got the blades back up. Hope this helps anyone out there that may encounter this problem.
5 comments:
You could unbind the profile from the template - for the server which is associated with this profile.
Disassociate the Service Profile from the server. Now it is not assoicated and the server is freed up as well. Once the tasks are complete in FSM, associate the profile back to the appropriate server and then bind the profile back to the template you want to.
This is helpful, when you have vNICs with vlans and vHBA configured, so you don't have to do any zoning one your FC Switches side etc... hope it helps someone in similar situation...
I have followed the said steps to remove the GHOST profile. but no luck. still i can see GHOST profiles. which is not visible in UCS central. any advice appreciated.
I encountered something similar too - a ghosted service profile. Luckily, I still have the choice to erase and start from scratch (my entire setup), or open a TAC case and see if they can assist me.
Follow, I did the same to a ghosted blank profile, which caused a similar mismatch (vif-mismatch). De-commisioning and resolving and re-ack fixed my issue!!!
Thank you!
another good one Terence, cheers.
Post a Comment