Monday, September 27, 2010

How to forcefully reassign assigned disks on a NetApp Filer

We had to do some emergency maintenance with a new NetApp shelf a few days ago and found that because we were working with an older version of the firmware, there were commands that did not exist on the filer so we ended up using the GUI and CLI to work around the problem.

Note: Some of the commands may not be necessary but I wanted to list all the steps we had to take to get this to work so I’ve highlighted the steps that is possibly not needed in RED. Please also take note that we had a small window to work with so the instructions below may not abide by best practices.

Task: Disks in the new shelf has been assigned to 2 separate controllers (6 each).

Problem: We could not find the command to unassign the disks in this version of the firmware so we had to use the combination of the GUI and CLI to remove the disk from the controller and then reassign it to another.

NetApp Information

Filer: Some Name

Model: FAS2020

Version: 7.2.6.1

---------------------------------------------------------------------------------------------------------------------------------------------

Initiating a disk show shows the following:

login as: root
root@172.20.1.131's password:

login as: root
root@172.20.1.131's password:

FAC01> disk show
DISK OWNER POOL SERIAL NUMBER
------------ ------------- ----- -------------
0c.00.11 FAC02 (135048330) Pool0 4AD7XX3Q00009925DGCC
0c.00.4 FAC02 (135048330) Pool0 4AD7XX4800009928G98U
0c.00.2 FAC02 (135048330) Pool0 4AD7XX8G00009928PPLH
0c.00.6 FAC02 (135048330) Pool0 3QP0ZC2P00009926HBXK
0c.00.0 FAC02 (135048330) Pool0 4AD7XXC900009928PP3P
0c.00.7 FAC01 (135048291) Pool0 4AD7XXEW00009928GNHA
0c.00.10 FAC01 (135048291) Pool0 4NE20JG400009926HEZS
0c.00.3 FAC01 (135048291) Pool0 4NE20EC100009928G4P0
0c.00.1 FAC01 (135048291) Pool0 3QP0WZT400009926KCBN
0c.00.5 FAC01 (135048291) Pool0 4NE20FK5000099171W7V
0c.00.8 FAC01 (135048291) Pool0 3QP0ZRF100009926HEJT
0c.00.9 FAC01 (135048291) Pool0 4AD7XXHG00009928GDQW
0b.18 FAC01 (135048291) Pool0 XX-XXXXX8038093
0b.22 FAC01 (135048291) Pool0 XX-XXXXX8095587
0b.21 FAC01 (135048291) Pool0 XX-XXXXX7977856
0b.20 FAC01 (135048291) Pool0 XX-XXXXX8039600
0b.24 FAC01 (135048291) Pool0 XX-XXXXX8039114
0b.17 FAC01 (135048291) Pool0 XX-XXXXX8015493
0b.29 FAC02 (135048330) Pool0 XX-XXXXX8095049
0b.23 FAC02 (135048330) Pool0 XX-XXXXX8039807
0b.19 FAC01 (135048291) Pool0 XX-XXXXX7977641
0b.27 FAC02 (135048330) Pool0 XX-XXXXX8039645
0b.28 FAC01 (135048291) Pool0 XX-XXXXX7978194
0b.26 FAC01 (135048291) Pool0 XX-XXXXX8039342
0b.25 FAC02 (135048330) Pool0 XX-XXXXX8039954
0b.16 FAC01 (135048291) Pool0 XX-XXXXX7977866
FAC01>

image

What we want to do is move disks: ob.17, 0b.19 and 0b.21 off of FAC01 to FAC02.

A disk ? shows the following:

FAC01> disk ?
usage: disk <options>
Options are:
fail [-i] [-f] <disk_name> - fail a file system disk
remove [-w] <disk_name> - remove a spare disk
swap - prepare (quiet) bus for swap
unswap - undo disk swap and resume service
scrub { start stop } - start or stop disk scrubbing
assign {<disk_name> all -n <count> auto} [-p <pool>] [-o <ownername>] [-s <sysid>] [-c blockzoned] [-f] - assign a disk to a filer or all unowned disks by specifying "all" or <count> number of unowned disks
show [-o <ownername> -s <sysid> -n -v -a] - lists disks and owners
replace {start [-f] [-m] <disk_name> <spare_disk_name>} {stop <disk_name>} - replace a file system disk with a spare disk or stop replacing
zero spares - Zero all spare disks
checksum {<disk_name> all} [-c block zoned]
sanitize { start abort status release } - sanitize one or more disks
maint { start abort status list} - run maintenance tests on one or more disks
FAC01>

We tried the remove option as well as the replace but found that just as the description specifies, these commands expect you to be moving the spares around. As we were pressed for time to get the disks reassigned, we went into the GUI to offline these disks by setting them to remove:

image

image

image

image

After removing these 3 disks from the GUI, a disk show now shows the following:

FAC01> disk show
DISK OWNER POOL SERIAL NUMBER
------------ ------------- ----- -------------
0c.00.11 FAC02 (135048330) Pool0 4AD7XX3Q00009925DGCC
0c.00.4 FAC02 (135048330) Pool0 4AD7XX4800009928G98U
0c.00.2 FAC02 (135048330) Pool0 4AD7XX8G00009928PPLH
0c.00.6 FAC02 (135048330) Pool0 3QP0ZC2P00009926HBXK
0c.00.0 FAC02 (135048330) Pool0 4AD7XXC900009928PP3P
0c.00.7 FAC01 (135048291) Pool0 4AD7XXEW00009928GNHA
0c.00.10 FAC01 (135048291) Pool0 4NE20JG400009926HEZS
0c.00.3 FAC01 (135048291) Pool0 4NE20EC100009928G4P0
0c.00.1 FAC01 (135048291) Pool0 3QP0WZT400009926KCBN
0c.00.5 FAC01 (135048291) Pool0 4NE20FK5000099171W7V
0c.00.8 FAC01 (135048291) Pool0 3QP0ZRF100009926HEJT
0c.00.9 FAC01 (135048291) Pool0 4AD7XXHG00009928GDQW
0b.18 FAC01 (135048291) Pool0 XX-XXXXX8038093
0b.22 FAC01 (135048291) Pool0 XX-XXXXX8095587
0b.21 FAC01 (135048291) FAILED XX-XXXXX7977856
0b.20 FAC01 (135048291) Pool0 XX-XXXXX8039600
0b.24 FAC01 (135048291) Pool0 XX-XXXXX8039114
0b.17 FAC01 (135048291) FAILED XX-XXXXX8015493
0b.29 FAC02 (135048330) Pool0 XX-XXXXX8095049
0b.23 FAC02 (135048330) Pool0 XX-XXXXX8039807
0b.19 FAC01 (135048291) FAILED XX-XXXXX7977641
0b.27 FAC02 (135048330) Pool0 XX-XXXXX8039645
0b.28 FAC01 (135048291) Pool0 XX-XXXXX7978194
0b.26 FAC01 (135048291) Pool0 XX-XXXXX8039342
0b.25 FAC02 (135048330) Pool0 XX-XXXXX8039954
0b.16 FAC01 (135048291) Pool0 XX-XXXXX7977866
FAC01>

Since the disk reassign command can only be run in maintenance mode or during takeover in advanced mode, we executed the disk remove_ownership command but before we can execute that command, we needed to elevate our privileges to advanced:

FAC01> priv set advanced
Warning: These advanced commands are potentially dangerous; use
them only when directed to do so by Network Appliance
personnel.
FAC01*>

Then we executed:

FAC01*> disk remove_ownership 0b.17
Note: Disks may be automatically assigned to this node, since option disk.auto_assign is on.
Volumes must be taken offline. Are all impacted volumes offline(y/n)?? y

FAC01*> disk remove_ownership 0b.19
Note: Disks may be automatically assigned to this node, since option disk.auto_assign is on.
Volumes must be taken offline. Are all impacted volumes offline(y/n)?? y

FAC01*> disk remove_ownership 0b.21
Note: Disks may be automatically assigned to this node, since option disk.auto_assign is on.
Volumes must be taken offline. Are all impacted volumes offline(y/n)?? y

FAC01*>

The following is the output when we execute a disk show after the above commands were completed:

FAC01> disk show
DISK OWNER POOL SERIAL NUMBER
------------ ------------- ----- -------------
0c.00.11 FAC02 (135048330) Pool0 4AD7XX3Q00009925DGCC
0c.00.4 FAC02 (135048330) Pool0 4AD7XX4800009928G98U
0c.00.2 FAC02 (135048330) Pool0 4AD7XX8G00009928PPLH
0c.00.6 FAC02 (135048330) Pool0 3QP0ZC2P00009926HBXK
0c.00.0 FAC02 (135048330) Pool0 4AD7XXC900009928PP3P
0c.00.7 FAC01 (135048291) Pool0 4AD7XXEW00009928GNHA
0c.00.10 FAC01 (135048291) Pool0 4NE20JG400009926HEZS
0c.00.3 FAC01 (135048291) Pool0 4NE20EC100009928G4P0
0c.00.1 FAC01 (135048291) Pool0 3QP0WZT400009926KCBN
0c.00.5 FAC01 (135048291) Pool0 4NE20FK5000099171W7V
0c.00.8 FAC01 (135048291) Pool0 3QP0ZRF100009926HEJT
0c.00.9 FAC01 (135048291) Pool0 4AD7XXHG00009928GDQW
0b.18 FAC01 (135048291) Pool0 XX-XXXXX8038093
0b.22 FAC01 (135048291) Pool0 XX-XXXXX8095587
0b.21 FAC01 (135048291) FAILED XX-XXXXX7977856
0b.20 FAC01 (135048291) Pool0 XX-XXXXX8039600
0b.24 FAC01 (135048291) Pool0 XX-XXXXX8039114
0b.17 FAC01 (135048291) FAILED XX-XXXXX8015493
0b.29 FAC02 (135048330) Pool0 XX-XXXXX8095049
0b.23 FAC02 (135048330) Pool0 XX-XXXXX8039807
0b.19 FAC01 (135048291) FAILED XX-XXXXX7977641
0b.27 FAC02 (135048330) Pool0 XX-XXXXX8039645
0b.28 FAC01 (135048291) Pool0 XX-XXXXX7978194
0b.26 FAC01 (135048291) Pool0 XX-XXXXX8039342
0b.25 FAC02 (135048330) Pool0 XX-XXXXX8039954
0b.16 FAC01 (135048291) Pool0 XX-XXXXX7977866
FAC01>

It almost looks like nothing was changed and here’s why:

FAC01*> disk remove_ownership 0b.17
Note: Disks may be automatically assigned to this node, since option disk.auto_assign is on.
Volumes must be taken offline. Are all impacted volumes offline(y/n)?? y

Notice that the message indicates that disk.auto_assign is turned on so in order to have these disks remain unassigned, we need to execute the following:

FAC01*> options disk.auto_assign off
You are changing option disk.auto_assign which applies to both members of
the cluster in takeover mode.
This value must be the same in both cluster members prior to any takeover
or giveback, or that next takeover/giveback may not work correctly.
Sun Sep 26 21:25:53 EST [PHMSFAC01: reg.options.cf.change:warning]: Option disk.auto_assign changed on one cluster node.
FAC01*>

FAC02*> options disk.auto_assign off
You are changing option disk.auto_assign which applies to both members of
the cluster in takeover mode.
This value must be the same in both cluster members prior to any takeover
or giveback, or that next takeover/giveback may not work correctly.
Sun Sep 26 21:25:53 EST [PHMSFAC01: reg.options.cf.change:warning]: Option disk.auto_assign changed on one cluster node.
FAC02*>

Now in order to reassign these disks, we had to

unfail it with the command:

disk unfail <disk name>

disk unfail 0b.19

disk unfail 0b.17

disk unfail0b.21

Here’s what the SSH session looks like:

FAC01*> disk unfail 0b.17
disk unfail: unfailing disk 0b.17...
FAC01*> Sun Sep 26 21:28:16 EST [FAC01: raid.disk.unfail.reassim:info]: Disk 0b.17 Shelf 1 Bay 1 [WDC WD1002FBYS-05ASX NA01] S/N [WD-WMATV8015493] was unfailed, and is now being reassimilated
disk unfail 0b.19
disk unfail: unfailing disk 0b.19...
FAC01*> Sun Sep 26 21:28:23 EST [FAC01: raid.disk.unfail.reassim:info]: Disk 0b.19 Shelf 1 Bay 3 [WDC WD1002FBYS-05ASX NA01] S/N [WD-WMATV7977641] was unfailed, and is now being reassimilated
disk unfail 0b.21
disk unfail: unfailing disk 0b.21...
FAC01*> Sun Sep 26 21:28:27 EST [FAC01: raid.disk.unfail.reassim:info]: Disk 0b.21 Shelf 1 Bay 5 [WDC WD1002FBYS-05ASX NA01] S/N [WD-WMATV7977856] was unfailed, and is now being reassimilated

Now that we’ve unfailed the disks as well as turned off disk.auto_assign, we can execute the remove_ownership command again:

FAC01*> disk remove_ownership 0b.17
Volumes must be taken offline. Are all impacted volumes offline(y/n)?? y
FAC01*> disk remove_ownership 0b.19
Volumes must be taken offline. Are all impacted volumes offline(y/n)?? y
FAC01*> disk remove_ownership 0b.21
Volumes must be taken offline. Are all impacted volumes offline(y/n)?? y
FAC01*> disk show
DISK OWNER POOL SERIAL NUMBER
------------ ------------- ----- -------------
0c.00.11 FAC02 (135048330) Pool0 3QP0YV3Q00009925DGCC
0c.00.4 FAC02 (135048330) Pool0 3QP0YV4800009928G98U
0c.00.2 FAC02 (135048330) Pool0 3QP0YV8G00009928PPLH
0c.00.6 FAC02 (135048330) Pool0 3QP0ZC2P00009926HBXK
0c.00.0 FAC02 (135048330) Pool0 3QP0YVC900009928PP3P
0c.00.7 FAC01 (135048291) Pool0 3QP0YVEW00009928GNHA
0c.00.10 FAC01 (135048291) Pool0 3QP10JG400009926HEZS
0c.00.3 FAC01 (135048291) Pool0 3QP10EC100009928G4P0
0c.00.1 FAC01 (135048291) Pool0 3QP0WZT400009926KCBN
0c.00.5 FAC01 (135048291) Pool0 3QP10FK5000099171W7V
0c.00.8 FAC01 (135048291) Pool0 3QP0ZRF100009926HEJT
0c.00.9 FAC01 (135048291) Pool0 3QP0YVHG00009928GDQW
0b.18 FAC01 (135048291) Pool0 WD-WMATV8038093
0b.22 FAC01 (135048291) Pool0 WD-WMATV8095587
0b.20 FAC01 (135048291) Pool0 WD-WMATV8039600
0b.24 FAC01 (135048291) Pool0 WD-WMATV8039114
0b.29 FAC02 (135048330) Pool0 WD-WMATV8095049
0b.23 FAC02 (135048330) Pool0 WD-WMATV8039807
0b.27 FAC02 (135048330) Pool0 WD-WMATV8039645
0b.28 FAC01 (135048291) Pool0 WD-WMATV7978194
0b.26 FAC01 (135048291) Pool0 WD-WMATV8039342
0b.25 FAC02 (135048330) Pool0 WD-WMATV8039954
0b.16 FAC01 (135048291) Pool0 WD-WMATV7977866
NOTE: Currently 3 disks are unowned. Use 'disk show -n' for additional information.
FAC01*>

Notice how there are 3 disks stated as being unowned now. The final step is to hop over to the controller that you want to assign the disks and execute the following:

FAC02*> disk assign 0b.17
disk assign: Assign failed for one or more disks in the disk list.
FAC02*> disk assign 0b.17
Sun Sep 26 21:39:34 EST [FAC02: diskown.changingOwner:info]: changing ownership for disk 0b.17 (S/N WD-WMATV8015493) from unowned (ID -1) to FAC02 (ID 135048330)
FAC02*> disk assign 0b.19
Sun Sep 26 21:39:39 EST [FAC02: diskown.changingOwner:info]: changing ownership for disk 0b.19 (S/N WD-WMATV7977641) from unowned (ID -1) to FAC02 (ID 135048330)
FAC02*> disk assign 0b.21
Sun Sep 26 21:39:43 EST [FAC02: diskown.changingOwner:info]: changing ownership for disk 0b.21 (S/N WD-WMATV7977856) from unowned (ID -1) to FAC02 (ID 135048330)
FAC02*>

A disk show now shows the 3 disks being assigned to the other active controller.

Sorry about the extra steps I included so they may or may not be required to change assigned disks from one controller to the other.

10 comments:

Anonymous said...

Whew. Rather fraught way of going about it.

Try:

FAC01> options disk.auto_assign off
FAC01> disk assign 0b.17 0b.19 0b.21 -s unowned -f

FAC02> disk assign 0b.17 0b.19 0b.21

Done.

Charlie

Terence Luk said...

Thanks Charlie. I don't remember whether we tried the command but will definitely try it in the future.

Jertek said...

But extra typing is one way to keep fingers warm in a data center lol...

Anonymous said...

Thank you for the post. It was helpful to me. - LMS

Anonymous said...

Thanks! -f solved the issue for me after receiving a refurbished disk that we couldn't get to be owned by our systems.

Ryan Betts said...

thanks mate, came in handy today. hope things are well in BDA.

Terence Luk said...

Great to know this old post was able to help you, Ryan!

Anonymous said...

Thank you very much, my problem has been completed

Anonymous said...

I've got an issue where I can assign a new shelf of disks to node with no problem but when I assign the shelf or even the disks one at a time it automatically creates an aggregate (aggr0,aggr1, aggr1(1) ) with those disks. Which sounds good but it seems to be an aggregate that I can't do anything with. What makes it worse is that CDOT (8.1) doesn't seem to even recognize that aggregate that was automatically created as a true aggregate because It won't even show up in the list of aggregates with aggregate show. If this were the case I should be able to delete the aggregate and those disks should then show as spares. Basically what I would like to do is simply assign ownership of that shelf to a node and have these disks show up as hot spares so that I can actually do something with the disks (like create an aggregate that I can then create volumes/LUNs on). This is the way it should work under normal circumstances.

Has anybody ever ran into this issue? Any help would be much appreciated!

Mail Pro said...

Hello
Same problem as you: how did you solve it ?