Video Screencast Help

vxdisk list showing errors on multiple disks, and I am unable to start cluster on slave node.

Created: 14 Apr 2014 • Updated: 01 May 2014 | 16 comments
This issue has been solved. See solution.

Hello,

If anybody have same experience and can help me, I am gonna be very thankful

I am using solars 10 (x86 141445-09) + EMC PowerPath  (5.5.P01_b002) + vxvm (5.0,REV=04.15.2007.12.15) on two node cluster.

This is fileserver cluster.

I've added couple new LUNs and when I try to scan for new disk :"vxdisk scandisks" command hangs and after that time I was unable to do any vxvm job on that node, everytime command hangs.

I've rebooted server in maintanance windows, (before reboot switched all SGs on 2nd node)

After that reboot I am unable to join to cluster with reason

2014/04/13 01:04:48 VCS WARNING V-16-10001-1002 (filesvr1) CVMCluster:cvm_clus:online:CVMCluster start failed on this node.
2014/04/13 01:04:49 VCS INFO V-16-2-13001 (filesvr1) Resource(cvm_clus): Output of the completed operation (online)
ERROR:
2014/04/13 01:04:49 VCS ERROR V-16-10001-1005 (filesvr1) CVMCluster:???:monitor:node - state: out of cluster
reason: Cannot find disk on slave node: retry to add a node failed 

 

Apr 13 01:10:09 s_local@filesvr1 vxvm: vxconfigd: [ID 702911 daemon.warning] V-5-1-8222 slave: missing disk 1306358680.76.filesvr1
Apr 13 01:10:09 s_local@filesvr1 vxvm: vxconfigd: [ID 702911 daemon.warning] V-5-1-7830 cannot find disk 1306358680.76.filesvr1
Apr 13 01:10:09 s_local@filesvr1 vxvm: vxconfigd: [ID 702911 daemon.error] V-5-1-11092 cleanup_client: (Cannot find disk on slave node) 222

 

here is output from 2nd node (working fine)

 

Disk:   emcpower33s2
type:   auto
flags:  online ready private autoconfig shared autoimport imported
guid:   {665c6838-1dd2-11b2-b1c1-00238b8a7c90}
udid:   DGC%5FVRAID%5FCKM00111001420%5F6006016066902C00915931414A86E011
site:    -
diskid: 1306358680.76.filesvr1
dgname: fileimgdg
dgid:   1254302839.50.filesvr1
clusterid: filesvrvcs
info:   format=cdsdisk,privoffset=256,pubslice=2,privslice=2

and here is from node where i see this problems

 

Device:    emcpower33s2
devicetag: emcpower33
type:      auto
flags:     error private autoconfig
pubpaths:  block=/dev/vx/dmp/emcpower33s2 char=/dev/vx/rdmp/emcpower33s2
guid:      {665c6838-1dd2-11b2-b1c1-00238b8a7c90}
udid:      DGC%5FVRAID%5FCKM00111001420%5F6006016066902C00915931414A86E011
site:      -
errno:     Configuration request too large
Multipathing information:
numpaths:   1
emcpower33c     state=enabled

 

Can anybody help me?

I am not sure about Configuration request too large 

 

Operating Systems:

Comments 16 CommentsJump to latest comment

starflyfly's picture

HI,

  For online add/remove lun, you need follow some best practice, like clear device tree from os layer, ensure no stale entry in /dev/[r]dsk.

   And 5.0 is a  old version, you may hit issues during online reconfigure.

 

   Currently,      the error maybe is not really
"Configuration request too large" but it is "Permission Denied" error.

 

   suggest you try:

   1. backup  data  from node2.(work fine  node)

   2. stop  cvm in  node 2.             <<<<need stop application first.

   3. deport  dg

    4. make sure all  disks  look well   by :

           on  node 1:

            cd /etc/vx

             mv /etc/vx/disk.info  /etc/vx/disk.info.old

             mv  array.info  array.info.orig

             rm  /dev/vx/rdmp/*

             rm /dev/vx/dmp/*

             rm  /dev/dsk/*

             rm /dev/dsk/*

             devfsadm -Cv

             vxconfigd-k

             vxdisk list  <<<check if all disk state ok.

     5. start cvm on both nodes.

            

 

If the answer has helped you, please mark as Solution.

rsharma1's picture

1.Run vxdisk -o alldgs list on the second (problem) node to identify the diskgroups that currently
have one or more disks in error state

2. Execute "vxdg -g <dgname> flush" from cvm master for the dgs identified in step 1

3. Try to online cvm SG via VCS (hagrp -online cvm -sys node2)

SOLUTION
Gaurav Sangamnerkar's picture

For CVM to work correctly, it is very important that all the nodes part of cluster should see SAME number of shared disks else CVM will have problems joining the cluster

# vxdisk -o alldgs list |grep -i shared | wc -l

Does both the nodes see same number of disks ?

 

G
 

PS: If you are happy with the answer provided, please mark the post as solution. You can do so by clicking link "Mark as Solution" below the answer provided.
 

lukaskison@gmail.com's picture

Hello, 

 

@Gaurav - no on primary node (this is working) there is every disk in "online shared" state and on second node there is half of disk in "online shared" state and 2nd half is in state just error

 

here is list form problem node:

 

05:24:59 [root@filesvr1:~]# vxdisk -o alldgs list
DEVICE       TYPE            DISK         GROUP        STATUS
emcpower0s2  auto:cdsdisk    -            (fileimgdg)  online shared
emcpower1s2  auto:cdsdisk    -            (fileimgdg)  online shared
emcpower2s2  auto:cdsdisk    -            (fileimgdg)  online shared
emcpower3s2  auto:cdsdisk    -            (fileimgdg)  online shared
emcpower4s2  auto:cdsdisk    -            (fileimgdg)  online shared
emcpower5s2  auto:cdsdisk    -            (fileimgdg)  online shared
emcpower6s2  auto:cdsdisk    -            (fileimgdg)  online shared
emcpower7s2  auto:cdsdisk    -            (fileimgdg)  online shared
emcpower8s2  auto            -            -            error
emcpower9s2  auto            -            -            error
emcpower10s2 auto            -            -            error
emcpower11s2 auto            -            -            error
emcpower12s2 auto:cdsdisk    -            (sambadg)    online shared
emcpower13s2 auto:cdsdisk    -            (sambadg)    online shared
emcpower14s2 auto:cdsdisk    -            (sambadg)    online shared
emcpower15s2 auto:cdsdisk    -            (sambadg)    online shared
emcpower16s2 auto            -            -            error
emcpower17s2 auto            -            -            error
emcpower18s2 auto            -            -            error
emcpower19s2 auto            -            -            error
emcpower20s2 auto:cdsdisk    -            (sambadg)    online shared
emcpower21s2 auto:cdsdisk    -            (sambadg)    online shared
emcpower22s2 auto:cdsdisk    -            (fileimgdg)  online shared
emcpower23s2 auto            -            -            error
emcpower24s2 auto            -            -            error
emcpower25s2 auto            -            -            error
emcpower26s2 auto            -            -            error
emcpower27s2 auto            -            -            error
emcpower28s2 auto            -            -            error
emcpower29s2 auto            -            -            error
emcpower30s2 auto:cdsdisk    -            (filesrvfendg) online
emcpower31s2 auto            -            -            error
emcpower32s2 auto            -            -            error
emcpower33s2 auto            -            -            error
emcpower34s2 auto            -            -            error
emcpower35s2 auto            -            -            error
emcpower36s2 auto            -            -            error
emcpower37s2 auto            -            -            error
emcpower38s2 auto            -            -            error
emcpower39s2 auto            -            -            error
emcpower40s2 auto            -            -            error
emcpower41s2 auto            -            -            error
emcpower42s2 auto            -            -            error
emcpower43s2 auto:cdsdisk    -            (filesrvfendg) online
emcpower44s2 auto:cdsdisk    -            (filesrvfendg) online
emcpower45s2 auto            -            -            error
emcpower46s2 auto            -            -            error
emcpower47s2 auto            -            -            error
emcpower48s2 auto:cdsdisk    -            (fileimgdg)  online shared
emcpower49s2 auto:cdsdisk    -            (fileimgdg)  online shared
emcpower50s2 auto:cdsdisk    -            (fileimgdg)  online shared
emcpower51s2 auto:cdsdisk    -            (fileimgdg)  online shared
emcpower52s2 auto            -            -            error
emcpower53s2 auto            -            -            error
emcpower54s2 auto            -            -            error
emcpower55s2 auto            -            -            error
emcpower56s2 auto:cdsdisk    -            (fileimgdg)  online shared
emcpower57s2 auto:cdsdisk    -            (fileimgdg)  online shared
emcpower58s2 auto:cdsdisk    -            (fileimgdg)  online shared
emcpower59s2 auto:cdsdisk    -            (sambadg)    online shared
emcpower60s2 auto:cdsdisk    -            (sambadg)    online shared
emcpower61s2 auto:cdsdisk    -            (sambadg)    online shared
emcpower62s2 auto:cdsdisk    -            (sambadg)    online shared
emcpower63s2 auto:cdsdisk    -            (sambadg)    online shared
emcpower64s2 auto:cdsdisk    -            (sambadg)    online shared
emcpower65s2 auto:cdsdisk    -            (fileimgdg)  online shared
emcpower66s2 auto:cdsdisk    -            (fileimgdg)  online shared
emcpower67s2 auto:cdsdisk    -            (fileimgdg)  online shared
emcpower68s2 auto:cdsdisk    -            (fileimgdg)  online shared
emcpower69s2 auto:cdsdisk    -            (sambadg)    online shared
emcpower70s2 auto:cdsdisk    -            (sambadg)    online shared
emcpower71s2 auto:cdsdisk    -            (sambadg)    online shared
emcpower72s2 auto:cdsdisk    -            (sambadg)    online shared

 

rsharma1's picture

as Gaurav asked above, could you please post o/p of the below command from *both* the nodes:

vxdisk -o alldgs list |grep -i shared | wc -l

Only if the number is matching on both nodes and you still have a rejoin issue, could you follow the "vxdg -g <dgname> flush" steps mentioned above?

lukaskison@gmail.com's picture

hello,

 

No, on problem node there is half of disk in error state and second half are online shared.

On second node (working fine) all disks are in online shared state.

 

Lukas

lukaskison@gmail.com's picture

19:28:36 [root@filesvr2:~]# vxdisk -o alldgs list |grep -i shared | wc -l
      70

 

19:28:52 [root@filesvr1:~]# vxdisk -o alldgs list |grep -i shared | wc -l
      36

 

 

 

novonil_choudhuri's picture

Are you able to see disks from OS. what is the o/p of # echo|format ? 

Are you able to see all the paths from EMC powerpath ?

Best Regards,

Novonil

lukaskison@gmail.com's picture

yes, vxdisk list see 36 disks in error and 34 are fine (online shared) , powermt display dev=all see all 70 disks without any problems (all paths are up)

solaris see all disks in format command.

I've attached output of commands

AttachmentSize
output.txt 222.4 KB
Gaurav Sangamnerkar's picture

To summarize, the issue here not CVM at this stage. As mentione above, for CVM cluster to work correctly, you would first need to have 70 shared disks on node filesvr1 as well, that too, disks should be accessible & not in error state. A disk in error state is indicating that somehow vxvm is unable to read out private region of the disk.

Problem here is 1 layer below, you need to find why the disks are in error state. First output which you gave in original post suggests that emc paths are enabled from DMP view. So here is what I would recommend

1. Confirm if Storage is alright, all 70 luns are intact & assigned to filesvr1.

2. verify SAN switches that the zoning is correct & filesvr1 should see 70 luns

3. OS is able to see all the disks correctly in format output, there is no disk in "drive not available" or any other inconsistent state. Any of disk which is in error state above, try to access that disk from "format" command & see if you can read the labels of the disk. "vxdisk -e list" command will help you map the emcpowerxx to c#t#d# names.

4. Verify powerpath can see the disks (powermt -display all) & paths are not dead

5. One more check you can do is on IOFencing. Do you have IOFencing on these disks ? are there any stale keys lying on these disks (vxfenadm -g <diskgroup> -s /dev/rdsk/cxtxdx), quite possible that fencing is restricting the IO access to these disks.

once above steps come clean, you might need to do a "vxdctl enable" for vxconfigd to establish connections with disk again. once vxdctl enable is executed, there is quite a possibility of you need to do a "vxreattach" of the disks to diskgroup however disks should be out of error state for that.

G

 

PS: If you are happy with the answer provided, please mark the post as solution. You can do so by clicking link "Mark as Solution" below the answer provided.
 

lukaskison@gmail.com's picture

I think that vxfen cause all my problems, because I can access to error disks via format command. I can see size od disk, layout, partitioning everything. But I have no idea how to confirm that vxfen is really causing all this problems.

Gaurav Sangamnerkar's picture

Well we can't say that IOFencing is cause, if reservation was an issue, even format would have problems reading the disks. can you get below outputs to confirm the IOFencing bit (below commands won't cause any harm, its just reading keys from disks)

Create a file with some error disks in it

# cat /tmp/diskfile
/dev/rdsk/c5t15d64s0
/dev/rdsk/c5t15d65s0
/dev/rdsk/c5t15d66s0   (you can get cxtxdx names with vxdisk -e list)

 

# vxfenadm -s all -f /tmp/diskfile

# vxfenadm -r all -f /tmp/diskfile

 

G

PS: If you are happy with the answer provided, please mark the post as solution. You can do so by clicking link "Mark as Solution" below the answer provided.
 

Gaurav Sangamnerkar's picture

ok my bad ... -s came post 5.1

try

# vxfenadm -g all -f /tmp/diskfile    ( -g for registration keys)

# vxfenadm -r all -f /tmp/diskfile     (-r for reservation keys)

 

G

PS: If you are happy with the answer provided, please mark the post as solution. You can do so by clicking link "Mark as Solution" below the answer provided.
 

lukaskison@gmail.com's picture

hello all

vxdg -g <disk_group> flush --> from CVM master node fixed my problem

all disks are visible after rescan and node was successfully joined to cluster.

Again, thanks for your time guys!

Pankaj Tandon's picture

Just to explain, why `vxdg flush` worked and why the joiner node initially failed to join the node? i.e. solution provided by rsharma1.

Details:

As part of join protocols:

- CVM master send the list of online disks (from imported shared disk groups) that it can see to the joiner.

- Now joiner check whether it can see those disks or not. If not, it creates a dummy entry by fetching some basic properties of disks from other nodes of cluster.

 

Why node join failed in this case:

- There were some disks on master that are in online state but actually missing connectivity globally (all nodes of cluster). And hence were potential candidate to be detached (error state). But we don't proactively detach disks (for some valid reasons) and it will only get detached as part of I/Os on that disk.

- So now any joiner is expected to have connectivity to these disks as well (online on master) or the join will fail.

 

How `vxdg flush` solved this problem:

- vxdg flush triggers some private region I/Os on the disks, as part of refershing the contents.

- And as mentioned in previous these I/Os will detach the globally(all nodes of cluster) disconnected disks on master.

NOTE: We have handled some of these (likely) scenario proactively but not all.

 

Thanks & Regards,

Pankaj Tandon (CVM Team)