I/O fencing cannot start after the node reboot because the last digit of the coordinator disk's serial number for EMC Symmetrix LUN has changed.

Article:TECH47370  |  Created: 2006-01-10  |  Updated: 2009-01-26  |  Article URL http://www.symantec.com/docs/TECH47370
Article Type
Technical Solution

Product(s)

Environment

Issue



I/O fencing cannot start after the node reboot because the last digit of the coordinator disk's serial number for EMC Symmetrix LUN has changed.

Error



VCS FEN ERROR V-11-1-23 Got the serial number of 785E4000a for coordinator disk from remote node. Corresponding value for local node is 785E4000g.
vmunix:
VCS FEN ERROR V-11-1-11 Snapshot different. Dropping out of cluster.

Solution



The issue:

With EMC Symmetrix storage, a rebooted node is unable to rejoin the cluster because fencing couldn't start on the joining node. It is found that the in-kernel record of one coordinator disk has a serial number of trailing digit "e", for instance 98EB6000e, while the physical disk reports a trailing digit of "a", for instance 98EB6000a.

# vxfenconfig -c
VCS FEN vxfenconfig NOTICE Driver will use SCSI-3 compliant disks.
VCS FEN vxfenconfig ERROR V-11-2-1006 List of coordinator disks in running cluster is different than local node.
       Unable to configure vxfen.

For Veritas Cluster Server (tm), Veritas Storage Foundation (tm) Cluster File System (SFCFS), Veritas Storage Foundation (tm) for Oracle/RAC (SFRAC) 4.0 and 4.1 the following errors are logged in the /var/adm/messages:
Apr  4 11:08:06 pltc107 vmunix: VCS FEN ERROR V-11-1-23 Got the serial number of 785E4000a for coordinator disk
Apr  4 11:08:06 pltc107 vmunix:  from remote node. Corresponding value for local node is 785E4000g.
Apr  4 11:08:06 pltc107 vmunix: VCS FEN ERROR V-11-1-11 Snapshot different. Dropping out of cluster.

For SFCFS and SFRAC 3.5,  no messages are logged in the /var/adm/messages,  but vxfendebug running on the joining node will show the following errors:
Ibolt: 808800 ../vxfen_io.c ln   68 VXFEN: vxfen_vrfsm_cback: start
lbolt: 808800 ../vxfen_io.c ln  118 VXFEN: vxfen_vrfsm_cback: received VRFSM_TK_SNAP_DATA
lbolt: 808800 ../vxfen_fence.c ln  971 VXFEN:vxfen_recv_snapshot_msg: begin
lbolt: 808800 ../vxfen_fence.c ln 1002 VXFEN:vxfen_recv_snapshot_msg from: 0 serial_num:  98EB6000e
lbolt: 808800 ../vxfen_fence.c ln 1023 VXFEN:vxfen_recv_snapshot_msg: end
lbolt: 808800 ../vxfen_io.c ln  365 VXFEN: vxfen_vrfsm_cback: end
lbolt: 808800 ../vxfen_fence.c ln   55 vxfen_skip_multipaths: skipping the device: device_num: 0 serial_num: 98BA8000a npaths: 1
lbolt: 808800 ../vxfen_fence.c ln  934 VXFEN: vxfen_verify_disk_same: Disk received in snapshot are different!!!

The issue (documented Incident# 586153) is caused by the way in which I/O fencing handles the EMC Symmetrix LUN serial numbers. The current vxfen driver includes byte 44 of the standard SCSI inquiry for the LUN serial number.  EMC is using this byte as a protection indicator byte which causes the same serial number to appear differently to the host systems according to the features enabled for that LUN.   The following is the interpretation of byte 44:

Bit 7 = 0
Bit 6 = 1
Bit 5 = Mirrored device
Bit 4 = RAID-protected device
Bit 3 = 0
Bit 2 = RDF-enabled
Bit 1 = Primary RDF device
Bit 0 = 1
(Contact EMC Technical Support for details on the meaning of individual bits)

On rebooting one node, the vxfen driver tries to join the existing cluster. The existing cluster checks if the coordinator disks have the same Serial number by reading 9 bytes starting at byte 44 and the 9th byte is used for LUN Type on Symmetrix. EMC requires only 8 bytes of serial number to be read if SPC-2 bit is not set but vxfen reads 9 bytes by default whether or not SPC-2 bit is set. When the Storage side LUN type change is made which should be transparent to the host the 9th byte changes but the running cluster still holds the old serial number in core. When one node in the cluster gets rebooted that node reads the updated serial number which has the trailing digit different and this causes the difference in serial number between the running node and the node trying to join the cluster. Due to this the node that rebooted was unable to join the cluster.

Workaround for cluster server versions 3.5 and 4.0:

Option 1: Requires downtime.
Shut down the entire cluster and restart vxfen driver on all nodes so kernel memory is refreshed with the new serial number.
Symantec recommends applying the vxfen driver point patches (for Etrack Incident 586153) on all the nodes in the cluster and restart the vxfen driver so as to avoid this issue from happening again.

Option 2:
Revert the LUN setup change done on coordinator disks.  For example, if Symmetrix Remote Data Facility (SRDF) functionality was enabled for the coordinator disks, it can be disabled because coordinator disks do not need SRDF functionality. This requires Storage level change. Check with the Storage admin to explorer the options. If the LUN type can be changed then no down time on the running cluster is required but the issue can happen again. Plan to upgrade to 5.x release at the latest or apply the required point patch when downtime is available.

Contact Symantec Enterprise Technical Support if the point patch is required.



Supplemental Materials

SourceETrack
Value586153
DescriptionEMC has only 8 bytes of serial number if SPC-2 bit is not set, while we read 12 bytes into the buffer

Legacy ID



282863


Article URL http://www.symantec.com/docs/TECH47370


Terms of use for this information are found in Legal Notices