Serial Split Brain (SSB) condition caused Cluster Volume Manager (CVM) Master Takeover to fail

Article:TECH177212  |  Created: 2011-12-18  |  Updated: 2011-12-21  |  Article URL http://www.symantec.com/docs/TECH177212
Article Type
Technical Solution


Environment

Issue



When Serial Split Brain (SSB) condition is detected by the new CVM master on Veritas Volume Manager (VxVM)  versions 5.0 and 5.1, the default CVM  behaviour will cause the new CVM master to leave the cluster and cause cluster-wide downtime.


Error



During the CVM Master Takeover, the new CVM master detected the SSB condition and decided to leave the cluster.

VxVM vxconfigd NOTICE V-5-1-7899 CVM_VOLD_CHANGE command received
V-5-1-0 Preempting CM NID 1
VxVM vxconfigd NOTICE V-5-1-9576 Split Brain. da id is 0.5, while dm id is 0.4 for dm cvmdgA-01
VxVM vxconfigd WARNING V-5-1-8060 master: could not delete shared disk groups
VxVM vxconfigd ERROR V-5-1-7934 Disk group cvmdgA: Disabled by errors
VxVM vxconfigd ERROR V-5-1-7934 Disk group cvmdgB: Disabled by errors
...
VxVM vxconfigd ERROR V-5-1-11467 kernel_fail_join() :           Reconfiguration interrupted: Reason is transition to role failed (12, 1)
VxVM vxconfigd NOTICE V-5-1-7901 CVM_VOLD_STOP command received
 


Environment



Usually SSB should not happen on an imported diskgroup regardless.  (It doesn't matter if it is a CVM-shared diskgroup or local diskgroup.) However in one customer case SSB was detected during the CVM Master Takeover and caused cluster-wide downtime because all the CVM volumes (and hence the CFS filesystems) to become unavailable.    This customer case happened on HP-UX 11.23 with VxVM version 5.0MP2RP2.


Cause



SSB is a condition where the on-disk SSB ID of a physical disk (Disk Access (DA) record) doesn't match the SSB ID of a logical disk (Disk Media (DM) record) in the diskgroup configuration.   The following is an example.

DA record can be shown by "vxdisk list <da>" command.  For example,

# vxdisk list c158t0d1
Device:    c158t0d1
devicetag: c158t0d1
type:      auto
clusterid: mycluster
disk:      name=cvmdgA-01 id=1305276345.3469.hosta
group:     name=cvmdgA id=1241454792.3390.hosta
info:      format=cdsdisk,privoffset=128
flags:     online ready private autoconfig shared autoimport imported
....
update:    time=1320137343 seqno=0.32
ssb:       actual_seqno=0.5                <<< on-disk DA SSB ID
headers:   0 120
configs:   count=1 len=24072
logs:      count=1 len=3648
Defined regions:
 config   priv 000024-000119[000096]: copy=01 offset=000000 enabled
 config   priv 000128-024103[023976]: copy=01 offset=000096 enabled
 log      priv 024104-027751[003648]: copy=01 offset=000000 disabled
 lockrgn  priv 027752-027823[000072]: part=00 offset=000000
Multipathing information:
numpaths:  12
c162t0d1        state=enabled
......

DM record can be show by "vxprint -m -g <dg> -d <dm>" command.   For example,

# vxprint -m -g cvmdgA -d cvmdgA-01
dm   cvmdgA-01
        tutil0="
        tutil1="
        tutil2="
        da_name=c158t0d1
        device_tag=c158t0d1
.....
        ssbid=0.4                   <<< DM SSB ID in the diskgroup configuration
.....

For an imported diskgroup the two SSB ID's should always be the same, but in the above example the DA SSB ID doesn't match the DM SSB ID and this causes the CVM Master Takeover to fail.

 

 


Solution



Symantec is still looking into the issue and trying to determine the reason why the SSB ID's became inconsistent even after the diskgroup was imported.   While the issue is under investigation, Symantec Engineering will backport an enhancement from VxVM 6.0 to 5.0 and 5.1 to minimize the impact of SSB during CVM Master Takeover.    After the fix with the enhancement is applied, when SSB is detected in a diskgroup, CVM will only disable that particular diskgroup and keep the other diskgroups imported during the CVM Master Takeover. The new CVM master will not leave the cluster with the fix applied.    Please refer to the Etrack incident listed in the Supplemental Material section of this article.

Workaround:

The following shell script can be used to detect if the SSB ID's are consistent before the CVM master takeover.

=========== BEGIN check_ssbid.sh ==============
#!/bin/ksh

PATH=/usr/bin:/usr/sbin

if (( $# != 1 ))
then
                echo "Usage: check_ssbid.sh <diskgroup>"
                exit 1
fi

DG=$1

TMPFILE1=$(mktemp /tmp/check_ssbid.$$.XXXXXX)
if [ -z "$TMPFILE1" ]
then
                echo "Failed to create temp file"
                exit 1
fi

TMPFILE2=$(mktemp /tmp/check_ssbid.$$.XXXXXX)
if [ -z "$TMPFILE2" ]
then
                echo "Failed to create temp file"
                exit 1
fi

CONFIG_SSB_LIST=$(vxprint -m -d -g $DG | egrep '^dm|ssb' | sed 's/=/ /' |
                awk '{print $NF}' | xargs -n 2 | sort)

echo "$CONFIG_SSB_LIST" > $TMPFILE1

echo "$CONFIG_SSB_LIST"  | awk '{print $1}' | xargs -n 1 vxdisk -g $DG list |
                egrep '^disk:|^ssb:' | sed 's/ id=.*//' | sed 's/=/ /' |
                awk '{print $NF}' | xargs -n 2 | sort > $TMPFILE2

echo "SSB from diskgroup configuration in tempfile $TMPFILE1"
echo "SSB from DA in tempfile $TMPFILE2"

diff $TMPFILE1 $TMPFILE2

if (( $? == 0 ))
then
                echo "SSB IDs are consistent"
                exit 0
else
                echo "Inconsistent SSB IDs found"
                exit 1
fi
============== END check_ssbid.sh ================

The above script will compare the on-disk DA SSB IDs and the diskgroup configuration DM SSB IDs and report any discrepancy.   For example, the following output shows that the SSB ID's are consistent.

# ./check_ssbid.sh datadg
SSB from diskgroup configuration in tempfile /tmp/check_ssbid.6765.yZaWnn
SSB from DA in tempfile /tmp/check_ssbid.6765.FZa4nn
SSB IDs are consistent

If the script detects the SSB condition, the SSB feature can be temporarily turned off for the diskgroup before the master take over.

# vxdg -g <dg> set ssb=off

For example,

# vxdg -g cvmdgA set ssb=off

After the SSB ID's are made consistent, the SSB attribute of the diskgroup can be turned on again.

# vxdg -g cvmdgA set ssb=on

 

In order to the fix the inconsistency of the SSB ID's, the diskgroup has to be deported and imported again.    The following is the procedure.

1. Shutdown the applications and umount all the CFS filesystem.

2. Deport the diskgroup

# vxdg deport cvmdgA

3. Make sure that there is only one diskgroup configuration pool in the diskgroup.    (If there are two pools of diskgroup configuration, then it is a real SSB diskgroup corruption and it will be necessary to use the proper procedure to fix the SSB corruption before importing the diskgroup again.   Please refer to the VxVM Administrator's Guide for the procedure.)

# vxdg -g <dg> listssbinfo

4. If the "vxdg listssbinfo" command reports only one pool, then the diskgroup can be imported again by choosing any one of the enable diskgroup configuration copies.

Check that the disk has an enabled diskgroup configuration copy by using "vxdisk list <da>" command.   For example,

# vxdisk list hds9500-alua0_91 | egrep 'disk|config.*enabled'
disk:      name=cvmdgH5 id=1166620221.318.hosta                <<< disk ID
info:      format=cdsdisk,privoffset=256,pubslice=2,privslice=2
public:    slice=2 offset=2304 len=957696 disk_offset=0
private:   slice=2 offset=256 len=2048 disk_offset=0
 config   priv 000048-000239[000192]: copy=01 offset=000000 enabled       <<< enabled config copy
 config   priv 000256-001343[001088]: copy=01 offset=000192 enabled

Import the diskgroup by using that copy of configuration.

# vxdg  -o selectcp=<disk ID> -s import <dg>

For example,

# vxdg -o selectcp=1166620221.318.hosta -s import cvmdgA

 


Supplemental Materials

SourceETrack
Value2527793
Description

pinnacle:cc:After one site failure, io fails on other site & all nodes of the other site going out of cluster




Article URL http://www.symantec.com/docs/TECH177212


Terms of use for this information are found in Legal Notices