Replication and I/O may stop responding after upgrading to Volume Replicator 5.0, 5.0MP1, 5.0MP2 and 5.0MP3 if the disk group version is not upgraded at the same time (Updated December 2 2009)

Article:TECH63045  |  Created: 2008-01-04  |  Updated: 2010-01-14  |  Article URL http://www.symantec.com/docs/TECH63045
Article Type
Technical Solution


Environment

Issue



Replication and I/O may stop responding after upgrading to Volume Replicator 5.0, 5.0MP1, 5.0MP2 and 5.0MP3 if the disk group version is not upgraded at the same time (Updated December 2 2009)

Solution



Update December 2 2009

Disk group version must be upgraded to 140 if the required patch is not applied

If VVR is upgraded to 5.0, 5.0MP1, 5.0MP2 or 5.0MP3 (that is 5.0MP3 without the required patch), the disk group version must be upgraded to 140 immediately in order to avoid replication or I/O hang.   Please upgrade the disk group version immediately to 140 before any further reboot after the system comes up with VVR 5.0 versions without the required patch.

Upgrade the disk group version to the latest (140) using the following command:
   
# vxdg upgrade diskgroup_name

Check the disk group version with the following command:

# vxdg list diskgroup_name | grep version

For example,

# vxdg upgrade vvr1dg

# vxdg list vvr1dg |grep version
version:   140


The patch for this issue is now available for the following platforms

Solaris Sparc and Solaris x64:
Veritas Storage Foundation (tm) and High Availability Solutions 5.0 Maintenance Pack 3 Rolling Patch 1 - Solaris x64  http://support.veritas.com/docs/316672
Veritas Storage Foundation (tm) and High Availability Solutions 5.0 Maintenance Pack 3 Rolling Patch 1 - Solaris Sparc  http://support.veritas.com/docs/316671

AIX:
Veritas Storage Foundation (tm) and High Availability Solutions 5.0 Maintenance Pack 3 Rolling Patch 1 - AIX    http://support.veritas.com/docs/328479

Linux:
Veritas Storage Foundation (tm) and High Availability Solutions 5.0 Maintenance Pack 3 Rolling Patch 3 - Linux      http://support.veritas.com/docs/330546


This Tech Note will be updated again when patches for the other platforms become available.


Introduction
After performing an upgrade to Veritas Volume Replicator (VVR) 5.0, 5.0 MP1, 5.0 MP2 or 5.0 Mp3, replication and I/O might stop responding in certain scenarios.  This is because the disk group version remains at the previous version, unless you explicitly upgrade the disk group version.

Details
The Etrack incident (listed in the Supplemental Material section of this technote) affects systems running VVR 5.0 with diskgroup version less than 140.   When a system is rebooted, Veritas Volume Manager may need to recover the data volume upon reboot.  The recovery procedure will involve reading data from the Storage Replication Log (SRL) and write the data back to the data volume.  Due to the Etrack incident wrong generation number can be used in the recovery and this will cause replication hang and I/O hang.  If the wrong generation number is updated to the SRL, then problem will persist even after the system is rebooted.  The only way to get out of this situation is to detach the RLINK.  

What is Affected
All earlier versions of VVR that are upgraded to either 5.0, 5.0 MP1, 5.0 MP2 or base 5.0MP3 may be affected by this issue.

How to Determine if Affected
The RLINK may go into 'detached' and 'stale' state when the disk group version of the replicated volume group at the primary or secondary is less than 140 and when one of the following occurs:

1.  With replication configured on shared disk groups, when the master node of the primary Cluster Volume Manager/Veritas Volume Replicator cluster fails over to the surviving node.

2.  With replication configured on private disk group, when the primary node reboots initiating a Storage Replicator Log recovery.

To check the disk group version, run the following command:

# vxdg list dgname | grep version

Note: When the disk group version displayed by the above command is less than 140, replication may stop responding.


To check for the RLINK state, run the following command:

# vxprint -Pl | grep flags

Note:  When the RLINK status displayed by the above command lists 'detached' and 'stale', refer to the Workaround documented below.


To check for the RVG (Replicated Volume Group) status, run the following command

# vxprint -Vl | grep flags

Note:  When the RVG status displayed by the above command lists 'passthru' and 'srl_header_err', refer to the Workaround documented below.


The system can also experience I/O hang.  

If the wrong generation number is updated to the SRL, the problem becomes permanent and persists across system reboots and the only solution is to detach the RLINK.   (In some situations detaching the RLINK may not be enough, you may need to dissociate the SRL as well.)


Workaround

If the replication hang or I/O hang persists even after system reboot, then it is most like the wrong generation number has been updated to the SRL,    In order to get out of this situation, the RLINK has to be detached.  The RVG will need to be resynchronized again.

The RLINK can be detached with the following command.

# vxrlink -f -g <diskgorup> det <rlink>

For example,

# vxrlink -f -g vvr1dg det rlk_host2bge3_rvg2
VxVM VVR vxrlink INFO V-5-1-6466 Data volumes are in use. Before restarting replication a complete synchronization of the secondary data volumes must be performed.

If vxconfigd hangs before you have a chance to detach the RLINK on the VVR primary, you may need to detach the RLINK on the secondary and reboot the VVR primary server again.   This will prevent the RLINK from getting connected.

The RVG can be resynchronized and the replication can be restarted by the following command.

Option 1 - Using autosync:

# vradmin -g <diskgroup> -a startrep <rvg> <remote host>

For example,

# vradmin -g vvr1dg -a startrep rvg2 alaw2bge3
Message from Primary:
VxVM VVR vxrlink WARNING V-5-1-3359 Attaching rlink to non-empty rvg. Autosync will be performed.
VxVM VVR vxrlink INFO V-5-1-3614 Secondary data volumes detected with rvg rvg2 as parent:
VxVM VVR vxrlink INFO V-5-1-6183 vvol21:       len=204800               primary_datavol=vvol21
VxVM VVR vxrlink INFO V-5-1-6183 vvol22:       len=204800               primary_datavol=vvol22
VxVM VVR vxrlink INFO V-5-1-3365 Autosync operation has started

Option 2 - Using difference-based synchronization and checkpoint

# vradmin -g <diskgroup> -c <checkpoint> syncrvg <rvg> <remote host>

# vradmin -g <diskgroup> -c <checkpoint> startrep <rvg> <remote host>

For example,

# vradmin -g vvr1dg -c ckptA syncrvg rvg2 alaw2bge3
Message from Primary:
VxVM VVR vxrsync INFO V-5-52-2233 Starting differences volume synchronization to remote
VxVM VVR vxrsync INFO V-5-52-2211    Source host:         192.168.33.1
VxVM VVR vxrsync INFO V-5-52-2212    Destination host(s): 192.168.33.2
VxVM VVR vxrsync INFO V-5-52-2213    Total volumes:       2
VxVM VVR vxrsync INFO V-5-52-2214    Total size:          200.000 M
Eps_time Dest_host       Src_vol     Dest_vol     F'shed/Tot_sz  Diff  Done
00:00:01 192.168.33.2    vvol21      vvol21           0M/100M      0%    0%
00:00:03 192.168.33.2    vvol21      vvol21         100M/100M     <1%  100%
00:00:03 192.168.33.2    vvol22      vvol22           0M/100M      0%    0%
00:00:05 192.168.33.2    vvol22      vvol22         100M/100M     <1%  100%
VxVM VVR vxrsync INFO V-5-52-2219 VxRSync operation completed.
VxVM VVR vxrsync INFO V-5-52-2220 Total elapsed time: 0:00:05

# vradmin -g vvr1dg -c ckptA startrep rvg2
Message from Primary:
VxVM VVR vxrlink INFO V-5-1-3614 Secondary data volumes detected with rvg rvg2 as parent:
VxVM VVR vxrlink INFO V-5-1-6183 vvol21:       len=204800               primary_datavol=vvol21
VxVM VVR vxrlink INFO V-5-1-6183 vvol22:       len=204800               primary_datavol=vvol22


SRL may need to be dissociated if the RVG recovery hangs even after the RLINK is detached

In some situation the incorrect generation number in the SRL may prevent the RVG recovery even if the RLINK is detached.  In this situation the SRL will need to be dissociated.

If the vxconfigd hangs during the system boot, please perform the following steps before dissociating the SRL.   The following steps will prevent the RVG recovery which triggers the vxconfigd hang.

1. Create the file /etc/vx/reconfig.d/state.d/install-db

# touch /etc/vx/reconfig.d/state.d/install-db

2. Reboot the system

# shutdown -i 6 -g 0 -y

3. Because of the existence of the file "instal-db" vxconfigd will not start automatically.  Please perform the following steps to start vxconfigd.

# vxiod set 10
# vxconfigd -m enable

4. Import the diskgroups with diskgroup version less than 140.

# vxdg import "diskgroup"

5. Detach the rlink

# vxrlink -f -g "diskgroup" det "rlink"

6. Dissociated the SRL

# vxvol -g "diskgroup" dis "SRL_volume"

You can now upgrade the diskgorup version by using the procedure provided at the beginning of the technote.  After the diskgroup version is upgraded, the SRL can be associated back to the RVG by the following command.

# vxvol -g "diskgroup" aslog "RVG_name" "SRL_volume"

The replication can then be restarted using the steps provided earlier.



Formal Resolution

Symantec has acknowledged that the above-mentioned issue is present in the current version(s) of the product(s) mentioned at the end of this article. Symantec is committed to product quality and satisfied customers. Symantec currently is addressing this issue by way of a patch to the current version of the software.

Please be sure to refer back to this document periodically, as any changes to the status of the issue will be reflected here. A link to the patch download will be added to this document when it becomes available. Please note that Symantec reserves the right to remove any fix from the targeted release if it does not pass quality assurance tests.  Symantec's plans are subject to change and any action taken by you based on the above information or your reliance upon the above information is made at your own risk.

Best Practices:
Symantec strongly recommends the following best practices:
1. Always perform a full backup prior to and after any changes to your environment.
2. Always make sure that your environment is running the latest version and patch level.
How to Subscribe to Software Alerts
If this TechNote was not received from the Symantec Email Notification Service as a Software Alert, please subscribe at the following link:    http://maillist.entsupport.symantec.com/subscribe.asp



Supplemental Materials

SourceETrack
Value1385126
Description[VVR]I/O hang due to wrong generation number assignment after recovery


Legacy ID



308183


Article URL http://www.symantec.com/docs/TECH63045


Terms of use for this information are found in Legal Notices