Possible data corruption when attaching an unusable data plex to a data volume, using the Fastresync feature

Article:TECH52426  |  Created: 2007-01-25  |  Updated: 2008-01-25  |  Article URL http://www.symantec.com/docs/TECH52426
Article Type
Technical Solution

Product(s)

Environment

Issue



Possible data corruption when attaching an unusable data plex to a data volume, using the Fastresync feature

Solution



Symantec has recently discovered an issue where attaching an unusable data plex (a plex whose subdisks are physically unavailable) to a parent volume, using the Fastresync feature,  can cause data corruption.

Due to the way in which Volume Manager processes the plex reattachment transactions:

  • The reattachment operation can fail because the plex being attached is unusable
  • An appropriate error message may be returned.
  • The data change object (DCO) attached to the parent volume and used for Fastresync functionality is incorrectly cleared.

Therefore, if at any point in the future, the unusable plex is repaired and a DCO is used with the Fastresync feature between the parent volume and the repaired plex:

  • Not all changes which have taken place to the parent volume will be flushed to the repaired plex.
  • Inconsistencies between the two plexes can result in data corruption.

Note: Only the volumes involved in the failed plex reattachment, and therefore the volumes using the cleared DCO, will be impacted.


Affected Versions:

This issue affects all currently supported versions of Veritas Volume Manager:

  • 5.0
  • 4.1
  • 4.0
  • 3.5 for HP-UX
  • The latest maintenance packs for these versions

This issue can occur on all supported UNIX platforms:

  • Solaris
  • HP-UX
  • Linux
  • AIX

This issue does not occur for DCO version 20 (i.e. DCO logs created with vxsnap prepare).
To determine the version of DCO log being used by a volume use the following commands:

To determine the name of a volumes DCO log:
# vxprint -g <disk_group> -F%dco_name <volume>

To determine the version of the DCO log
# vxprint -g <disk_group> -F%version <dco_name>

To upgrade a DCO log from version 0 to version 20 such that it is no longer affected by this issue the following procedure should be used:

Upgrade the disk group containing the version 0 DCO log to the latest disk group version supported by your version of Volume Manager:
# vxdg upgrade <disk_group>

Determine which volumes in the disk group have DCO logs associated:
# vxprint -g <disk_group> -F "%name" -e "v_hasdcolog"

The version of each volumes DCO log can now be determined using the above command. The following commands should then be repeated for each volume having a version 0 DCO log which is being upgrade to version 20:

If the volume has a DRL log plex associated this should be removed. Note that by default a single log will be removed by the following command, specify nlog=n to remove multiple logs:
# vxassist -g <disk_group> remove log <volume> [nlog=n]

If the volume has any snapshot volumes associated then these should be reattached and resynchronised to the parent volume before continuing. Note that if you believe the DCO log has already been incorrectly cleared due to this issue then the snapshot volumes should not use FMR to synchronise with the parent volume:
# vxassist -g <disk_group> snapback <snapshot_volume>

Fast resync functionality should be disabled for the volume:
# vxedit -g <disk_group> set fastresync=off <volume>

The version 0 DCO log should be removed from the volume:
# vxassist -g <disk_group> remove log <volume> logtype=dco

The volume should have a version 20 DCO associated and be prepared for fast resync with the vxsnap command:
# vxsnap -g <disk_group> prepare <volume>

Note that version 20 DCO logs also contain DRL functionality so there is no need to recreate any separate DRL logs for this volume if being used previously.

Example of attaching an unusable plex to a data volume:

Initially, there is a volume containing a detached data plex (testvol-02). Note that the plexes underlying subdisk (testdg02-01) are still marked as enabled though the physical disk on which the subdisk is located has failed. This is because there has been no Input/Output (I/O) to the physical disk to cause Volume Manager to fail the subdisk:

v  testvol      -            ENABLED  ACTIVE   262144   SELECT    -        fsgen
pl testvol-01   testvol      ENABLED  ACTIVE   262144   CONCAT    -        RW
sd testdg01-01  testvol-01   testdg01 0        262144   0         c3t10d0  ENA
pl testvol-02   testvol      DETACHED STALE    262144   CONCAT    -        RW
sd testdg02-01  testvol-02   testdg02 0        262144   0         c3t11d0  ENA
dc testvol_dco  testvol      testvol_dcl
v  testvol_dcl  -            ENABLED  ACTIVE   144      SELECT    -        gen
pl testvol_dcl-01 testvol_dcl ENABLED ACTIVE   144      CONCAT    -        RW
sd testdg01-02  testvol_dcl-01 testdg01 262144 144      0         c3t10d0  ENA
pl testvol_dcl-02 testvol_dcl DETACHED STALE   144      CONCAT    -        RW
sd testdg02-02  testvol_dcl-02 testdg02 262144 144      0         c3t11d0  ENA


As the subdisk and therefore testvol-02 plex look usable, it appears possible to reattach the plex to the parent volume. Due to issues with underlying storage, the attach fails:

root@bishbosh# vxplex -g testdg att testvol testvol-02
VxVM vxplex ERROR V-5-1-1278 Volume testvol, plex testvol-02, block 0: Plex write:
       Error: Write failure
VxVM vxplex ERROR V-5-1-2005 sd testdg02-01 in plex testvol-02 failed during attach
VxVM vxplex ERROR V-5-1-10127 changing plex testvol-02:
       Plex contains unusable subdisk
VxVM vxplex ERROR V-5-1-407 Attempting to cleanup after failure ...

Despite the above messages, this operation may have caused the parent volumes DCO to be incorrectly cleared. Therefore, if the unusable plex is repaired (the failed disk is replaced or reattached) and Fastresync is used to sync the detached plex (testvol-02) to the enabled active plex (testvol-01), not all changes which have occurred to the enabled active plex will be flushed to the repaired plex.



Permanent Fix:

Hot fixes have been released for Volume Manager 4.0 and 4.1 on Solaris only.

Before applying these hot fixes, install the following versions of Volume Manager:

  • Volume Manager 4.0 MP2 Rolling Patch 7
  • Volume Manager 4.1 MP2 Rolling Patch 1


Fixes for all other platforms and versions of Volume Manager are currently in development. As they become available, Symantec will update this TechNote.



Workaround:

To avoid this issue:

Ensure that the underlying storage of the plex is available before reattaching the plex to a parent volume.

If you believe that a DCO may already have been incorrectly cleared:

1. Remove the DCO from the parent volume and recreate.
2. Perform a full resynchronization on any plexes detached from the parent volume when the failed plex attachment was performed. Do not use Fastresync.

Example of recreating the DCO and performing a full resynchronization of a detached plex:

Note: A volume with a detached data plex exists. There was an attempt to reattach this data plex to the parent volume while the plex was unusable and it is now believed that the DCO has been incorrectly cleared. The plex has now been repaired (i.e. the underlying physical disk has been reattached). However, the DCO requires recreation followed by a full resynchronization of the detached plex:

v  testvol      -            ENABLED  ACTIVE   262144   SELECT    -        fsgen
pl testvol-01   testvol      ENABLED  ACTIVE   262144   CONCAT    -        RW
sd testdg01-01  testvol-01   testdg01 0        262144   0         c3t10d0  ENA
pl testvol-02   testvol      DISABLED RECOVER  262144   CONCAT    -        WO
sd testdg02-01  testvol-02   testdg02 0        262144   0         c3t11d0  ENA
dc testvol_dco  testvol      testvol_dcl
v  testvol_dcl  -            ENABLED  ACTIVE   144      SELECT    -        gen
pl testvol_dcl-01 testvol_dcl ENABLED ACTIVE   144      CONCAT    -        RW
sd testdg01-02  testvol_dcl-01 testdg01 262144 144      0         c3t10d0  ENA
pl testvol_dcl-02 testvol_dcl DISABLED RECOVER 144      CONCAT    -        RW
sd testdg02-02  testvol_dcl-02 testdg02 262144 144      0         c3t11d0  ENA


Remove and recreate the DCO as follows:

# vxassist -g testdg remove log testvol logtype=dco
# vxassist -g testdg addlog testvol logtype=dco


Reattach the data plex:

Note: Because the DCO has been recreated (correctly cleared) while testvol-02 has been detached, Volume Manager forces a full resynchronization of the data plex:

# vxplex -g testdg att testvol testvol-02


Recreate the DCO for completeness, but follow these alternative methods of forcing a full resynchronization of the detached plex without recreating the DCO:

1. Use   '-o nofmr' with 'vxplex att' to avoid the use of Fastresync when performing a plex attachment:

# vxplex -g testdg -o nofmr att testvol testvol-02


2. Use these commands to

  • Disable Fastresync for the volume
  • Reattach the detached plex
  • Re enable Fastresync once the plex synchronization is complete:

# vxvol -g testdg set fastresync=off testvol
# vxplex -g testdg att testvol testvol-02
# vxvol -g testdg set fastresync=on testvol



To obtain rolling patches, hot fixes, or further information about this issue,  please contact Symantec Enterprise Support.



If you have not received this TechNote as a Software Alert from the Email Notification Service, you can subscribe at the following link:



Supplemental Materials

SourceETrack
Value1047820
DescriptionParent etrack incident for this defect

Legacy ID



289667


Article URL http://www.symantec.com/docs/TECH52426


Terms of use for this information are found in Legal Notices